Results 1 to 7 of 7
- 05-02-2011, 08:10 PM #1
Member
- Join Date
- May 2011
- Posts
- 3
- Rep Power
- 0
- 05-02-2011, 11:12 PM #2
Moderator
- Join Date
- Jul 2010
- Location
- California
- Posts
- 1,609
- Rep Power
- 5
Lesson: Regular Expressions (The Java™ Tutorials > Essential Classes) will allow you to parse the html and find certain content. That being said, if by change you mean change via a script such as javascript the problem becomes exponentially more complicated.
- 05-02-2011, 11:28 PM #3
Member
- Join Date
- May 2011
- Posts
- 3
- Rep Power
- 0
Yea, I just want to parse html generated by php content.
- 05-03-2011, 12:05 AM #4
Senior Member
- Join Date
- Mar 2011
- Posts
- 261
- Rep Power
- 3
Use an URL and a BufferedReader like so:
Java Code:URL url = new URL("direct url to page"); BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
- 05-03-2011, 12:31 AM #5
Member
- Join Date
- May 2011
- Posts
- 3
- Rep Power
- 0
I've gotten that far, but once I'm there, how do I designate exactly which values to pull?
- 05-03-2011, 01:57 AM #6
Member
- Join Date
- Apr 2011
- Location
- Canada!
- Posts
- 30
- Rep Power
- 0
Well the solution is to parse the page line by line, and find what you need based on regular expressions, or simple string operations.
Hehe actually Im writing an article on a multithreaded webcrawler, so what I need are the html pages enclosed in <a href= </a> tags.
Here's how I do it, perhaps you can adapt it to your own design.
Java Code:public void scrapeYourself() throws IOException{ InputStream input = theURL.openConnection().getInputStream(); BufferedReader reader = new BufferedReader(new InputStreamReader(input)); String line; int startIndex = 0; int endIndex = 0; while ((line = reader.readLine()) != null){ //System.out.println(line); /*See if <aref=.... > tag is present. Stuff (link name, target attribute * can be in between <aref="" and </a>, but consider simplest case for now */ if (line.contains("<a href=\"http:") && line.contains("</a>") && (line.indexOf("<a href=\"http:") < line.indexOf("</a>"))){ //System.out.println("The line: " + line); String[] linkTokensWithinLine = line.split("<a href=\""); String theLink = null; /* Method 2 */ for (int i = 0; i < linkTokensWithinLine.length; i++){ /*Ignore part of the line before the first <a href=...> split */ if (linkTokensWithinLine[i].contains("</a>")){ try{ theLink = linkTokensWithinLine[i].substring(linkTokensWithinLine[i].indexOf("http:"), linkTokensWithinLine[i].indexOf("\"", linkTokensWithinLine[i].indexOf("http:")+ 7)); } catch (StringIndexOutOfBoundsException ex){ //If link is in javascript:/etc... format, just ignore it. } //System.out.println("" + i + ": " + linkTokensWithinLine[i]); //System.out.println("Absolute URL: " + theLink); /*Add the proper URL to the list of this link's sublinks */ this.allSublinks.add(new Link(new URL(theLink), this.crawler)); /*Debugging: */ //System.out.println(" Sublink:" + this.allSublinks); } /*Method 2 END */ } } } /*Release the resources */ reader.close(); input.close(); }
- 05-03-2011, 02:25 AM #7
Moderator
- Join Date
- Jul 2010
- Location
- California
- Posts
- 1,609
- Rep Power
- 5
Have you looked at the link I provided above? Did you try using anything learned in it, and if so what? A Pattern/Matcher combo will allow you to parse text in ways unimaginable, and is a powerful tool a programmer should have in their arsenal. You've provided little insight into what tag you wish to grab and what about it you want, so posting an SSCCE or example will help as well.
Similar Threads
-
Help with html tags in java
By peliukasss in forum New To JavaReplies: 5Last Post: 02-03-2010, 06:13 AM -
Help in reading values from html form in java
By ichkoguy in forum Advanced JavaReplies: 7Last Post: 03-16-2009, 07:45 AM -
HTML tags anyone?
By tim in forum Suggestions & FeedbackReplies: 2Last Post: 06-29-2008, 04:49 AM -
Html tags within XML- need help
By iamhappy in forum XMLReplies: 2Last Post: 03-27-2008, 04:21 PM


LinkBack URL
About LinkBacks
Reply With Quote

Bookmarks