Results 1 to 4 of 4

Thread: web extraction

  1. #1
    murali is offline Member
    Join Date
    Dec 2008
    Rep Power

    Default web extraction

    how to extract web content using java.

    perticularly accesing hyperlinks.

  2. #2
    neilcoffey is offline Senior Member
    Join Date
    Nov 2008
    Rep Power


    To extract the "raw" page:
    - construct a URL object with the URL you want to "download"
    - call openConnection() on it, then getInputStream() to a stream to read the raw bytes of the page from
    - wrap an InputStreamReader around that input stream so you can read characters; when you initialise it, set an appropriate character encoding (try "UTF-8", "ISO-8859-1", or the value returned by urlconnection.getContentEncoding() if it returns non-null)
    - pull out the characters and put them into a StringBuilder (Java 5+) or StringBuffer

    Then, to extract the links from the page, I'd recommend using regular expressions (the link is to a tutorial I wrote in case it helps, but there are others on the web). So the idea is:
    - call Pattern.compile(), passing in a pattern representing the format of a link. As a first approximation (read about regular expressions to understand why, and how you might improve this!), something like:
    Java Code:
      Pattern p = Pattern.compile("<a.*?href=\"(.*?)\"[^>]*>");
    - then, call p.matcher(), passing in the StringBuilder/StringBuffer from earlier
    - now, in a loop, call find() on the matcher
    - while find() returns true, call to pull out the URL of the link

    If you're writing a spider, you'll have to think about:
    - recursive calling or using a Stack object to handle recursively pulling out links from pages
    - storing in a HashSet or similar the links you've already explored, so you don't repeat them (store the String version of the URL, not the URL object itself)

  3. #3
    roots's Avatar
    roots is offline Moderator
    Join Date
    Jan 2008
    Rep Power


    I use the combination of HTML Cleaner and XPath.
    dont worry newbie, we got you covered.

  4. #4
    fishtoprecords's Avatar
    fishtoprecords is offline Senior Member
    Join Date
    Jun 2008
    Rep Power


    @neilcoffey's approach will work. But its not robust. Its much better to use a well tested library such as Apache's Http client code. It can handle things like chunked results, mime encoding, etc.

Similar Threads

  1. Replies: 1
    Last Post: 02-04-2008, 09:26 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts