Results 1 to 5 of 5
  1. #1
    nijil is offline Member
    Join Date
    Feb 2010
    Posts
    14
    Rep Power
    0

    Default extract contents for a search engine (text,urls)

    need to extract contents from an url...all forms of url..i want to get the whole content of a url into an object or file and frm there i want to extract text and urls which that page is linked to for a search engine....i tried using regular expression ..but i heard that jdom offers btter soln by refering to its tree structre and functions...so i tried to build a document from a url....but it started showing error....is jdom the way or is there a better way than using regex....if u hav a soln in jdom pls post
    Last edited by nijil; 02-23-2010 at 08:10 PM.

  2. #2
    travishein's Avatar
    travishein is offline Senior Member
    Join Date
    Sep 2009
    Location
    Canada
    Posts
    684
    Rep Power
    6

    Default

    I think JDOM is only going to work if we can ensure the source HTML is well formed XHTML. In the general case HTML might not have all of its closing tags.
    If you wanted to use the tree type structure, I think there is a good HTML parser from the mozilla project. Mozilla Java Html Parser

    It has been a long time since I have used it, but i remember it being fairly quick to get into learning how to work with it.

    edit: no, thats not the project i was using, I meant to say this one:

    http://htmlparser.sourceforge.net

    lol, too many projects out there right :)
    Last edited by travishein; 02-28-2010 at 10:29 PM.

  3. #3
    FON
    FON is offline Senior Member
    Join Date
    Dec 2009
    Location
    Belgrade, Serbia
    Posts
    368
    Rep Power
    5

    Default

    If you are trying to create crawler you could start with sun's
    'Writing a Web Crawler in the Java Programming Language'
    where you can find key concepts and even a src code for basic crawler.

    Writing a Web Crawler in the Java Programming Language

    There is entire use case described, so take a look and see does it fulfill your needs i really think many concepts will clear up for you like:
    - how to get list of URLs
    - how to move through that list
    - how to deal with different file types
    ...
    You can always add your parsers and your logic with help of some API
    like jDOM...

    really hope this will help...

  4. #4
    nijil is offline Member
    Join Date
    Feb 2010
    Posts
    14
    Rep Power
    0

    Default thnx...am adding a new thread ...

    thnx....I am trying to use mozilla parser....but i cant get it running....do we need to build mozilla firefox code to use the parser...

  5. #5
    travishein's Avatar
    travishein is offline Senior Member
    Join Date
    Sep 2009
    Location
    Canada
    Posts
    684
    Rep Power
    6

    Default

    my bad,

    use
    HTML Parser - HTML Parser

    see my reply with the example invocation on your other thread.

Similar Threads

  1. Search Engine on JSP Page
    By samanthamaryhorgan in forum Advanced Java
    Replies: 0
    Last Post: 02-13-2010, 12:40 PM
  2. simple search engine
    By semoche in forum Enterprise JavaBeans (EJB)
    Replies: 3
    Last Post: 12-07-2009, 08:41 AM
  3. Web Spider - Extract URLS
    By heveen in forum Networking
    Replies: 2
    Last Post: 07-09-2009, 01:15 PM
  4. Search Engine , Web Crawler
    By sahil.ansari in forum Advanced Java
    Replies: 5
    Last Post: 07-21-2008, 01:53 AM
  5. Search Engine
    By SSam Varghese in forum Java Servlet
    Replies: 5
    Last Post: 01-05-2008, 08:26 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •