Results 1 to 5 of 5
- 02-23-2010, 08:07 PM #1
Member
- Join Date
- Feb 2010
- Posts
- 14
- Rep Power
- 0
extract contents for a search engine (text,urls)
need to extract contents from an url...all forms of url..i want to get the whole content of a url into an object or file and frm there i want to extract text and urls which that page is linked to for a search engine....i tried using regular expression ..but i heard that jdom offers btter soln by refering to its tree structre and functions...so i tried to build a document from a url....but it started showing error....is jdom the way or is there a better way than using regex....if u hav a soln in jdom pls post
Last edited by nijil; 02-23-2010 at 08:10 PM.
- 02-24-2010, 03:32 AM #2
I think JDOM is only going to work if we can ensure the source HTML is well formed XHTML. In the general case HTML might not have all of its closing tags.
If you wanted to use the tree type structure, I think there is a good HTML parser from the mozilla project. Mozilla Java Html Parser
It has been a long time since I have used it, but i remember it being fairly quick to get into learning how to work with it.
edit: no, thats not the project i was using, I meant to say this one:
http://htmlparser.sourceforge.net
lol, too many projects out there right :)Last edited by travishein; 02-28-2010 at 10:29 PM.
- 02-24-2010, 09:32 PM #3
Senior Member
- Join Date
- Dec 2009
- Location
- Belgrade, Serbia
- Posts
- 364
- Rep Power
- 4
If you are trying to create crawler you could start with sun's
'Writing a Web Crawler in the Java Programming Language'
where you can find key concepts and even a src code for basic crawler.
Writing a Web Crawler in the Java Programming Language
There is entire use case described, so take a look and see does it fulfill your needs i really think many concepts will clear up for you like:
- how to get list of URLs
- how to move through that list
- how to deal with different file types
...
You can always add your parsers and your logic with help of some API
like jDOM...
really hope this will help...
- 02-27-2010, 09:39 PM #4
Member
- Join Date
- Feb 2010
- Posts
- 14
- Rep Power
- 0
thnx...am adding a new thread ...
thnx....I am trying to use mozilla parser....but i cant get it running....do we need to build mozilla firefox code to use the parser...
- 02-28-2010, 10:30 PM #5
my bad,
use
HTML Parser - HTML Parser
see my reply with the example invocation on your other thread.
Similar Threads
-
Search Engine on JSP Page
By samanthamaryhorgan in forum Advanced JavaReplies: 0Last Post: 02-13-2010, 12:40 PM -
simple search engine
By semoche in forum Enterprise JavaBeans (EJB)Replies: 3Last Post: 12-07-2009, 08:41 AM -
Web Spider - Extract URLS
By heveen in forum NetworkingReplies: 2Last Post: 07-09-2009, 01:15 PM -
Search Engine , Web Crawler
By sahil.ansari in forum Advanced JavaReplies: 5Last Post: 07-21-2008, 01:53 AM -
Search Engine
By SSam Varghese in forum Java ServletReplies: 5Last Post: 01-05-2008, 08:26 AM


LinkBack URL
About LinkBacks
Reply With Quote
Bookmarks