Thread: Search Engine
View Single Post
  #5 (permalink)  
Old 01-05-2008, 10:23 AM
roots's Avatar
roots roots is offline
Moderator
 
Join Date: Jan 2008
Location: Dallas
Posts: 263
roots is on a distinguished road
This concept is called information extraction. Please do some research on this concept. There is one java application framework that does exactly the same called web harvest. Have a look on this as well.

Basically what you need are.

1. Which page to search. Example would be certain search result page from google. You need to prepare complete list of page address either generated dynamically (Based on previous data collection) or some other method.

2. Use HTTP Components or some good HTTP framework to download raw HTML content.

3. Now you need to extract the data from that raw HTML page. You have following options
1. Write your own stream parsers for different website. (I did same).
2. Use HTML to XML conversion libs such as jtidy and extract data as if google served you XML

When you have information (Extracted information) you can store them to Lucene or any other indexing engine of your choice. That should not be the problem.


Let me warn you. Google do not allow automated system to search for longer time.. :-)
__________________
dont worry newbie, we got you covered.
Reply With Quote