Am developing a Crawler. Where it searches the content in most of the search engines like google,yahoo,ask..etc. and produce the results.. this is the idea.
I don't know how to search the contents within these sites.... i retrieve the value ie,the content then later append with the search site and write an xml file dynamically.
Is it the correct way..... pls guide me... am in a great trouble..
Waiting for the reply....
Search Lucene. It is a powerfull API, if you want to search inside the data you have crawled. That is the best way i think.
I downloaded the Lucene.. but it searches within the files.. i want to go to the internet and then search.. later i downloaded the Nutch. but have a problem in running the .war file they have produced...
Could u please help me how to run the nutch and see the output.
This concept is called information extraction. Please do some research on this concept. There is one java application framework that does exactly the same called web harvest. Have a look on this as well.
Basically what you need are.
1. Which page to search. Example would be certain search result page from google. You need to prepare complete list of page address either generated dynamically (Based on previous data collection) or some other method.
2. Use HTTP Components or some good HTTP framework to download raw HTML content.
3. Now you need to extract the data from that raw HTML page. You have following options
1. Write your own stream parsers for different website. (I did same).
2. Use HTML to XML conversion libs such as jtidy and extract data as if google served you XML
When you have information (Extracted information) you can store them to Lucene or any other indexing engine of your choice. That should not be the problem.
Let me warn you. Google do not allow automated system to search for longer time.. :-)