Results 1 to 6 of 6

Thread: Search Engine

  1. #1
    SSam Varghese is offline Member
    Join Date
    Dec 2007
    Posts
    4
    Rep Power
    0

    Default Search Engine

    Hi,
    Am developing a Crawler. Where it searches the content in most of the search engines like google,yahoo,ask..etc. and produce the results.. this is the idea.

    I don't know how to search the contents within these sites.... i retrieve the value ie,the content then later append with the search site and write an xml file dynamically.

    Is it the correct way..... pls guide me... am in a great trouble..

    Waiting for the reply....

    Regards....

  2. #2
    JavaBean's Avatar
    JavaBean is offline Moderator
    Join Date
    May 2007
    Posts
    1,270
    Rep Power
    9

    Default

    Search Lucene. It is a powerfull API, if you want to search inside the data you have crawled. That is the best way i think.

  3. #3
    SSam Varghese is offline Member
    Join Date
    Dec 2007
    Posts
    4
    Rep Power
    0

    Default

    Thanks for the reply

  4. #4
    SSam Varghese is offline Member
    Join Date
    Dec 2007
    Posts
    4
    Rep Power
    0

    Default

    Hi Sir...
    I downloaded the Lucene.. but it searches within the files.. i want to go to the internet and then search.. later i downloaded the Nutch. but have a problem in running the .war file they have produced...

    Could u please help me how to run the nutch and see the output.

  5. #5
    roots's Avatar
    roots is offline Moderator
    Join Date
    Jan 2008
    Location
    Dallas
    Posts
    293
    Rep Power
    7

    Default

    This concept is called information extraction. Please do some research on this concept. There is one java application framework that does exactly the same called web harvest. Have a look on this as well.

    Basically what you need are.

    1. Which page to search. Example would be certain search result page from google. You need to prepare complete list of page address either generated dynamically (Based on previous data collection) or some other method.

    2. Use HTTP Components or some good HTTP framework to download raw HTML content.

    3. Now you need to extract the data from that raw HTML page. You have following options
    1. Write your own stream parsers for different website. (I did same).
    2. Use HTML to XML conversion libs such as jtidy and extract data as if google served you XML

    When you have information (Extracted information) you can store them to Lucene or any other indexing engine of your choice. That should not be the problem.


    Let me warn you. Google do not allow automated system to search for longer time.. :-)
    dont worry newbie, we got you covered.

  6. #6
    roots's Avatar
    roots is offline Moderator
    Join Date
    Jan 2008
    Location
    Dallas
    Posts
    293
    Rep Power
    7

Similar Threads

  1. Java Code Snippets Search Engine
    By ZuudoTech in forum Reviews / Advertising
    Replies: 1
    Last Post: 01-18-2008, 10:13 PM
  2. H2 Database Engine 1.0.62
    By JavaBean in forum Java Software
    Replies: 0
    Last Post: 11-27-2007, 09:35 PM
  3. H2 Database Engine 1.0.61
    By JavaBean in forum Java Software
    Replies: 0
    Last Post: 11-12-2007, 07:08 PM
  4. H2 Database Engine 1.0.60
    By JavaBean in forum Java Software
    Replies: 0
    Last Post: 10-21-2007, 05:17 PM
  5. New Search engine for Java Programmers
    By coolgeek in forum Java Software
    Replies: 0
    Last Post: 07-02-2007, 08:41 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •