Java Forums

Main Menu
Home
Today's Posts
FAQ
Search
Contact Us

Java Network
Java Tips
Java Tips Blog

Sponsored Links





Welcome to the Java Forums.

You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other features. By joining our free community, you will:

  • have access to post topics
  • communicate privately with other members (PM)
  • not see advertisements between posts
  • have the possibility to earn one of our surprises if you are an active member
  • access many other special features that will be introduced later.

Registration is fast, simple and absolutely free so please, join our community today!

If you have any problems with the registration process or your account login, please contact us.

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 12-17-2007, 02:21 PM
Member
 
Join Date: Dec 2007
Posts: 3
SSam Varghese is on a distinguished road
Search Engine
Hi,
Am developing a Crawler. Where it searches the content in most of the search engines like google,yahoo,ask..etc. and produce the results.. this is the idea.

I don't know how to search the contents within these sites.... i retrieve the value ie,the content then later append with the search site and write an xml file dynamically.

Is it the correct way..... pls guide me... am in a great trouble..

Waiting for the reply....

Regards....
Bookmark Post in Technorati
Reply With Quote
Sponsored Links
  #2 (permalink)  
Old 12-18-2007, 07:36 AM
JavaBean's Avatar
Moderator
 
Join Date: May 2007
Posts: 1,272
JavaBean is on a distinguished road
Search Lucene. It is a powerfull API, if you want to search inside the data you have crawled. That is the best way i think.
Bookmark Post in Technorati
Reply With Quote
  #3 (permalink)  
Old 12-23-2007, 09:52 AM
Member
 
Join Date: Dec 2007
Posts: 3
SSam Varghese is on a distinguished road
Thanks for the reply
Bookmark Post in Technorati
Reply With Quote
  #4 (permalink)  
Old 12-30-2007, 11:42 AM
Member
 
Join Date: Dec 2007
Posts: 3
SSam Varghese is on a distinguished road
Hi Sir...
I downloaded the Lucene.. but it searches within the files.. i want to go to the internet and then search.. later i downloaded the Nutch. but have a problem in running the .war file they have produced...

Could u please help me how to run the nutch and see the output.
Bookmark Post in Technorati
Reply With Quote
  #5 (permalink)  
Old 01-05-2008, 09:23 AM
roots's Avatar
Moderator
 
Join Date: Jan 2008
Location: Dallas
Posts: 263
roots is on a distinguished road
This concept is called information extraction. Please do some research on this concept. There is one java application framework that does exactly the same called web harvest. Have a look on this as well.

Basically what you need are.

1. Which page to search. Example would be certain search result page from google. You need to prepare complete list of page address either generated dynamically (Based on previous data collection) or some other method.

2. Use HTTP Components or some good HTTP framework to download raw HTML content.

3. Now you need to extract the data from that raw HTML page. You have following options
1. Write your own stream parsers for different website. (I did same).
2. Use HTML to XML conversion libs such as jtidy and extract data as if google served you XML

When you have information (Extracted information) you can store them to Lucene or any other indexing engine of your choice. That should not be the problem.


Let me warn you. Google do not allow automated system to search for longer time.. :-)
__________________
dont worry newbie, we got you covered.
Bookmark Post in Technorati
Reply With Quote
  #6 (permalink)  
Old 01-05-2008, 09:26 AM
roots's Avatar
Moderator
 
Join Date: Jan 2008
Location: Dallas
Posts: 263
roots is on a distinguished road
web information extraction - Google Search
HttpComponents - HttpComponents Overview
Web-Harvest Project Home Page
Tip: Convert from HTML to XML with HTML Tidy

These links might be useful as well .. Excuse me for my english in earlier post..
__________________
dont worry newbie, we got you covered.
Bookmark Post in Technorati
Reply With Quote
Sponsored Links
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Java Code Snippets Search Engine ZuudoTech Reviews / Advertising 1 01-18-2008 10:13 PM
H2 Database Engine 1.0.62 JavaBean Java Announcements 0 11-27-2007 09:35 PM
H2 Database Engine 1.0.61 JavaBean Java Announcements 0 11-12-2007 07:08 PM
H2 Database Engine 1.0.60 JavaBean Java Announcements 0 10-21-2007 05:17 PM
New Search engine for Java Programmers coolgeek Java Announcements 0 07-02-2007 08:41 PM


All times are GMT +3. The time now is 04:04 AM.


VBulletin, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright ©2006 - 2007, www.java-forums.org