Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 07-03-2009, 12:17 AM
Member
 
Join Date: Jul 2009
Posts: 1
Rep Power: 0
heveen is on a distinguished road
Default Web Spider - Extract URLS
Hello friends
I am totally new in Java programming. I have a module, 'Network Programming with Java' and i have to develop a simple web spider.

Can anyone suggest me how to do it?
Especially how will i extract urls from a spcific url.

Below is part of my code:
try
{
URL u = new URL("mainURL");
URLConnection uc = u.openConnection();
String contentType = uc.getContentType();

int contentLength = uc.getContentLength();

if (contentType.startsWith("text/") || contentLength == -1 )
{ BufferedReader in = new BufferedReader(new InputStreamReader(u.openStream()));
FileWriter out = new FileWriter ("home.html");

String str;

while ((str = in.readLine()) != null)
{

out.write(str);


How will i extract the urls from here..
I have try to manipulate the strings but in vain.
i just want to extract at least 5 links from the content i get.

Please help me in doing this.

thanks a lots
Bookmark Post in Technorati
Reply With Quote
  #2 (permalink)  
Old 07-07-2009, 08:54 AM
serjant's Avatar
Senior Member
 
Join Date: Jun 2008
Location: Ukraine,Zaporozhye
Posts: 483
Rep Power: 2
serjant is on a distinguished road
Send a message via ICQ to serjant Send a message via Skype™ to serjant
Default
You need to write your own parsers, if it is HTML/PHP/ASP(Content-type: text/html) then java htmlparser library will help you, search on web, if it is a MS files, then POI library, if it is a pdf file then PDFBox or iText or JPOD will help you to do that.
Bookmark Post in Technorati
Reply With Quote
  #3 (permalink)  
Old 07-09-2009, 02:15 PM
Member
 
Join Date: Jul 2009
Posts: 13
Rep Power: 0
neeti is on a distinguished road
Default
Hi,
you can use regular expressions to extract urls from the text.
Bookmark Post in Technorati
Reply With Quote
Reply

Bookmarks

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Help me to give URLs which gives the following http values... Shiv Networking 0 05-28-2009 06:47 PM
reading an Html file and checking for urls sudukrish Advanced Java 1 04-25-2009 02:39 AM
getting URLs Shiv Networking 3 04-16-2009 06:48 PM
Integrate images and urls in any java application Engineeringserver.com New To Java 2 08-07-2008 12:46 AM
Reading URLs Protected with HTTP Authentication Java Tip java.net 0 04-07-2008 08:58 PM


All times are GMT +2. The time now is 05:01 PM.



VBulletin, Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2009, Crawlability, Inc.
Copyright ©2006 - 2007, www.java-forums.org