Results 1 to 3 of 3
  1. #1
    heveen is offline Member
    Join Date
    Jul 2009
    Posts
    1
    Rep Power
    0

    Default Web Spider - Extract URLS

    Hello friends
    I am totally new in Java programming. I have a module, 'Network Programming with Java' and i have to develop a simple web spider.

    Can anyone suggest me how to do it?
    Especially how will i extract urls from a spcific url.

    Below is part of my code:
    try
    {
    URL u = new URL("mainURL");
    URLConnection uc = u.openConnection();
    String contentType = uc.getContentType();

    int contentLength = uc.getContentLength();

    if (contentType.startsWith("text/") || contentLength == -1 )
    { BufferedReader in = new BufferedReader(new InputStreamReader(u.openStream()));
    FileWriter out = new FileWriter ("home.html");

    String str;

    while ((str = in.readLine()) != null)
    {

    out.write(str);


    How will i extract the urls from here..
    I have try to manipulate the strings but in vain.
    i just want to extract at least 5 links from the content i get.

    Please help me in doing this.

    thanks a lots

  2. #2
    serjant's Avatar
    serjant is offline Senior Member
    Join Date
    Jun 2008
    Location
    Ukraine,Zaporozhye
    Posts
    487
    Rep Power
    6

    Default

    You need to write your own parsers, if it is HTML/PHP/ASP(Content-type: text/html) then java htmlparser library will help you, search on web, if it is a MS files, then POI library, if it is a pdf file then PDFBox or iText or JPOD will help you to do that.

  3. #3
    neeti is offline Member
    Join Date
    Jul 2009
    Posts
    13
    Rep Power
    0

    Default

    Hi,
    you can use regular expressions to extract urls from the text.

Similar Threads

  1. Replies: 0
    Last Post: 05-28-2009, 05:47 PM
  2. reading an Html file and checking for urls
    By sudukrish in forum Advanced Java
    Replies: 1
    Last Post: 04-25-2009, 01:39 AM
  3. getting URLs
    By Shiv in forum Networking
    Replies: 3
    Last Post: 04-16-2009, 05:48 PM
  4. Integrate images and urls in any java application
    By Engineeringserver.com in forum New To Java
    Replies: 2
    Last Post: 08-06-2008, 11:46 PM
  5. Reading URLs Protected with HTTP Authentication
    By Java Tip in forum java.net
    Replies: 0
    Last Post: 04-07-2008, 07:58 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •