Results 1 to 14 of 14
Like Tree1Likes
  • 1 Post By makpandian

Thread: help me..how to get links in html file

  1. #1
    Join Date
    Oct 2008
    Posts
    4
    Rep Power
    0

    Default help me..how to get links in html file

    hi body..
    i have just study java..
    i have a problem..
    i want to get all links in html file and write them to txt file,code with core java..
    i can't do it..help me...plz..

  2. #2
    fishtoprecords's Avatar
    fishtoprecords is offline Senior Member
    Join Date
    Jun 2008
    Posts
    571
    Rep Power
    7

    Default

    What's your definition of a link?

    Basic approach: setup code to open file, and read each line.
    For each line, scan for <a href=, which is the start of the anchor.
    Scan to end, what is between = and end is link.
    Write out link
    repeat until done.

  3. #3
    Fubarable's Avatar
    Fubarable is offline Moderator
    Join Date
    Jun 2008
    Posts
    19,315
    Blog Entries
    1
    Rep Power
    26

    Default

    Perhaps you need to use an HTML parser? I'm no pro at this, but if it were me I'd search Google or this forum (or other large Java fora) on this subject as I'm sure it's been addressed many times. I've also seen utilities that can massage HTML so that it can be read by XPath, I think one is called JTidy, but again, I've only seen these in passing. Sorry I couldn't give more info, but perhaps a pro will come by and give you more. Good luck!

    edit: never mind, just do what fishtoprecords suggests!

  4. #4
    fishtoprecords's Avatar
    fishtoprecords is offline Senior Member
    Join Date
    Jun 2008
    Posts
    571
    Rep Power
    7

    Default

    Quote Originally Posted by Fubarable View Post
    Perhaps you need to use an HTML parser?
    That's the real answer, but I think that fails the "core Java" restriction.

  5. #5
    Join Date
    Oct 2008
    Posts
    4
    Rep Power
    0

    Default

    I sorry..but in html file..there are links type bookmark and javascript, I don't want to get them..i only get real links..
    So how i can do ? I can't...plz.

  6. #6
    Norm's Avatar
    Norm is offline Moderator
    Join Date
    Jun 2008
    Location
    Eastern Florida
    Posts
    17,902
    Rep Power
    25

    Default

    real links..
    Please give some examples of the types of links you want to skip and the types that are "real".

  7. #7
    eMAX is offline Member
    Join Date
    Oct 2008
    Posts
    2
    Rep Power
    0

    Default

    Quote Originally Posted by Norm View Post
    Please give some examples of the types of links you want to skip and the types that are "real".
    May we have same tag like this:

    "<ahref=\"#\" Xonclick=\"javascript:returnfalse;\" />java-forum</a><br>"

    So, after "a heaf=" is not a real link. How can we reject this case?
    (Sorry, I'm new bike)

  8. #8
    eMAX is offline Member
    Join Date
    Oct 2008
    Posts
    2
    Rep Power
    0

    Default

    Quote Originally Posted by fishtoprecords View Post
    What's your definition of a link?

    Basic approach: setup code to open file, and read each line.
    For each line, scan for <a href=, which is the start of the anchor.
    Scan to end, what is between = and end is link.
    Write out link
    repeat until done.
    uhh...
    - what if we have more than 2 tag "<a href=" each line?
    - what if we have "<a hreaf="#""
    - or "a href=&acute java-forum . com &acute" // &acute = '
    - or "a href=&quot java-forum . com &quot" // &quot = "
    - ...
    Last edited by eMAX; 10-04-2008 at 05:25 PM.

  9. #9
    Supamagier is offline Senior Member
    Join Date
    Aug 2008
    Posts
    384
    Rep Power
    7

    Default

    Use regex to get it all out or write your own method that removes useless chars in that line.
    I die a little on the inside...
    Every time I get shot.

  10. #10
    litty.joseph is offline Member
    Join Date
    Nov 2008
    Posts
    1
    Rep Power
    0

    Default href scanner

    Here is a real "simple" href scanner.
    dominomill.com/blog/?p=6

    This does the intended job at hand. But it may not be the best way to do it. Of course this one doesn't use an HTML parser... and obviously will leave out the tricky links in HTML.

  11. #11
    makpandian's Avatar
    makpandian is offline Senior Member
    Join Date
    Dec 2008
    Location
    Chennai
    Posts
    450
    Rep Power
    7

    Default Code for this Thread

    import java.io.*;
    import java.net.*;
    import javax.swing.text.*;
    import javax.swing.text.html.*;

    class GetLinks {
    public static void main(String[] args) {
    EditorKit kit = new HTMLEditorKit();
    Document doc = kit.createDefaultDocument();

    // The Document class does not yet
    // handle charset's properly.
    doc.putProperty("IgnoreCharsetDirective",
    Boolean.TRUE);
    try {

    // Create a reader on the HTML content.
    Reader rd = getReader(args[0]);

    // Parse the HTML.
    kit.read(rd, doc, 0);

    // Iterate through the elements
    // of the HTML document.
    ElementIterator it = new ElementIterator(doc);
    javax.swing.text.Element elem;
    while ((elem = it.next()) != null) {
    MutableAttributeSet s = (MutableAttributeSet)
    elem.getAttributes().getAttribute(HTML.Tag.A);
    System.out.println(s);
    if (s != null) {
    System.out.println(
    s.getAttribute(HTML.Attribute.HREF));
    }
    }
    } catch (Exception e) {
    e.printStackTrace();
    }
    System.exit(1);
    }

    // Returns a reader on the HTML data. If 'uri' begins
    // with "http:", it's treated as a URL; otherwise,
    // it's assumed to be a local filename.
    static Reader getReader(String uri)
    throws IOException {
    if (uri.startsWith("http:")) {

    // Retrieve from Internet.
    URLConnection conn =
    new URL(uri).openConnection();
    return new
    InputStreamReader(conn.getInputStream());
    } else {

    // Retrieve from file.
    return new FileReader(uri);
    }
    }
    }


    I hope the above code extract all links in html file and print them.
    To run this application pass the html file as a arg of main program.
    I hope this code will be quite useful for you.

    Thanking you.
    If you have any problem while running , mail me.
    atia akram likes this.
    Mak
    (Living @ Virtual World)

  12. #12
    Tolls is offline Moderator
    Join Date
    Apr 2009
    Posts
    12,224
    Rep Power
    20

    Default Re: help me..how to get links in html file

    Makpandian hasn't posted here in a while.
    Please use [code] tags [/code] when posting code.

    If you read their code (which I realise is bloody difficult without tags) you would see that it accepts an 'http' url as a parameter as well.
    See the last if/else statement in the code.
    If it begins with 'http:' then it will open a URL connection and read from that rather than a file.
    Please do not ask for code as refusal often offends.

    ** This space for rent **

  13. #13
    atia akram is offline Member
    Join Date
    Oct 2012
    Posts
    2
    Rep Power
    0

    Default Re: Code for this Thread

    sir.. what changes i have to make for extracting images links...??????plzz help me.

  14. #14
    Fubarable's Avatar
    Fubarable is offline Moderator
    Join Date
    Jun 2008
    Posts
    19,315
    Blog Entries
    1
    Rep Power
    26

Similar Threads

  1. HTMl file
    By serfster in forum New To Java
    Replies: 11
    Last Post: 02-13-2009, 06:26 AM
  2. How can I include a html file in html textarea?
    By surya_dks in forum New To Java
    Replies: 2
    Last Post: 10-04-2008, 08:20 AM
  3. add information to HTML file
    By newbieal in forum New To Java
    Replies: 2
    Last Post: 10-03-2008, 09:59 PM
  4. RegExp to remove tag from html file with exceptions
    By Daedalus in forum Advanced Java
    Replies: 3
    Last Post: 09-27-2008, 05:43 AM
  5. how to call a JAR FILE from HTML
    By leonard in forum Java Applets
    Replies: 1
    Last Post: 08-05-2007, 07:06 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •