Results 1 to 8 of 8
  1. #1
    nijil is offline Member
    Join Date
    Feb 2010
    Posts
    14
    Rep Power
    0

    Default help on java mozilla html parser.....

    I want to use this particular parser 4 html parsing................i cant get this working....i dont know the exact way to put this up....some were it says u need mozilla distribution...do no wat to do.....i just want to use it...am using netbeans as the ide.....in linux but a solution for windows will do fine....i wuld hav loved to find the solution by myself but am running out of time....please help...really desperate....

    would be great if u provide a program to test wether the parser is properly working or not......

    Mozilla Java Html Parser

  2. #2
    travishein's Avatar
    travishein is offline Senior Member
    Join Date
    Sep 2009
    Location
    Canada
    Posts
    684
    Rep Power
    6

    Default

    Hmm, my bad I think I linked to the wrong parser, this one they have appears to be a java wrapper around the C parser in teh mozilla browser, just a lot of terrible complexity really. and yes, its a mess to try to figure out how to build that thing too.

    The one I had used before appears to be this one:

    HTML Parser - HTML Parser

    Where they have gone and built all the supporting stuff in Java instead of just wrapping the C code., and it just plain simple works.

    Download their zip file of the bin distribution, inside the lib folder, copy and add the htmlparser.jar file to your project classpath settings.

    Here is a sample use of their API to parse out the hidden form fields from a HTML document.

    Java Code:
    import java.io.IOException;
    import java.io.InputStream;
    import java.util.Hashtable;
    import java.util.Map;
    
    import org.htmlparser.Node;
    import org.htmlparser.NodeFilter;
    import org.htmlparser.Parser;
    import org.htmlparser.lexer.InputStreamSource;
    import org.htmlparser.lexer.Lexer;
    import org.htmlparser.lexer.Page;
    import org.htmlparser.tags.InputTag;
    import org.htmlparser.util.NodeList;
    import org.htmlparser.util.ParserException;
    import org.htmlparser.util.SimpleNodeIterator;
    
    
    public class SampleHtmlParser {
    
      /**
       * Reads all hidden field elements from a HTML source.
       * @param inputStream an input stream, such as from FileInputStream, or the body from a commons-httpclient GET method.
       * @return a map of hidden field element name-value pairs found.
       */
      public Map<String,String> parseHiddenFields(InputStream inputStream) throws IOException {
     // try to read out the hidden fields from the form.
        Parser parser = new Parser(new Lexer(new Page(new InputStreamSource(inputStream))));
        NodeFilter filter = new NodeFilter() {
          public boolean accept(Node n) {
            if (n.getText().toLowerCase().indexOf("input") >= 0) {
              return true;
            }            
            return false;
          }
        };
        
        Map<String, String> hiddenFields= new Hashtable<String, String>();
        
        try {
          NodeList nodes = parser.parse(filter);
          for (SimpleNodeIterator it = nodes.elements(); it.hasMoreNodes();) {
            Node node = it.nextNode();
            if (InputTag.class.isAssignableFrom(node.getClass())) {
              InputTag tag = (InputTag) node;
              String type = tag.getAttribute("type");
              if (type.equalsIgnoreCase("hidden")) {
                String name = tag.getAttribute("name");
                String value = tag.getAttribute("value");
                
                hiddenFields.put(name, value);
              }
            }
          } // for
        }
        catch (ParserException ex) {
          throw new IOException("unable to parse content: " + ex.getMessage(), ex);
        }
        return hiddenFields;
      }
    }

  3. #3
    nijil is offline Member
    Join Date
    Feb 2010
    Posts
    14
    Rep Power
    0

    Default

    thnx......i also tried jericho parser.....but in the extraction part using getattributevalue funtion it can only get href links...wat about the src links.....one more doubt is that how do we get the content of a relative link like for eg: /direct/contact.html .....not exactly an http link or will the parser take care of that?


    this is a code for text extraction in jericho:

    towards the middleu can see the href part..........


    import au.id.jericho.lib.html.*;
    import java.util.*;
    import java.io.*;
    import java.net.*;

    public class ExtractText {
    public static void main(String[] args) throws Exception {
    String sourceUrlString="http://mozillaparser.sourceforge.net/download.html";
    if (args.length==0)
    System.err.println("Using default argument of \""+sourceUrlString+'"');
    else
    sourceUrlString=args[0];
    if (sourceUrlString.indexOf(':')==-1) sourceUrlString="file:"+sourceUrlString;
    Source source=new Source(new URL(sourceUrlString));

    // Call fullSequentialParse manually as most of the source will be parsed.
    source.fullSequentialParse();

    System.out.println("Document title:");
    String title=getTitle(source);
    System.out.println(title==null ? "(none)" : title);

    System.out.println("\nDocument description:");
    String description=getMetaValue(source,"description");
    System.out.println(description==null ? "(none)" : description);

    System.out.println("\nDocument keywords:");
    String keywords=getMetaValue(source,"keywords");
    System.out.println(keywords==null ? "(none)" : keywords);

    System.out.println("\nLinks to other documents:");
    List linkElements=source.findAllElements(HTMLElementNam e.A);
    for (Iterator i=linkElements.iterator(); i.hasNext();) {
    Element linkElement=(Element)i.next();
    String href=linkElement.getAttributeValue("href");
    if (href==null) continue;
    // A element can contain other tags so need to extract the text from it:
    String label=linkElement.getContent().getTextExtractor(). toString();
    System.out.println(href+" ("+label+")");
    }

    System.out.println("\nAll text from BODY (exluding content inside SCRIPT and STYLE elements):");
    Element bodyElement=source.findNextElement(0,HTMLElementNa me.BODY);
    Segment contentSegment=(bodyElement==null) ? source : bodyElement.getContent();
    System.out.println(contentSegment.getTextExtractor ().setIncludeAttributes(true).toString());
    }

    private static String getTitle(Source source) {
    Element titleElement=source.findNextElement(0,HTMLElementN ame.TITLE);
    if (titleElement==null) return null;
    // TITLE element never contains other tags so just decode it collapsing whitespace:
    return CharacterReference.decodeCollapseWhiteSpace(titleE lement.getContent());
    }

    private static String getMetaValue(Source source, String key) {
    for (int pos=0; pos<source.length();) {
    StartTag startTag=source.findNextStartTag(pos,"name",key,fa lse);
    if (startTag==null) return null;
    if (startTag.getName()==HTMLElementName.META)
    return startTag.getAttributeValue("content"); // Attribute values are automatically decoded
    pos=startTag.getEnd();
    }
    return null;
    }
    }

  4. #4
    travishein's Avatar
    travishein is offline Senior Member
    Join Date
    Sep 2009
    Location
    Canada
    Posts
    684
    Rep Power
    6

    Default

    I think fetching the contents of the link would require working with a http client, such as commons-httpclient where we would have it GET the url for us.

    that is, the parser on its own is only going to be good to decode a single html file at a time,

    I was assuming you would have the commons-httpclient or other HTTP client to fetch the original file, parse it into the pieces you were interested in with this html parser, and then if there were absolute links, and you wanted to traverse to get those, you could invoke a get operation on those as well. and for relative links, knowing the url of the page that was fetched, and needing to compute what the url path would be to that relative location, then fetching those.

    I would think to do this a kind of "fetch queue" would be handy, to have a list of absolutte urls (because all relative urls are somehow converted into an absolute url). and as you observe links in the current page to also be fetched, add them to the url fetch queue, so the next time the processing loop is ran it will fetch those too, and also work to know if it has already fetched something to not go fetch it again

    this whole crawl or spider of many pages can be quite complicated , as there are a lot of edge cases to think about, and it is likely very difficult to build one from scratch. have you had a look at the apache nutch project ? About Nutch

  5. #5
    nijil is offline Member
    Join Date
    Feb 2010
    Posts
    14
    Rep Power
    0

    Default sm more clarification.......

    so u r telling that we need another program which actually download the file and give that as the input to the parser which extract the links and text and gives the links bak to that prog to continue downloading....using a queue...fine...am ok with dat.....my question is how do u actually get these links frm an html page since it is not given in "href" format...eg:src like and links in javasrcipt.....hw do we know that those are actually links either absolute ir relative....in the http-common which one sld we use...the client or core....i gess it is the client....one more thing....will it get a file if the protocol is "https://"??....can this get other files also like xml and pdf and docs?

  6. #6
    travishein's Avatar
    travishein is offline Senior Member
    Join Date
    Sep 2009
    Location
    Canada
    Posts
    684
    Rep Power
    6

    Default

    i haven't gotten around to building a utility to read a page and extract links and then read those links. but i would theorize we know the site's url from the first page we fetched by url, soif the original html page was http://somesite.com/somepath/index.html and in it we had an anchor tag with href= or a javascript tag with src="../main.css" we would have to somehow parse the path component of the original request, (e.g. http://somesite.com/somepath/) and catenate that as the prefix to this relative uri, where if we see .. to do some kind of translation to go up one folder. (e.g. to ask for http://somesite.com/main.css)

    not sure what you mean about the core an client parts ? commons-httpclient has just the httpclient.jar file right ? HttpComponents - HttpComponents Downloads the zip file binary download.

    https protocol is supported by the commons-httpclient, though it uses the java security mechanism for certificates, and this means it will have problems working with self-signed ssl certificates on development machines, unless you import the certificate into a local java keystore. i have not gotten into figuring that out, it has only come up for me that i was using https with valid certificates.

    other file types like pdf files and images are possible to be fetched. in the response object we get back from the invocation of the HTTP get, we can see the content type HTTP header that was set from the server, and here would be where you have a switch to handle the html parser, or a binary save content out handler. its possible to get an input stream to the response content and write it out.

  7. #7
    nijil is offline Member
    Join Date
    Feb 2010
    Posts
    14
    Rep Power
    0

    Default well that clears a lot of things....great and thnx...sm small clarification

    i understood almost evry thing......but how do u actually get the urls written in src and other tags other than href...

  8. #8
    travishein's Avatar
    travishein is offline Senior Member
    Join Date
    Sep 2009
    Location
    Canada
    Posts
    684
    Rep Power
    6

    Default

    but how do u actually get the urls written in src and other tags other than href...
    This is done using their kind of node API, where we determine if the tag we have found is, for example, an "img" tag, and then extract the value of the "src" attribute.

    For example, by modifying my earlier example above where I find all the <input type="hidden"
    Java Code:
    // ...
    Parser parser = new Parser(new Lexer(new Page(new InputStreamSource(inputStream))));
        NodeFilter filter = new NodeFilter() {
          public boolean accept(Node n) {
            if (n.getText().toLowerCase().indexOf("img") >= 0) {
              return true;
            }            
            return false;
          }
        };
    
    try {
          NodeList nodes = parser.parse(filter);
          for (SimpleNodeIterator it = nodes.elements(); it.hasMoreNodes();) {
            Node node = it.nextNode();
            if (ImageTag.class.isAssignableFrom(node.getClass())) {
              ImageTag tag = (ImageTag) node;
              String srcHref = tag.getImageURL(); // see also: http://htmlparser.sourceforge.net/javadoc/index.html
            }
          } // for
        }
        catch (ParserException ex) {
          throw new IOException("unable to parse content: " + ex.getMessage(), ex);
        }
    Last edited by travishein; 03-01-2010 at 06:53 PM.

Similar Threads

  1. Replies: 15
    Last Post: 03-25-2010, 02:34 PM
  2. Jericho HTML Parser 2.6
    By Java Tip in forum Java Software
    Replies: 0
    Last Post: 06-26-2008, 06:22 PM
  3. Developing HTML Parser in JAVA
    By shinojkk in forum Advanced Java
    Replies: 1
    Last Post: 01-18-2008, 08:07 PM
  4. Java Mozilla Html Parser 0.1.7
    By levent in forum Java Software
    Replies: 0
    Last Post: 07-30-2007, 04:30 PM
  5. Jericho HTML Parser 2.4
    By levent in forum Java Software
    Replies: 0
    Last Post: 05-21-2007, 10:05 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •