View RSS Feed

Java XML

The Benefits of JAXP

Rate this Entry
by , 11-29-2011 at 02:58 AM (1032 Views)
One of the most important technologies available in java is the APIs used to work with XML. There are basically two ways to work with XML documents. SAX involves an event driven means of processing XML using callbacks to handle the relevant events. DOM involves using an in-house tree structure of the XML document. Sun Microsystems created a Java API for XML Processing (JAXP) toolkit which makes XML manageable for all developers to use. It is a key component for exploiting all the possibilities with using XML technology such as building web services.

In this article Iím assuming that you have some basic knowledge of XML although you may not know very much about XML parsing. If not, there are a large number of books available to help you with understanding the basics of XML. Now letís get started!

There are two key things for any developer using the JAXP to remember when deciding which of the two APIs to use in their project for parsing an XML document. If you are focused on making one pass through the document and want to use the events initiated by this to capture key information, than the Simple API for XML (SAX) is the API that you want to use. If you are looking to manipulate, transform or query a document than you are better to use the Document Object Model (DOM). In fact, one cannot really call them APIs but rather abstraction layers since you are able to plug in the parsers that you prefer to perform these operations.

The Basics
In order to use a parser irrespective of what you are trying to do, in general the process is exactly the same. The steps are the following:
  1. Create a parser object
  2. Pass your XML document to the parser
  3. Process the results
  4. With this process in mind, one can start to build applications that take advantage of XML. Of course the process of building applications or web services are more involved than this. But this shows the typical flow for an application using XML.



Types of parsers
There are different ways to categorize parsers. There are parsers that support the Document Object Model (DOM) as well as those that support the Simple API for XML. The parsers using these abstraction models are written in a number of languages including Java, Perl and C++. One can also differentiate between validating and non-validating parsers. XML documents that use a schema or older documents using a DTD and follow the rules defined in that schema or DTD are called valid documents. XML documents that follow the basic tagging rules are called well-formed documents. The XML specification requires all parsers to report errors when they find that a document is not well- formed. Validation, is however a completely different issue. Validating parsers validate XML documents as they parse them. Non-validating parsers ignore any validation errors. In other words, if an XML document is well-formed, a non-validating parser doesnít care if the document follows the rules specified in its schema (if any).

The benefits of non-validating parser
The benefit of using non-validating parser is the gain in speed and efficiency due to the time saved avoiding the validation of the document. It takes a significant amount of effort for an XML parser to process a schema and make sure that every element in an XML document follows the rules of the schema. One would only attempt this if one is confident that the XML document is already valid (either something that has been used within your organization or from a trusted source), so thereís no point in validating it again. Another scenario is when you want to find all of the XML tags in a document. Once you have acquired them, you can use them to extract the data from them and process them.

The Simple API for XML (SAX)
The SAX API is an event driven means of working with the contents of XML documents. It was developed by David Megginson and other members of the XML-Dev mailing list.
When you parse an XML document with a SAX parser, the parser generates events at various points in your document. You then use callback functions to decide what to do with each of those events. A SAX parser generates events at the start and end of a document, at the start and end of an element, when it finds characters inside an element, and at several other points. You write the Java code (callback) that handles each event, and you decide what to do with the information you get from the parser.

Working with SAX
In the SAX model, we send our XML document to the parser, and the parser notifies us when certain events happen. Itís up to us to decide what we want to do with those events; if we ignore them, the information in the event is discarded. The SAX API defines a number of events. You can write Java code that handles all of the events you care about. If you donít care about a certain type of event, you donít have to write any code at all. Just ignore the event, and the parser will discard it. Here is a list of most of the commonly used SAX events. There are other SAX events but are not relevant for this article. Theyíre part of the DefaultHandler class in the org.xml.sax.helpers package.
  • startDocument - Signals the start of the document.
  • endDocument - Signals the end of the document.
  • startElement - Signals the start of an element. The parser fires this event when all of the contents of the opening tag have been processed. This includes the name of the tag and any attributes it might have.
  • endElement - Signals the end of an element.
  • characters - Contains character data, similar to a DOM Text node

.

A Simple SAX Parser using JAXP
So a simple SAX Parser uses the following typical routine:
  • Create a SAXParser instance using the SAXParserFactory for instantiating a specific vendorís parser implementation.
  • Register callback implementations (by extending DefaultHandler or another callback class)
  • Start parsing and sit back as your callback implementations are fired off.



JAXP's SAX component provides a simple means for doing all of this. JAXP lets you provide a parser as a Java system property. The parser that is used is Sun's version of Xerces. You can change the parser to another implementation by just changing the classpath setting without any need to recompile any code. That is the beauty of JAXP.

Once you have set up the factory, invoking newSAXParser(), it returns a ready-to-use instance of the JAXP SAXParser class. This class wraps an underlying SAX parser (an instance of the SAX class org.xml.sax.XMLReader). It also protects you from using any vendor-specific additions to the parser class. (Remember the discussion about the XmlDocument class earlier in this article?) This class allows actual parsing behavior to be kicked off. The First figure shows the handler with all the callbacks

Java Code:
public class SimpleHandler extends DefaultHandler {
    // SAX callback implementations from DocumentHandler, ErrorHandler, etc.

    private Writer out;

    public SimpleHandler() throws SAXException {
        try {
            out = new OutputStreamWriter(System.out, "UTF8");
        } catch (IOException e) {
            throw new SAXException("Error getting output handle.", e);
        }
    }

    public void startDocument() throws SAXException {
        print("<?xml version=\"1.0\"?>\n");
    }

    public void startElement(String uri, String localName,
                             String qName, Attributes atts)
        throws SAXException {

        print("<" + qName);
        if (atts != null) {
            for (int i=0, len = atts.getLength(); i<len; i++) {
                print(" " + atts.getQName(i) + 
                      "=\"" + atts.getValue(i) + "\"");
            }
        }
        print(">");
    }

    public void endElement(String uri, String localName, 
                           String qName) throws SAXException {
        print("</" + qName + ">\n");
    }

    public void characters(char[] ch, int start, int len) throws SAXException {
        print(new String(ch, start, len));
    }

    private void print(String s) throws SAXException {
        try {
            out.write(s);
            out.flush();
        } catch (IOException e) {
            throw new SAXException("IO Error Occurred.", e);
        }
    }
}
Figure 1

The next figure shows the steps for how to create, configure, and use a SAX factory.

Java Code:
public class SimpleSAXParsing {
    public static void main(String[] args) {
        try {
            if (args.length != 1) {
                System.err.println ("Usage: java SimpleSAXParsing [filename]");
                System.exit (1);
            }
            // Get SAX Parser Factory
            SAXParserFactory factory = SAXParserFactory.newInstance();
            // Turn on validation, and turn off namespaces
            factory.setValidating(true);
            factory.setNamespaceAware(false);
            SAXParser parser = factory.newSAXParser();
            parser.parse(new File(args[0]), new SimpleHandler());
        } catch (ParserConfigurationException e) {
            System.out.println("The underlying parser does not support " +
                               " the requested features.");
        } catch (FactoryConfigurationError e) {
            System.out.println("Error occurred obtaining SAX Parser Factory.");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}
Figure 2

Working with Document Object Model (DOM)
The Document Object Model defines an interface that enables programs to access and update the style, structure, and contents of XML documents. XML parsers that support the DOM implement that interface. When you use a DOM parser to parse an XML document, you get back a tree structure that contains all of the elements of the document. The DOM provides a variety of functions you can use to examine the contents and structure of the document. Here are the methods which you will commonly used:
  • Document.getDocumentElement() Returns the root element of the document.
  • Node.getFirstChild() Returns the first child of a given Node.
  • Node.getLastChild() Returns the last child of a given Node.
  • Node.getNextSibling() This method returns the next sibling of a given Node.
  • Node.getPreviousSibling() This method returns the previous sibling of a given Node.
  • Node.getAttribute(attrName) For a given Node, returns the attribute with the requested name.


A Simple DOM Parser using JAXP
DOM with JAXP is almost the same as using SAX. The differences are primarily in the names of the classes and the return types. JAXP is responsible for return a org.w3c.dom.Document object from parsing. The XML document and is made up of DOM nodes that represent the elements, attributes, and other XML constructs.

Unlike with SAX we donít have any callback handler so it is just a matter of parsing the XML document and then using the DOM object for addressing our needs. In this example, we show how to write out the DOM tree both forwards and in reverse.

Java Code:
public class XMLDocumentWriter {
    /** Output will use this encoding */
    static final String outputEncoding = "UTF-8";

    /**
	 * Output goes here
	 * @uml.property  name="out"
	 */
    private PrintWriter out;

    /** Constants used for JAXP 1.2 */
    static final String JAXP_SCHEMA_LANGUAGE =
        "http://java.sun.com/xml/jaxp/properties/schemaLanguage";
    static final String W3C_XML_SCHEMA =
        "http://www.w3.org/2001/XMLSchema";
    static final String JAXP_SCHEMA_SOURCE =
        "http://java.sun.com/xml/jaxp/properties/schemaSource";

    /** Initialize the output stream */
    public XMLDocumentWriter(PrintWriter out) { this.out = out; }

    /** Close the output stream. */
    public void close() { out.close(); }

    /** Output a DOM Node (such as a Document) to the output stream */
    public void write(Node node) { write(node, ""); }

    /**
     * Output the specified DOM Node object, printing it using the specified
     * indentation string
     **/
    public void write(Node node, String indent) {
        // The output depends on the type of the node
        switch(node.getNodeType()) {
        case Node.DOCUMENT_NODE: {       // If it's a Document node
            Document doc = (Document)node;
            out.println(indent + "<?xml version='1.0'?>");  // Output header
            Node child = doc.getFirstChild();   // Start looping through nodes
            while(child != null) {              // Loop until there are no more nodes
                write(child, indent);           // Output node
                child = child.getNextSibling(); // Get next node
            }
            break;
        } 
        case Node.DOCUMENT_TYPE_NODE: {  // If it's a <!DOCTYPE> tag
            DocumentType doctype = (DocumentType) node;
            // Note that the DOM Level 1 does not give us information about
            // the the public or system ids of the doctype, so we can't output
            // a complete <!DOCTYPE> tag here.  I think this is better done in Level 2
            out.println("<!DOCTYPE " + doctype.getName() + ">");
            break;
        }
        case Node.ELEMENT_NODE: {        // Most nodes are Elements
            Element elt = (Element) node;
            out.print(indent + "<" + elt.getTagName());   // Begin start tag
            NamedNodeMap attrs = elt.getAttributes();     
            for(int i = 0; i < attrs.getLength(); i++) { 
                Node a = attrs.item(i);
                out.print(" " + a.getNodeName() + "='" +  
                          fixup(a.getNodeValue()) + "'"); 
            }
            out.println(">");                             // Finish start tag
            // Increase indent
            String newindent = indent + "    ";          
            Node child = elt.getFirstChild();             
            while(child != null) {                        
                write(child, newindent);                  
                child = child.getNextSibling();           
            }

            out.println(indent + "</" +                   // Output end tag
                        elt.getTagName() + ">");
            break;
        }
        case Node.TEXT_NODE: {                   // Plain text node
            Text textNode = (Text)node;
            String text = textNode.getData().trim();   
            if ((text != null) && text.length() > 0)   
                out.println(indent + fixup(text));     
            break;
        }
        case Node.PROCESSING_INSTRUCTION_NODE: {  // Handle PI nodes
            ProcessingInstruction pi = (ProcessingInstruction)node;
            out.println(indent + "<?" + pi.getTarget() +
                               " " + pi.getData() + "?>");
            break;
        }
        case Node.ENTITY_REFERENCE_NODE: {        // Handle entities
            out.println(indent + "&" + node.getNodeName() + ";");
            break;
        }
        case Node.CDATA_SECTION_NODE: {           // Output CDATA sections
            CDATASection cdata = (CDATASection)node;
            // Careful! Don't put a CDATA section in the program itself!
            out.println(indent + "<" + "![CDATA[" + cdata.getData() +
                        "]]" + ">");
            break;
        }
        case Node.COMMENT_NODE: {                 // Comments
            Comment c = (Comment)node;
            out.println(indent + "<!--" + c.getData() + "-->");
            break;
        }
        default:   // Hopefully, this won't happen too much!
            System.err.println("Ignoring node: " + node.getClass().getName());
            break;
        }
    }

    
    /** Output a DOM Node (such as a Document) to the output stream in reverse order*/
    public void reverse(Node node) { reverse(node, ""); }

    /**
     * Output the specified DOM Node object, printing it using the specified
     * indentation string in reverse order
     **/
    public void reverse(Node node, String indent) {
        // The output depends on the type of the node
        switch(node.getNodeType()) {
        case Node.DOCUMENT_NODE: {       // If its a Document node
            Document doc = (Document)node;
            out.println(indent + "<?xml version='1.0'?>");  // Output header
            Node child = doc.getLastChild();   // Get the last node
            while(child != null) {              // Loop 'till no more nodes
                reverse(child, indent);           // Output node
                child = child.getPreviousSibling(); // Get previous node
            }
            break;
        } 
        case Node.DOCUMENT_TYPE_NODE: {  // It is a <!DOCTYPE> tag
            DocumentType doctype = (DocumentType) node;
            // Similar to write method in terms of what's possible for <!DOCTYPE> tag
            out.println("<!DOCTYPE " + doctype.getName() + ">");
            break;
        }
        case Node.ELEMENT_NODE: {        // Most nodes are Elements
            Element elt = (Element) node;
            out.print(indent + "<" + elt.getTagName());   // Begin start tag
            NamedNodeMap attrs = elt.getAttributes();     
            for(int i = attrs.getLength()-1; i > 0; i--) {  // Loop through them
                Node a = attrs.item(i);
                out.print(" " + a.getNodeName() + "='" +  
                          fixup(a.getNodeValue()) + "'"); 
            }
            out.println(">");                             // Finish start tag

            String newindent = indent + "    ";           // Increase indent
            Node child = elt.getLastChild();             // Get last child
            while(child != null) {                        // Loop 
                reverse(child, newindent);                  // Output child
                child = child.getPreviousSibling();           // Get previous child
            }

            out.println(indent + "</" +                   // Output end tag
                        elt.getTagName() + ">");
            break;
        }
        case Node.TEXT_NODE: {                   // Plain text node
            Text textNode = (Text)node;
            String text = textNode.getData().trim();   // Strip off space
            if ((text != null) && text.length() > 0)   // If non-empty
                out.println(indent + fixup(text));     // print text
            break;
        }
        case Node.PROCESSING_INSTRUCTION_NODE: {  // Handle PI nodes
            ProcessingInstruction pi = (ProcessingInstruction)node;
            out.println(indent + "<?" + pi.getTarget() +
                               " " + pi.getData() + "?>");
            break;
        }
        case Node.ENTITY_REFERENCE_NODE: {        // Handle entities
            out.println(indent + "&" + node.getNodeName() + ";");
            break;
        }
        case Node.CDATA_SECTION_NODE: {           // Output CDATA sections
            CDATASection cdata = (CDATASection)node;
            // Careful! Don't put a CDATA section in the program itself!
            out.println(indent + "<" + "![CDATA[" + cdata.getData() +
                        "]]" + ">");
            break;
        }
        case Node.COMMENT_NODE: {                 // Comments
            Comment c = (Comment)node;
            out.println(indent + "<!--" + c.getData() + "-->");
            break;
        }
        default:   // Hopefully, this won't happen too much!
            System.err.println("Ignoring node: " + node.getClass().getName());
            break;
        }
    }

    // This method replaces reserved characters with entities.
    String fixup(String s) {
        StringBuffer sb = new StringBuffer();
        int len = s.length();
        for(int i = 0; i < len; i++) {
            char c = s.charAt(i);
            switch(c) {
            default: sb.append(c); break;
            case '<': sb.append("&lt;"); break;
            case '>': sb.append("&gt;"); break;
            case '&': sb.append("&amp;"); break;
            case '"': sb.append("&quot;"); break;
            case '\'': sb.append("&apos;"); break;
            }
        }
        return sb.toString();
    }

    
    private static void usage() {
        System.err.println("Usage: ReverseDOM [-options] <file.xml>");
        System.err.println("       -dtd = DTD validation");
        System.err.println("       -xsd | -xsdss <file.xsd> = W3C XML Schema validation using xsi: hints");
        System.err.println("           in instance document or schema source <file.xsd>");
        System.err.println("       -ws = do not create element content whitespace nodes");
        System.err.println("       -co[mments] = do not create comment nodes");
        System.err.println("       -cd[ata] = put CDATA into Text nodes");
        System.err.println("       -e[ntity-ref] = create EntityReference nodes");
        System.err.println("       -usage or -help = this message");
        System.exit(1);
    }

    
	/**
	 * @param args
	 */
	public static void main(String[] args)  throws Exception {
        String filename = null;
        boolean dtdValidate = false;
        boolean xsdValidate = false;
        String schemaSource = null;

        boolean ignoreWhitespace = false;
        boolean ignoreComments = false;
        boolean putCDATAIntoText = false;
        boolean createEntityRefs = false;

        for (int i = 0; i < args.length; i++) {
            if (args[i].equals("-dtd")) {
                dtdValidate = true;
            } else if (args[i].equals("-xsd")) {
                xsdValidate = true;
            } else if (args[i].equals("-xsdss")) {
                if (i == args.length - 1) {
                    usage();
                }
                xsdValidate = true;
                schemaSource = args[++i];
            } else if (args[i].equals("-ws")) {
                ignoreWhitespace = true;
            } else if (args[i].startsWith("-co")) {
                ignoreComments = true;
            } else if (args[i].startsWith("-cd")) {
                putCDATAIntoText = true;
            } else if (args[i].startsWith("-e")) {
                createEntityRefs = true;
            } else if (args[i].equals("-usage")) {
                usage();
            } else if (args[i].equals("-help")) {
                usage();
            } else {
                filename = args[i];

                // Must be last arg
                if (i != args.length - 1) {
                    usage();
                }
            }
        }
        if (filename == null) {
            usage();
        }

        // Step 1: create a DocumentBuilderFactory and configure it
        DocumentBuilderFactory dbf =
            DocumentBuilderFactory.newInstance();

        // Set namespaceAware to true to get a DOM Level 2 tree with nodes
        // containing namespace information.  This is necessary because the
        // default value from JAXP 1.0 was defined to be false.
        dbf.setNamespaceAware(true);

        // Set the validation mode to either: no validation, DTD
        // validation, or XSD validation
        dbf.setValidating(dtdValidate || xsdValidate);
        if (xsdValidate) {
            try {
                dbf.setAttribute(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA);
            } catch (IllegalArgumentException x) {
                // This can happen if the parser does not support JAXP 1.2
                System.err.println(
                    "Error: JAXP DocumentBuilderFactory attribute not recognized: "
                    + JAXP_SCHEMA_LANGUAGE);
                System.err.println(
                    "Check to see if parser conforms to JAXP 1.2 spec.");
                System.exit(1);
            }
        }

        // Set the schema source, if any.  See the JAXP 1.2 maintenance
        // update specification for more complex usages of this feature.
        if (schemaSource != null) {
            dbf.setAttribute(JAXP_SCHEMA_SOURCE, new File(schemaSource));
        }

        // Optional: set various configuration options
        dbf.setIgnoringComments(ignoreComments);
        dbf.setIgnoringElementContentWhitespace(ignoreWhitespace);
        dbf.setCoalescing(putCDATAIntoText);
        // The opposite of creating entity ref nodes is expanding them inline
        dbf.setExpandEntityReferences(!createEntityRefs);

        // Step 2: create a DocumentBuilder that satisfies the constraints
        // specified by the DocumentBuilderFactory
        DocumentBuilder db = dbf.newDocumentBuilder();

        // Set an ErrorHandler before parsing
        OutputStreamWriter errorWriter =
            new OutputStreamWriter(System.err, outputEncoding);
        db.setErrorHandler(
            new MyErrorHandler(new PrintWriter(errorWriter, true)));

        // Step 3: parse the input file
        Document doc = db.parse(new File(filename));

        // Print out the DOM tree
        OutputStreamWriter outWriter =
            new OutputStreamWriter(System.out, outputEncoding);
        XMLDocumentWriter xmlDocWriter = new XMLDocumentWriter(new PrintWriter(outWriter, true));
        xmlDocWriter.write(doc);
        xmlDocWriter.reverse(doc);
	}

    // Error handler to report errors and warnings
    private static class MyErrorHandler implements ErrorHandler {
        /** Error handler output goes here */
        private PrintWriter out;

        MyErrorHandler(PrintWriter out) {
            this.out = out;
        }

        /**
         * Returns a string describing parse exception details
         */
        private String getParseExceptionInfo(SAXParseException spe) {
            String systemId = spe.getSystemId();
            if (systemId == null) {
                systemId = "null";
            }
            String info = "URI=" + systemId +
                " Line=" + spe.getLineNumber() +
                ": " + spe.getMessage();
            return info;
        }

        // The following methods are standard SAX ErrorHandler methods.
        // See SAX documentation for more info.

        public void warning(SAXParseException spe) throws SAXException {
            out.println("Warning: " + getParseExceptionInfo(spe));
        }
        
        public void error(SAXParseException spe) throws SAXException {
            String message = "Error: " + getParseExceptionInfo(spe);
            throw new SAXException(message);
        }

        public void fatalError(SAXParseException spe) throws SAXException {
            String message = "Fatal Error: " + getParseExceptionInfo(spe);
            throw new SAXException(message);
        }
    }


}
Figure 3

Key Point
The key point Iíll make is that in working with the Nodes in the DOM tree, you have to check the type of each Node before you work with it. Certain methods, such as getAttributes, return null for some node types. If you donít check the node type, youíll get unexpected results (at best) and exceptions (at worst).

Which parser should you use?
Use a DOM parser when:
  • You need to know a lot about the structure of a document
  • You need to move parts of the document around (you might want to sort certain elements, for example)
  • You need to use the information in the document more than once
  • Use a SAX parser when:
  • You only need to extract a few elements from an XML document.
  • You donít have much memory to work with
  • Youíre only going to use the information in the document once (as opposed to parsing the information once, then using it many times later).


In this article, we have covered some of the basics related to using JAXP and the benefits it provides in relation to XML processing. In a future article we will look at some of the more advanced functions used with JAXP for both SAX and DOM parsers.

Submit "The Benefits of JAXP" to Facebook Submit "The Benefits of JAXP" to Digg Submit "The Benefits of JAXP" to del.icio.us Submit "The Benefits of JAXP" to StumbleUpon Submit "The Benefits of JAXP" to Google

Updated 11-30-2011 at 03:01 PM by Java XML

Categories
JAXP , DOM , SAX.

Comments