Monday, May 15, 2006

A sample sax parser

I thought of writing a sax parser, just to see how it would work with the web.xml file for the ajaz example.

Here is the XML file.(This is the file that this line of code picks up 'm.parseXml("C:/temp/ajaxtest/web.xml");')

NOTE:- Blogger tries to render HTML tags in a blog. So where you see [] in tags, please read them as <>.

[?xml version="1.0" encoding="UTF-8"?]
[web-app testAttribute="test1"]
[servlet]
[servlet-name]SimpleTestServlet[/servlet-name]
[servlet-class]suchak.ajax.test.SimpleTestServlet[/servlet-class]
[/servlet]
[servlet]
[servlet-name]AnotherSimpleTestServlet[/servlet-name]
[servlet-class]suchak.ajax.test.AnotherSimpleTestServlet[/servlet-class]
[/servlet]
[servlet-mapping]
[servlet-name]SimpleTestServlet[/servlet-name]
[url-pattern]*.cmd[/url-pattern]
[/servlet-mapping]
[/web-app]

The ContentHandler interface is the key interface for sax parsing api. It has an implementation called DefaultHandler which can be used for simplicity. One can just choose to overide methods one wants. Just to add, I also thought of using the fully qualified names so that there would be clarity as to where the classes come from at a first glance rather than looking up the import.

I am using the apache xeres parser. So if you try this out you might want to include the apache xerces libraries in your classpath.

Here is the class. Please note the ContentHandler itself is an inner class called TestContentHandler.

import java.io.File;

public class Main {

public static void main(String [] args){
Main m = new Main();
m.parseXml("C:/temp/ajaxtest/web.xml");
}

private void parseXml(String file) {
try {
org.xml.sax.InputSource is = new org.xml.sax.InputSource(
new java.io.FileInputStream(file));
org.xml.sax.XMLReader reader =
org.xml.sax.helpers.XMLReaderFactory.createXMLReader(
"org.apache.xerces.parsers.SAXParser");
reader.setContentHandler(new Main.TestContentHandler());
log(" -- Staring XML Parsing ! --");
reader.parse(is);
log(" -- Finished XML Parsing ! --");
} catch (java.io.FileNotFoundException e) {
e.printStackTrace();
} catch (org.xml.sax.SAXException e) {
e.printStackTrace();
} catch (java.io.IOException e) {
e.printStackTrace();
}

}

public class TestContentHandler extends org.xml.sax.helpers.DefaultHandler {

String currentElement = null;
java.util.HashMap elementValues = new java.util.HashMap();

public void startElement(
String namespaceURI,
String localName,
String qName,
org.xml.sax.Attributes attributes)
throws org.xml.sax.SAXException {
log(
" Start element " +
" ---- qName :::: '" + qName + "'"
);
// Element is starting to be read, so name it as the current element and
currentElement = qName;
elementValues.put(currentElement, new StringBuffer());

int noofattributes = attributes.getLength();

for(int i=0;i(less than)noofattributes;i++){
log(
" Attribute found for element :: '" + qName + "'" +
" att qname :: " + attributes.getQName(i) +
" att type :: " + attributes.getType(i) +
" att value :: " + attributes.getValue(i));
}
}

public void characters(
char[] ch,
int start,
int length)
throws org.xml.sax.SAXException {
log( "Current element being processed in chars :: '" +
currentElement + "'");
String tempString = new String(ch,start,length);
//log( "Value of the above element is :: '" + tempString + "'");
if(tempString.trim().equals("")){
// if emply string just log it and do nothing
log( " -- White space -- ");
}else{

/* this part of the code add the strings to the current element.
* as per http://java.sun.com/j2se/1.4.2/docs/api/org/xml/sax/
* ContentHandler.html#characters(char[],%20int,%20int)
* The Parser will call this method to report each chunk of character
* data. SAX parsers may return all contiguous character data in a
* single chunk, or they may split it into several chunks; however,
* all of the characters in any single event must come from the same
* external entity so that the Locator provides useful information.
* The application must not attempt to read from the array outside of
* the specified range.
*/
((StringBuffer)(elementValues.get(currentElement))).append(tempString);
}
}

public void endElement(
String namespaceURI,
String localName,
String qName)
throws org.xml.sax.SAXException {
log(
"End element " +
" ---- qName :: '" + qName + "'"
);
/*
* Current element is fully read, so it is ok to
* remove the element value from the map
*
*/

String value = ((StringBuffer)(elementValues.get(qName))).toString();
// Print the value out of it is not am epty string
if(!value.equals("")){
log(" Value of the above current element :: " + qName +
" is :: "+ value);
}
elementValues.remove(qName);
/*
* We do not reset the current element here as the startElement
* method should do that at the start of an element
*
*/
}
}

private void log(String s){
System.out.println(s);
}
}



As seen above the characters method can span multiple calls depending on the buffer size for the same element.

Here is the output:-

-- Staring XML Parsing ! --

Start element ---- qName :::: 'web-app'
Attribute found for element :: 'web-app' att qname :: testAttribute att type :: CDATA att value :: test1
Current element being processed in chars :: 'web-app'
-- White space --

Start element ---- qName :::: 'servlet'
Current element being processed in chars :: 'servlet'
-- White space --

Start element ---- qName :::: 'servlet-name'
Current element being processed in chars :: 'servlet-name'
End element ---- qName :: 'servlet-name'
Value of the above current element :: servlet-name is :: SimpleTestServlet

Current element being processed in chars :: 'servlet-name'
-- White space --
Start element ---- qName :::: 'servlet-class'

Current element being processed in chars :: 'servlet-class'
End element ---- qName :: 'servlet-class'
Value of the above current element :: servlet-class is :: suchak.ajax.test.SimpleTestServlet

Current element being processed in chars :: 'servlet-class'
-- White space --

Current element being processed in chars :: 'servlet-class'
-- White space --

End element ---- qName :: 'servlet'
Current element being processed in chars :: 'servlet-class'
-- White space --

Start element ---- qName :::: 'servlet'
Current element being processed in chars :: 'servlet'
-- White space --

Start element ---- qName :::: 'servlet-name'
Current element being processed in chars :: 'servlet-name'
End element ---- qName :: 'servlet-name'
Value of the above current element :: servlet-name is :: AnotherSimpleTestServlet

Current element being processed in chars :: 'servlet-name'
-- White space --

Start element ---- qName :::: 'servlet-class'
Current element being processed in chars :: 'servlet-class'
End element ---- qName :: 'servlet-class'
Value of the above current element :: servlet-class is :: suchak.ajax.test.AnotherSimpleTestServlet

Current element being processed in chars :: 'servlet-class'
-- White space --

Current element being processed in chars :: 'servlet-class'
-- White space --

End element ---- qName :: 'servlet'
Current element being processed in chars :: 'servlet-class'
-- White space --

Current element being processed in chars :: 'servlet-class'
-- White space --

Start element ---- qName :::: 'servlet-mapping'
Current element being processed in chars :: 'servlet-mapping'
-- White space --

Start element ---- qName :::: 'servlet-name'
Current element being processed in chars :: 'servlet-name'
End element ---- qName :: 'servlet-name'
Value of the above current element :: servlet-name is :: SimpleTestServlet

Current element being processed in chars :: 'servlet-name'
-- White space --

Start element ---- qName :::: 'url-pattern'
Current element being processed in chars :: 'url-pattern'
End element ---- qName :: 'url-pattern'
Value of the above current element :: url-pattern is :: *.cmd

Current element being processed in chars :: 'url-pattern'
-- White space --

Current element being processed in chars :: 'url-pattern'
-- White space --

End element ---- qName :: 'servlet-mapping'

Current element being processed in chars :: 'url-pattern'
-- White space --

End element ---- qName :: 'web-app'-- Finished XML Parsing ! --


As seen, all white spaces, including CR's are also outputted by the characters method. I did not see a way to tell the apache xeres parser to not read the white spaces in the characters method.

However Oracle has a neat XML parser in which one can do this :-

SAXParser parser = new SAXParser();
parser.setPreserveWhitespace(false);

Here are more details :- http://www.oracle.com/technology/pub/articles/wang-whitespace.html

All in all it seems pretty simple to write up a custom sax parser whith current tools.

1 Comments:

Blogger Twisted said...

> Blogger tries to render HTML tags in a blog. So where you see [] in tags, please read them as <>.

You need to use the HTML entity equivalent for < and > and many other special characters since they have a special meaning in HTML. Or instead of writing your blog post in the "Compose" mode, write only text in "Compose" mode, publish the post then add the related code and XML descriptor in "HTML Mode". It would then be rendered correctly.

Also consider using <pre> tags around code you post to maintain formatting. All this can be done unfortunately in the HTML Mode only.

Also you would require a few changes in the Blog template to make your code and deployment descriptors stand out. Get back to me or just post a comment on this blog post for more details.

--STS

11:45 PM  

Post a Comment

<< Home