next up previous
Next: Design of Pull API Up: Design of a Pull Previous: Current XML APIs for

Push and Pull: complementary sides of XML parsing

Most SAX parsers are built on top of a pull parsing layer. It is an interesting challenge to expose to the user both pull and push layers. This allows an application to use pull parsing when needed without having to stop using SAX API.

Pull and push parsing models are not the only two ways to parse XML. It is possible to convert pull parser into a push model. This is possible as during pull parsing the caller has control over parsing and can push events (as an example please see description SAX2 driver for XPP2 at the end of this paper). It is also possible to convert push into pull parser but requires to buffer all events converted from SAX callbacks or an extra thread that can be used to "pull" more data from SAX parser and is kept suspended until the user asks for more events.

This approach is best exemplified by Pull Parser Wrapper for SAX [8] that allows conversion from SAX compliant parser into a XML pull parser. However as it requires an extra thread to convert events and this requirement has proved to be very difficult in server environments and impossible in EJB container environments. For an example of such difficulties it is useful to look on design considerations of Apache SOAP Axis (see for example notes from Axis Face-to-Face Meeting at http://xml.apache.org/axis/docs/F2F-2.html)

Pull parsing is ideally suited for applications that needs to transform input XML to other formats. As such transformation is typically complex they must maintain state during parsing. Using SAX would require to maintain state between callbacks to be able to determine correct action to SAX event. In pull parsing, application can be naturally structured and information can be pulled from XML when needed as application can pull next event when is ready to process it.

Lets look at an example data record below that represents information about one person that has name and two addresses:

<person>
<name>Alek</name>
<home_address>
<street>101 Sweet Home</street>
<phone>333-3333</phone>
</home_address>
<work_address>
<street>303 Office Street</street>
<phone>444-4444</phone>
</work_address>
</person>

This XML input can naturally be mapped to Java classes:

class Person {
        String name;
        Address homeAddress;
        Address workAddress;
}

class Address {
        String street;
        String phone;
}

To process it with SAX one would need to have startElement callback to do some work depending on start tag name but this will be not enough as phone must be put in different place depending on whether or not the previous tag was home_address or work_address. Here we give an example of such code:

    Person person = new Person();
    StringBuffer elementContent = new StringBuffer();
    Address address;
    
    public void startElement(String uri, String local, String raw,
                             Attributes attrs) throws SAXException {
        buf.clear();
    }
    public void characters(char ch[], int start, int length)
        throws SAXException {
        buf.append(ch, start, length);
    }
    public void endElement(String uri, String local, String raw)
        throws SAXException {
        if("name".equals(local)) { 
           person.name = elementContent.toString; 
         } else if("home_address".equals(local)) { 
           address = person.homeAddress = new Address(); 
         } else if("work_address".equals(local)) { 
           address = person.workAddress = new Address(); 
         } else if("phone".equals(local)) { 
           address.phone = buf.toString(); 
         } else if("street".equals(local)) { 
           address.street = buf.toString(); 
         } else {
           throw new SAXException("unexpected element "+local);
         }
         
    }

Although on the first look the code may look sufficient there are some inherent problems because it does not maintain state between endElement callbacks. Therefore the code does not know what is its position in the parsed XML structure. One particular problem with such approach is that such SAX program will not validate input and can even produce incorrect conversions. For example this code would process input XML from below by incorrectly overriding home phone to 666-666 instead of 333-3333.

<person>
<name>Alek</name>
<home_address>
<phone>333-3333</phone>
</home_address>
<phone>666-6666</phone>
</person>

This can be fixed by keeping track of how deeply nested the start/end element is and using some extra flags to make sure that phone is set only once and address always corresponds only to the current home or work address. But it requires adding extra state variable and code to do validation.

When using pull parser the conversion of XML input into Person object is very natural and follows hierarchical relation between Person and Address objects. Let look on how it could be done if a simple pull parser was available:

        parser = new PullParser(input)
        Person person = parsePerson(parser);

        public Person parsePerson(PullParser parser) throws ValidationException 
        {
                Person person = new Person();
                while(true) {
                   int eventTyppe = parser.next();
                   if(eventType = parser.START_TAG) {
                      String tag = parser.getStartTagName();
                      if("name",equals(tag)) {
                        person.name = readContent(parser);                
                      } else if("home_address",equals(tag)) {
                        person.homeAddress = readAddress(parser);
                      } else if("work_address",equals(tag)) {
                        person.workAddress = readAddress(parser);
                      } else {
                        throw new ValidationException(
                          "unknown field "+tag+" in person record");
                      }
                   } else if(eventType == parser.END_TAG) {
                     break;
                   } else {
                     throw new ValidationException("invalid XML input");
                   }
                }
        }
        public Address parseAddress(PullParser parser) throws ValidationException 
        {
                Address person = new Address();
                while(true) {
                   int eventTyppe = parser.next();
                   if(eventType = parser.START_TAG) {
                      String tag = parser.getStartTagName();
                      if("street",equals(tag)) {
                        address.street = readContent(parser);                
                      } else if("phone",equals(tag)) {
                        address.phone = readContent(parser);                
                      } else {
                        throw new ValidationException(
                          "unknown field "+tag+" in person record");
                      }
                   } else if(eventType == parser.END_TAG) {
                     break;
                   } else {
                     throw new ValidationException("invalid XML input");
                   }
        }
        public String readContent(PullParser parser) throws ValidationException 
        {
                if(parser.next() != parser.CONTENT) {
                  throw new ValidationException("expected string content");
                }
                String content = parser,readContent();                
                if(parser.next() != parser.END_TAG) {
                  throw new ValidationException(
                    "expected end tag after string content");
                }
                return content;
        }

The structure naturally reflects the organization of data structures and therefore is much easier to maintain. The state is kept naturally on the stack as a consequence of method calls that can be nested as much as necessary. Notice also that with pull parsing the second input will trigger a ValidationException.


next up previous
Next: Design of Pull API Up: Design of a Pull Previous: Current XML APIs for
Aleksander Andrzej Slominski
2002-02-10