next up previous
Next: Supporting multiple implementations of Up: Design of a Pull Previous: Design of Pull API


Implementing XML Pull Parser

Any XML parser will have to do XML tokenization. For push parsers this can be combined with higher level parser as push callbacks can be called as soon as interesting input is read. For pull parsing this is different as when interesting event is seen the pull parser must return to the user. Therefore it needs to maintain internal state to be able to continue parsing when the user requests it.

In XPP2 we have made it possible to parse some parts of input using pull API and then for some document XML subtrees to use push parses that provide SAX2 API. From a SAX2 callback user can continue to use pull parsing or even create another nested SAX2 push parser and this recursive nested parsing can be as deep as required.

Java Performance Constraints

When implementing XPP2 one of the most important task was to assure good overall performance. We identified that creation of XmlStartTag object for every start tag would be incurring a very high overhead but we did not want to remove possibility to record parser state for later use (otherwise building XML object model in memory from XPP2 events would be difficult). The other consideration was to make sure that the parser performance is not too low when compared with simple string tokenization and other XML parsers. This requirements also helped to estimate acceptable XML parsing overhead. An interesting fact was determining the difference in processing char[] as compared to String even when advanced JIT such as Hotspot is used (see [1] for more details on performance evaluations).


The most important part of XML Pull Parser is tokenizer that is responsible for breaking input into tokens that are later used to produce events.

To make this task as efficient as possible the tokenizer in XPP2 is a state machine driven by input characters.

To avoid reading the input character by character, the input is internally buffered. In J2ME it is important to make sure that buffer size does not exceed some hard limits (so that XML parsing does not consume the available memory). If hard limit is exceed such as for very long element content an exception will be thrown and parsing will be stopped. It is also desirable to establish soft limit on internal buffer size to indicate what is desired buffer size (it must be always less than hard limit).


The parser in XPP2 is implements all of XML 1.0 specification requirements for non-validating XML parser except for parsing internal DTD. This decision was made to ensure that the size of XPP2 is not too big and moreover as XML schemas are going to replace DTDs we think that supporting DTD parsing is no longer desired by users. The other limitation of the parser is that input must be in UNICODE as represented by Java Reader and char type. The user is responsible for detecting input encoding (such as UTF8) and use Java built-in mechanisms to transform it to Reader.

Support for streaming

XPP2 pull model is inherently well suitable for streaming. The internal buffer support for soft and hard limit sizes allow us to ensure that during parsing memory consumption is kept under tight control.

Namespace handling

In some situation when XML input does not contain XML namespaces there can be slight boost in performance (around 5%) by avoiding overhead of validation and maintaining XML namespace prefixes declarations. Therefore XPP2 allows to decide before parsing if namespaces should switched on or off.

next up previous
Next: Supporting multiple implementations of Up: Design of a Pull Previous: Design of Pull API
Aleksander Andrzej Slominski