XPB4J User Guide

covers XPB4J-0.90

Last Updated: June 24, 2002

Guide For the Impatient
Introduction
Getting XPB4J
Running XPB4J
Measurement Process
Input Data Files
XStat Processing
XPB4J Framework
Known Limitations

Guide For the Impatient

Here are the instructions for running XPB4J out of the box:

Download the XPB4J distribution file. Unzip it to get the directory tree rooted at xpb4j-0.90.
Make sure that you have J2SDK1.4.x installed and environment variable JAVA_HOME set to the base directory of J2SDK1.4.x installation.
Make sure that you have Jakarta Ant 1.4 or above installed and PATH environment variable has the bin directory of Ant as a component.
Change directory to home directory of XPB4J ( directory xpb4j-0.90 ) and issue the command:

$ ant run

This will compile the XPB4J sources and execute the default XStat Processing code and print the performance data on your screen. It will use the sample data file res0.xml kept in directory xmldata and JAXP libraries of J2SDK, along with other XML parsing libraries included with XPB4J.
Alternatively, find a set of measurements and analysis at XML Processing Measurements using XPB4J.

If you want to learn more and experiment around, read on.

XML Processing Benchmark for Java (XPB4J) is a Java based performance measurement and comparison program for XML processing software. XML operations such as parsing, transformation, validation, encryption/decryption, custom access/manipulation or any combination of these applied on one or more XML files and/or byte streams is considered as XML processing.

Specific examples of such processing include:

validation of an XML file against a specified XML schema file;
creation/verification of digital signature as per XML Digital Signature standard;
validation of XML content as per a given set of business rules;
merging two or more XML files as per a specified set of rules using XSLT stylesheet or otherwise;
creating memory objects from XML content and vice-versa as per specified data binding rules;

XPB4J can be used to measure the performance of any of these processing activities, provided you hookup the processing code to XPB4J. By default, XPB4J includes the processing code for XStat Processing, a kind of XML content analysis that collects certain statistical information about an XML document.

XPB4J doesn't define any benchmark standard; it simply defines a framework to execute and measure performance characteristics of Processing Activities. If the same operation can be performed with different Processing Methods ( say, using different parsing APIs such as SAX, DOM, JDOM or Pull Parser API) then the performance charateristics of these can be measured and compared. One could also use different parsers and/or transformers and compare the results for the same processing method.

I wrote XPB4J primarily to

learn about different XML processing APIs;
enable myself and my fellow programmers to experiment with different ways of doing the same processing and understand the trade-offs and hence help us make better design and deployment choices;
track evolution of XML processing software with respect to their performance characteristics.

I have exercised XPB4J for XStat Processing using different parsing APIs and parser implementations. The different processing methods used are:

SAX -- linear scan using a JAXP compliant SAX parser;
DOM -- building W3C DOM structure using a JAXP compliant DOM parser;
PULL -- linear scan using a Pull Parser;
JDOM -- building JDOM structure using JDOM software;
XSLT -- using Java extension functions in an XSL stylesheet and using an XSLT transformer; and
DOM4J -- building DOM4J tree using a DOM4J Processor.

You can find my observations and conclusions in XML Processing Measurements using XPB4J. You could also run XPBJ4 on your machine with your favourite parser/transformer with your typical input and observe the results.

If your interest is in finding out performance and memory usage of your own custom processing, you can write your own classes using XPB4J Framework to invoke your processing and collect the relevant data. Unfortunately, documentation for this framework doesn't exist right now. For now, I would suggest that you look at the source code and figure it out by yourself. This is actually simpler than you might expect.

Note: The directory path and execution script name in this document use the MS WINDOWS convention. Their UNIX equivalents can be derived simply by replacing \ by / in path names.

Getting XPB4J

You can either download the released software as a single .zip file or get the latest code from CVS archive anonymously. If you download the the zipped file xpb4j-0.90.zip, unzip it in your working directory to get the directory tree starting at XPB4J home directory xpb4j-0.90.

For cvs access, you should have CVS client software. If you are a Linux user, you should have it already. If you are a Windows user, you can either get the command line version as part of Cygwin toolkit or install the excellent WinCVS.

Instructions given below are for command ine version on Windows machine. Instruction on Linux and different flavours of UNIX are very similar.

Set the environment variable CVSROOT.
$ set CVSROOT=:pserver:anonymous@cvs.xpb4j.sourceforge.net:/cvsroot/xpb4j
Login anonymously to the CVS server:
$ cvs login
Hit Enter when asked for password.
Checkout the files:
$ cvs -z8 co xpb4j

You get the latest snapshot from CVS archive. But beware, it may not work ( or even compile !! ).

Running XPB4J

As illustrated in section Guide For the Impatient, running XPB4J with default arguments is very simple.

By default, configurable parameters are taken from build.properties file, jar files are loaded from ( besides those loaded by the JVM ) lib directory and the input data set is made by all .xml files in xmldata directory. To change these default values, you can either modify the Ant script build.xml or change the content of the above mentioned files and/or directories. It is also possbile to set certain parameters as Ant properties in the command line.

I have found it convenient to

copy the appropriate .jar files to ( and remove unwanted .jar files from ) lib directory to change parser and processor implementations.
copy the appropriate .xml data files to ( and remove unwanted .xml files from ) xmldata directory to change input data files.
Set Ant properties xpb.loopcount and/or xpb.runcount in the command line itself for a specific invocation.

Examples of invocation commands include:


$ ant run
$ ant run-dom
$ ant run -Dxpb.loopcount=1000
$ ant run -Dxpb.loopcount=10 -Dxpb.runcount=10
$ ant help

Refer to Ant build script and the section on Measurement Process for more details on these.

A successful execution of XPB4J writes the measurements in file pdata.xml and processing results in file results.xml.

Measurement Process

Execution of the measurement program is started by invoking the command "ant run".

The measurement program executes the equivalent of following loop for measurements:

  // Code for illustration only. Won't compile.
  for (int r = 0; r < runcount; r++)	// 4 runs
    {
    Runtime.gc();	// Hope that this will force garbage collection.
    long startMem = Runtime.totalMemory() - Runtime.freeMemory();
    long startTime = System.currentTimeMillis();
    for (int l = 0; l < loopcount; l++)	// 100 loops
      {
      if (gc flag is on)	// off by default
        System.gc();
      for (file f in input files ) // Do the processing.
        process f;
      }
    long endTime = System.currentTimeMillis();
    long endMem = Runtime.totalMemory() - Runtime.freeMemory();
    System.out.println("Processing Time: " + (endTime - startTime)/100 + " milli secs.");
    System.out.println("Memory Use: " + (endMem - startMem)/1024 + " KB.");
    }

This loop is actually spread over two different classes in the code. You can find the corresponding source code in files PMethod.java and XPBmain.java under package org.xperf.xpb. The importatn thing to note is that all the runs are within same execution of the JVM ( so that the warmup overhead is incurred only once ) and each run consists of a number of processing iterations ( so that the measurement window is large enough to get meaningful average processing time ). Each processing iteration processes all the specified input files.

Input Data Files

XPB4J distribution includes input data files obtained from Google using Googles' Web API by issuing SOAP requests with search string "Bill Gates". Files res0.xml, res1.xml, ..., res9.xml, a total of ten, were created by saving the returned documents, each having 10 search result entries, res0.xml containing first to ninth, res1.xml containing tenth to nineteenth and so on. Each file is a valid SOAP document and is approximately 10KB in size. A big file having all the entries, file res.xml was created by concatenating the entries of all the other files. Note that file res.xml is not a textual concatenation of the files but contains the totality of the search result entries, thus preserving the structure of a valid SOAP document but with a total size that is slightly less than the sum of individual file sizes.

File res0.xml can be found in direcotry xmldata and all the files, including res0.xml, can be found in direcotry xmldata\google.

Note: I collected these files sometime in the middle of May 2002. Due to dynamic nature of the Web, if you try the same query now, you may not get exactly same search results.

Random XML data can also be generated by the supplied rxgen utility. This program takes a number or approximate size in KB as an argument and generates a somewhat random XML document. Look at the source file org.xperf.xpb.RandXMLGen.java ( included in the distribution ) to understand how the random file is generated.

Examples of rxgen invocation commands include:


$ rxgen -elemcount 10 > xmldata\rxgen100.xml
$ rxgen -datasize 100 > xmldata\rxgen100KB.xml

XStat Processing

XStat processing consist of scanning one or more XML files and collecting following statistical information for each element in the file:

the number times the element occurred
the number of times it had a particular element as a parent
the number of times it had a particular element as a child
the number of times it had a particular attribute
the amount of character data that had at least some non-whitespace characters
whether the element was always empty

Note that this processing does require some book-keeping and is not a simple parse of the file. Memory tree oriented APIs like DOM, JDOM and DOM4J make it simple to write the processing code whereas linear scan APIs like SAX and XmlPull require extra datastructures to be maintained. Acknowledgement: I have borrowed the idea behind this processing from the article Using The Perl XML::Parser Module.

XStat Processing Methods

XPB4J includes code to perform XStat processing using following Processing Methods:

SAX -- The XML input is accessed using a SAX API and relevant information is stored in a suitable datastructure. The SAX parser is accessed using JAXP API. Refer to sources under package org.xperf.xpb.xstat.sax for details.
DOM -- The XML input is converted into a W3C DOM object and is traversed to gather the relevant statistics. The DOM parser is accessed using JAXP API. Refer to sources under package org.xperf.xpb.xstat.dom for details.
PULL -- The XML input is accessed using Pull Parser API avaialable at Pull Parser site and relevant information is stored in a suitable datastructure, as with SAX processing method. Refer to sources under package org.xperf.xpb.xstat.pull for details.
JDOM -- The XML input is converted into JDOM document and is traversed to gather the relevant statistics. This is very similar to DOM processing method. Refer to sources under package org.xperf.xpb.xstat.jdom for details.
XSLT -- A stylesheet with Java extension functions is applied to the XML input using an XSL tranformer. The stylesheet fires appropriate Java function on encountering element nodes, attributes and text nodes. The transformer is obtained and invoked using JAXP API. Refer to sources under package org.xperf.xpb.xstat.xslt for details.
DOM4J -- The XML input is converted into DOM4J document and is traversed to gather the relevant statistics. This is very similar to DOM processing method. Refer to sources under package org.xperf.xpb.xstat.dom4j for details.

XPB4J Framework

TBD.

Known Limitations

Here is a partial list of known issues/limitations:

XSL stylesheet for XSLT processing method for XStat uses XALAN processor specific extensions and has been found to work with J2SDK1.4.0 JAXP processor. However, it may not be portable to other processors. It should be simple to write stylesheets for other processors. If you do, please share with me and I will include those.
Javadocs for XPB4J Framework do not exist. However, the framework itself is quite simple and if you really want to use it for your own processing code, you should be able to do so by looking at the code and following the XStat as a sample.