Sunday, January 23, 2011

Parsing large XML files.

Today I will write about parsing XML files from the Java perspective. Recently I faced a task of reading and eliciting some certain data from a bigger than usually XML file – over half a GB. The most desirable approach when it's about parsing XML files is to use a parser that implements a DOM API interface. For those of you who are not familiar with this mechanism – in a nutshell, it's a tree based API which reads the entire document into memory and represents it as a tree of objects, to which we have random access. It makes a work with XML documents very convenient and easy thanks to the fact that we can easily retrieve any interesting node from the tree and read the data that we need. Unfortunately, there is a huge drawback with this solution – it requires lots of memory, depending on the implementation, up to a few times of the xml document's size which in my case, forced me to seek for another approach.
After a while, I came across to StAX, which stands for Streaming API for XML. It is much different API than the DOM. The first important thing – it does not convert the document into a tree. Instead, it treats it as it is – a stream. But that's not everything. The StAX is a pull streaming model which means that it is up to the programmer when he or she wants to start, pause or resume the parsing process.
Fine! I guess that's enough for an introduction. Let's have a look at the following example of parsing an xml file. First of all, here is how the xml file looks like:

<calendar>
<event type = "party">
<where>Club Mojito</where>
<whom>My friends</whom>
</event>
<event type = "meeting">
<where>A building</where>
<whom>Project Manager</whom>
<date>12/09/11</date>
</event>
<event type = "lunch">
<where>Canteen</where>
</event>
</calendar>

And the code that parses the file using StAX:

package stAX;

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.InputStream;
import java.util.Iterator;

import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.Attribute;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;

public class StAXExample {
public static void main(String[] args) {
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
try {
InputStream in = new FileInputStream("ourFile.xml");
XMLEventReader eventReader = inputFactory.createXMLEventReader(in);
String currentElement = "";
while(eventReader.hasNext()) {
XMLEvent event = eventReader.nextEvent();
if(event.isStartElement()) {
StartElement startElement = event.asStartElement();
currentElement = startElement.getName().toString();
System.out.println("Start element: " + startElement.getName());
@SuppressWarnings("unchecked")
Iterator<Attribute> it1 = startElement.getAttributes();
while(it1.hasNext()) {
Attribute attribute = it1.next();
System.out.println(" Attribute name: " + attribute.getName() + ", value: " + attribute.getValue());
}
}
if(event.isEndElement()) {
currentElement = "";
}
if(event.isCharacters()) {
if(currentElement.equals("whom") && event.isCharacters()) {
System.out.println(event.asCharacters().getData());
}
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch(XMLStreamException xse) {
}
}
}

In lines 17 – 20 we initialize our parser. The most important thing, is to get an implementation of an XMLEventReader interface. This is the top level interface for parsing xml files which gives an access to all methods that we need. In order to get this implementation we need XMLInputFactory and InputStream with the xml file we want to parse.
In the next part of code we can see the usage of an XMLEventReader interface. The methods that we take advantage of are as follows:
hasNext() It checks if there are more events
nextEvent() returns the next event
isStartElement() checks if the event is a start element, which means, for instance, an opening tag
isEndElement() as the previous method, just that it relates to an end element.
IsCharacters() checks if the event is the plain text, which means, the text between opening and closing tags.

Ok, so for now, we know how to get opening and closing tags, as well the content between them but let's say we would like to get the attributes of some tag and their values. In the lines 29-33 I achieve this with an iterator. The interface StartElement has a method getAttributes() that returns an iterator to the attributes which we can cast to the Iterator. After that the Attribute interface has the methods getName() and getValue() which we use to get the name and value of the tag's attributes.
It is that simple. So now on, if you have a large XML file you will know how big boys do handle it ;-).

No comments:

Post a Comment