Custom Search

Monday, April 16, 2012

Converting Python from the dom to etree

Overview

As part of a technology upgrade for an application, I recently rewrote a collection of Python classes to use xml.etree instead of xml.dom. The application used pretty much all of the libraries: reading in xml strings to create in-memory representations, then searching them for various values. Taking a non-xml data structure and creating an xml representation of that, then writing it out in a string for future use.

Before starting, I went looking for a document discussing this kind of rewrite, in hopes of avoiding any gotchas that might be lurking in the process. I was a bit surprised to find - well, not find - such a document. Either this process isn't all that common, or there aren't any gotchas. In either case, it's complicated enough that I think it's worth discussing the differences.

Converting code from using the DOM to using an ElementTree API is relatively straightforward. I'm going to cover the major details, pointing out any gotchas I've found. However, this is not a guide to either ElementTree or the DOM. It should be sufficient for most such conversions, but you may well be able to improve a specific bit of code if you study the ElementTree documentation.

The ElementTree API was designed for Python, so you are unlikely to find it elsewhere. Objects provide standard Python API's to access related parts of the DOM, rather than having methods and/or attributes from the DOM spec for that.

There are mutiple implementations of the API. Recent cPython versions ship with ElementTree and cElementTree. lxml provides an ElementTree implementation based on the libxml C library, and thus includes both fast XML processing and features that aren't available in the included versions.

Extracting data

Data in an XML tree comes in three forms:
  1. Element nodes, which hold everyting.
  2. Text in a node.
  3. Attributes.
The changes get harder to deal with as you go down the list, so I'll deal with them in reverse order.


Attributes

Attributes are easy. Instead of getAttribute, you have get. getAttributeNode (and similar) don't exist in ElementTree. The attributes in ElementTree look like dictionary items on the node - except for lack of __getitem__ and the ability to iterate over them (which means that in doesn't work on them!). If you really want a dict, you can get it from the nodes attrib element.

The gotcha here is that getAttribute returns an empty string if the attribute doesn't exist. get is the standard dictionary function, and has a second argument (which defaults to None) that is returned if the attribute isn't there. So watch for hasAttribute tests used to deal with the default value in some way, and rewrite them as appropriate.

Text

Text is where things start getting strange. Text in the DOM is stored in a TextNode node type. In ElementTree, text is stored in the nodes text and tail attributes. To find all the text in a DOM node, you iterate over the children of the node looking for TextNodes, and get the text from their data attribute.

In ElementTree, the text attribute holds any text that immediately follows the tag opening. The tail attribute holds any text that immediately follows the tag closeing. So the equivalent process is to get the nodes text value, then walk the child nodes, collecting their tail values.

In well-designed schemas (which don't have what is known as mixed content, with both text and nodes as children), both processes are much easier. The DOM pattern is to get the first child, make sure it's a text node, and then use it's data attribute. In ElementTree, you can just use the text attribute.

Elements

The easy part is the name of the tag: it's tagName or name in the DOM (depending on exactly what you want) and tag in ElementTree. However, there is a major gotcha in dealing with elements at all. A DOM node is always true. An ElementTree element is false if it has no children. Even if it has attributes or text, it will be false if there are no children. This is standard Python behavior for lists. It means that tests like if node: need to be rewritten as if node is not None:. On the other hand, testing for no children is simpler. The DOM has a childNodes attribute to provide a list of child nodes - including text nodes. With ElementTree, the node itself provides the Python list interface to the children.

More importantly, the only way to search for nodes in the DOM is via getElementsByTagName and getElementsByTagnameNS methods. These walk the entire subtree of the node, looking for that name.
ElementTree provides the getIterator method that does pretty much what the DOM methods do. It is somewhat more powerfull, in that it accepts limited XPath expressions as well as simple tag names. How limited the XPath expressions are depends on the implementation, but intelligent use of these can move some checks into a C code library to improve performance.

There are two gotchas with getIterater. The first is that it can return an iterator instead of an iterable (up to the implementation), so you may need to pass it to list to get an iterable, depending on what's done with it and whether you want to maintain portability across ElementTree implementations. The second is that, unlike the DOM methods, it includes the current element in the search. So you need to insure that the current element doesn't match the search expression.

However, if you know something about the schema, you may be able to improve performance here. ElementTree nodes have find and findAll methods that return the first node matching the search, or an iterable over all such children, respectively. Matching is the same as for getIterator. So, if you know that you're looking for a child node, these can be used instead of getIterator to save searching grandchildren, etc.

Odds 'n Ends

A few random things that don't fit in the above.

Adding Nodes

In the DOM, you add a node by using an appropriate create*Foo* method from the Document object it's going to be added to, then add it where you want it. For ElementTree, you should usually use the the SubElement function. That will create the appropriate element and append it to the nodes children.

Comments and Processing Instructions

The ElementTree implementation in the standard library ignores processing instructions when it parses XML. This can cause problems if you change implementations. It's also something to be wary of if your application uses those.