OverviewAs part of a technology upgrade for an application, I recently rewrote a collection of Python classes to use xml.etree instead of xml.dom. The application used pretty much all of the libraries: reading in xml strings to create in-memory representations, then searching them for various values. Taking a non-xml data structure and creating an xml representation of that, then writing it out in a string for future use.
Before starting, I went looking for a document discussing this kind of rewrite, in hopes of avoiding any gotchas that might be lurking in the process. I was a bit surprised to find - well, not find - such a document. Either this process isn't all that common, or there aren't any gotchas. In either case, it's complicated enough that I think it's worth discussing the differences.
Converting code from using the DOM to using an ElementTree API is relatively straightforward. I'm going to cover the major details, pointing out any gotchas I've found. However, this is not a guide to either ElementTree or the DOM. It should be sufficient for most such conversions, but you may well be able to improve a specific bit of code if you study the ElementTree documentation.
The ElementTree API was designed for Python, so you are unlikely to find it elsewhere. Objects provide standard Python API's to access related parts of the DOM, rather than having methods and/or attributes from the DOM spec for that.
There are mutiple implementations of the API. Recent cPython versions ship with ElementTree and cElementTree. lxml provides an ElementTree implementation based on the libxml C library, and thus includes both fast XML processing and features that aren't available in the included versions.
Extracting dataData in an XML tree comes in three forms:
- Element nodes, which hold everyting.
- Text in a node.
Attributes are easy. Instead of
getAttribute, you have
getAttributeNode(and similar) don't exist in ElementTree. The attributes in ElementTree look like dictionary items on the node - except for lack of
__getitem__and the ability to iterate over them (which means that
indoesn't work on them!). If you really want a dict, you can get it from the nodes
The gotcha here is that
getAttributereturns an empty string if the attribute doesn't exist.
getis the standard dictionary function, and has a second argument (which defaults to
None) that is returned if the attribute isn't there. So watch for
hasAttributetests used to deal with the default value in some way, and rewrite them as appropriate.
TextText is where things start getting strange. Text in the DOM is stored in a
TextNodenode type. In ElementTree, text is stored in the nodes
tailattributes. To find all the text in a DOM node, you iterate over the children of the node looking for TextNodes, and get the text from their
In ElementTree, the
textattribute holds any text that immediately follows the tag opening. The
tailattribute holds any text that immediately follows the tag closeing. So the equivalent process is to get the nodes
textvalue, then walk the child nodes, collecting their
In well-designed schemas (which don't have what is known as mixed content, with both text and nodes as children), both processes are much easier. The DOM pattern is to get the first child, make sure it's a text node, and then use it's
dataattribute. In ElementTree, you can just use the
ElementsThe easy part is the name of the tag: it's
namein the DOM (depending on exactly what you want) and
tagin ElementTree. However, there is a major gotcha in dealing with elements at all. A DOM node is always true. An ElementTree element is false if it has no children. Even if it has attributes or text, it will be false if there are no children. This is standard Python behavior for lists. It means that tests like
if node:need to be rewritten as
if node is not None:. On the other hand, testing for no children is simpler. The DOM has a
childNodesattribute to provide a list of child nodes - including text nodes. With ElementTree, the node itself provides the Python list interface to the children.
More importantly, the only way to search for nodes in the DOM is via
getElementsByTagnameNSmethods. These walk the entire subtree of the node, looking for that name.
ElementTree provides the
getIteratormethod that does pretty much what the DOM methods do. It is somewhat more powerfull, in that it accepts limited XPath expressions as well as simple tag names. How limited the XPath expressions are depends on the implementation, but intelligent use of these can move some checks into a C code library to improve performance.
There are two gotchas with
getIterater. The first is that it can return an iterator instead of an iterable (up to the implementation), so you may need to pass it to
listto get an iterable, depending on what's done with it and whether you want to maintain portability across ElementTree implementations. The second is that, unlike the DOM methods, it includes the current element in the search. So you need to insure that the current element doesn't match the search expression.
However, if you know something about the schema, you may be able to improve performance here. ElementTree nodes have
findAllmethods that return the first node matching the search, or an iterable over all such children, respectively. Matching is the same as for
getIterator. So, if you know that you're looking for a child node, these can be used instead of
getIteratorto save searching grandchildren, etc.
Odds 'n EndsA few random things that don't fit in the above.
Adding NodesIn the DOM, you add a node by using an appropriate
create*Foo*method from the
Documentobject it's going to be added to, then add it where you want it. For ElementTree, you should usually use the the
SubElementfunction. That will create the appropriate element and append it to the nodes children.