Custom Search

Friday, April 20, 2012

Processes - a modern approach to concurrency

I seem to keep running into an assumption that the only solution for concurrent operations is having multiple threads of control in a shared address space - aka "threading". This is in spite of this model having the rather nasty downside of exposing everything in memory to shared access and modification, making it very easy to inadvertently write ill-behaved code. It's as easy as linking with a standard library instead of the special "thread-safe" version of that same library - assuming there is a "thread-safe" version of the library!

Letting concurrent operations share address space is an old model. There is a modern alternative that people either dismiss or aren't aware of that doesn't suffer from the problems mentioned above. Unfortunately, more primitive operating systems don't support it very well, so languages and libraries that want to be cross-platform tend to use or recommend the more primitive threading model. I suspect this is why the assumption of it being the only solution is so widespread. I'm going to write this as if you didn't know about the alternative, even though it's very likely most of you do.

Some history

The shared address space model used by threads is very old. Early computers didn't support multiple address spaces, so it was forced on them. A typical system would divide the address space into parts, some for use by the system and some for use by user programs, and you could only have one normal user program running at a time - because there was only one address space.

Of course, there are user programs that you want to always be running, or to run concurrently with some other programs. Such programs could be written to run in the existing shared address space, but had to be written specifically to do so. There were lots of issues with doing this, so the less said about this form of concurrency, the better.

When hardware support for multiple address spaces was first introduced, systems still had address spaces shared between the user and system, and concurrent operations were still painful. It was mostly used to support simultaneous users, rather than concurrent operations in a single application. If you wanted that, you still had no choice but to write code specifically intended to share a single address space. It might be possible to create a new address space, but that was hideously expensive and convoluted, and communications between them was painful at best.

As an aside, some systems did make things a bit easier - they insisted that user programs be position independent code, so they could be loaded anywhere in the shared address space. This meant that concurrent operations didn't require special code, as all user programs were special.

The breakthrough idea is putting each user program in it's own address space, and making communications between them cheap and easy. To be practical, this meant that creating a new address space also had to be cheap and easy, at least compared to doing the same thing on earlier systems. The resulting things are known as processes.

The cost of processes for concurrency

While non-shared address spaces for the concurrent processing does solve the problems that come from everything being shared, they do come with some extra costs. Those aren't as great as they appear at first glance.

Creation costs

In a modern system, a new process is created by copying the parent. This involves the cost of creating a new thread of control - which has to be paid if the address space is shared or not. It also involves creating a copy of the descriptors for the data space, making the new ones copy on write. This isn't very expensive, as it all happens in memory. The cost will go up as the child writes to non-shared memory, since the system will automatically copy those parts of memory. This is probably more expensive than properly locking all access to that memory, but there is no way to create a locking bug. This is similar to garbage collection, where you pay with performance for having something else pick up unused memory, but in return you eliminate the possibility of bugs that use freed memory. This is generally considered a good trade off.

Memory costs

The threading model uses minimal extra memory. It just needs to allocate a new stack for each thread.

The process model doesn't create a new stack, but will create more stack as needed for each process, and copy the old data in the new process as it gets reused. More importantly, as objects get written to, it will create copies of the pages being written. This is an unavoidable cost of not sharing those objects.

For many applications, the latter isn't really a problem. Code won't be written to, and static data structures shouldn't be written to. However, if the language uses reference counting as part of it's garbage collection strategy, that could cause lots of writes that aren't really needed. On the other hand, every one of those writes represents a lock acquire and release in the threading model. So even in this case, it's the classic memory vs. performance trade off.

Data Sharing costs

The trade offs here are different than with threads. In some cases, processes have a lower cost. In other cases, they have a greater cost.

The base cost in the threading model is that every object that is written to by multiple threads needs to be identified, and either moved into thread-local storage (if it shouldn't be shared) or wrapped with the proper locks (if it should). Assuming you get all the locking correct (which, in my experience, is highly unlikely), every time you touch such data, it costs a little extra. You may wind up paying a lot if there's contention.

In the process model, data can be shared in three different ways. First, anything created in the parent process will be available to the child. Alterations made by the child won't be visible to the parent, but this model handles a surprising number of cases. In particular, configuration, user preference, and similar things all work well this way.

Second, data can be communicated over an external channel of some sort. For small bits of memory, they just get serialized and passed back and forth in memory. For data that needs to be passed in one direction, this works well. This model of communicating sequential processes is very popular, even in systems that use threading for concurrency.

If the data is very large - so that it won't reasonably fit in memory - then it winds up stored on disk. In this case, the costs are the same whether you have a shared address space or not, as the communication will generally happen via disk. You can even get the effect of sharing the OS object for the open file if the parent creates the object before creating the child - at least in an OS that properly supports this model of concurrency. That's not usually desirable.

Finally, the processes can explicitly arrange to share a chunk of memory. This will need locks just like the threading model. It may also require that the objects stored in it be position independent, unless the processes can negotiate an address for the memory. The major downside here is that support for such sharing in most high-level language isn't very good - at least not yet. It does work very well for arrays of machine primitive types, which is a common in cases that have lots of shared data.

Another benefit

Actually, another major downside of a number of these models is that they can't work between processes that aren't on the same machine. This means that if you need more concurrency than one machine can comfortable handle, you lose. If you're building a truly scaleable system, you'll chose one of the methods that can work between machines. But the threading model is unusable in this situation.

Summary

My point isn't that processes are the solution for concurrency. My point is that they are a perfectly acceptable solution for many problems needing concurrency. They are a more modern approach than the shared address space model of threads, and like many such things (garbage collection, run-time type checking, lazy evaluation, STM, etc.) they make the programmers job easier at some cost in performance and flexibility.

My point is that, if you automatically start using threads when you need concurrent operations, you're ignoring an important advance in software development. If you assume that processes can't meet your performance requirement without measuring it, you've committed` the error of premature optimization.

So next time you need concurrent operations, consider using processes before oping the can of worms that threading represents.

Monday, April 16, 2012

Converting Python from the dom to etree

Overview

As part of a technology upgrade for an application, I recently rewrote a collection of Python classes to use xml.etree instead of xml.dom. The application used pretty much all of the libraries: reading in xml strings to create in-memory representations, then searching them for various values. Taking a non-xml data structure and creating an xml representation of that, then writing it out in a string for future use.

Before starting, I went looking for a document discussing this kind of rewrite, in hopes of avoiding any gotchas that might be lurking in the process. I was a bit surprised to find - well, not find - such a document. Either this process isn't all that common, or there aren't any gotchas. In either case, it's complicated enough that I think it's worth discussing the differences.

Converting code from using the DOM to using an ElementTree API is relatively straightforward. I'm going to cover the major details, pointing out any gotchas I've found. However, this is not a guide to either ElementTree or the DOM. It should be sufficient for most such conversions, but you may well be able to improve a specific bit of code if you study the ElementTree documentation.

The ElementTree API was designed for Python, so you are unlikely to find it elsewhere. Objects provide standard Python API's to access related parts of the DOM, rather than having methods and/or attributes from the DOM spec for that.

There are mutiple implementations of the API. Recent cPython versions ship with ElementTree and cElementTree. lxml provides an ElementTree implementation based on the libxml C library, and thus includes both fast XML processing and features that aren't available in the included versions.

Extracting data

Data in an XML tree comes in three forms:
  1. Element nodes, which hold everyting.
  2. Text in a node.
  3. Attributes.
The changes get harder to deal with as you go down the list, so I'll deal with them in reverse order.


Attributes

Attributes are easy. Instead of getAttribute, you have get. getAttributeNode (and similar) don't exist in ElementTree. The attributes in ElementTree look like dictionary items on the node - except for lack of __getitem__ and the ability to iterate over them (which means that in doesn't work on them!). If you really want a dict, you can get it from the nodes attrib element.

The gotcha here is that getAttribute returns an empty string if the attribute doesn't exist. get is the standard dictionary function, and has a second argument (which defaults to None) that is returned if the attribute isn't there. So watch for hasAttribute tests used to deal with the default value in some way, and rewrite them as appropriate.

Text

Text is where things start getting strange. Text in the DOM is stored in a TextNode node type. In ElementTree, text is stored in the nodes text and tail attributes. To find all the text in a DOM node, you iterate over the children of the node looking for TextNodes, and get the text from their data attribute.

In ElementTree, the text attribute holds any text that immediately follows the tag opening. The tail attribute holds any text that immediately follows the tag closeing. So the equivalent process is to get the nodes text value, then walk the child nodes, collecting their tail values.

In well-designed schemas (which don't have what is known as mixed content, with both text and nodes as children), both processes are much easier. The DOM pattern is to get the first child, make sure it's a text node, and then use it's data attribute. In ElementTree, you can just use the text attribute.

Elements

The easy part is the name of the tag: it's tagName or name in the DOM (depending on exactly what you want) and tag in ElementTree. However, there is a major gotcha in dealing with elements at all. A DOM node is always true. An ElementTree element is false if it has no children. Even if it has attributes or text, it will be false if there are no children. This is standard Python behavior for lists. It means that tests like if node: need to be rewritten as if node is not None:. On the other hand, testing for no children is simpler. The DOM has a childNodes attribute to provide a list of child nodes - including text nodes. With ElementTree, the node itself provides the Python list interface to the children.

More importantly, the only way to search for nodes in the DOM is via getElementsByTagName and getElementsByTagnameNS methods. These walk the entire subtree of the node, looking for that name.
ElementTree provides the getIterator method that does pretty much what the DOM methods do. It is somewhat more powerfull, in that it accepts limited XPath expressions as well as simple tag names. How limited the XPath expressions are depends on the implementation, but intelligent use of these can move some checks into a C code library to improve performance.

There are two gotchas with getIterater. The first is that it can return an iterator instead of an iterable (up to the implementation), so you may need to pass it to list to get an iterable, depending on what's done with it and whether you want to maintain portability across ElementTree implementations. The second is that, unlike the DOM methods, it includes the current element in the search. So you need to insure that the current element doesn't match the search expression.

However, if you know something about the schema, you may be able to improve performance here. ElementTree nodes have find and findAll methods that return the first node matching the search, or an iterable over all such children, respectively. Matching is the same as for getIterator. So, if you know that you're looking for a child node, these can be used instead of getIterator to save searching grandchildren, etc.

Odds 'n Ends

A few random things that don't fit in the above.

Adding Nodes

In the DOM, you add a node by using an appropriate create*Foo* method from the Document object it's going to be added to, then add it where you want it. For ElementTree, you should usually use the the SubElement function. That will create the appropriate element and append it to the nodes children.

Comments and Processing Instructions

The ElementTree implementation in the standard library ignores processing instructions when it parses XML. This can cause problems if you change implementations. It's also something to be wary of if your application uses those.