Custom Search

Sunday, March 17, 2013

Idiomatic C and Arduino

The few of you who read this blog regularly will be wondering about "C". Not Python, not some new-fangled functional language, but good old-fashioned C, the workhorse language of the last 30 years or so. That's because I'm talking Arduino, and C - sorta kinda - is the language of choice for that platform.

For those not familiar with it, Arduino is an open source hardware platform built around the ATMega micro-controller CPUs. The boards - whose design is covered by a Creative Commons Attribution Share-Alike license - provide the micro-controller and support circuitry to make using external devices easy, and have a standardized form factor for adding expansion hardware. There is also an IDE for writing code, for the C/C++ framework to run on the controller. The boards - or clones of them - are available for as little as $15 shipped

These are 8 bit CPUs with memory sizes (flash, eeprom, and ram) measured in K, or maybe 10s of K. There is no way the run time system of a modern language will run on them, though there is a forth system or two available, as well as some custom interpreters. You can use a modern language to control the hardware if you run it on the desktop with some kind of serial link between the devices. But normally you cross-compile a small language - like C or a DSL for hardware systems like Atom - for the thing. C isn't much of a reach - these systems aren't much less powerful than the machines C was developed on, though just enough so that a C compiler probably won't fit.

The project

As mentioned, the Arduino boards are designed to make it easy to use external hardware, and the standard examples start with LEDs. They're cheap, easy to hook up, and hard to screw up.

So, you give a geek a bunch of LEDs that can be controlled by a computer, and the first thing they're going to do is build a binary clock, unless they have a grid of LEDs, in which case they'll play life on them. Because that is, after all cool - unless you're not a geek, in which case it's geeky.

Once I noticed that there were wearable Arduino platforms available, I figured a binary wrist watch was a natural. So I built a prototype:

I'm not going to discuss the code here, but it has been added to the blog repository. What I want to discuss are the implementations of the binary clocks I found.

The code

None of the implementations felt like idiomatic C to me. Possibly the problem is with me - I learned C long enough ago that =+ was a valid operator, and this code is more C++ than C. Possibly it's the way hardware people (which is where Arduino's roots are) approach C. Possibly it's the result of learning programming with relatively modern languages instead of things like C and it's predecessors. But we're going to look at one function: display_value(uint8_t *, uint16_t)

display_value accepts a pointer to an array of pin numbers (Arduino output devices, which for this purpose can be set to either 0 or 1 to turn an LED off or on), and a 16 bit unsigned value to display on those LEDs. A common variation is to use a starting pin number and assume the pins are assigned in order, but this doesn't work well with lots of LEDs, as some pins have special capabilities, and wiring constraints may cause you to want to use pins in strange orders. The array of pin numbers allows them to be assigned to whatever pins are convenient.

One common version of the function (usually just coded in line) looks like so:
void
display_value(uint8_t *pins, uint16_t value) {

  for (int8_t i = 0; i < 8; i++) {
    digitalWrite(pins[i], value & (1 << i)) ;
  }
}

The digitalWrite function sets an output pin to the given value. This version is bad on the ATmega processors because they don't have multi-bit shift/rotate operations - you have to do them one bit at a time. So the 1 << i turns into an inner loop. The Arduino framework includes bit twiddling functions that would presumable avoid this, but they don't seem to be in common use, and aren't recommended in the language documentation about >> and <<. In fact, that the processors needs a loop to do variable multi-bit shifts isn't mentioned, either.

The second form avoids that, by using a mask variable that it modifies with each pass through the loop:
void
display_value(uint8_t *pins, uint16_t value) {

  for (uint8_t bit = 0x8, i = 0; bit; i++, bit >>= 1) {
    digitalWrite(pins[i], value & bit) ;
  }
}

This version doesn't have the inner loop for the shift, and when disassembled takes about half the instructions of the first version, and a bit less stack space.

The version I wrote as "natural" C changes the interface slightly, using a terminal zero in the array for the last element, then doing pointer arithmetic instead of an explicit index:
void
display_value(uint8_t *pins, uint16_t value) {

  for (; *pins; pins++, value >>= 1) {
    digitalWrite(*pins, value & 1) ;
  }
}

When disassembled, this one is slightly shorter and uses slightly less stack than the second version. However, if it were open coded, it would require another variable, or two, depending on whether or not value could be destroyed.

As I said, this feels natural to me. I learned to do array processing in C with pointer arithmetic. I can see that that might no longer be fashionable - after all, it is error-prone. However, that it allows me to avoid introducing two new variables seems to more than make up for that, as those introduce there own errors.

What I'm wondering is if there's some good reason to avoid the third version for Arduino projects, or modern C in general. Disassembling it didn't turn up any. Anyone else have an opinion?

Monday, December 31, 2012

3d printers and the thing store

A recent discussion about 3d printers led me to thinking about what they can do now, and what they might do in the future, and where it might go. I thought of something I hadn't seen mentioned before, and am now going to justify why I think it's the important part of how 3d printers are going to change the world.

Predicting the future is easy - it's going to be just like the past, except for what's different. The trick is figuring out what's going to be different.


The hardware

We're at the very start of the 3d printer revolution. Most home printers can only print in one material at a time, with low resolution, fixed color, and limited choices for materials. Commercial printers can print in three or more materials at a time, with much higher resolution, and in a wide variety of materials, ranging from the cheap plastics used by home printers, through stone and stone-like materials, to precious metals. Both are going to get better, but how much so?

The closest analogy I can think of to 3d printers is paper printers. My first paper printer was an Epson dot matrix printer, back in the mid '70s. It printed upper and lower case characters, using 9 lines of dots per character. Since then, they've added sufficient resolution to do graphics - the first generation of graphics printers being about a hundred dpi, the latest being well over a thousand, colors - going from a single cartridge that was either color (and unable to print a true black) or black that you swapped out by hand for black and white - to four or five or more inks. The printed material has gone from fan-fold perforated paper to multiple trays feeding in sheets that might can include photo paper in different grades. The quality has gone from being described as "unacceptable for a business letter" to photo-quality printing at home. Meanwhile the price has gone from the better part of a thousand dollars to practically giving them away in order to sell ink cartridges for it.

I can't see any reason that 3d printers shouldn't follow a similar path. So sometime in 20 to 40 years, I expect to be able to buy a 3d printer capable of printing in 3 to 5 different materials - and many common materials - at resolutions I can't distinguish, adding color to the material as it goes. And I expect to buy it for at most a few hundred dollars, depending on inflation - because the printer vendor is going to want to sell me the materials to print with.


And what we will do with it

That hardware evolution presumes that they are going to become nearly as common as paper printers, or computers. Which begs the question - what are all those people going to do with them?

Paper printers aren't a good thing to look at here, because when they first showed up, the potential market was minuscule, consisting of people geeky enough to buy one of those early, eight-bit computers that predated IBM entering the desktop computer market. Instead, let's look at media.

Computers and the internet have already made it possible to create things at home and then deliver them to mass audiences that used to require corporate-level assets to deliver to that audience, if not create in the first place. How have those affected us?

The first thing to note is that very few people actually create such works. Ok, there are hundreds of millions people posting status updates. There are probably a similar number writing blogs, a longer form. But how many are writing novels, or even short stories? Sure, it's more than it used to be - the ease of selling ebooks through amazon or similar markets makes self-publishing trivial and cheap. But I'd be surprised it it's a lot more - probably less than an order of magnitude total, especially if you count things that used to get published in fanzines or passed around by hand, set in a universe a professional author created. And most authors making money at it are still going through the traditional publishing houses, though some - most notably web comics - are experimenting with alternatives, and I fully expect that one or more of those will eventually displace the publishing houses. But the real difference is that the consumer can now purchase electronic copies online, to be delivered to their device instantly, and (if they are savvy shoppers) significantly cheaper than the hard copies. The online booksellers were already killing the brick and mortar bookstores, but ebooks are their death knell.

Music is similar. Yes, there are sites where you can find music produced by amateurs, but mostly it's still produced by people trying to make money at it. The number of people producing complete albums (which I admit are also threatened) is still relatively small. For most people, the difference that computers and the internet have brought to music is the ability to first, order cd's without having to go to a store, and more recently, download the music or just add it to a playlist for later streaming.

Video shows the same pattern: lots of people uploading short clips to sites like youtube, and another group creating vlogs (a contraction of "video blog", blog itself being a contraction of "web log") of some kind or another. A few are actually creating art, as opposed to commentaries or home videos - but they are few and far between. Even fewer are creating feature-length movies, though there are some leveraging the new technology to do that. Again, the real difference for the masses is the delivery mechanism. I can now purchase and download - or possibly stream - professionally produced video material over the network. Again, network based ordering was already killing the brick and mortar stores, but streaming is causing people to cut the cable that feeds their television.

Notice the pattern here? Early adopters were creators, because these technologies allowed them to do things they didn't otherwise have the resources to do. Sure, more people started creating as the tools get better and easier to use, but for most people it's the easier delivery system that's the important difference.

Again, I don't see anything that would make 3d printers different. The early adopters are creators, because these things make creating things much, much easier than it was before. But the mass market users are going to be consumers. People who will want to log into a thing store, click "buy" and have an object printed on their desk. Why pay extra for next day delivery when you can get it now? Not only that, it should cost less in total, because you won't have to pay shipping and handling through multiple intermediaries. Options to pick materials (or classes of materials) and color schemes may well be popular, but at this point I'm trying to predict details in a monetization scheme that doesn't exist yet, and the only accurate prediction I can make is that any predictions I make are almost certainly wrong.

But that's the future I see for 3d printers - online stores selling things to print on them. They will sound the death knell for any brick and mortar store competing on price. In other words, just like the past, except where it's different.

Saturday, December 22, 2012

Type checking for algorithmic errors

While listening to a recent podcast about Yesod - a web framework designed to be type safe - the host commented that he didn't buy into static type checking, because it was neither necessary nor sufficient, as it couldn't catch algorithmic bugs. While he's right that static typing isn't necessary - otherwise dynamic languages wouldn't work - or sufficient - it can't catch all bugs, he's wrong in that it can catch algorithmic bugs. At least some of them. The example he gave - that if you have a month and a day, the type system can't insure that the two are a valid date - is one such case. I threw together a quick example of that, which I'll present here.

Strategy

The idea is to encode the algorithm state in the data type. So there's an input data type, call it Step0Type. Then the processing at each step will accept data of type Step(N-1)Type and produce data of type StepNType. This means that if you have an error where you skipped a step on some piece of data, you'll try passing type Step(N-1)Type to a function expecting StepNType, so the type checker will catch the error.

Most practical examples only need one step. If you're passing strings to an external command processor - a shell, an SQL server or whatever - that's suspectable to to data injection attacks, having a type that's been sanitized against such attacks that the function that actually handles the IO requires will allow the type checker to flag attempts to pass unsanitized commands to the external processor. Yesod uses this technique for HTML, CSS and JavaScript text, each having a type indicating it's been quoted, to insure that each winds up only where it belongs, and properly quoted.

For the date example, the input type is three values consisting of a year, month and day of the month. The output data type represents a date known to be valid. You then only allow users of the type to access the function that that creates valid dates. So, we're going to write a Date type that holds the year, month and day, and a date function that takes a year, month and day as arguments, verifies that they represent a valid date and returns the appropriate Date. The package doesn't export the primitive Date constructor, but does export the function that constructs only valid dates, so that any Date objects that appear in client code will be valid. Functions declared to accept Date parameters will never see an invalid date.

Code

The data types:
type Year = Int
type Day = Int

data Month = Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec
           deriving (Show, Eq, Ord, Enum)

data Date = Date Year Month Day deriving (Show, Eq, Ord)
These are pretty generic declarations. The only typing information is that Year and Day are introduced as aliases for Int. Now the function that checks that a date is valid and returns the Date object, plus a couple of helpers:
date :: Year -> Month -> Day -> Date
date year month day | year == 1752 && month == Sep && day > 2 && day < 14 =
  error "Date not in US calendar."
                    | day < 1 = error "Month days start at 1."
                    | day > daysInMonth year month = error "Day not in month."
                    | otherwise = Date year month day

daysInMonth :: Year -> Month -> Day
daysInMonth year month | month `elem` [Jan, Mar, May, Jul, Aug, Oct, Dec] = 31
                       | month `elem` [Apr, Jun, Sep, Nov] = 30
                       | month == Feb = if isLeapYear year then 29 else 28

isLeapYear :: Year -> Bool
isLeapYear year | year > 1752 && year `mod` 400 == 0 = True
                | year > 1752 && year `mod` 100 == 0 = False
                | otherwise = year `mod` 4 == 0
The US converted to the Gregorian calendar in September of 1752. Before that, every fourth year was a leap year. IsLeapYear reflects that. The date was also adjusted, so that the day after 1752 Sep 2 is 1752 Sep 14. date checks for that first, then that the day is in the given month.

Finally, to control the export, the module starts with:
-- A date type that enforces the US calendar restrictions on dates.

module Date (Date, Month(..), date, year, month, day) where
This explicitly exports the Date and Month
types, along with the constructors for Month, the date function described above, and some getters that we haven't discussed. Most notably, it does not export the Date constructor, so that the only way client code can construct a Date is to use the date function.

While I wouldn't recommend this as a real, general-purpose date package, the error handling is simply to raise an exception, which is probably not the best choice, the entire file - including exporting the getters for the date type - can be found in the googlecode repository listed on the right, in the haskell directory.

Friday, November 23, 2012

The changing nature of my identity

When I was in school, a large chunk of my self-identification
was as a geek and a hacker. The former was more accurately a
"computer geek", and the latter was something I was proud to be
called by talented computer people. As time has gone by,
those labels have been redefined in ways that I don't really
identify with. So I'm going to rant a little about how programmers
have had their identity appropriated.


Crackers

Hackers were originally people who modified things
("hacked" them) to make them useful in ways the creators hadn't
anticipated. Computers, being the ultimate malleable tool, were
also ultimately attractive to hackers. A computer could be hacked
by the simple act of writing a program for it. More interesting
hacks required adding or modifying the hardware, but that almost
always required software as well.

Eventually, the bare phrase hacker came to be synonymous with
computer hacker. A small subset of these people used their
skills for criminal - or, more usually, delinquent -
purposes. Unfortunately, breaking into a computer owned by a major
corporation or government - no matter how little value it or it's
contents had - is more newsworthy than anything that merely made
computers faster or more capable. So those are the people the news
media applied the term to.

This had two problems. First, most of the media-labeled
hackers don't really have any hacking skills. They are - or were -
juvenile delinquents, doing the equivalent of tagging
computers. Yes, a few are truly talented, just as a few graffiti
artists are also talented artists. Second, the media failed to
notice that the vast majority of hackers only used their skills
for good.

What the media did was sort of like using mechanic to mean
car thief. Yes, some car thieves are talented mechanics, but by
no means all of them. And most mechanics only use their talents to
fix cars for people, not steal them from people.

In any case, I continue referring to skilled friends as
hackers, and explaining to anyone who asks why I hang out with
criminals how the media misused the phrase.


Faux Geeks


Faux Geek Girls

Just to clarify, there's a meme current on the internet flaming
about women who put on sexy outfits and go to cons. My complaint
is not about such women.

Apparently, some people feel that they are just pretending to
be interested in things to attract the attention of men who can do
those things. There were women like that around when I was in
school. We called them either cheerleaders or groupies,
depending on what the men did. If geeks now have groupies, I'm
sad. Because it didn't happen when I was a lad! Any man who
objects to such behavior is an idiot, and should have his right to
reproduce revoked, no matter how unlikely it is that he would
exercise it.

Alternatively, these women somehow fail to meet some standard
that that person has for what it means to be geek. OK, that one
I agree with - I'll get to that soon - but I'm still not going to
complain about women who want to put on sexy costumes and attend
conventions! Complaining about that is still an idiotic idea.


Faux Geeks

A geek is, of course, someone who bites the heads off chickens
in a carnival sideshow. OK, they've also had their identity
stolen. When I was in school, it was used to denote anybody who
did things that most of the public considered about as repulsive
as biting the heads off chickens, especially if they obsessed
about it.

What that meant in practice was that people who enjoyed working
with complicated problems - to the point of obsessing about them -
were called geeks. Of course, you pretty much had to
obsess about these problems in order to get them right. If you
tried to let the details take care of themselves - well, not
getting the O-rings on your spaceship right could make it blow up
on launch. Likewise, you couldn't pay attention to pennies and
let the pounds take care of themselves
. If you built a networked
mail system that was fast and reliable but had no security structure,
your servers wind up spending most of their time delivering spam.

Unsurprisingly, we brought that same obsession to our
entertainments. While in some cases it led to pastimes that
required that kind of obsession and didn't attract others, in
other cases it led to past-times that were avoided by others
simply because they had the geek label.

Faux Geeks are into those pastimes - possibly even
obsessively - without applying that obsessiveness to anything
constructive. Knowing the plot to all 79 original Star Trek
episodes - or all 83 Lost in Space episodes - isn't constructive,
and probably not a critical skill for any job. The only skill
displayed is memorization - and not even a particularly impressive
feat of that. Professions not generally regarded as requiring
intelligence are liable to require more memorization skills than
that. Some football players, for instance, have to memorize over
200 offensive plays as well which each of them can be changed to
at the line of scrimmage - and that list is liable to change on a
weekly basis. And it's also clear which of those is worth more: I
don't know of any job that pays for knowing about old scifi
shows. Learning football plays can pay for your education, and
people get paid millions of dollars a year for jobs that include
doing that.

I'm ecstatic that there are now lots of people into these
pastimes. That means there are lots more people consuming those
products. So there's lots more money to be made. So more people
are making them, and more people can spend their full time doing
so. I do wish it had happened sooner, but hey, late is better than
never.

No, what I'm annoyed about is that - once again - part of my
self-identification is being diluted. If something is labeled as,
say "Geeking out", I can no longer expect to find intellectually
challenging technological tidbits in it. Instead, I find games,
and movie or book reviews, or other lightweight fair that just
isn't what I'd call geeky.

In this case, I no longer call myself a geek, because - well,
the term as currently used may apply, but it's not the identity I
want.

Tuesday, August 21, 2012

Monads in Python

Just a quick note based on a recent Aha moment, showing
that learning Haskell is improving my Python coding.

C# and Python's string.find

I'm translating some C# into Python for a client, because -
well, C# just isn't suited to big data problems. They keep running
into problems with it, and asked me to come aboard to redo this in
Python. This code involves lots of string processesing. It
translates from C# into Python by changing the method names and
block delimiters, so I get lots of things like:


i = data.find('target1')
if i > -1:
    i = data.find('target2', i + 7)
    if i > -1:
        i2 = data.find('targetend', i + 7)
 if i > -1:
     result = data[i + 7:i2]

Yeah, it's ugly. But I recognized the pattern: that's just
Haskell's Maybe Monad! string.find returns a value of
type Maybe Int, with -1 as the
Nothing value.

Haskell's Maybe monad and string.index

Haskell has a better way to string together the result of Maybe
functions that cause any maybe to skip to the end, using the
>> operator. But Python also has a better way, using
string.index as the function returning a value of
type Maybe Int. This lets me rewrite the code as:


try:
    i = data.index('target1')
    i = data.index('target2', i + 7)
    i2 = data.index('targetend')
    result = data[i + 7:i2]
except ValueError:
    pass

Here, the Nothing value is the
ValueError exception. For absolute technical
correctness, the assignment to result ought to be in the
else: clause of the try statement, but
either way reads better than the string of if's
crawling across the page


Friday, April 20, 2012

Processes - a modern approach to concurrency

I seem to keep running into an assumption that the only solution for concurrent operations is having multiple threads of control in a shared address space - aka "threading". This is in spite of this model having the rather nasty downside of exposing everything in memory to shared access and modification, making it very easy to inadvertently write ill-behaved code. It's as easy as linking with a standard library instead of the special "thread-safe" version of that same library - assuming there is a "thread-safe" version of the library!

Letting concurrent operations share address space is an old model. There is a modern alternative that people either dismiss or aren't aware of that doesn't suffer from the problems mentioned above. Unfortunately, more primitive operating systems don't support it very well, so languages and libraries that want to be cross-platform tend to use or recommend the more primitive threading model. I suspect this is why the assumption of it being the only solution is so widespread. I'm going to write this as if you didn't know about the alternative, even though it's very likely most of you do.

Some history

The shared address space model used by threads is very old. Early computers didn't support multiple address spaces, so it was forced on them. A typical system would divide the address space into parts, some for use by the system and some for use by user programs, and you could only have one normal user program running at a time - because there was only one address space.

Of course, there are user programs that you want to always be running, or to run concurrently with some other programs. Such programs could be written to run in the existing shared address space, but had to be written specifically to do so. There were lots of issues with doing this, so the less said about this form of concurrency, the better.

When hardware support for multiple address spaces was first introduced, systems still had address spaces shared between the user and system, and concurrent operations were still painful. It was mostly used to support simultaneous users, rather than concurrent operations in a single application. If you wanted that, you still had no choice but to write code specifically intended to share a single address space. It might be possible to create a new address space, but that was hideously expensive and convoluted, and communications between them was painful at best.

As an aside, some systems did make things a bit easier - they insisted that user programs be position independent code, so they could be loaded anywhere in the shared address space. This meant that concurrent operations didn't require special code, as all user programs were special.

The breakthrough idea is putting each user program in it's own address space, and making communications between them cheap and easy. To be practical, this meant that creating a new address space also had to be cheap and easy, at least compared to doing the same thing on earlier systems. The resulting things are known as processes.

The cost of processes for concurrency

While non-shared address spaces for the concurrent processing does solve the problems that come from everything being shared, they do come with some extra costs. Those aren't as great as they appear at first glance.

Creation costs

In a modern system, a new process is created by copying the parent. This involves the cost of creating a new thread of control - which has to be paid if the address space is shared or not. It also involves creating a copy of the descriptors for the data space, making the new ones copy on write. This isn't very expensive, as it all happens in memory. The cost will go up as the child writes to non-shared memory, since the system will automatically copy those parts of memory. This is probably more expensive than properly locking all access to that memory, but there is no way to create a locking bug. This is similar to garbage collection, where you pay with performance for having something else pick up unused memory, but in return you eliminate the possibility of bugs that use freed memory. This is generally considered a good trade off.

Memory costs

The threading model uses minimal extra memory. It just needs to allocate a new stack for each thread.

The process model doesn't create a new stack, but will create more stack as needed for each process, and copy the old data in the new process as it gets reused. More importantly, as objects get written to, it will create copies of the pages being written. This is an unavoidable cost of not sharing those objects.

For many applications, the latter isn't really a problem. Code won't be written to, and static data structures shouldn't be written to. However, if the language uses reference counting as part of it's garbage collection strategy, that could cause lots of writes that aren't really needed. On the other hand, every one of those writes represents a lock acquire and release in the threading model. So even in this case, it's the classic memory vs. performance trade off.

Data Sharing costs

The trade offs here are different than with threads. In some cases, processes have a lower cost. In other cases, they have a greater cost.

The base cost in the threading model is that every object that is written to by multiple threads needs to be identified, and either moved into thread-local storage (if it shouldn't be shared) or wrapped with the proper locks (if it should). Assuming you get all the locking correct (which, in my experience, is highly unlikely), every time you touch such data, it costs a little extra. You may wind up paying a lot if there's contention.

In the process model, data can be shared in three different ways. First, anything created in the parent process will be available to the child. Alterations made by the child won't be visible to the parent, but this model handles a surprising number of cases. In particular, configuration, user preference, and similar things all work well this way.

Second, data can be communicated over an external channel of some sort. For small bits of memory, they just get serialized and passed back and forth in memory. For data that needs to be passed in one direction, this works well. This model of communicating sequential processes is very popular, even in systems that use threading for concurrency.

If the data is very large - so that it won't reasonably fit in memory - then it winds up stored on disk. In this case, the costs are the same whether you have a shared address space or not, as the communication will generally happen via disk. You can even get the effect of sharing the OS object for the open file if the parent creates the object before creating the child - at least in an OS that properly supports this model of concurrency. That's not usually desirable.

Finally, the processes can explicitly arrange to share a chunk of memory. This will need locks just like the threading model. It may also require that the objects stored in it be position independent, unless the processes can negotiate an address for the memory. The major downside here is that support for such sharing in most high-level language isn't very good - at least not yet. It does work very well for arrays of machine primitive types, which is a common in cases that have lots of shared data.

Another benefit

Actually, another major downside of a number of these models is that they can't work between processes that aren't on the same machine. This means that if you need more concurrency than one machine can comfortable handle, you lose. If you're building a truly scaleable system, you'll chose one of the methods that can work between machines. But the threading model is unusable in this situation.

Summary

My point isn't that processes are the solution for concurrency. My point is that they are a perfectly acceptable solution for many problems needing concurrency. They are a more modern approach than the shared address space model of threads, and like many such things (garbage collection, run-time type checking, lazy evaluation, STM, etc.) they make the programmers job easier at some cost in performance and flexibility.

My point is that, if you automatically start using threads when you need concurrent operations, you're ignoring an important advance in software development. If you assume that processes can't meet your performance requirement without measuring it, you've committed` the error of premature optimization.

So next time you need concurrent operations, consider using processes before oping the can of worms that threading represents.

Monday, April 16, 2012

Converting Python from the dom to etree

Overview

As part of a technology upgrade for an application, I recently rewrote a collection of Python classes to use xml.etree instead of xml.dom. The application used pretty much all of the libraries: reading in xml strings to create in-memory representations, then searching them for various values. Taking a non-xml data structure and creating an xml representation of that, then writing it out in a string for future use.

Before starting, I went looking for a document discussing this kind of rewrite, in hopes of avoiding any gotchas that might be lurking in the process. I was a bit surprised to find - well, not find - such a document. Either this process isn't all that common, or there aren't any gotchas. In either case, it's complicated enough that I think it's worth discussing the differences.

Converting code from using the DOM to using an ElementTree API is relatively straightforward. I'm going to cover the major details, pointing out any gotchas I've found. However, this is not a guide to either ElementTree or the DOM. It should be sufficient for most such conversions, but you may well be able to improve a specific bit of code if you study the ElementTree documentation.

The ElementTree API was designed for Python, so you are unlikely to find it elsewhere. Objects provide standard Python API's to access related parts of the DOM, rather than having methods and/or attributes from the DOM spec for that.

There are mutiple implementations of the API. Recent cPython versions ship with ElementTree and cElementTree. lxml provides an ElementTree implementation based on the libxml C library, and thus includes both fast XML processing and features that aren't available in the included versions.

Extracting data

Data in an XML tree comes in three forms:
  1. Element nodes, which hold everyting.
  2. Text in a node.
  3. Attributes.
The changes get harder to deal with as you go down the list, so I'll deal with them in reverse order.


Attributes

Attributes are easy. Instead of getAttribute, you have get. getAttributeNode (and similar) don't exist in ElementTree. The attributes in ElementTree look like dictionary items on the node - except for lack of __getitem__ and the ability to iterate over them (which means that in doesn't work on them!). If you really want a dict, you can get it from the nodes attrib element.

The gotcha here is that getAttribute returns an empty string if the attribute doesn't exist. get is the standard dictionary function, and has a second argument (which defaults to None) that is returned if the attribute isn't there. So watch for hasAttribute tests used to deal with the default value in some way, and rewrite them as appropriate.

Text

Text is where things start getting strange. Text in the DOM is stored in a TextNode node type. In ElementTree, text is stored in the nodes text and tail attributes. To find all the text in a DOM node, you iterate over the children of the node looking for TextNodes, and get the text from their data attribute.

In ElementTree, the text attribute holds any text that immediately follows the tag opening. The tail attribute holds any text that immediately follows the tag closeing. So the equivalent process is to get the nodes text value, then walk the child nodes, collecting their tail values.

In well-designed schemas (which don't have what is known as mixed content, with both text and nodes as children), both processes are much easier. The DOM pattern is to get the first child, make sure it's a text node, and then use it's data attribute. In ElementTree, you can just use the text attribute.

Elements

The easy part is the name of the tag: it's tagName or name in the DOM (depending on exactly what you want) and tag in ElementTree. However, there is a major gotcha in dealing with elements at all. A DOM node is always true. An ElementTree element is false if it has no children. Even if it has attributes or text, it will be false if there are no children. This is standard Python behavior for lists. It means that tests like if node: need to be rewritten as if node is not None:. On the other hand, testing for no children is simpler. The DOM has a childNodes attribute to provide a list of child nodes - including text nodes. With ElementTree, the node itself provides the Python list interface to the children.

More importantly, the only way to search for nodes in the DOM is via getElementsByTagName and getElementsByTagnameNS methods. These walk the entire subtree of the node, looking for that name.
ElementTree provides the getIterator method that does pretty much what the DOM methods do. It is somewhat more powerfull, in that it accepts limited XPath expressions as well as simple tag names. How limited the XPath expressions are depends on the implementation, but intelligent use of these can move some checks into a C code library to improve performance.

There are two gotchas with getIterater. The first is that it can return an iterator instead of an iterable (up to the implementation), so you may need to pass it to list to get an iterable, depending on what's done with it and whether you want to maintain portability across ElementTree implementations. The second is that, unlike the DOM methods, it includes the current element in the search. So you need to insure that the current element doesn't match the search expression.

However, if you know something about the schema, you may be able to improve performance here. ElementTree nodes have find and findAll methods that return the first node matching the search, or an iterable over all such children, respectively. Matching is the same as for getIterator. So, if you know that you're looking for a child node, these can be used instead of getIterator to save searching grandchildren, etc.

Odds 'n Ends

A few random things that don't fit in the above.

Adding Nodes

In the DOM, you add a node by using an appropriate create*Foo* method from the Document object it's going to be added to, then add it where you want it. For ElementTree, you should usually use the the SubElement function. That will create the appropriate element and append it to the nodes children.

Comments and Processing Instructions

The ElementTree implementation in the standard library ignores processing instructions when it parses XML. This can cause problems if you change implementations. It's also something to be wary of if your application uses those.