Custom Search

Sunday, December 30, 2012

3d printers and the thing store

A recent discussion about 3d printers led me to thinking about what they can do now, and what they might do in the future, and where it might go. I thought of something I hadn't seen mentioned before, and am now going to justify why I think it's the important part of how 3d printers are going to change the world.

Predicting the future is easy - it's going to be just like the past, except for what's different. The trick is figuring out what's going to be different.


The hardware

We're at the very start of the 3d printer revolution. Most home printers can only print in one material at a time, with low resolution, fixed color, and limited choices for materials. Commercial printers can print in three or more materials at a time, with much higher resolution, and in a wide variety of materials, ranging from the cheap plastics used by home printers, through stone and stone-like materials, to precious metals. Both are going to get better, but how much so?

The closest analogy I can think of to 3d printers is paper printers. My first paper printer was an Epson dot matrix printer, back in the mid '70s. It printed upper and lower case characters, using 9 lines of dots per character. Since then, they've added sufficient resolution to do graphics - the first generation of graphics printers being about a hundred dpi, the latest being well over a thousand, colors - going from a single cartridge that was either color (and unable to print a true black) or black that you swapped out by hand for black and white - to four or five or more inks. The printed material has gone from fan-fold perforated paper to multiple trays feeding in sheets that might can include photo paper in different grades. The quality has gone from being described as "unacceptable for a business letter" to photo-quality printing at home. Meanwhile the price has gone from the better part of a thousand dollars to practically giving them away in order to sell ink cartridges for it.

I can't see any reason that 3d printers shouldn't follow a similar path. So sometime in 20 to 40 years, I expect to be able to buy a 3d printer capable of printing in 3 to 5 different materials - and many common materials - at resolutions I can't distinguish, adding color to the material as it goes. And I expect to buy it for at most a few hundred dollars, depending on inflation - because the printer vendor is going to want to sell me the materials to print with.


And what we will do with it

That hardware evolution presumes that they are going to become nearly as common as paper printers, or computers. Which begs the question - what are all those people going to do with them?

Paper printers aren't a good thing to look at here, because when they first showed up, the potential market was minuscule, consisting of people geeky enough to buy one of those early, eight-bit computers that predated IBM entering the desktop computer market. Instead, let's look at media.

Computers and the internet have already made it possible to create things at home and then deliver them to mass audiences that used to require corporate-level assets to deliver to that audience, if not create in the first place. How have those affected us?

The first thing to note is that very few people actually create such works. Ok, there are hundreds of millions people posting status updates. There are probably a similar number writing blogs, a longer form. But how many are writing novels, or even short stories? Sure, it's more than it used to be - the ease of selling ebooks through amazon or similar markets makes self-publishing trivial and cheap. But I'd be surprised it it's a lot more - probably less than an order of magnitude total, especially if you count things that used to get published in fanzines or passed around by hand, set in a universe a professional author created. And most authors making money at it are still going through the traditional publishing houses, though some - most notably web comics - are experimenting with alternatives, and I fully expect that one or more of those will eventually displace the publishing houses. But the real difference is that the consumer can now purchase electronic copies online, to be delivered to their device instantly, and (if they are savvy shoppers) significantly cheaper than the hard copies. The online booksellers were already killing the brick and mortar bookstores, but ebooks are their death knell.

Music is similar. Yes, there are sites where you can find music produced by amateurs, but mostly it's still produced by people trying to make money at it. The number of people producing complete albums (which I admit are also threatened) is still relatively small. For most people, the difference that computers and the internet have brought to music is the ability to first, order cd's without having to go to a store, and more recently, download the music or just add it to a playlist for later streaming.

Video shows the same pattern: lots of people uploading short clips to sites like youtube, and another group creating vlogs (a contraction of "video blog", blog itself being a contraction of "web log") of some kind or another. A few are actually creating art, as opposed to commentaries or home videos - but they are few and far between. Even fewer are creating feature-length movies, though there are some leveraging the new technology to do that. Again, the real difference for the masses is the delivery mechanism. I can now purchase and download - or possibly stream - professionally produced video material over the network. Again, network based ordering was already killing the brick and mortar stores, but streaming is causing people to cut the cable that feeds their television.

Notice the pattern here? Early adopters were creators, because these technologies allowed them to do things they didn't otherwise have the resources to do. Sure, more people started creating as the tools get better and easier to use, but for most people it's the easier delivery system that's the important difference.

Again, I don't see anything that would make 3d printers different. The early adopters are creators, because these things make creating things much, much easier than it was before. But the mass market users are going to be consumers. People who will want to log into a thing store, click "buy" and have an object printed on their desk. Why pay extra for next day delivery when you can get it now? Not only that, it should cost less in total, because you won't have to pay shipping and handling through multiple intermediaries. Options to pick materials (or classes of materials) and color schemes may well be popular, but at this point I'm trying to predict details in a monetization scheme that doesn't exist yet, and the only accurate prediction I can make is that any predictions I make are almost certainly wrong.

But that's the future I see for 3d printers - online stores selling things to print on them. They will sound the death knell for any brick and mortar store competing on price. In other words, just like the past, except where it's different.

Saturday, December 22, 2012

Type checking for algorithmic errors

While listening to a recent podcast about Yesod - a web framework designed to be type safe - the host commented that he didn't buy into static type checking, because it was neither necessary nor sufficient, as it couldn't catch algorithmic bugs. While he's right that static typing isn't necessary - otherwise dynamic languages wouldn't work - or sufficient - it can't catch all bugs, he's wrong in that it can catch algorithmic bugs. At least some of them. The example he gave - that if you have a month and a day, the type system can't insure that the two are a valid date - is one such case. I threw together a quick example of that, which I'll present here.

Strategy

The idea is to encode the algorithm state in the data type. So there's an input data type, call it Step0Type. Then the processing at each step will accept data of type Step(N-1)Type and produce data of type StepNType. This means that if you have an error where you skipped a step on some piece of data, you'll try passing type Step(N-1)Type to a function expecting StepNType, so the type checker will catch the error.

Most practical examples only need one step. If you're passing strings to an external command processor - a shell, an SQL server or whatever - that's suspectable to to data injection attacks, having a type that's been sanitized against such attacks that the function that actually handles the IO requires will allow the type checker to flag attempts to pass unsanitized commands to the external processor. Yesod uses this technique for HTML, CSS and JavaScript text, each having a type indicating it's been quoted, to insure that each winds up only where it belongs, and properly quoted.

For the date example, the input type is three values consisting of a year, month and day of the month. The output data type represents a date known to be valid. You then only allow users of the type to access the function that that creates valid dates. So, we're going to write a Date type that holds the year, month and day, and a date function that takes a year, month and day as arguments, verifies that they represent a valid date and returns the appropriate Date. The package doesn't export the primitive Date constructor, but does export the function that constructs only valid dates, so that any Date objects that appear in client code will be valid. Functions declared to accept Date parameters will never see an invalid date.

Code

The data types:
type Year = Int
type Day = Int

data Month = Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec
           deriving (Show, Eq, Ord, Enum)

data Date = Date Year Month Day deriving (Show, Eq, Ord)
These are pretty generic declarations. The only typing information is that Year and Day are introduced as aliases for Int. Now the function that checks that a date is valid and returns the Date object, plus a couple of helpers:
date :: Year -> Month -> Day -> Date
date year month day | year == 1752 && month == Sep && day > 2 && day < 14 =
  error "Date not in US calendar."
                    | day < 1 = error "Month days start at 1."
                    | day > daysInMonth year month = error "Day not in month."
                    | otherwise = Date year month day

daysInMonth :: Year -> Month -> Day
daysInMonth year month | month `elem` [Jan, Mar, May, Jul, Aug, Oct, Dec] = 31
                       | month `elem` [Apr, Jun, Sep, Nov] = 30
                       | month == Feb = if isLeapYear year then 29 else 28

isLeapYear :: Year -> Bool
isLeapYear year | year > 1752 && year `mod` 400 == 0 = True
                | year > 1752 && year `mod` 100 == 0 = False
                | otherwise = year `mod` 4 == 0
The US converted to the Gregorian calendar in September of 1752. Before that, every fourth year was a leap year. IsLeapYear reflects that. The date was also adjusted, so that the day after 1752 Sep 2 is 1752 Sep 14. date checks for that first, then that the day is in the given month.

Finally, to control the export, the module starts with:
-- A date type that enforces the US calendar restrictions on dates.

module Date (Date, Month(..), date, year, month, day) where
This explicitly exports the Date and Month
types, along with the constructors for Month, the date function described above, and some getters that we haven't discussed. Most notably, it does not export the Date constructor, so that the only way client code can construct a Date is to use the date function.

While I wouldn't recommend this as a real, general-purpose date package, the error handling is simply to raise an exception, which is probably not the best choice, the entire file - including exporting the getters for the date type - can be found in the googlecode repository listed on the right, in the haskell directory.

Friday, November 23, 2012

The changing nature of my identity

When I was in school, a large chunk of my self-identification
was as a geek and a hacker. The former was more accurately a
"computer geek", and the latter was something I was proud to be
called by talented computer people. As time has gone by,
those labels have been redefined in ways that I don't really
identify with. So I'm going to rant a little about how programmers
have had their identity appropriated.


Crackers

Hackers were originally people who modified things
("hacked" them) to make them useful in ways the creators hadn't
anticipated. Computers, being the ultimate malleable tool, were
also ultimately attractive to hackers. A computer could be hacked
by the simple act of writing a program for it. More interesting
hacks required adding or modifying the hardware, but that almost
always required software as well.

Eventually, the bare phrase hacker came to be synonymous with
computer hacker. A small subset of these people used their
skills for criminal - or, more usually, delinquent -
purposes. Unfortunately, breaking into a computer owned by a major
corporation or government - no matter how little value it or it's
contents had - is more newsworthy than anything that merely made
computers faster or more capable. So those are the people the news
media applied the term to.

This had two problems. First, most of the media-labeled
hackers don't really have any hacking skills. They are - or were -
juvenile delinquents, doing the equivalent of tagging
computers. Yes, a few are truly talented, just as a few graffiti
artists are also talented artists. Second, the media failed to
notice that the vast majority of hackers only used their skills
for good.

What the media did was sort of like using mechanic to mean
car thief. Yes, some car thieves are talented mechanics, but by
no means all of them. And most mechanics only use their talents to
fix cars for people, not steal them from people.

In any case, I continue referring to skilled friends as
hackers, and explaining to anyone who asks why I hang out with
criminals how the media misused the phrase.


Faux Geeks


Faux Geek Girls

Just to clarify, there's a meme current on the internet flaming
about women who put on sexy outfits and go to cons. My complaint
is not about such women.

Apparently, some people feel that they are just pretending to
be interested in things to attract the attention of men who can do
those things. There were women like that around when I was in
school. We called them either cheerleaders or groupies,
depending on what the men did. If geeks now have groupies, I'm
sad. Because it didn't happen when I was a lad! Any man who
objects to such behavior is an idiot, and should have his right to
reproduce revoked, no matter how unlikely it is that he would
exercise it.

Alternatively, these women somehow fail to meet some standard
that that person has for what it means to be geek. OK, that one
I agree with - I'll get to that soon - but I'm still not going to
complain about women who want to put on sexy costumes and attend
conventions! Complaining about that is still an idiotic idea.


Faux Geeks

A geek is, of course, someone who bites the heads off chickens
in a carnival sideshow. OK, they've also had their identity
stolen. When I was in school, it was used to denote anybody who
did things that most of the public considered about as repulsive
as biting the heads off chickens, especially if they obsessed
about it.

What that meant in practice was that people who enjoyed working
with complicated problems - to the point of obsessing about them -
were called geeks. Of course, you pretty much had to
obsess about these problems in order to get them right. If you
tried to let the details take care of themselves - well, not
getting the O-rings on your spaceship right could make it blow up
on launch. Likewise, you couldn't pay attention to pennies and
let the pounds take care of themselves
. If you built a networked
mail system that was fast and reliable but had no security structure,
your servers wind up spending most of their time delivering spam.

Unsurprisingly, we brought that same obsession to our
entertainments. While in some cases it led to pastimes that
required that kind of obsession and didn't attract others, in
other cases it led to past-times that were avoided by others
simply because they had the geek label.

Faux Geeks are into those pastimes - possibly even
obsessively - without applying that obsessiveness to anything
constructive. Knowing the plot to all 79 original Star Trek
episodes - or all 83 Lost in Space episodes - isn't constructive,
and probably not a critical skill for any job. The only skill
displayed is memorization - and not even a particularly impressive
feat of that. Professions not generally regarded as requiring
intelligence are liable to require more memorization skills than
that. Some football players, for instance, have to memorize over
200 offensive plays as well which each of them can be changed to
at the line of scrimmage - and that list is liable to change on a
weekly basis. And it's also clear which of those is worth more: I
don't know of any job that pays for knowing about old scifi
shows. Learning football plays can pay for your education, and
people get paid millions of dollars a year for jobs that include
doing that.

I'm ecstatic that there are now lots of people into these
pastimes. That means there are lots more people consuming those
products. So there's lots more money to be made. So more people
are making them, and more people can spend their full time doing
so. I do wish it had happened sooner, but hey, late is better than
never.

No, what I'm annoyed about is that - once again - part of my
self-identification is being diluted. If something is labeled as,
say "Geeking out", I can no longer expect to find intellectually
challenging technological tidbits in it. Instead, I find games,
and movie or book reviews, or other lightweight fair that just
isn't what I'd call geeky.

In this case, I no longer call myself a geek, because - well,
the term as currently used may apply, but it's not the identity I
want.

Tuesday, August 21, 2012

Monads in Python

Just a quick note based on a recent Aha moment, showing
that learning Haskell is improving my Python coding.

C# and Python's string.find

I'm translating some C# into Python for a client, because -
well, C# just isn't suited to big data problems. They keep running
into problems with it, and asked me to come aboard to redo this in
Python. This code involves lots of string processesing. It
translates from C# into Python by changing the method names and
block delimiters, so I get lots of things like:


i = data.find('target1')
if i > -1:
    i = data.find('target2', i + 7)
    if i > -1:
        i2 = data.find('targetend', i + 7)
 if i > -1:
     result = data[i + 7:i2]

Yeah, it's ugly. But I recognized the pattern: that's just
Haskell's Maybe Monad! string.find returns a value of
type Maybe Int, with -1 as the
Nothing value.

Haskell's Maybe monad and string.index

Haskell has a better way to string together the result of Maybe
functions that cause any maybe to skip to the end, using the
>> operator. But Python also has a better way, using
string.index as the function returning a value of
type Maybe Int. This lets me rewrite the code as:


try:
    i = data.index('target1')
    i = data.index('target2', i + 7)
    i2 = data.index('targetend')
    result = data[i + 7:i2]
except ValueError:
    pass

Here, the Nothing value is the
ValueError exception. For absolute technical
correctness, the assignment to result ought to be in the
else: clause of the try statement, but
either way reads better than the string of if's
crawling across the page


Friday, April 20, 2012

Processes - a modern approach to concurrency

I seem to keep running into an assumption that the only solution for concurrent operations is having multiple threads of control in a shared address space - aka "threading". This is in spite of this model having the rather nasty downside of exposing everything in memory to shared access and modification, making it very easy to inadvertently write ill-behaved code. It's as easy as linking with a standard library instead of the special "thread-safe" version of that same library - assuming there is a "thread-safe" version of the library!

Letting concurrent operations share address space is an old model. There is a modern alternative that people either dismiss or aren't aware of that doesn't suffer from the problems mentioned above. Unfortunately, more primitive operating systems don't support it very well, so languages and libraries that want to be cross-platform tend to use or recommend the more primitive threading model. I suspect this is why the assumption of it being the only solution is so widespread. I'm going to write this as if you didn't know about the alternative, even though it's very likely most of you do.

Some history

The shared address space model used by threads is very old. Early computers didn't support multiple address spaces, so it was forced on them. A typical system would divide the address space into parts, some for use by the system and some for use by user programs, and you could only have one normal user program running at a time - because there was only one address space.

Of course, there are user programs that you want to always be running, or to run concurrently with some other programs. Such programs could be written to run in the existing shared address space, but had to be written specifically to do so. There were lots of issues with doing this, so the less said about this form of concurrency, the better.

When hardware support for multiple address spaces was first introduced, systems still had address spaces shared between the user and system, and concurrent operations were still painful. It was mostly used to support simultaneous users, rather than concurrent operations in a single application. If you wanted that, you still had no choice but to write code specifically intended to share a single address space. It might be possible to create a new address space, but that was hideously expensive and convoluted, and communications between them was painful at best.

As an aside, some systems did make things a bit easier - they insisted that user programs be position independent code, so they could be loaded anywhere in the shared address space. This meant that concurrent operations didn't require special code, as all user programs were special.

The breakthrough idea is putting each user program in it's own address space, and making communications between them cheap and easy. To be practical, this meant that creating a new address space also had to be cheap and easy, at least compared to doing the same thing on earlier systems. The resulting things are known as processes.

The cost of processes for concurrency

While non-shared address spaces for the concurrent processing does solve the problems that come from everything being shared, they do come with some extra costs. Those aren't as great as they appear at first glance.

Creation costs

In a modern system, a new process is created by copying the parent. This involves the cost of creating a new thread of control - which has to be paid if the address space is shared or not. It also involves creating a copy of the descriptors for the data space, making the new ones copy on write. This isn't very expensive, as it all happens in memory. The cost will go up as the child writes to non-shared memory, since the system will automatically copy those parts of memory. This is probably more expensive than properly locking all access to that memory, but there is no way to create a locking bug. This is similar to garbage collection, where you pay with performance for having something else pick up unused memory, but in return you eliminate the possibility of bugs that use freed memory. This is generally considered a good trade off.

Memory costs

The threading model uses minimal extra memory. It just needs to allocate a new stack for each thread.

The process model doesn't create a new stack, but will create more stack as needed for each process, and copy the old data in the new process as it gets reused. More importantly, as objects get written to, it will create copies of the pages being written. This is an unavoidable cost of not sharing those objects.

For many applications, the latter isn't really a problem. Code won't be written to, and static data structures shouldn't be written to. However, if the language uses reference counting as part of it's garbage collection strategy, that could cause lots of writes that aren't really needed. On the other hand, every one of those writes represents a lock acquire and release in the threading model. So even in this case, it's the classic memory vs. performance trade off.

Data Sharing costs

The trade offs here are different than with threads. In some cases, processes have a lower cost. In other cases, they have a greater cost.

The base cost in the threading model is that every object that is written to by multiple threads needs to be identified, and either moved into thread-local storage (if it shouldn't be shared) or wrapped with the proper locks (if it should). Assuming you get all the locking correct (which, in my experience, is highly unlikely), every time you touch such data, it costs a little extra. You may wind up paying a lot if there's contention.

In the process model, data can be shared in three different ways. First, anything created in the parent process will be available to the child. Alterations made by the child won't be visible to the parent, but this model handles a surprising number of cases. In particular, configuration, user preference, and similar things all work well this way.

Second, data can be communicated over an external channel of some sort. For small bits of memory, they just get serialized and passed back and forth in memory. For data that needs to be passed in one direction, this works well. This model of communicating sequential processes is very popular, even in systems that use threading for concurrency.

If the data is very large - so that it won't reasonably fit in memory - then it winds up stored on disk. In this case, the costs are the same whether you have a shared address space or not, as the communication will generally happen via disk. You can even get the effect of sharing the OS object for the open file if the parent creates the object before creating the child - at least in an OS that properly supports this model of concurrency. That's not usually desirable.

Finally, the processes can explicitly arrange to share a chunk of memory. This will need locks just like the threading model. It may also require that the objects stored in it be position independent, unless the processes can negotiate an address for the memory. The major downside here is that support for such sharing in most high-level language isn't very good - at least not yet. It does work very well for arrays of machine primitive types, which is a common in cases that have lots of shared data.

Another benefit

Actually, another major downside of a number of these models is that they can't work between processes that aren't on the same machine. This means that if you need more concurrency than one machine can comfortable handle, you lose. If you're building a truly scaleable system, you'll chose one of the methods that can work between machines. But the threading model is unusable in this situation.

Summary

My point isn't that processes are the solution for concurrency. My point is that they are a perfectly acceptable solution for many problems needing concurrency. They are a more modern approach than the shared address space model of threads, and like many such things (garbage collection, run-time type checking, lazy evaluation, STM, etc.) they make the programmers job easier at some cost in performance and flexibility.

My point is that, if you automatically start using threads when you need concurrent operations, you're ignoring an important advance in software development. If you assume that processes can't meet your performance requirement without measuring it, you've committed` the error of premature optimization.

So next time you need concurrent operations, consider using processes before oping the can of worms that threading represents.

Monday, April 16, 2012

Converting Python from the dom to etree

Overview

As part of a technology upgrade for an application, I recently rewrote a collection of Python classes to use xml.etree instead of xml.dom. The application used pretty much all of the libraries: reading in xml strings to create in-memory representations, then searching them for various values. Taking a non-xml data structure and creating an xml representation of that, then writing it out in a string for future use.

Before starting, I went looking for a document discussing this kind of rewrite, in hopes of avoiding any gotchas that might be lurking in the process. I was a bit surprised to find - well, not find - such a document. Either this process isn't all that common, or there aren't any gotchas. In either case, it's complicated enough that I think it's worth discussing the differences.

Converting code from using the DOM to using an ElementTree API is relatively straightforward. I'm going to cover the major details, pointing out any gotchas I've found. However, this is not a guide to either ElementTree or the DOM. It should be sufficient for most such conversions, but you may well be able to improve a specific bit of code if you study the ElementTree documentation.

The ElementTree API was designed for Python, so you are unlikely to find it elsewhere. Objects provide standard Python API's to access related parts of the DOM, rather than having methods and/or attributes from the DOM spec for that.

There are mutiple implementations of the API. Recent cPython versions ship with ElementTree and cElementTree. lxml provides an ElementTree implementation based on the libxml C library, and thus includes both fast XML processing and features that aren't available in the included versions.

Extracting data

Data in an XML tree comes in three forms:
  1. Element nodes, which hold everyting.
  2. Text in a node.
  3. Attributes.
The changes get harder to deal with as you go down the list, so I'll deal with them in reverse order.


Attributes

Attributes are easy. Instead of getAttribute, you have get. getAttributeNode (and similar) don't exist in ElementTree. The attributes in ElementTree look like dictionary items on the node - except for lack of __getitem__ and the ability to iterate over them (which means that in doesn't work on them!). If you really want a dict, you can get it from the nodes attrib element.

The gotcha here is that getAttribute returns an empty string if the attribute doesn't exist. get is the standard dictionary function, and has a second argument (which defaults to None) that is returned if the attribute isn't there. So watch for hasAttribute tests used to deal with the default value in some way, and rewrite them as appropriate.

Text

Text is where things start getting strange. Text in the DOM is stored in a TextNode node type. In ElementTree, text is stored in the nodes text and tail attributes. To find all the text in a DOM node, you iterate over the children of the node looking for TextNodes, and get the text from their data attribute.

In ElementTree, the text attribute holds any text that immediately follows the tag opening. The tail attribute holds any text that immediately follows the tag closeing. So the equivalent process is to get the nodes text value, then walk the child nodes, collecting their tail values.

In well-designed schemas (which don't have what is known as mixed content, with both text and nodes as children), both processes are much easier. The DOM pattern is to get the first child, make sure it's a text node, and then use it's data attribute. In ElementTree, you can just use the text attribute.

Elements

The easy part is the name of the tag: it's tagName or name in the DOM (depending on exactly what you want) and tag in ElementTree. However, there is a major gotcha in dealing with elements at all. A DOM node is always true. An ElementTree element is false if it has no children. Even if it has attributes or text, it will be false if there are no children. This is standard Python behavior for lists. It means that tests like if node: need to be rewritten as if node is not None:. On the other hand, testing for no children is simpler. The DOM has a childNodes attribute to provide a list of child nodes - including text nodes. With ElementTree, the node itself provides the Python list interface to the children.

More importantly, the only way to search for nodes in the DOM is via getElementsByTagName and getElementsByTagnameNS methods. These walk the entire subtree of the node, looking for that name.
ElementTree provides the getIterator method that does pretty much what the DOM methods do. It is somewhat more powerfull, in that it accepts limited XPath expressions as well as simple tag names. How limited the XPath expressions are depends on the implementation, but intelligent use of these can move some checks into a C code library to improve performance.

There are two gotchas with getIterater. The first is that it can return an iterator instead of an iterable (up to the implementation), so you may need to pass it to list to get an iterable, depending on what's done with it and whether you want to maintain portability across ElementTree implementations. The second is that, unlike the DOM methods, it includes the current element in the search. So you need to insure that the current element doesn't match the search expression.

However, if you know something about the schema, you may be able to improve performance here. ElementTree nodes have find and findAll methods that return the first node matching the search, or an iterable over all such children, respectively. Matching is the same as for getIterator. So, if you know that you're looking for a child node, these can be used instead of getIterator to save searching grandchildren, etc.

Odds 'n Ends

A few random things that don't fit in the above.

Adding Nodes

In the DOM, you add a node by using an appropriate create*Foo* method from the Document object it's going to be added to, then add it where you want it. For ElementTree, you should usually use the the SubElement function. That will create the appropriate element and append it to the nodes children.

Comments and Processing Instructions

The ElementTree implementation in the standard library ignores processing instructions when it parses XML. This can cause problems if you change implementations. It's also something to be wary of if your application uses those.

Tuesday, March 6, 2012

Configuration with Python class objects

Another old bit of writing. While the space used for the demonstration - web applications - is now thoroughly dominated by template engines and frameworks, the core concepts are still valid. Resurrected by request.

Abstract

The requirements for configuring a complex program include such things as separate namespaces for different parts of the program; the ability to configure an object as like that one, only... and the ability to create new configurations on the fly. In general, the configuration facility should provide programming language constructs for advanced users. All this should be possible while keeping the language simple enough for unsophisticated users.

Python provides a modern, object oriented language for building configuration values, with a regular and straightforward syntax. Its class objects can hold configuration information in class variables. Classes can inherit from each other, and hence can look be made to resemble each other, or prototype. Finally, classes can be instantiated to create an object that acts like a configuration object created at run time.

The configuration problem

Writing software configuration tools has always been a balancing act between the needs for power and flexibility, and the need to make program configuration simple enough for non-technical users. Much of this has been solved on the desktop with the appearance of standardized APIs and configuration wizards, but it has resurfaced with a vengeance in software for the WWW.

A typical web program consists of a collection of interfaces to data objects, which have similar but not identical configurations, accessed by users who have different capabilities - some can write, some can only read, and others can create - and different views of the same objects. Worse yet, some of the configured objects may need slightly different configurations, created dynamically at run time. This is all coupled with the need for the HTML generated by the application to be created by a nontechnical designer. Thus the HTML must be part of the configuration, not part of the source code. While a configuration wizard might be able to handle these needs, many WWW applications are custom built in a highly competitive environment that does not allow time for such frills.

Python provides a very effective tool for configuring such programs. While it may seem like overkill to use an object-oriented programming language as a configuration language, Python has a history of use as an embedded scripting language for non-technical users. See [Anderson], [Beazley] and [Hinsen] for recent examples. A little artful work with class statements allows the namespace to be partitioned. This leaves having to quote strings that would be unquoted tokens in a custom configuration language as the only serious wart.

Configuration objects as a solution

The fundamental goal of configuration is to allow the user to control the appearance and properties of the programs objects without having to write program code. In an object-oriented programming environment, the simple way to do this is to attach a configuration object to each program object, where the attributes of the configuration object are the values the program object uses to determine it's appearance and behavior.

This leads to a parallel object structure, with the programmer creating one class hierarchy and the configuration creating the second. The objects created by the programmer define behavior. The objects the configuration creates select a behavior, and control the presentation.

The configuration objects connect to each programmed object in the system in a pattern related to both the State and Command patterns [Gamma]. This pattern, which I refer to as the Configuration pattern, when combined with Python's object model, neatly solves the configuration problem.

The problem then becomes how the configuration file is used to create these objects. In general, the only objects that the python interpreter creates without programming are class objects. However, class objects have attributes that can be read just like object attributes. Since each class creates a separate namespace with a simple syntax, this makes them an ideal solution to this problem.

Details from a working example

Having described the solution, we now present it in detail with extracts from a working program that uses this method for program configuration. The program in question is a web-based chat server. All of the HTML presented by the system comes from the configuration files.

The chat servers basic interaction with the user is to present a list of rooms, each of which represents an area with some set of discussion rules. The user selects one of the rooms and is then shown the contents of the room. The first part of the room is a control panel. Part of this is used to control how the user sees the room, and part to input messages to the room and control the appearance of the input messages. The second part of the room is the list of messages that user sees.

The basic objects for this application are obvious, and implemented as you would expect. There are user objects that reflect the users control panel settings. There are message objects that are used to hold messages. There are room objects that hold specifics about the room, such as a list of active users and messages they can see.

As described above, parallel to this class hierarchy is a configuration class hierarchy. Every object in the system has a configuration object attached that control's it's behavior and presentation. Behavior is controlled by the configuration objects attribute settings. Presentation is controlled by a configuration object attribute with a string value, which uses the Python % operator, usually with the object being presented as the right operand. This means that each class needs to provide the Python dictionary interface for that operator.

The simplest object in this application is a message. The message class definition is:

class message:
 "A message in a room."
 def __init__(my, dispatch, room):
  my._user = dispatch.request.session
  my.time = time.time()
  my.fromid = my._user['userid']
  my.nickname = my._user['nickname']

  # Message text
  my.message = string.strip(dispatch.form['message'].value)
  my.text = config.message.format % my
  del my._user

 def __str__(my): return my.text

 def __getitem__(my, key):
  if key == 'time':
   return time.ctime(my.time)
  if hasattr(my,  key):
   return getattr(my, key)
  return my._user[key]

The one atypical feature of the message class is that the HTML to display is created when the object is created and then returned by the __str__ method. It does this because the HTML representation of the message never changes, and it could be displayed thousands of times. Most other objects change between displays, and hence rebuild the HTML when they are displayed.

The other features - an __str__ method to return the HTML, and __getitem__ that calculates some values, uses attributes if they exist, and otherwise pulls values from a component object - are standard for objects in this system. A typical configuration for a message is:

class message:
 format = '''%(format_image)s%(format_status)s%(face)s<br>
   %(message)s<br>From %(location)s on %(time)s
   Back to <a href="#top">top</a>.'''
 to_wrapper = '<font color=red>%s</font>'
 from_wrapper = '%s'

The format attribute is used in the class and completely controls the format of messages. Most of the values used in it come from the _user object. The two _wrapper attributes are used to wrap messages to and from a user so they can be distinguished without having to rebuild the HTML every time the message is displayed.

As can be seen, the text used to build configuration objects is relatively simple, and doesn't require programming skills. In practice, a few strings need to be quoted where they wouldn't in a custom configuration language, and the comma required to distinguish a singleton tuple from an expression are the only real drawbacks in the syntax.

Showing the flexibility of this technic calls for a larger example. In a system of chat rooms, the rooms themselves are the most important objects. They are where messages are displayed, and users move in and out of them. This particular system also provides a couple of different types of dynamically created rooms, all of which have to be dealt with.

The configuration objects to deal with a collection of rooms looks like:

class room:
 # Format to display the body of a room. Does not
 # include messages! This has title &/ name as
 # formatting items.
 body_format = '<h3 align=center id="top">%(title)s</h3>%(header)s'

 # Description of the room, displayed in body_format
 # as %(header)
 header = ""

 # Format to display the head of a room. Same
 # formatting items as above.
 head_format = '<TITLE>Chat Room - %(title)s</TITLE>'

 # Maximum number of lines/chars in a message in
 # this room. 0 or None means no limit
 maxlines = 20
 maxchars = 2000

 # How long a room will hang around unused.
 timeout = 15 * 60

# Public rooms timeout VERY SLOWLY
class public(room):
    timeout = 24 * 60 * 60

# Each room in rooms must have an entry here with a title.
# Other values for the room config can be overridden in the
# class statement for each room
class friends(room): title = "The Friends Room"
class lounge(room):
 title = "The Lounge"
 header = """<P>This room is the <b>beta</b> chat room.
  It's where we test new features before putting them
  online everywhere.</P>"""
class poetry(room):
 title = "The Poetry Room"
 maxlines = 100
 maxchars = 10000

# List rooms created at startup
rooms = (lounge, poetry, friends)

As with the message configuration object, the bulk of this configuration object is taken up with formatting the resulting HTML. A few attributes at the end control the behavior of each room. Public rooms are created on the fly, and timeout slower than the default - 24 hours instead of 5 minutes - and hence override the timeout attribute. The first real room is the Friends room, which sets the title for the room. The Lounge adds a note to the top of the room display by overriding the header attribute. The Poetry room allows very long messages by overriding both the maxlines and the maxchars attributes.

The last feature this example can illustrate is the ability to create new configurations on the fly. The chat system lets users enter a room name to create a new room dynamically. The configuration for such a room is public room configuration, with a title added. So the configuration object can be created by creating an instance of a config.public object, and then setting it's title attribute:

 newroom = room()
 newroom.config = config.public()
 newroom.config.title = newtitle

Which creates a new room and attaches a configuration object with the new title.

Summary

As can be seen, a configuration pattern combined with Pythons object model provides a solution to each element of the configuration problem. Python itself provides the configuration language and programming constructs. The Python class construct provides namespaces and the ability to create configurations that inherit from other configurations. The ability to instantiate configuration classes and change them provides this same ability dynamically.

References

[Anderson]
Charles Anderson. "Experiences with Extension Programing and Scripting in Python" In Proceedings of the Sixth International Python Conference. 1997
[Beazley]
David M. Beazley and Peter S. Lomdahl. "Feeding a Large-scale Physics Application to Python" In Proceedings of the Sixth International Python Conference. 1997
[Gamma]
Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns - Elements of Reusable Object-Oriented Software. Addison-Wesley, 1995.
[Hinsen]
Konrad Hinsen. "The Molecular Modeling Toolkit: as case study of a large scientific application in Python" In Proceedings of the Sixth International Python Conference. 1997

Friday, March 2, 2012

The political implications of the iOS App Store

You may well be wondering how a something like an online software store can have political implications. The answer is that the infrastructure that makes it work can, at least in the ongoing war on general purpose computing. If you aren't familiar with this war, read the article.

By delivering iOS with the App Store as the only authorized means of adding software, Apple has shown that it's possible to create a popular computing platform that isn't a general-purpose computing platform. I don't believe they did this by - as Cory Doctorow claims in the above article - delivering a device with malware preinstalled. Instead, they heavily censor the software that can be installed, turning iOS devices into platforms for running a small set approved software appliances.

Implication for Apple

Computer manufacturers are normally shielded from liability for any crimes committed using their products because the computers have so many legal uses, and they can't control the software that the customer installs. Apple has lost this last claim - they exercise absolute control over the software their customers install.

At some point someone will use an iOS device and an app from the store in committing a crime. Given the deep pockets theory - which says you pay for damages depending on your ability to pay, not your responsibility - and Apples very deep pockets, some bright lawyer will decide to try and show that Apple bears some responsibility for this crime because they allowed the app to be sold in the store.

Whether Apple will lose and have to pay, or pay to make the lawsuit go away, or fight and win is yet to be seen. But the lawsuit is certainly bound to happen.

Implication for users

The unauthorized way to turn an iOS device back into a general purpose computer is to jailbreak it. That is a violation of the DMCA, but jailbreaking is currently legal because the US Copyright Office provides an exemption to the DMCA for actions required to unlock phones for use on other carriers. Since jailbreaking is required for that, it's covered.

The question at hand is - how long will that stay true once the government realizes that that exemption allows people to use iOS devices for all the illegal activities that Apple prevents? Again, the answer is unknown, but the continued existence of the exemption - which must be reviewed at regular intervals - will get less and less likely as the war goes on.

Implications for other manufacturers

Those fighting to keep the public's computers from being able to perform actions that government has been convinced are bad for the country - basically, to keep the public's computers from being general purpose computers - will love iOS and the App Store. They demonstrate that it's possible to make money building and selling computing platforms which can have arbitrary restrictions placed on what they can do.

Since Apple can do this, why can't Dell, Gateway, Microsoft, HTC, and everyone else who makes computers or operating systems? If they try and resist - which doesn't seem likely, as such stores are profit centers - I'd expect it to become a legal requirement in order to sell computers.

Recommendations

If you like not having the government dictate what you can and can't do with your computers, you need to support platforms that don't let anyone do that. Avoid iOS. Avoid buying software through manufacturers stores. Avoid hardware that won't let you run alternative operating systems.

In a sentence: protect your right to choose by making sure you buy and use systems that preserve it.

Sunday, February 26, 2012

An evaluation of Haskell

I've now been writing Haskell code on and off in my spare time for most of a year. I've done small calculations in the REPL, and written a few small utilities. I still don't think I'm an expert in the language, but do think I've gotten enough of it under my belt to evaluate it compared to the other languages I've used.

First comment: I don't think this will ever be an immensely popular language. The syntax is weird. While Perl shows that that alone can't keep a language from becoming popular, it is an obstacle. Second, it's got the worst learning curve of any language I've ever run into. I'll expand on these points where it's appropriate.

The syntax

The syntax is weird, but it's also one of the things that keeps the language terse. Consider this fragment from my analysis of fbs controversies blogs:


getYear :: String -> (Bool, Bool, Bool)
getYear = controversy . sort . (\ (t, r) -> map (getData t) r) . getRows . getTable . parseTags

Note that the first line is a type declaration, and optional here. This is basically a one-line function definition, though a relatively long line. Let's try it in Python:


def getYear(year):
    t, r = getRows(getTable(parseTags(year)))
    return controversy(sorted([getData(t, x) for x in r]))

That uses functional idioms for Python, and is probably how I'd write it if I were writing Python and had that set of tools (chances are I'd have approached the problem differently in Python, though).

Three lines. Shorter lines, yes, but much denser lines as well. And a couple of extra variables. One is required. One I could get rid of by using map(functools.partial(getData, t), r) instead of a list comprehension, but that makes the code longer.

Maybe a more functional language would help. Say Clojure:


(defn getYear
  [year]
  (let [(t r) (getRows (getTable (parsetags year)))]
    (controversy (sort (for [x r] (getData t x))))))

A pretty direct translation of the Python code. Again, how I'd probably write the function in Clojure. Again, with the caveat that I might not use this approach in Clojure. This time it's four lines long, but not quite as dense.

I think the Haskell code is better than either. The weird syntax - which allows for automatic currying and functional composition as an operator - is part of what makes the language powerful. Haskell without those features would be like LISP without macros.

Functional

As hinted at above, Haskell has built-in operators that work on functions. I'm not familiar with any mainstream language that has such builtins. I can see how some might let you add such a feature, as they let you define your own operators, or at least define how the built-in operators work on your objects. But most languages that have functional features just let you pass them around and assign them, not create new ones with operators!

Functional programming in general makes for a steep learning curve, at least if all you're familiar with is object-oriented programming. That Haskell's functional nature is so deep makes it worse, because what looks like operators are function applications, and some function applications do control flow, and you have to learn how to use all of that appropriately.

A shade to much type checking

The debate about dynamic vs. static typing is all about development time. Dynamic typing advocates say that having to actually type in type information - in extreme cases, as many as three times - slows you down. Static typing advocates say that having the compiler check your types saves you time. Both claim that their time savings outweigh the loss from not having the other.

Haskell has both. The compiler infers types from the code, so you don't have to type them in. Well, not very often, anyway. In some cases, the compiler can't narrow the type down enough to generate code, so you have to provide some. The convention is that you write out the type information for top-level functions, which usually meets any such requirements. If that seems onerous, you can load your code into the REPL and ask for the types.

The only place I've found where typing causes problems is with the mixed mode arithmetic. This requires explicit conversions unless everything is either of the same type or a constant. Constants will be inferred to have the appropriate type. So you can say 1 / 2 and get .5, because the types assigned to 1 and 2 can be converted to float, and then divided. You can't say (length [1]) / 2 because length [1] is an Int, and that has to be explicitly converted to float. Once you're used to this, it's not a problem. But the first time you run into it, it's nasty.

Libraries

It has been observed by a number of people that programming is now more about connecting libraries than writing elegant solutions to problems. Haskell has a fine library, with lots of interesting tools in it, including most of the common ones.

In some ways, this makes the learning curve worse. Since control flow is handled by functions, the library has lots of interesting control flow tools in it, as well as the usual collection of data processing tools. Learning those extra control flow tools extends the learning curve.

In fact, there are classes for doing control flow that are so important that the language has syntax to make them easier to use. Learning how to use all of those is another learning curve extension, and one I'm still working on.

On the plus side, the libraries are in some ways easier to explore than those for dynamic languages. There are datebases of the standard library that you can search by type. So if you need a way to apply a function to every element of a list, you can ask for it (for the curious, the query is (a -> b) -> [a] -> [b]) and get a list of available functions from the library that match that. Of course, it does require being able to write Haskell type expressions, but that's a requirement for many things.

My status

Personally, I already had years of experience with a number of functional languages. The functional nature of Haskell didn't bother me. Getting used to operators that are just functions - and a syntax to treat them that way - took a bit of time.

Haskell can be run as a script or compiled into a binary. It fits well into the Unix environment. There are even back ends to generate JavaScript, if that's what you want to do.

On the other hand, the Haskell infrastructure seems to be missing some of the parts required to develop enterprise-size applications. The Hakellwiki page about it provides a list.

I think it's good for a lot of things I want to do. I think that learning more about it will make me a better programmer, and exercise the mathematics degree I haven't otherwise used much. I'm going to keep plugging away at the learning curve, while still keeping an eye open for languages with good concurrency support that might be a bit more acceptable to mainstream programmers.

Tuesday, January 17, 2012

An analysis of FBS controversies: Part III, the solution

While the title is "the solution", this is, of course, my proposed solution. The problem to be solved is that the current system leaving "deserving" teams out of the playoff creates controversies.

Why a playoff is not the solution

A playoff would seem to be the obvious solution, but there are some real reasons not to change. Ignoring the financial interests, a real playoff system - whether it's 8 teams (the major conference champions plus wild cards) or 16 (all FBS conference champions plus wild cards) - changes the nature of the regular season. The more teams allowed into the playoffs, the less important - and hence less intense and less exciting - the regular season becomes.

While it's now a cliche, with the current system every game counts. If a team is in contention for a national championship, losing any game changes the nature of their season. It may be enough to take them out of the national championship picture. At the very least, they'll need help - in the form of other teams beating contending teams - in order to have a shot at the national championship.

The conference championship is a different thing entirely. The non-conference games aren't counted in conference standings - or only counted indirectly, in that the rankings are used as last resort tie-breakers, and the non-conference games count there. Losing a game to a conference opponent that's already got more losses than you doesnt't - by itself - change the nature of your season. If you win the rest of the conference games, you'll be tied with someone that you beat and hence hold the tiebreaker.

So a team can be completely out of the national championship race and still be a contender for a conference championship. A playoff system changes that because if you're in contention for a conference title, that puts you in contention for a national championship. This makes the regular season nothing more than a qualifier for the playoffs, and much less exciting.

The Flex Playoff

The goal is to preserve the "every game matters" nature of the regular season, while at the same time avoiding controversies when it comes to picking teams for the playoff.

The solution - to me, at least - seems obvious: let the number of teams in the playoff change, so that each year only the "deserving" teams are included. This means the format changes each year, but the bowl system is flexible enough to accommodate that.

To do this, take the top two teams in the rankings, and anyone with no more losses than the number two team, seeded by ranking. That eliminates controversies - at least so far as choosing the teams is concerned. It means that a team that loses a game needs help - in the form of all but one other undefeated team losing - to have a chance to get into the national championship game. It means the bias against the mid-majors will only lower their seeding; they'll get a chance to prove they deserve better on the field.

This is still using the regular season as a qualifier for the playoffs, but the level to qualify is the nearly the same as it is now, so the regular season is nearly as important as it is now. The format would be the same as an 8-team tournanemt, with teams facing a non-qualifying seed (i.e. seed #5 when only three teams qualified) get a bye. With fewer than 5 qualifiers, the first round is skipped and there's a four team tournamet. With only two qualifers, there's just a championship game.

So a flex playoff will have all the fairness desired of a regular playoff, in that no deserving team will be left out. It will be more fair than a playoff, in that no undeserving team will be allowed in. It will maintain the intensity of the regular season at the level it is now, because qualifying for a flex playoff requires the same record as qualifying for the championship game. It combines the best features of both systems.

Sunday, January 15, 2012

An analysis of FBS controversies: Part II, the analysis

Now that I have a tool that can report on how much controversy any given scheme would have, I can start looking at numbers!

Which seasons?

First, I have to decide the range of seasons to check. The earliest season I can use is 1872. See the writeup on the code for a discussion of why 1869 through 1871 can't be used. An argument could be made that I should start with "the modern era", whatever that is. Arguments can be made for years as early as 1906 and as late as 1985 - and many in between. I'm going to take the "more data is better" approach, and start with 1872.

Likewise, the last year to use is open to debate. Here, the issue is bowl games. Ideally, the data would use records from the regular season and possible conference championships, which are played the last day of the regular season. Unfortunately, the data source also includes bowl games. Before 1992, the bowl games were no better organized than the regular season. Conferences and bowls had contracts for teams to play in specific games. So you would get a selection of highly ranked teams playing each other apparently randomly, with little chance of the two highest ranked teams playing each other. While these games may well have changed which systems would have been controversial each year, I believe the effect will average out over the long term.

This changed in 1992, when various entities started trying to get the best two teams playing each other. There was limited success until 1999, when the current system - agreed to by the major conferences - was started. Because the system matched the top two teams in 1992, I'm going to stop with the 1991 season.

Majors vs. mid-majors

One of the controversial features of the current system is that the FBS schools are split into two groups, "majors" and "mid-majors". The major (aka AQ or BCS) schools get ranked above the mid-majors, even if they have more losses. The explanation for this is that schools in each group mostly plays schools in that group, so mid-major schools don't play major schools very often, and hence have an easier schedule. I purposely ignored that, because an undefeated mid-major team causes controversy, especially if there are no undefeated major teams. Again, the controversy the code finds may be different from the controversy the press sees, and again I expect that to average out over the long run.

And the numbers are

PollChampionshipPlus-1
65%54%62%

As you can see, there isn't a lot of difference to be seen between the three options. The championship game we have now seems to generate slightly less controversy, which is what we've experienced. Going to a plus-1 from what we have now would raise the controversy level back to almost the point we were at before.

Conclusion

Our current system with a single championship game is less likely to generate controversy than going to a plus-1 system.

Friday, January 13, 2012

An analysis of FBS controversies: Part I, the code

America's college Football Bowl Subdivision (FBS) has a history of controversy when it comes to picking a champion. And a history of proposed solutions. I've decided to examine how well those solutions would have worked if applied to previous seasons.

The football background

If you're a football fan, you can skip this section.

The FBS has one unusual property. Unlike every other organized college sport in the US, and every other nationally organized sport anywhere in the world, it has no clear cut way to pick a champion for each season. Instead, various groups - some as small as a single person - analyze the games played during the season and rank the teams according to that play, with the highest ranked team at the end of the season being crowned "champion". If the major groups agree with each other, there's a "consensus champion". If not, then you have a controversy.

People familiar with other sports might be amazed that there's ever a consensus. However, FBS teams play few enough games that winning every game in a season is possible, and it's not at all unusual for one or more teams to be undefeated in a season. It's also hard enough that it's not unusal if there are no undefeated teams at the end of a season.

So the most important criteria any ranking system considers is "number of losses and ties". (Football fans who decided to read this section are now saying "But what about non-AQ schools". For now, I'm going to ignore that, and I'll justify doing so in the sequel.) The other criteria are where the controvery comes in, as a case can be made that any two undefeated teams are equally good, and ditto for any two one-loss teams (provided they haven't played each other), etc.

The proposed solutions - short of an actual playoff, also to be discussed in the sequel - involve picking some small number of teams from the top to play each other. This system has actually been in use since the 1992 season to pick the top two teams. It doesn't seem to be any less controversial, and the formula for picking the two teams has been adjusted nearly continuously to try and deal with those controversies. A "plus-1" system gets a lot of talk, even though it has multiple definitions, but usually involves picking the top 4 teams.

The task

Since the FBS has a controversy any time either a clear winner fails to emerge, or - since the Bowl Coalition and it's succesors starting picking two teams for a championship game - if there's not a clear choice of two teams, the goal of the project is to see how often picking either 1, 2 or 4 teams (historical, current and "plus-1" approaches) who get a chance to play for the national championship would have done on previous seasons.

Deciding if there would have been controversy for a system is easy. If the last team to make the cutoff - either the first place team, the second place team or the fourth place team - has the same number of losses/ties as some team that doesn't get such a chance, the system with that cutoff has a conroversy for that season.

The code

Given that definition, the code is straightforward: for each season, get the records of all the teams, sort them based on losses/ties, and the check the records of the two teams on either side of each cutoff. This part presents the code for doing that.

So, let's start by getting the record for a season. Since that involves IO, the function that does it will work in the IO monad, giving us loadYear:


loadYear :: Int -> IO (Bool, Bool, Bool)
loadYear =
  liftM getYear . (>>= getResponseBody) . simpleHTTP . getRequest . makeURL
          
makeURL :: Show a => a -> String
makeURL year = "http://www.jhowell.net/cf/cf" ++ show year ++ ".htm"

makeURL is a helper function that just turns the integer year into a URL for the data on www.jhowell.net. That url is fed to a series of functions from the Network.HTTP module to produce a page

This idiom - a series of functions separated by '.' - is common in Haskell code. '.' is the compose operator, producing a new function that is the result of applying the right function and then the left function. This idiom and an alternative are going to show up in a number of places.

The interesting part is the first two steps, which actually happen last. The step before them - simpleHTTP takes a string value and produces a value in the IO monad. The form (>>= getResponseBody) is a function that will apply getResponseBody - which also work in the IO monad - to the value in the monad, returning the result from getResponseBody. liftM getYear is similar, except that getYear (presented below) does not work in the IO monad. liftM applies it to the value in the monad, and returns it's result in the monad. I can't use liftM on getResponseBody because getResponseBody, unlike getYear returns a value in the monad.

I've just talked a lot about monads. If you're not familiar with them, you're probably wondering what I'm talking about. A monad is a class - by which I mean a group of functions that can be applied to members of the class - that functional programming languages use to hide plumbing of various types. In this case, the IO monad hides the state information for whatever IO is being done from. One of the properties of a monad is that it can be viral. If a function uses a value in some monad, the function has to return a value in that monad unless the monad includes a function to unwrap the value. Without that, while the using function can extract the value from the monad to pass it to functions that aren't in the monad (know as pure functions) and get a value back, the value of the function has to be in the monad. After all, if a function does IO, then any function that calls it also does IO.

So makeURL and getRequest are pure functions, returning pure values. simpleHTTP takes that value and does IO, so it returns a value in the IO monad. The next two functions - (>>= getResponseBody) and liftM getYear have to accept and return values in the IO monad.

The next group of functions are easier, I promise. They're all pure. getYear takes the string representing the page for a season, and maps it to a set of three booleans, indicating whether or not that seasons results would be a controversy for a poll, a championship or a plus one system.


getYear :: String -> (Bool, Bool, Bool)
getYear = controversy . sort . (\ (t, r) -> map (getData t) r) . getRows . getTable . parseTags

And we once again see that list of function compositions. parseTags is from the Network.HTTP module, and turns the string into a list of Tag String elements. getTable extracts the table from that list:


getTable :: [Tag String] -> [[Tag String]]
getTable = partitions (isOpenTag "tr") . drop 2 
           . takeWhile (not . isCloseTag "table")
           . dropWhile (not . isOpenTag "table")

And again I have a list of function compositions. dropWhile skips over elements that it's first argument says to - in this case, any element that is not a "table" open tag, takeWhile accepts elements that it's first argument says to accept - in this case, any element that is not a "table" close tag. drop just returns the list without it's first two elements, which are the "table" open tag and the text between it and the start of the first row. Finally, partitions (also from the network.HTTP module) turns the list of Tags into a list of lists of Tags, starting a new list at every "tr" open tag.

isOpenTag and isCloseTag are very similar, using haskell's argument matching facility to check for TagOpen or TagClose elements in the list, and then comparing the text in the tag to their first argument to make sure this is the open or close for the proper element:


isOpenTag :: String -> Tag String -> Bool
isOpenTag tag (TagOpen t _) = map toLower t == tag
isOpenTag _ _ = False

isCloseTag :: String -> Tag String -> Bool
isCloseTag tag (TagClose t) = map toLower t == tag
isCloseTag _ _ = False

The table from getTable is then passed to getRows. getRows is more complicated, in that it has to check the first row to see if this year has a column for ties (the rules changed in the 90s, making ties impossible). It returns a Boolean indicating whether or not there was a tie along with list of rows after the first one.


getRows :: [[Tag String]] -> (Bool, [[Tag String]])
getRows rs = (isTies (head rs), tail rs)
  where isTies r = maybe False (const True) (find (== TagText "Ties") r)

isTies is a helper function only visible in the getRows namespace. It runs code in the Maybe monad, which can either hold a value or not (i.e - is Nothing). This monad has an extraction function maybe that can return values outside the monad. It's arguments are a value to return if there is no value in the monad (i.e. - it's Nothing), a function to apply to the value if there is one, and the value to be checked. In this case, I return False if there was no "Ties" column, and use a function that always returns True if there is a "Ties" column.

The piping the maybe monad is used to hide is error checking. You can pass a value in the monad through a string of functions that return a value in the monad, and once one returns a Nothing the rest will be skipped, passing the Nothing to the top level without having to explicitly code tests for it at each step. I'm not taking advantage of that here, but would have if I'd wanted to treat missing pages as seasons with no teams (as in 1871).

The next step in the processing is to extract the controversy information for each season. That's done by (\ (t, r) -> map (getData t) r), an anonymous function that accepts the tie Bool and then maps getData t over the rows. getData t uses Haskell's automatic currying to create a function that already has the ties Bool and will process a row:


getData :: Bool -> [Tag String] -> (Int, Int)
getData True = (\gs -> (head gs, head $ tail gs)) 
               . map getLosses . take 2 . dropToLosses
getData False = (,0) . getLosses . head . dropToLosses

The two variants of getData are almost identical. Both start with dropToLosses, which partitions the list on data elements, then apply getLosses either to the first element (if there are no ties) or the first two (if there may have been ties) and finally puts the results in a tuple, using a 0 for ties if they weren't possible.

To summarize the list tweakings: I started with a list of Tag's, turned that into a list of lists, each sub-list being a row. The processed each sub-list through getData, which once again turns it into a list of lists - this time of data elements in the row.

dropToLosses is simple:


dropToLosses :: [Tag String] -> [[Tag String]]
dropToLosses = drop 8 . partitions (isOpenTag "td")

It does the partitioning, then drops the first eight sub-lists, representing things in the table before the losses column.

getLosses is almost as simple:


getLosses :: [Tag String] -> Int
getLosses = read . fromTagText . head . tail

It drops the first element, the "td" tag, then extracts the fist element after that, which is the text in the "td" tag, then uses read to convert that from a string into an integer.

The last step in getData is to create a triple of Bools, saying whether the first and second teams had the same record, whether the second and third had the same record, and whether the fourth and fifth had the same record. Because list indices start at 0, I have to subtract one from each value in the comparison:


controversy :: [(Int, Int)] -> (Bool, Bool, Bool)
controversy records = (records !! 0 == records !! 1,
                       records !! 1 == records !! 2,
                       records !! 3 == records !! 4)

genStats generates the stats, but starts by calling loadYear, putting it in the IO monad. genStats uses loadYear on a list of years, then calculates and returns the percentage of times each type of controversy occurs in that list of years:


genStats :: [Int] -> IO (Int, Int, Int)
genStats years = do
  pages <- mapM loadYear years
  return $ percents (foldl' counts (0, 0, 0) pages) 
                    (fromIntegral $ length pages)
  where
    percents (a, b, c) l = (a % l, b % l, c % l)
    (%) a b = round $ 100 * a / b
    counts (a, b, c) (ab, bb, cb) = (minc a ab, minc b bb, minc c cb)
    minc x xb = x + if xb then 1 else 0

genStats gets the list of controversy triples via loadYear, then uses a quad of helper functions to calculate percentages. percents takes the triple of counts and the length of the list and uses % to generate the percentages. This is the one place where Haskell's type system gets in the way. It can't coerce the integer from length into a type that can be used with /, so it has to be explicitly coerced with fromIntegral. counts generates the actual counts, using minc to increment the current value of each count depending on whether or not there was a controversy for that count on this page.

I use makeYears to process the arguments. It takes a list of Strings, and returns a list of Ints. If the input list has only one element, it's value as an Int is wrapped in a list and returned. If it has two elements, they are taken from the list, turned into Ints, and the range from the first to the second is returned. Otherwise an error message is printed:


makeYears :: [String] -> [Int]
makeYears args =
  case length args of
    1 -> [read (head args)]
    2 -> (\l -> [head l .. head $ tail l]) $ map read args
    otherwise -> error "Arguments must be first and optional last year."

Finally, the main routine - which in Haskell as in other languages, is where compiled programs start - uses a chain of functions in the IO monad to get the results and print them. This time, the unusual function is makeYears, which isn't in the IO monad, but returns a value to genStats, which is.


main =
  getArgs >>= genStats . makeYears >>= uncurry3 (printf format)
  where
    format = "Percentage of seasons with controversies using\n" ++ 
             "Poll: %d%%, Game: %d%%, Plus 1: %d%%\n"

That's the code. You can, of course, fetch it from the repository if you want to build it and see the results yourself. I'll be discussing that in the next entry.

Dealing with the first three years

This code should work for all football seasons from 1872 forward. There was no football played in 1871. The first season was played in 1869. There were fewer than five teams playing in 1869 and 1870. Dealing with them would require deciding if no teams in the lower slots should be counted as year played without controversy, or not counted at all, and the code modified to match.

In 1869, there were two one-loss teams, so a poll had a controversy, and a championship game didn't. This table also requires special handling, as the pre-season ranking column is missing.

In 1870, there there three teams, one undefeated, and two with one loss. The code would have counted this as no controversy for a poll, and a controversy for a championship. However, Princeton beat Rutgers and Rutgers beat Columbia, so the rankings for 1, 2 and 3 aren't controversial.

In 1871, there was no college football. Worse yet, that page generates an error instead of an empty list, so handling it would require special handling.

Since it would have taken three changes to include just two more seasons - and problematic ones at that - these are ignored.