Custom Search

Sunday, September 5, 2010

A Monkey in Clojure

One of the things contributing to the hiatus was unpacking and cataloging my office library. During that process, I found a book I couldn't remember buying, much less reading: The Practice of Programming by Kernighan and Pick. It naturally went to the top of my reading list.

Chapter three in TPoP presents the same algorithm in five different languages: C, C++, Java, awk and Perl. The algorithm in question is what's commonly known as a Markov Chain, this one dealing with words. Markov Chains generate output that resembles an input sequence of by recording the suffixes that follows fixed-length prefixes in the input. If the sequence is composed of words, you get text that resembles the input more closely as the prefix length increases. If the sequences are composed of letters, you get text that goes from gibberish to ever-longer words from the input as the prefix length increases These were a popular project when I was in college, though we called them "monkeys", because, according to a then-popular meme, a million character monkeys with length zero prefixes would eventually output all of Shakespeare. I'm using the more colorful name.

On being reminded of the problem, I was instantly struck by how well clojure was suited for this problem: input a sequence to output a map that mapped vectors of words (the prefix) to a collection of words (the possible suffixes), then use that map to generate a sequence of output words? Should be a slam-dunk.

With a major caveat - there are many different ways to tweak this algorithm. Besides the obvious characters or words or other objects, you can fix the prefix length or make it variable, and there are various ways to start the sequence and to cause the generation phase to terminate. If you're doing words, then there are a variety of ways of parsing the input, dealing with punctuation, etc. All of those can make things more complicated. Kernighan and Pike chose a simple model: any printing character belong to the word, all punctuation included. They also used sentinel values both to start the sequence of words and as an indicator that the generated text should stop. As they point out, this is in general a good practice, because it makes the resulting programs simpler.

So, the first step was to see what was available on the web. While google turned up a couple of markov chain implementations, they didn't sit quite right with me: they were closer in length to the C++ version than the awk and Perl versions. I felt that it should be possible to do better than that - that a problem this well suited to clojure should be expressible in about the same number of lines of code as it is in Perl, awk or the Python version I in the Python CookBook. This is not to say the other algorithms I found were bad - they were solving slightly different - and more intricate - versions of the problem. At least one could be used for either character or word monkeys.

First step - generate a sequence of words from an input file. Since we don't want to have to keep all of the words in memory, we're going to make it a lazy sequence. This being my first use of lazy sequences in clojure, I'm not sure I actually managed that goal. Anyway, read-words returns a lazy sequence of the words from standard input. It goes to a little bit of trouble - trimming whitespace from the beginning of the input and then filtering empty strings - to deal with odd formatting on the input.

;;;; A variable-order word monkey (markov chain of words.

(ns monkey (:use [clojure.contrib.str-utils2 :only (split trim)]))

(defn read-words []
(letfn [(read-unfiltered []
(if-let [line (read-line)]
(lazy-cat (split (trim line) #"[\s]+")
(read-unfiltered))))]
(filter seq (read-unfiltered))))

The sub-function read-unfiltered does most of the work, using lazy-cat to create a lazy sequence from the sequence resulting from splitting each line on whitespace after trimming any leading and trailing whitespace. The main body just uses seq to filter out empty strings from read-unfiltered, which makes the call to trim superfluous.

The sequence of words is going to be used to create a native clojure data structure: a hash-map that use a vector of words as the key, and returns a collection of words as possible suffixes. That takes all of two functions and six lines of code:

(defn add-map [markov-map prefix-plus]
(let [prefix (vec (butlast prefix-plus))]
(assoc markov-map prefix (conj (markov-map prefix) (last prefix-plus)))))

(defn build-map [prefix-len col]
(reduce add-map {} (partition (inc prefix-len) 1
(concat (repeat prefix-len nil) (col) [nil]))))

add-map adds the suffix to the map entry for the prefix that it is passed, using a fairly standard clojure idiom for adding to a map entry. This avoids some of the issues seen in the other implementations because conjing a value onto the nil returned by a map lookup on a missing key gives you back a collection containing just that value. build-map uses partition to create a lazy sequence of lists containing prefix and suffix from the input sequence, which is what add-map expects to be passed. Reduce is then used to build up the complete map by adding all of those in turn to an initially empty map.

Given that I started with a lazy sequence of words, it only made sense to finish with one. The last two functions generate that sequence, and then print the part of the results we want:

(defn generate [markov-map prefix]
(let [suffixes (markov-map (vec prefix))]
(if-let [suffix (nth suffixes (rand-int (count suffixes)))]
(lazy-cat [suffix] (generate markov-map (concat (rest prefix) [suffix]))))))

(defn word-monkey
([maxgen] (word-monkey maxgen 2))
([maxgen prefix-len]
(let [markov-map (build-map prefix-len read-words)]
(dorun (dec maxgen)
(map println (generate markov-map (repeat prefix-len nil)))))))

The generate function takes a map and an initial prefix, then returns a lazy sequence of words that follow that prefix by calling itself with the same map and a new prefix. The word-monkey function builds the map, then uses doseq to print either the sequence or the maximum number of words wanted from it, one word per line, as per TPoP.

Two issues here: one is that the code is a bit wider than I would have liked. My initial version assigned the lazy sequences to variables to shorten the lazy-cat call, but that meant I was holding the head of those sequences, which meant I had them all in memory, defeating the purpose of using the lazy sequences. The other is that we cheat a little on TPoP by using nil for the sentinel value. It has the prerequisite properties, in that it won't be generated by read-words, but saves having to create and assign it.

As expected, the clojure version at 23 lines is about the same length - counting only code lines, not blank or comment lines - as the similarly-commented Perl (18), awk (20) and Python (20) versions. The difference is that the word-monkey function is a function instead of the body of the script, and - unlike the others - this version doesn't fix the prefix length at 2, but allows it to be set by the invoker. Dropping those two things would make it three or four lines shorter.

As a final does of goodness, the add-map, build-map and generate functions will work with sequences of any objects, so long as nil isn't a valid object. The character-monkey version using those is:

(defn char-monkey [file maxgen prefix-len]
(let [markov-map (build-map prefix-len (fn [] (slurp file)))]
(dorun (dec maxgen)
(map print (generate markov-map (repeat prefix-len nil))))))