Custom Search

Tuesday, February 23, 2010

There's not an app for that, take 2

In order to keep better track of what iPhone users are missing out on, I've set up a page listing Android applications - or at least categories of them - that Apple doesn't allow on the iPhone. You can find it under Pages on the right hand side of any page on the blog, listed as Apps you won't find in the app store.

Monday, February 22, 2010

Phrase groups in clojure

A recent interview with yags* consisted of solving problems and writing demo code for them, which is pretty standard for a gs. One of the problems captured my attention, because it looked like a perfect problem to explore the strengths and weaknesses of clojure.

The problem is simple enough to state: Given a set of lists of words, remove every phrase group that occurs in all the lists. A phrase group is any sequence of three or more words in a row. Two things make this interesting for clojure: 1) it's a purely functional problem: the output result depends only on the input values, and 2) it involves manipulating relatively deep structures - at least in the solution I came up with. That is something working with immutable data structures typically makes painful.

Since I'm still exploring clojure, I chose to do this in Python in the interview. The code here - and in the repository at bitbucket - uses the same algorithm, expressed with the high-level data structures of clojure instead of Python.

First, a couple of short helper functions:
(defn mfilter
"Return a hash-map built by removing entries for which (pred (key entry))
returns false from mapin."

[pred mapin]
(apply hash-map (apply concat (filter #(pred %) mapin))))

(defn enumerate
"Return pairs of an index into sequence, and the value at that index"
[sequence]
(map vector (iterate inc 0) sequence))
As the document string for mfilter says, it returns a copy of a hash-map built from a map by removing entries for which the predicate applied to the key is false. Likewise, enumerate counts the elements in a sequence, starting at 0, and returns pairs of them and the counter.

Note that we only have to worry about phrases of length three. A phrase of length 4 will show up as two phrases of length three with an overlap, and one of length five as three such phrases, all overlapping. So we can ignore phrases longer than three words. And of course, phrases shorter than three words aren't phrase groups, so we ignore them as well. So the list of phrases in an input list is a list triples that has two fewer elements than the input list, as the last two elements don't start phrases.

The data structure used for my solution is a dictionary, index by phrase groups - meaning triples of words. Each entry is also a dictionary, indexed by an input lists index in the list of input lists. The entries in those dictionaries is a list of places that that phrase group starts in the input list. I'm going to call this the phrase dictionary.

That leads us to the first building block function, which accepts a phrase dictionary and a phrase, along with the index of a list and the index of that phrase in the list, and returns the update phrase dictionary:
(defn add-phrase-to-phrase-dict [phrase-dict phrase list-index phrase-index]
(if (or (phrase-dict phrase) (= list-index 0))
(update-in phrase-dict [phrase list-index] conj phrase-index)
phrase-dict))
This uses the clojure function update-in, which is something I don't ever remember seeing in a lisp before (though cl's setf might be close). It finds an element in a nested structure - like the phrase dictionary - using it's first argument, which is a sequence of indices into the structure. That's [phrase list-index] , so this finds the list of places where phrase appears in the input list list-index. Further, if the elements aren't there, it creates the intermediate map needed, and return nils if the last lookup fails. The resulting value gets passed to the second argument, along with any remaining arguments, and the result of that call is used in this position in the new version of the structure that is returned. While this might sound expensive, since everything in the structure is immutable, the two versions can actually share everything but the values along the path to the updated value.

The conditional application of update-in is an optimization. If the phrase we're updating isn't in the dictionary, we want two different behaviors: if this is the first input list, we want to add that phrase to the dictionary. Otherwise, we can ignore the phrase, because it isn't in at least one input list - the first one. Adding it won't change the final result, but will generate more work. So we check for it there, and then only add it if this is the first input list. The alternative case returns the input phrase dictionary unmodified.

We now need to invoke this function on every phrase in every input list. Let's start with a function to invoke it on every phrase in a single input list:
(defn add-list-to-phrase-dict [phrase-dict list-index list]
(reduce (fn [phrase-dict [phrase-index phrase]]
(add-phrase-to-phrase-dict phrase-dict phrase list-index phrase-index))
phrase-dict
(enumerate (map vector list (rest list) (nthnext list 2)))))
As input, we get the phrase dictionary we're going to update, the index of the input list in the list of lists, and the input list itself. Wanting to update a value based on processing values in a list is what the lisp function reduce is for. It takes a function, an initial value, and a list, then calls that function with the initial value and each element in the list. So our function - introduced by (fn - takes two arguments, a phrase dict and a pair of phrase-index and phrase, and returns the result of calling add-phrase-to-phrase-dict with them and the input list values. The initial value is the input phrase dictionary. We generate the phrase list by mapping vector over the input list, and the results of removing the first and then second element from it, giving three-element vectors of three consecutive words - which are our phrases. Again, the primitive does the right thing for us, and stops producing maps when the last list runs out, so the last two words in the initial input list don't start phrases. We pass that list to enumerate to get the index and phrase pairs we need.

Given that, we basically repeat the same construct to deal with all the input lists:
(defn build-phrase-dict [lists]
(let [phrase-dict (reduce (fn [phrase-dict [list-index list]]
(add-list-to-phrase-dict phrase-dict
list-index list))
{}
(enumerate lists))
list-count (count lists)]
(mfilter #(= (count (val %)) list-count) phrase-dict)))
The input is a sequence of lists. That's the sequence we reduce over here, once again passing it to enumerate to generate our indices. This time, the initial value is an empty map. The most important change is that, instead of returning the result of reduce directly, we use the mfilter function defined earlier to remove any elements that don't have as many entries as we had input lists. That's the criteria for inclusion - that the phrase be in every list. If it isn't in some list, then it's dictionary entry won't have an entry for that list, and hence the count won't match the length of the list of lists.

That's the first half the problem - building up the phrase dictionary. We now need to use that to remove the phrases from the input lists. Oddly enough, the phrases themselves don't matter - we just care about their positions. So we write a function that takes the list of phrase positions, and removes them from a list:
(defn remove-phrases-from-list [phrase-starts list]
(mfilter (fn [[x _]] (not-any? #(and (>= x %) (< x (+ % 3))) phrase-starts))
(apply sorted-map (mapcat vector (iterate inc 0) list))))
Here's the second mfilter - this time, it's remove words from the list. We start by creating a sorted map indexed by word position of the words in the list. Then mfilter uses not-any? to check each word's position against the phrase start positions, specifically that the position is not greater than the start position and less than three plus the start posistion. If a position falls into that range for any phrase in the phrase-starts list, it's removed.

So all that's left is printing the result:
(defn remove-shared-phrases [lists]
(let [phrases (apply merge-with concat (vals (build-phrase-dict lists)))]
(doseq [[idx list] (enumerate lists)]
(println (vals (remove-phrases-from-list (phrases idx) list))))))
This actually combines everything so far: building the phrase dictionary from the list, extracting the list of phrase positions from that, using merge-with to concat the lists in each of the element dictionaries into a single list for each file, then using doseq to call println on the result of removing the phrases in each list from the list. At this point, we see why the output map was a sorted-map: that forces the results of each list to print in the proper order.

All in all, this has been an illuminating exercise. Clojure's functions for dealing with deep structures and non-list structures work very well with them, providing what is still a very Lisp-like environment, even though we're not working with lists. In particular, the handling of what in python were special cases is right by default, which feels very much like classic lisp behavior. I'd say Clojure held up very well here.

*) That would be Yet Another Google Spinoff.

Saturday, February 13, 2010

mired in code now available from bitbucket

To make it easier to get the code published here I've set up a mercurial repository at bitbucket.org. Assuming you have mercurial installed, then you can get a current copy of the code with:

$ hg clone https://bitbucket.org/mwm/mired-in-code/

That will also leave the most recent code checked out. You can update the repository and check out the most recent updates with:

$ hg pull -u

For anything more advanced, you'll want to check out mercurials documentation with:

$ hg help

Friday, February 12, 2010

bitchin web framework - templates

One of the advantages of making the templates functions is that clojure  does almost all the work. Tags basically come in two flavors, depending on whether they render newlines between the elements of their contents to make the resulting HTML more readable. So we could define a tag like so:
(defn html
([] (format "<%s />" "html"))
([attribs & args]
(if (associative? attribs)
(str "<" "html" (make-attrib-strings attribs)
(if (nil? args)
" />"
(format ">%s%s%s</%s>" \n (str-join \n args) \n "html")))
(apply html (concat [{} attribs] args)))))
Where make-attrib-strings handles the attribute map, and looks like:
(defn- make-attrib-strings [attrs]
(apply str (map #(format " %s=%s" (if (keyword? %) (name %) %)
(quoted-attribute (attrs %)))
(keys attrs))))

(defn quoted-attribute [input]
(format "\"%s\""
(apply str (replace {\& "&amp;", \" "&quot;", \> "&gt;"} (str input)))))
These two functions are fairly straightforward. quoted-attrib takes a string, and returns it with the characters that are dangerous inside of attributes - '"', '&' and '>' (the latter only for very old browsers, but it's easy to fix as well) and returns a string with them replaced by the appropriate XHTML entities. We have to process the output through str, as the result of replace is a sequence. Likewise, we pass the input through str, just in case they gave us a symbol (which we will use later). make-attrib-string is passed an associative object, and maps the keys through a function that turns it into a stringo f the form key="value", with value being quoted by quoted-attribute. It does check to see if the key is a keyword, and if so uses it's name, so that users can use :href instead of 'href. Again, we pass the results to str to turn it into a string for output.

The problem with the above approach is that it requires writing out that rather long function for every tag we want to use. Fortunately, clojure provides us with macros, so we can just write a macro to do all that for us, like so:
(defmacro make-tags
([sep name & names]
`(do
(make-tags ~sep ~name)
(make-tags ~sep ~@names)))
([sep name]
(let [name# (first (re-split #"_" (str name)))]
`(defn ~name
([] (format "<%s />" ~name#))
([attribs# & args#]
(if (associative? attribs#)
(str "<" ~name# (make-attrib-strings attribs#)
(if (nil? args#)
" />"
(format ">%s%s%s</%s>"
~sep (str-join ~sep args#) ~sep ~name#)))
(apply ~name (concat [{} attribs#] args#))))))))
The first clause is a standard clojure idiom - we want to allow multiple tag names, so we handle the case with more than one name by calling the macro with one name, and then again with the rest of the names.

The second clause is the code from the html function above, with the
string or symbol html taken out, and replaced by an
expression involving name. With one hiccup - some of the html tag names are core clojure function names, so we can't use them. Instead, we use map_ and meta_, and extract the string into the symbol name#. The # causes clojure to generate a unique symbol for this binding, so we don't have to worry about name collisions. The rest of the macro is the function from above, with ~name or ~name# (the ~ causing evaluation) replacing what was html, and ~sep.

So now we can put that all together, along with two invocations of the macro to create all the tags we normally use. And of course, the user gets the make-tags function if they want to use custom tags of some kind or another.
(ns org.mired.bitchin.html
(:use [clojure.contrib.str-utils :only (str-join re-split)]))

(defmacro make-tags
([sep name & names]
`(do
(make-tags ~sep ~name)
(make-tags ~sep ~@names)))
([sep name]
(let [name# (first (re-split #"_" (str name)))]
`(defn ~name
([] (format "<%s />" ~name#))
([attribs# & args#]
(if (associative? attribs#)
(str "<" ~name# (make-attrib-strings attribs#)
(if (nil? args#)
" />"
(format ">%s%s%s</%s>"
~sep (str-join ~sep args#) ~sep ~name#)))
(apply ~name (concat [{} attribs#] args#))))))))

(defn quoted-attribute [input]
(format "\"%s\""
(apply str (replace {\& "&amp;", \" "&quot;", \> "&gt;"} (str input)))))

(defn- make-attrib-strings [attrs]
(apply str (map #(format " %s=%s" (if (keyword? %) (name %) %)
(quoted-attribute (attrs %)))
(keys attrs))))


;; The standard html tags with newlines between element in the contents
(make-tags "\n" html head meta_ base link script style body div dl ol ul li table
tr colgroup col thead tbody tfoot form select input)

;; And now the same, with nothing between elements in the contents.
(make-tags "" title h1 h2 h3 h4 h5 h6 p pre address blockquote hr dd dt th td
caption option label textarea button a b i big small strong em q
tt a cite dfn abbr acronym code samp kbd var sub del ins font span img
br map_ area object param embed noembed)
As expected, the tests had to be debugged as well. Mostly, it was spacing inside of tags. However, the design changed just slightly - the template system doesn't automatically quote characters in content strings (but it does in attribute strings). The recursive nature of the invocations makes it hard - if not impossible - to keep track of what has and has not been quoted, and quoting a second time would be a failure. This may be a reason to change from a code representation to a vector representation, if the problem can be solved.

Sunday, February 7, 2010

The bitchin web framework - template testing

I've been building web sites for nearly twenty years now. In that time, I've learned some lessons that apply here:

1) I shouldn't do visual or interface design. My eye for graphics sucks, and my mind doesn't work the way most users do when it comes to UI issues.

2) CSS is sufficient for visual design. The existence of "div soup" web pages demonstrates that. Since those can go in a stylesheet file, we get automatic separation of design and content. Given #1, this is good for me.

3) If you're doing HTML, the reader always gets the final say about presentation! There are browser with options - or extensions that add options - to turn off CSS completely, or use the users own CSS. Worst comes to worst, they can use View Source on your page. Attempting to tell them how to set their browser will, if you're lucky, get you laughed at before they read your text. If you're not lucky, they'll take their business elsewhere.

With those in mind, bitchin is a programmers framework. Templates are meant to be embedded in code, with design decisions going into a style sheet.

A template is a data structure that can be "evaluated" in a context to fill in values from various data sources to provide content, producing HTML text - or a fragment thereof. In most programming languages, we have a tool for that already - it's called a function! Further, modern HTML is an XML application, and the parallels between S-expressions and XML have been pointed out numerous times. Given that Clojure functions are S-expressions - well, mostly - I'm going to use them as bitchin templates.

So a tag is a callable object. If the first argument is a Clojure map - or any associative object - the key/value pairs will turn into attributes for the tag. Any remaining arguments will be evaluated to generate the contents of the tag. So a typical simple web page template might look like:
(html (head (title "A short test"))
(body (h1 "A short test"))
(p "And some text")))
And then render as:
<html>
<head>
<title>A short test</title>
</head>
<body>
<h1>A short test</h1>
<p>And some text</p>
</body>
</html>
So the first step is test code. While I am a fan of test suites, I'm not a fan of test driven development. The problem is - the tests are code, and hence likely to have bugs. The only proper test for them is the code they're supposed to test. So to do TDD on the tests, you need to write the code to be tested first. The reality is that it doesn't matter which you write first - you're going to debug them in parallel. So I normally write them in parallel as well.

In this case, the tests are providing examples for how the code should behave. As such, that's worth writing out explicitly. Given the lose nature of XML, I'll probably still wind up debugging them in parallel.

The test code uses the clojure.test unit testing framework, dividing the tests up into groups, doing a simple equality comparison between the string we want out and the code that should generate it:
;; Unit tests

(ns org.mired.bitchin.test
[:use [clojure.test] :only (is)]
[:use org.mired.bitchin.html])

(deftest inline-elements
(is (= "<p />" (p)) "empty contents")
(is (= "<map />" (map_)) "keyword tag")
(is (= "<p>now</p>" (p "now")) "string contents")
(is (= "<p>now</p>" (p 'now)) "symbol contents")
(is (= "<p>:now</p>" (p :now)) "keyword contents")
(is (= "<p><br /></p>" (p (br))) "element content")
(is (= "<p>now &amp;</p>" (p "now &")) "ampersand encoding")
(is (= "<p>now &lt;</p>" (p "now <")) "less than encoding")
(is (= "<p>now is the</p>" (p "now " 'is " the")) "multipart contents"))

(deftest block-elements
(is (= "<div />" (div)) "empty contents")
(is (= "<meta />" (meta_)) "keyword tag")
(is (= "<div>\nthen\n</div>" (div "then")) "string contents")
(is (= "<div>\nthen\n</div>" (div 'then)) "symbol contents")
(is (= "<div>\n:then\n</div>" (div :then)) "keyword contents")
(is (= "<div>\n<br />\n</div>" (div (br))) "element contents")
(is (= "<p>now &amp;</p>" (p "now &")) "ampersand encoding")
(is (= "<p>now &lt;</p>" (p "now <")) "less than encoding")
(is (= "<div>\nthen\nwas\nthe\n</div>" (div "then" 'was "the"))
"multipart contents"))

(deftest mixed-content
(is (= "<div>\n<span>now is the</span>\n<b>time</b>\n</div>"
(div (span "now is the") (b 'time)))))

(deftest attributes-block
(is (= "<div border=\"0\" />" (div {:border 0})) "keyword name")
(is (= "<div border=\"0\" />" (div {'border 0})) "symbol name")
(is (= "<div border=\"0\" />" (div {"border" 0})) "string name")
(is (= "<div width=\"100%\" />" (div {:width "100%"})) "string value")
(is (= "<div border=\"0\" width=\"100%\" />" (div {:border 0, 'width "100%"}))
"multiple attributes")
(is (= "<div border=\"0\" width=\"100%\">\n<span width=\"90%\">Text</span>\n</div>"
(div {"border" 0, :width "100%"} (span {'width "90%"} "Text")))
"multiple attributes with content"))

(deftest attributes-inline
(is (= "<span border=\"0\" />" (span {:border 0})) "keyword name")
(is (= "<span border=\"0\" />" (span {'border 0})) "symbol name")
(is (= "<span border=\"0\" />" (span {"border" 0})) "string name")
(is (= "<span width=\"100%\" />" (span {:width "100%"})) "string value")
(is (= "<span border=\"0\" width=\"100%\" />" (span {:border 0, 'width "100%"}))
"multiple attributes")
(is (= "<span class=\"now&amp;then\" />" (span {:class "now&then"}))
"ampersand encoding")
(is (= "<span class=\"&quot;FOO&quot;\" />" (span {:class "\"FOO\""}))
"quote encoding")
(is (= "<span class=\"x&gt;3\" />" (span {:class "x>3"}))
"greater than encoding")
(is (= "<span border=\"0\" width=\"100%\"><b width=\"90%\">Text</b></span>"
(span {"border" 0, :width "100%"} (b {'width "90%"} "Text")))
"multiple attributes with content"))

(run-tests 'org.mired.bitchin.test)