America's college Football Bowl Subdivision (FBS) has a history
of controversy when it comes to picking a champion. And a history
of proposed solutions. I've decided to examine how well those
solutions would have worked if applied to previous seasons.
The football background
If you're a football fan, you can skip this section.
The FBS has one unusual property. Unlike every other organized
college sport in the US, and every other nationally organized
sport anywhere in the world, it has no clear cut way to pick a
champion for each season. Instead, various groups - some as small
as a single person - analyze the games played during the season
and rank the teams according to that play, with the highest ranked
team at the end of the season being crowned "champion". If the
major groups agree with each other, there's a "consensus
champion". If not, then you have a controversy.
People familiar with other sports might be amazed that there's
ever a consensus. However, FBS teams play few enough games that
winning every game in a season is possible, and it's not at all
unusual for one or more teams to be undefeated in a season. It's
also hard enough that it's not unusal if there are no undefeated
teams at the end of a season.
So the most important criteria any ranking system considers is
"number of losses and ties". (Football fans who decided to read
this section are now saying "But what about non-AQ schools". For
now, I'm going to ignore that, and I'll justify doing so in the
sequel.) The other criteria are where the controvery comes in, as
a case can be made that any two undefeated teams are equally good,
and ditto for any two one-loss teams (provided they haven't played
each other), etc.
The proposed solutions - short of an actual playoff, also to be
discussed in the sequel - involve picking some small number of
teams from the top to play each other. This system has actually
been in use since the 1992 season to pick the top two teams. It
doesn't seem to be any less controversial, and the formula for
picking the two teams has been adjusted nearly continuously to try
and deal with those controversies. A "plus-1" system gets a lot of
talk, even though it has multiple definitions, but usually
involves picking the top 4 teams.
The task
Since the FBS has a controversy any time either a clear winner
fails to emerge, or - since the Bowl Coalition and it's succesors
starting picking two teams for a championship game - if there's not
a clear choice of two teams, the goal of the project is to see how
often picking either 1, 2 or 4 teams (historical, current and
"plus-1" approaches) who get a chance to play for the national
championship would have done on previous seasons.
Deciding if there would have been controversy for a system is
easy. If the last team to make the cutoff - either the first place
team, the second place team or the fourth place team - has the
same number of losses/ties as some team that doesn't get such a
chance, the system with that cutoff has a conroversy for that
season.
The code
Given that definition, the code is straightforward: for each
season, get the records of all the teams, sort them based on
losses/ties, and the check the records of the two teams on either
side of each cutoff. This part presents the code for doing
that.
So, let's start by getting the record for a season. Since that
involves IO, the function that does it will work in the IO
monad, giving us loadYear:
loadYear :: Int -> IO (Bool, Bool, Bool)
loadYear =
liftM getYear . (>>= getResponseBody) . simpleHTTP . getRequest . makeURL
makeURL :: Show a => a -> String
makeURL year = "http://www.jhowell.net/cf/cf" ++ show year ++ ".htm"
makeURL is a helper function that just turns the
integer year into a URL for the data on www.jhowell.net. That url is fed
to a series of functions from the Network.HTTP module
to produce a page
This idiom - a series of functions separated by
'.' - is common in Haskell code. '.' is
the compose operator, producing a new function that is the
result of applying the right function and then the left
function. This idiom and an alternative are going to show up in a
number of places.
The interesting part is the first two steps, which actually
happen last. The step before them - simpleHTTP takes
a string value and produces a value in the IO monad. The form
(>>= getResponseBody) is a function that will
apply getResponseBody - which also work in the IO
monad - to the value in the monad, returning the result from
getResponseBody. liftM getYear is
similar, except that getYear (presented below) does
not work in the IO monad. liftM applies it to the
value in the monad, and returns it's result in the monad. I can't
use liftM on getResponseBody because
getResponseBody, unlike getYear returns
a value in the monad.
I've just talked a lot about monads. If you're not
familiar with them, you're probably wondering what I'm talking
about. A monad is a class - by which I mean a group of functions
that can be applied to members of the class - that functional
programming languages use to hide plumbing of various types. In
this case, the IO monad hides the state information for whatever
IO is being done from. One of the properties of a monad is that it
can be viral. If a function uses a value in some monad, the
function has to return a value in that monad unless the monad
includes a function to unwrap the value. Without that, while the
using function can extract the value from the monad to pass it to
functions that aren't in the monad (know as pure functions)
and get a value back, the value of the function has to be in the
monad. After all, if a function does IO, then any function that
calls it also does IO.
So makeURL and getRequest are pure
functions, returning pure values. simpleHTTP takes
that value and does IO, so it returns a value in the IO monad. The
next two functions - (>>= getResponseBody) and
liftM getYear have to accept and return values in the
IO monad.
The next group of functions are easier, I promise. They're all
pure. getYear takes the string representing the page
for a season, and maps it to a set of three booleans, indicating
whether or not that seasons results would be a controversy for a
poll, a championship or a plus one system.
getYear :: String -> (Bool, Bool, Bool)
getYear = controversy . sort . (\ (t, r) -> map (getData t) r) . getRows . getTable . parseTags
And we once again see that list of function
compositions. parseTags is from the
Network.HTTP module, and turns the string into a list
of Tag String elements. getTable
extracts the table from that list:
getTable :: [Tag String] -> [[Tag String]]
getTable = partitions (isOpenTag "tr") . drop 2
. takeWhile (not . isCloseTag "table")
. dropWhile (not . isOpenTag "table")
And again I have a list of function compositions.
dropWhile skips over elements that it's first
argument says to - in this case, any element that is not a
"table" open tag, takeWhile accepts
elements that it's first argument says to accept - in this case,
any element that is not a "table" close tag.
drop just returns the list without it's first two
elements, which are the "table" open tag and the text
between it and the start of the first row. Finally,
partitions (also from the network.HTTP
module) turns the list of Tags into a list of lists
of Tags, starting a new list at every
"tr" open tag.
isOpenTag and isCloseTag are very
similar, using haskell's argument matching facility to check for
TagOpen or TagClose elements in the
list, and then comparing the text in the tag to their first
argument to make sure this is the open or close for the proper
element:
isOpenTag :: String -> Tag String -> Bool
isOpenTag tag (TagOpen t _) = map toLower t == tag
isOpenTag _ _ = False
isCloseTag :: String -> Tag String -> Bool
isCloseTag tag (TagClose t) = map toLower t == tag
isCloseTag _ _ = False
The table from getTable is then passed to
getRows. getRows is more complicated, in
that it has to check the first row to see if this year has a
column for ties (the rules changed in the 90s, making ties
impossible). It returns a Boolean indicating whether or not there
was a tie along with list of rows after the first one.
getRows :: [[Tag String]] -> (Bool, [[Tag String]])
getRows rs = (isTies (head rs), tail rs)
where isTies r = maybe False (const True) (find (== TagText "Ties") r)
isTies is a helper function only visible in the
getRows namespace. It runs code in the Maybe
monad, which can either hold a value or not (i.e - is
Nothing). This monad has an extraction function
maybe that can return values outside the monad.
It's arguments are a
value to return if there is no value in the monad (i.e. - it's
Nothing), a function to apply to the value if there
is one, and the value to be checked. In this case, I return
False if there was no "Ties" column, and
use a function that always returns True if there is a
"Ties" column.
The piping the maybe monad is used to hide is error
checking. You can pass a value in the monad through a string of
functions that return a value in the monad, and once one returns a
Nothing the rest will be skipped, passing the
Nothing to the top level without having to explicitly
code tests for it at each step. I'm not taking advantage of that
here, but would have if I'd wanted to treat missing pages as
seasons with no teams (as in 1871).
The next step in the processing is to extract the controversy
information for each season. That's done by (\ (t, r) ->
map (getData t) r), an anonymous function that accepts the
tie Bool and then maps getData t over
the rows. getData t uses Haskell's automatic currying
to create a function that already has the ties Bool
and will process a row:
getData :: Bool -> [Tag String] -> (Int, Int)
getData True = (\gs -> (head gs, head $ tail gs))
. map getLosses . take 2 . dropToLosses
getData False = (,0) . getLosses . head . dropToLosses
The two variants of getData are almost
identical. Both start with dropToLosses, which
partitions the list on data elements, then apply
getLosses either to the first element (if there are
no ties) or the first two (if there may have been ties) and
finally puts the results in a tuple, using a 0 for ties if they
weren't possible.
To summarize the list tweakings: I started with a list of
Tag's, turned that into a list of lists, each
sub-list being a row. The processed each sub-list through
getData, which once again turns it into a list of
lists - this time of data elements in the row.
dropToLosses is simple:
dropToLosses :: [Tag String] -> [[Tag String]]
dropToLosses = drop 8 . partitions (isOpenTag "td")
It does the partitioning, then drops the first eight sub-lists,
representing things in the table before the losses column.
getLosses is almost as simple:
getLosses :: [Tag String] -> Int
getLosses = read . fromTagText . head . tail
It drops the first element, the "td" tag, then
extracts the fist element after that, which is the text in the
"td" tag, then uses read to convert that from a
string into an integer.
The last step in getData is to create a triple of
Bools, saying whether the first and second teams had
the same record, whether the second and third had the same record,
and whether the fourth and fifth had the same record. Because list
indices start at 0, I have to subtract one from each value in the
comparison:
controversy :: [(Int, Int)] -> (Bool, Bool, Bool)
controversy records = (records !! 0 == records !! 1,
records !! 1 == records !! 2,
records !! 3 == records !! 4)
genStats generates the stats, but starts by calling
loadYear, putting it in the IO monad. genStats
uses loadYear on a
list of years, then calculates and returns the percentage of times
each type of controversy occurs in that list of years:
genStats :: [Int] -> IO (Int, Int, Int)
genStats years = do
pages <- mapM loadYear years
return $ percents (foldl' counts (0, 0, 0) pages)
(fromIntegral $ length pages)
where
percents (a, b, c) l = (a % l, b % l, c % l)
(%) a b = round $ 100 * a / b
counts (a, b, c) (ab, bb, cb) = (minc a ab, minc b bb, minc c cb)
minc x xb = x + if xb then 1 else 0
genStats gets the list of controversy triples via
loadYear, then uses a quad of helper functions to calculate
percentages. percents takes the triple of counts and
the length of the list and uses % to generate the
percentages. This is the one place where Haskell's type system
gets in the way. It can't coerce the integer from length into a
type that can be used with /, so it has to be
explicitly coerced with fromIntegral.
counts generates the actual counts, using
minc to increment the current value of each count
depending on whether or not there was a controversy for that count
on this page.
I use makeYears to process the arguments. It takes
a list of Strings, and returns a list of
Ints. If the input list has only one element, it's
value as an Int is wrapped in a list and returned.
If it has two elements, they are taken from the list, turned into
Ints, and the range from the first to the second is
returned. Otherwise an error message is printed:
makeYears :: [String] -> [Int]
makeYears args =
case length args of
1 -> [read (head args)]
2 -> (\l -> [head l .. head $ tail l]) $ map read args
otherwise -> error "Arguments must be first and optional last year."
Finally, the main routine - which in Haskell as in
other languages, is where compiled programs start - uses a chain
of functions in the IO monad to get the results and print
them. This time, the unusual function is makeYears,
which isn't in the IO monad, but returns a value to
genStats, which is.
main =
getArgs >>= genStats . makeYears >>= uncurry3 (printf format)
where
format = "Percentage of seasons with controversies using\n" ++
"Poll: %d%%, Game: %d%%, Plus 1: %d%%\n"
That's the code. You can, of course, fetch it from the repository if you want
to build it and see the results yourself. I'll be discussing that
in the next entry.
Dealing with the first three years
This code should work for all football seasons from 1872
forward. There was no football played in 1871. The first season
was played in 1869. There were
fewer than five teams playing in 1869 and 1870. Dealing
with them would require deciding if no teams in the lower slots
should be counted as year played without controversy, or not
counted at all, and the code modified to match.
In 1869, there were two one-loss teams, so a poll had a
controversy, and a championship game didn't. This table also
requires special handling, as the pre-season ranking column is
missing.
In 1870, there there three teams, one undefeated, and two with
one loss. The code would have counted this as no controversy for a
poll, and a controversy for a championship. However, Princeton
beat Rutgers and Rutgers beat Columbia, so the rankings for 1, 2
and 3 aren't controversial.
In 1871, there was no college football. Worse yet, that page
generates an error instead of an empty list, so handling it would
require special handling.
Since it would have taken three changes to include just two
more seasons - and problematic ones at that - these are
ignored.