America's college Football Bowl Subdivision (FBS) has a history of controversy when it comes to picking a champion. And a history of proposed solutions. I've decided to examine how well those solutions would have worked if applied to previous seasons.
The football background
If you're a football fan, you can skip this section.
The FBS has one unusual property. Unlike every other organized college sport in the US, and every other nationally organized sport anywhere in the world, it has no clear cut way to pick a champion for each season. Instead, various groups - some as small as a single person - analyze the games played during the season and rank the teams according to that play, with the highest ranked team at the end of the season being crowned "champion". If the major groups agree with each other, there's a "consensus champion". If not, then you have a controversy.
People familiar with other sports might be amazed that there's ever a consensus. However, FBS teams play few enough games that winning every game in a season is possible, and it's not at all unusual for one or more teams to be undefeated in a season. It's also hard enough that it's not unusal if there are no undefeated teams at the end of a season.
So the most important criteria any ranking system considers is "number of losses and ties". (Football fans who decided to read this section are now saying "But what about non-AQ schools". For now, I'm going to ignore that, and I'll justify doing so in the sequel.) The other criteria are where the controvery comes in, as a case can be made that any two undefeated teams are equally good, and ditto for any two one-loss teams (provided they haven't played each other), etc.
The proposed solutions - short of an actual playoff, also to be discussed in the sequel - involve picking some small number of teams from the top to play each other. This system has actually been in use since the 1992 season to pick the top two teams. It doesn't seem to be any less controversial, and the formula for picking the two teams has been adjusted nearly continuously to try and deal with those controversies. A "plus-1" system gets a lot of talk, even though it has multiple definitions, but usually involves picking the top 4 teams.
The task
Since the FBS has a controversy any time either a clear winner fails to emerge, or - since the Bowl Coalition and it's succesors starting picking two teams for a championship game - if there's not a clear choice of two teams, the goal of the project is to see how often picking either 1, 2 or 4 teams (historical, current and "plus-1" approaches) who get a chance to play for the national championship would have done on previous seasons.
Deciding if there would have been controversy for a system is easy. If the last team to make the cutoff - either the first place team, the second place team or the fourth place team - has the same number of losses/ties as some team that doesn't get such a chance, the system with that cutoff has a conroversy for that season.
The code
Given that definition, the code is straightforward: for each season, get the records of all the teams, sort them based on losses/ties, and the check the records of the two teams on either side of each cutoff. This part presents the code for doing that.
So, let's start by getting the record for a season. Since that
involves IO, the function that does it will work in the IO
monad, giving us loadYear
:
loadYear :: Int -> IO (Bool, Bool, Bool)
loadYear =
liftM getYear . (>>= getResponseBody) . simpleHTTP . getRequest . makeURL
makeURL :: Show a => a -> String
makeURL year = "http://www.jhowell.net/cf/cf" ++ show year ++ ".htm"
makeURL
is a helper function that just turns the
integer year into a URL for the data on www.jhowell.net. That url is fed
to a series of functions from the Network.HTTP
module
to produce a page
This idiom - a series of functions separated by
'.
' - is common in Haskell code. '.
' is
the compose operator, producing a new function that is the
result of applying the right function and then the left
function. This idiom and an alternative are going to show up in a
number of places.
The interesting part is the first two steps, which actually
happen last. The step before them - simpleHTTP
takes
a string value and produces a value in the IO monad. The form
(>>= getResponseBody)
is a function that will
apply getResponseBody
- which also work in the IO
monad - to the value in the monad, returning the result from
getResponseBody
. liftM getYear
is
similar, except that getYear
(presented below) does
not work in the IO monad. liftM
applies it to the
value in the monad, and returns it's result in the monad. I can't
use liftM
on getResponseBody
because
getResponseBody
, unlike getYear
returns
a value in the monad.
I've just talked a lot about monads. If you're not familiar with them, you're probably wondering what I'm talking about. A monad is a class - by which I mean a group of functions that can be applied to members of the class - that functional programming languages use to hide plumbing of various types. In this case, the IO monad hides the state information for whatever IO is being done from. One of the properties of a monad is that it can be viral. If a function uses a value in some monad, the function has to return a value in that monad unless the monad includes a function to unwrap the value. Without that, while the using function can extract the value from the monad to pass it to functions that aren't in the monad (know as pure functions) and get a value back, the value of the function has to be in the monad. After all, if a function does IO, then any function that calls it also does IO.
So makeURL
and getRequest
are pure
functions, returning pure values. simpleHTTP
takes
that value and does IO, so it returns a value in the IO monad. The
next two functions - (>>= getResponseBody)
and
liftM getYear
have to accept and return values in the
IO monad.
The next group of functions are easier, I promise. They're all
pure. getYear
takes the string representing the page
for a season, and maps it to a set of three booleans, indicating
whether or not that seasons results would be a controversy for a
poll, a championship or a plus one system.
getYear :: String -> (Bool, Bool, Bool)
getYear = controversy . sort . (\ (t, r) -> map (getData t) r) . getRows . getTable . parseTags
And we once again see that list of function
compositions. parseTags
is from the
Network.HTTP
module, and turns the string into a list
of Tag String
elements. getTable
extracts the table from that list:
getTable :: [Tag String] -> [[Tag String]]
getTable = partitions (isOpenTag "tr") . drop 2
. takeWhile (not . isCloseTag "table")
. dropWhile (not . isOpenTag "table")
And again I have a list of function compositions.
dropWhile
skips over elements that it's first
argument says to - in this case, any element that is not a
"table"
open tag, takeWhile
accepts
elements that it's first argument says to accept - in this case,
any element that is not a "table"
close tag.
drop
just returns the list without it's first two
elements, which are the "table"
open tag and the text
between it and the start of the first row. Finally,
partitions
(also from the network.HTTP
module) turns the list of Tag
s into a list of lists
of Tag
s, starting a new list at every
"tr"
open tag.
isOpenTag
and isCloseTag
are very
similar, using haskell's argument matching facility to check for
TagOpen
or TagClose
elements in the
list, and then comparing the text in the tag to their first
argument to make sure this is the open or close for the proper
element:
isOpenTag :: String -> Tag String -> Bool
isOpenTag tag (TagOpen t _) = map toLower t == tag
isOpenTag _ _ = False
isCloseTag :: String -> Tag String -> Bool
isCloseTag tag (TagClose t) = map toLower t == tag
isCloseTag _ _ = False
The table from getTable
is then passed to
getRows
. getRows
is more complicated, in
that it has to check the first row to see if this year has a
column for ties (the rules changed in the 90s, making ties
impossible). It returns a Boolean indicating whether or not there
was a tie along with list of rows after the first one.
getRows :: [[Tag String]] -> (Bool, [[Tag String]])
getRows rs = (isTies (head rs), tail rs)
where isTies r = maybe False (const True) (find (== TagText "Ties") r)
isTies
is a helper function only visible in the
getRows
namespace. It runs code in the Maybe
monad, which can either hold a value or not (i.e - is
Nothing
). This monad has an extraction function
maybe
that can return values outside the monad.
It's arguments are a
value to return if there is no value in the monad (i.e. - it's
Nothing
), a function to apply to the value if there
is one, and the value to be checked. In this case, I return
False
if there was no "Ties"
column, and
use a function that always returns True
if there is a
"Ties"
column.
The piping the maybe monad is used to hide is error
checking. You can pass a value in the monad through a string of
functions that return a value in the monad, and once one returns a
Nothing
the rest will be skipped, passing the
Nothing
to the top level without having to explicitly
code tests for it at each step. I'm not taking advantage of that
here, but would have if I'd wanted to treat missing pages as
seasons with no teams (as in 1871).
The next step in the processing is to extract the controversy
information for each season. That's done by (\ (t, r) ->
map (getData t) r)
, an anonymous function that accepts the
tie Bool
and then maps getData t
over
the rows. getData t
uses Haskell's automatic currying
to create a function that already has the ties Bool
and will process a row:
getData :: Bool -> [Tag String] -> (Int, Int)
getData True = (\gs -> (head gs, head $ tail gs))
. map getLosses . take 2 . dropToLosses
getData False = (,0) . getLosses . head . dropToLosses
The two variants of getData
are almost
identical. Both start with dropToLosses
, which
partitions the list on data elements, then apply
getLosses
either to the first element (if there are
no ties) or the first two (if there may have been ties) and
finally puts the results in a tuple, using a 0 for ties if they
weren't possible.
To summarize the list tweakings: I started with a list of
Tag
's, turned that into a list of lists, each
sub-list being a row. The processed each sub-list through
getData
, which once again turns it into a list of
lists - this time of data elements in the row.
dropToLosses
is simple:
dropToLosses :: [Tag String] -> [[Tag String]]
dropToLosses = drop 8 . partitions (isOpenTag "td")
It does the partitioning, then drops the first eight sub-lists, representing things in the table before the losses column.
getLosses
is almost as simple:
getLosses :: [Tag String] -> Int
getLosses = read . fromTagText . head . tail
It drops the first element, the "td"
tag, then
extracts the fist element after that, which is the text in the
"td"
tag, then uses read to convert that from a
string into an integer.
The last step in getData
is to create a triple of
Bool
s, saying whether the first and second teams had
the same record, whether the second and third had the same record,
and whether the fourth and fifth had the same record. Because list
indices start at 0, I have to subtract one from each value in the
comparison:
controversy :: [(Int, Int)] -> (Bool, Bool, Bool)
controversy records = (records !! 0 == records !! 1,
records !! 1 == records !! 2,
records !! 3 == records !! 4)
genStats
generates the stats, but starts by calling
loadYear
, putting it in the IO monad. genStats
uses loadYear
on a
list of years, then calculates and returns the percentage of times
each type of controversy occurs in that list of years:
genStats :: [Int] -> IO (Int, Int, Int)
genStats years = do
pages <- mapM loadYear years
return $ percents (foldl' counts (0, 0, 0) pages)
(fromIntegral $ length pages)
where
percents (a, b, c) l = (a % l, b % l, c % l)
(%) a b = round $ 100 * a / b
counts (a, b, c) (ab, bb, cb) = (minc a ab, minc b bb, minc c cb)
minc x xb = x + if xb then 1 else 0
genStats
gets the list of controversy triples via
loadYear
, then uses a quad of helper functions to calculate
percentages. percents
takes the triple of counts and
the length of the list and uses %
to generate the
percentages. This is the one place where Haskell's type system
gets in the way. It can't coerce the integer from length into a
type that can be used with /
, so it has to be
explicitly coerced with fromIntegral
.
counts
generates the actual counts, using
minc
to increment the current value of each count
depending on whether or not there was a controversy for that count
on this page.
I use makeYears
to process the arguments. It takes
a list of String
s, and returns a list of
Int
s. If the input list has only one element, it's
value as an Int
is wrapped in a list and returned.
If it has two elements, they are taken from the list, turned into
Int
s, and the range from the first to the second is
returned. Otherwise an error message is printed:
makeYears :: [String] -> [Int]
makeYears args =
case length args of
1 -> [read (head args)]
2 -> (\l -> [head l .. head $ tail l]) $ map read args
otherwise -> error "Arguments must be first and optional last year."
Finally, the main
routine - which in Haskell as in
other languages, is where compiled programs start - uses a chain
of functions in the IO monad to get the results and print
them. This time, the unusual function is makeYears
,
which isn't in the IO monad, but returns a value to
genStats
, which is.
main =
getArgs >>= genStats . makeYears >>= uncurry3 (printf format)
where
format = "Percentage of seasons with controversies using\n" ++
"Poll: %d%%, Game: %d%%, Plus 1: %d%%\n"
That's the code. You can, of course, fetch it from the repository if you want to build it and see the results yourself. I'll be discussing that in the next entry.
Dealing with the first three years
This code should work for all football seasons from 1872 forward. There was no football played in 1871. The first season was played in 1869. There were fewer than five teams playing in 1869 and 1870. Dealing with them would require deciding if no teams in the lower slots should be counted as year played without controversy, or not counted at all, and the code modified to match.
In 1869, there were two one-loss teams, so a poll had a controversy, and a championship game didn't. This table also requires special handling, as the pre-season ranking column is missing.
In 1870, there there three teams, one undefeated, and two with one loss. The code would have counted this as no controversy for a poll, and a controversy for a championship. However, Princeton beat Rutgers and Rutgers beat Columbia, so the rankings for 1, 2 and 3 aren't controversial.
In 1871, there was no college football. Worse yet, that page generates an error instead of an empty list, so handling it would require special handling.
Since it would have taken three changes to include just two more seasons - and problematic ones at that - these are ignored.
No comments:
Post a Comment