Monday, May 23, 2011

Creating a Hugo Awards DB with Clojure and Sqlite

In my previous post I wrote code to retrieve the Hugo Award – Best Novel winners and nominees for the 2000s and write the results into a text file. While I was working on that post I thought it would be nice to have a database with this information. In this post I will walk you through the code I wrote to create a sqlite database using clojure.contrib.sql library.

The Goal

By the end of the post I will have code that will retrieve the nominees and winners since 2000 and create then load sqlite database with the parsed data . The nominees table will have the following columns: id, year, title, author, winner, read_it, own_it, want_it.

The Setup

The first thing to do is get sqlite installed on your machine. Since I’m working on I windows I grabbed these two downloads: sqlite shell and sqlite-dll. After downloading the files make sure they are in your PATH.

Next, I created a hugoDB branch to my hugo project on github to keep the code from this post separate from the original hugo post.

You will also need leiningen installed. Leiningen is a Clojure build tool that helps you manage your projects. I use in all of my clojure projects. The install takes no time at all, just follow the instructions on the project’s page and you will be ready for business.

Creating/Updating the HugoDB Project.clj File

Once the hugoDB branch was created I updated the project.clj file to include the sqlitejdbc library. The library allows me to connect to a sqlite database. Here is the updated project file:

If you want to start a fresh project, check out Create the Hugo Project in my previous post. After creating the project add the sqlitejdbc dependency. To ensure you have all necessary dependencies run the following command:

lein deps

The deps command will download and install any dependencies that are not already in the project’s lib directory. Now that the project.clj file has been updated and the dependencies are in place, lets move on to the cmdline.clj file.

The Code

The code consists of an updated cmdline.clj file to support the database generation process and a command line option. I have also updated the hugo/parser.clj removing unnecessary functions and added a new one.  The database related code is in two files.  The first file is hugo/db/createsqlite.clj which handles the database creation and loading.  The second file, hugo/db/sqlite.clj, contains code that retrieves and inserts data. I removed the hugo/text-formatting.clj file from this branch since I am not writing out to a text file.

The cmdline.clj file

The –main function was updated to support command line options using the clojure.contrib.command-line/with-command-line macro. The macro makes it possible to map a command line option to a local variable. For this ‘app’ the only option is the –-drop option. When I run (-main “--drop” “true”) the local variable will contain the string “true”. The macro also has the –-help option built in. It prints the comment that is directly under the with-command-line line and a description of each option you’ve defined. Here’s what (-main “--help”) prints out when I run it in the REPL:

image

After the command line support is in place I need to check to see if –-drop true. The if function is used to see if I need to drop the nominees table using the hugo.db.createsqlite/drop-table function first.

Next, I grab the links from the Hugo Awards History page. Since I’m only interested in the winners from 2000 on I only pass the first 12 links to the create-and-load-db function. That is the function that kicks off the database creation and loading process begins.

The create-and-load-db function calls a wrapper function called get-data. I created the function so I can make use of the map function to retrieve the nominees data. The hugo.parser/parse-best-novel-nominees is called to parse the data from the web pages. I will get into the details parse-best-novel-nominees a little later. After all the data has been parsed I create the database by calling hugo.db.createsqlite/create-db function. The create-db function creates the nominees table in the hugo database. The last line of the create-and-load-db function is where the data is inserted into the newly created nominees table by calling the hugo.db.createsqlite/process-awards function.

That’s it for the cmdline.clj file. This file exists simply to allow me to create the database from the command line. Now lets take a look at the database code by walking you through the hugo.db.createsqlite.clj file.

The hugo.db.createsqlite.clj file

The first three lines are the typical namespace and dependency declarations. I have included dependencies on my hugo.db.sqlite.clj file and the clojure.contrib.sql library giving it a short name of sql. The short name allows me to call functions in the library by prefacing them with sql/. The sql library allows me to use JDBC to access the database.

hugo.db.sqlite/db contains the information needed by JDBC to locate and connect to the database. In order to create the database I needed to add the :create flag to the connection. I was able to do this by using the merge function with the db variable and :create flag allowed me to create the new-db-conn. Now I have a connection that allows me to create the database.

Once I had the new-db-conn connection it was time to get down to business. The ‘gateway’ function into the database creation process is the create-db function.

There really isn’t much to this function but there are a few things I would like to point out. First the with-connection function will ‘wrap’ the code that makes up the body of the call which in our case the line:

(create-tables)

Wrap? What I mean by wrap is any database related code within the ‘with-connection’ body will use the database connection created by the with-connection call. When the body has finished the connection is closed.

The create-db function contains the call to the create-tables call which is where my nominees table is created. The first parameter of the create-table function is the name of the table to create. The vectors that follow the table name are the definition for each column. Each vector has the name of the column followed by its data type. Any special description of the column like primary keys, unique, etc.. is listed after the data type.

Once the database table has been created it's time to parse and load the data into the nominees table. The last call in cmdline/create-and-load-db is to the process-awards function. The process-awards function takes the parsed award data and feeds each nominee to the add-new-nominees function using the map function. The add-new-nominees function doesn’t have much meat to it. I'm finding that with Clojure you can get a lot done with very little code. First the function grabs the year from the category struct which is the first parameter to the add-nominee function. Next, the map function is used to insert a record for each entry in the category struct’s book sequence. Each item is the second parameter for the add-nominee function. When the map call completes there will be a record for each of the year's nominees in the nominees table. When the process-awards function finishes I will have a nominees table loaded with all nominees/winners since 2000.

Running the HugoDB code

To create the database you can either run the code from within the REPL or the command line. Here's how to run it in the REPL. First run the following command from the project’s home directory:

lein repl

The first time you run the app you'll just need to call the -main function like so:

(-main)

After you run it one time you'll need to run it with the --drop true to drop the nominees table before you start creating the database. Here's how to call the main with --drop true

(-main "--drop" "true")

If you choose to run it from the command line run it like this the first time:

lein run

After running the app once you'll need to run it like this:

lein run --drop true

Now that we have a database of the nominees its time to do some querying to ensure that we have loaded the data correctly. There are two ways for us to accomplish this. First I will use the sqlite3 command line tools to run SQL against the database. After that I’ll use Clojure in the REPL to show show a few select functions. Now it’s time to fire up sqlite.

The Sqlite Shell

Jump back to the command prompt and cd into the project’s home page. From there change into the db directory and get a listing of its contents. The directory should have a file named hugo.sqlite3.  That is the database file the app just created. To get to a querying interface run the command (assuming you have added sqlite3 to your PATH):

sqlite3 hugo.sqlite3

To view the tables in our database enter .tables from the sqlite prompt and you should see our single table nominees listed.& Lets make sure that the 2004 records were loaded correctly.

image

The columns we care about here are id, year, title, author and whether or not the book was the winner plus three other columns that are there for my next post. The first record has 1 in the winner column which indicates it was the winner. All of the nominees for the year were also properly saved.

The hugo.db.sqlite.clj file

In addition to the sqlite shell I wrote a few functions that will retrieve the nominees from the database. The first one I will call use is get-nominees. It does what you might expect returns all of the records in the nominees table.

The get-nominees function introduces function overloading in Clojure. If get-nominees is called without parameters it will call get-sql passing in the sql statement defined in the get-all-nominees var which returns all of the records. However, if a year is passed in the function will return all winners/nominees for the given year. Before calling the get-sql function I add a predicate to base sql statement with a place holder. The new string is the first item in the vector. The next item is the value that will replace the placeholder when get-sql calls the with-query-results function. Really not much to get-nominees, most of the work is done in the get-sql function.

The functions that all the ‘get’ type functions are based on the function sqlite.clj/get-sql . It wraps the call to clojure.contrib.sql/with-query-results function. The first parameter is a sequence that will contain the results of the query. The next parameter is the sql statement and parameters to be ran. The doall statement forces the lazy sequence that contains the results into a ‘real’ sequence that is returned to the caller.

Summary

Creating a sqlite database and running queries against it with Clojure is straight forward. Adding support for command line options is trivial. In my next 'Hugo' post I am going create a UI that will allow me to view the data. Stay tuned!

As part of my Clojure learning process I appreciate any and all comments on my code. Following my last post I had great comments that helped improve my code and expand my Clojure knowledge. Please keep the comments coming.

Resources

clojure, clojure-contrib, enlive, leiningen, sqlite, my previous post.

Code

The code for this project can be found on the hugoDB branch of the Hugo project.  You can download the entire hugoDB branch of the project here.  The code and database files I discussed in the post can be viewed here:  cmdline.clj, createsqlite.clj, sqlite.clj and the database.

Tuesday, May 10, 2011

Parsing Web Pages with Clojure and enlive

As my infatuation with Clojure grows I thought I would write some code to retrieve all of the works that have either won or been nominated for the Hugo award's Best Novel category. I know it’s geeky but it is information that I can use so why not use it as a source to learn more Clojure?

Please keep in mind that I am writing this blog from my perspective as a Clojure noob. Any and all feedback on the post or the code is welcome.  Even if you think it is minor, please pass it along. With that said, lets get on with post!

Goal

By the end of the post is to have code that will retrieve the winners and nominees for the Best Novel category since 2000 and write them to a text file with the following layout:

Year Hugo Awards - Best Novel
         Title – Author (Winner) 
         Title – Author 
         Title – Author ...

The Setup

UPDATE As @Bendlas mentioned below leiningen will install the clojure jars. You can skip the first paragraph and go right to installing leiningen. I have tested it on a Ubuntu VM and when I ran lein repl leiningen downloaded the clojure jars. On Windows you will need to have curl.exe or wget.exe installed to get it to work. Thanks again @Bendlas.

If you don’t already have Clojure installed you can get everything you need from the download page. While you are there go ahead and grab the clojure-contrib.zip as well. Assuming you already have Java on your machine the next step is to add the Clojure and clojure-contrib directories to your CLASSPATH.

Once you have Clojure installed the next thing to install is lein. It is a Clojure build tool that helps you manage your projects. In my short period of time in the Clojure world lein has been a great tool and I have found it has many useful plugins. The install takes no time at all, just follow the instructions on the project’s page and you will be ready for business. Now that Clojure and lein are installed I'm ready to start the ‘Hugo’ project.

Creating the Hugo Project

To create the project using run lein with the parameters below:

lein new hugo

The command will create a directory structure for our project. For a little more information on the project directory structure lein creates take a look at my previous post Getting Started with Ring and Compojure - Clojure Web Programming.  I have a little more detail there.

I updated the project.clj file to include dependencies on enlive and clojure-contrib. A line was added to indicate which namespace my main function is located in. The new line allows me to run this 'app' from the command line.

The project setup is complete, now its time to start parsing!

The Code

The code for this project is in three source files under the project’s src directory.  The cmdline.clj file houses the app’s main function which allows the app to be started from the command line. The hugo/parser.clj file contains the code that retrieves and parses the web pages.  The last file is hugo/text-formatting.clj which contains the code to format the output.  In this post I will walk through the cmdline.clj and hugo/parser.clj files.

The cmdline.clj file

As you can see, there isn’t much to this file.  The file exists to create a class that allows me to run hugo from the command line and provides a concise way to run the code within the REPL. The first five lines set up the namespace, include my code’s namespaces and loads the clojure-contrib.duck-streams library which I will use to write the results out to the output file. 

Since I want to run this application from the command line I need to generate a java class file.  To do this I use the :gen-class macro.  The generated cmdline.class file will be placed in the project’s classes directory. Any methods in the java class will look in the source clj file for a method by the same name but preceded by a –.  That is why the –main function exists.  It is what is called when I execute the app outside of the REPL.

The –main method calls the hugo.parser/get-award-links function to retrieve the links to each year’s awards page from the Hugo awards history page.  The links are returned as a sequence, since I only want the entries for 2000 to 2010 the code grabs the first 12 links, which are passed to the prep-for-file function. 

The prep-for-file function is where the real parsing is kicked off, I will discuss the parsing in more detail later. For now just know that the data retrieved from the URL is formatted by the hugo.text-formatting/format-output function and the map function. The results are converted to a string by using the apply and str functions.

When the parsing and formatting is complete the results are passed to the clojure.contrib.duck-streams/spit function. The spit function, I really like that name, writes the results to a file named hugo_awards_best_novel.txt. That's it. I've given you the 5 second tour of the cmdline.clj file, now its time to take a look at the HTML parsing.

The hugo/parser.clj file

As you might expect this is where all the parsing code lives. The first function called is the get-award-links. The function is responsible for parsing out all of the links to the annual awards pages.

The get-award-links Function

The first task this function does is to retrieve the page’s HTML tags using the fetch-url function which wraps enlive’s html-resource function. The html-resource function retrieves a web page and returns its HTML tags in a sequence that is passed to other enlive functions as input.

Once the page has been parsed, I call enlive’s select function passing the tag sequence as the first parameter. Select's second parameter tells the function which tags I want out of the tags sequence using something similar to CSS selectors. In this case I’m telling select to grab all a tags inside of LI tags that are members of the page_item class and are within DIV tag that with the id of content. The second vector tells select that I want the text for each link so I can use it for the year value later. After the parsing of the tags is done the map function will grab the attrs for each tag that is returned which returns the following for each link:

image

When this function call is completed a map is returned with a title and href for each year that the Hugo awards were given.  The results of this call are passed to the prep-for-file function in the cmdline.clj file. 

The get-awards-per-year Function

Now that I have the links to all the award pages it is time to gather all the data on each category. The function creates a sequence of category structs that contain the award category, the nominees/winners in the category and the year the award was given.

The year’s page is retrieved using the fetch-url function and the results are stored in the page-content variable. Next, the parse-award-page function is called passing in the page-content as its only parameter. It returns a sequence that contains lists for each award given that year that will look like this: ((“Best Novel”) (array maps for each nominee/winner)). I will refer to this sequence as the category sequence from here out. Right now the parse-award-page function looks like a black box I promise to get into the details in a bit.

The results of the parse-award-page call are passed map to create a sequence that contains category structs.  The category struct is defined as:

Getting the award string

In the map function call I am using an anonymous function to create a category struct. When the map call completes it returns a sequence of category structs for the given year. Creating a new struct is easy, just pass in the name of the struct to create and a value for each of the keys in the struct. The category struct's first key is the :award key. The value is parsed with this code: (apply str (first %)). Since the first item in the category sequence is a string in a lazy sequence representing the award's title I need to use apply str instead of just str. If I called (str (first %)) what I would get back is something like this: clojure.lang.LazySeq@5784711f which is obviously not what we want. 

Getting the books sequence

Grabbing the books that represent the nominees/winners is almost as easy as the award. Since I know that the nominee/winners are stored in the second part of the category sequence lists I use the second function to retrieve them.

(get-book-info (rest (second %)))

I’m using rest here because for some reason the first entry in the sequence is “\n” I’m not sure why. In the future I will figure it out but for now I’m using the rest call to get to the ‘guts’ of the book sequence. The results of the rest call are passed to a helper function that returns a sequence of work structs that will be stored in the category struct’s books key.

Getting the year string

The last key in the category struct is the year key. It will store a string that begins with the year and ends with “Hugo Awards”.  The code to retrieve the ‘year’ makes use of the select statement, grabs the first element in the returned sequence and converts the value of :content to a string. Here's what the code looks like:

(apply str (:content (first (html/select page-content #{[:div#content :h2]}))))

Now I have a value for each of the category struct’s keys. The struct provides a much easier way to work with the data. At this point all of the parsing has been completed. All that is left is for the cmdline/prep-for-file function to format the data and write it out to the file. Since that is pretty straight forward I'm going to leave that code out of the post. Before I wrap up this post I’d like to dive into the hugo.parser/parse-award-page function, where the real parsing happens.

The parse-award-page Function

Once the year’s award page has been retrieved, its tag sequence function is passed to parse-award-page. The function grabs the category title and the nominees/winners and creates a sequence of lists. Here’s how it is done. All of the nominees/winners are found in the map function call. The sequence returned from the call to select returns all UL tags found in content DIV tag. Each tag is passed to the anonymous function which just pulls out the :content key from the tag’s array map creating a sequence of book titles.

The category titles are parsed on the line that has the split-at function call. Again, the select function is called to find all P tags that are within the content DIV. The text for the the first child of the P tag is returned creating a sequence of category titles. The split-at function is called to ‘remove’ the first four P tag results since the contain information on where the awards banquet was held.

After both the titles and then nominees/winners sequences are created the interleave function is called. Interleave creates a single sequence by combining the two sequences one item at a time. How the function works is the first item in the titles sequence is added to the new sequence followed by the first item in the nominees sequence, the second from titles is followed by the second nominees item, etc. When interleave returns I have one sequence that looks something like ( “award title” “nominees” “award title” “nominees”….).

Having the sequence provided by interleave is nice but it isn’t going to work for what I want. I need to pair the category title with the nominees/winners for the category. This is where the partition function comes in. According to the partition documentation the function will “create a lazy sequence of n items” which in our cause is 2. When the parse-award-page function completes it returns a sequence of lists that match the category up with it’s nominees/winner which is exactly what I need in the get-awards-per-year function.

How do I run it from the command line?

If you are like me most of the clojure you write is either run through the REPL or as a web app. I had no idea how to run this ‘app’ from the command line. After checking out the leiningen project again I noticed that there is a command called uberjar. What uberjar does is create a jar file that bundles everything up that is needed to run your app from the command line. The jar file uses the naming convention of:

<project name>-<version info>-standalone.jar

Remember I’m a .NET guy by day so I don’t have a real in-depth knowledge of jar files yet. I just know that they allow my to run the app from the command line. Once the jar file has been created I can run the app from the command line using this command:

java -jar hugo-0.0.3-SNAPSHOT-standalone.jar 

Summary

Parsing HTML using clojure is relatively easy using enlive. Using enlive I was able to parse the Hugo Awards information to create a text file with all of the Best Novel category nominees and winners ( hugo_awards_best_novels.txt ) since 2000.

One More Thing…

When you run the project you may encounter an IOException like this:

image

You can resolve the issue by visiting the URL through a web browser. I believe I can get around this issue by setting the user-agent for my enlive html-resource call but I couldn’t figure out how to do it. If anyone has a suggestion please leave me a comment.

Resources

clojure, clojure-contrib, enlive, lein

Code

Download entire project. Code files: cmdline.clj, parser.clj and text_formatting.clj

The output file: hugo_awards_best_novels.txt