Monday, May 23, 2011

Creating a Hugo Awards DB with Clojure and Sqlite

In my previous post I wrote code to retrieve the Hugo Award – Best Novel winners and nominees for the 2000s and write the results into a text file. While I was working on that post I thought it would be nice to have a database with this information. In this post I will walk you through the code I wrote to create a sqlite database using clojure.contrib.sql library.

The Goal

By the end of the post I will have code that will retrieve the nominees and winners since 2000 and create then load sqlite database with the parsed data . The nominees table will have the following columns: id, year, title, author, winner, read_it, own_it, want_it.

The Setup

The first thing to do is get sqlite installed on your machine. Since I’m working on I windows I grabbed these two downloads: sqlite shell and sqlite-dll. After downloading the files make sure they are in your PATH.

Next, I created a hugoDB branch to my hugo project on github to keep the code from this post separate from the original hugo post.

You will also need leiningen installed. Leiningen is a Clojure build tool that helps you manage your projects. I use in all of my clojure projects. The install takes no time at all, just follow the instructions on the project’s page and you will be ready for business.

Creating/Updating the HugoDB Project.clj File

Once the hugoDB branch was created I updated the project.clj file to include the sqlitejdbc library. The library allows me to connect to a sqlite database. Here is the updated project file:

If you want to start a fresh project, check out Create the Hugo Project in my previous post. After creating the project add the sqlitejdbc dependency. To ensure you have all necessary dependencies run the following command:

lein deps

The deps command will download and install any dependencies that are not already in the project’s lib directory. Now that the project.clj file has been updated and the dependencies are in place, lets move on to the cmdline.clj file.

The Code

The code consists of an updated cmdline.clj file to support the database generation process and a command line option. I have also updated the hugo/parser.clj removing unnecessary functions and added a new one.  The database related code is in two files.  The first file is hugo/db/createsqlite.clj which handles the database creation and loading.  The second file, hugo/db/sqlite.clj, contains code that retrieves and inserts data. I removed the hugo/text-formatting.clj file from this branch since I am not writing out to a text file.

The cmdline.clj file

The –main function was updated to support command line options using the clojure.contrib.command-line/with-command-line macro. The macro makes it possible to map a command line option to a local variable. For this ‘app’ the only option is the –-drop option. When I run (-main “--drop” “true”) the local variable will contain the string “true”. The macro also has the –-help option built in. It prints the comment that is directly under the with-command-line line and a description of each option you’ve defined. Here’s what (-main “--help”) prints out when I run it in the REPL:

image

After the command line support is in place I need to check to see if –-drop true. The if function is used to see if I need to drop the nominees table using the hugo.db.createsqlite/drop-table function first.

Next, I grab the links from the Hugo Awards History page. Since I’m only interested in the winners from 2000 on I only pass the first 12 links to the create-and-load-db function. That is the function that kicks off the database creation and loading process begins.

The create-and-load-db function calls a wrapper function called get-data. I created the function so I can make use of the map function to retrieve the nominees data. The hugo.parser/parse-best-novel-nominees is called to parse the data from the web pages. I will get into the details parse-best-novel-nominees a little later. After all the data has been parsed I create the database by calling hugo.db.createsqlite/create-db function. The create-db function creates the nominees table in the hugo database. The last line of the create-and-load-db function is where the data is inserted into the newly created nominees table by calling the hugo.db.createsqlite/process-awards function.

That’s it for the cmdline.clj file. This file exists simply to allow me to create the database from the command line. Now lets take a look at the database code by walking you through the hugo.db.createsqlite.clj file.

The hugo.db.createsqlite.clj file

The first three lines are the typical namespace and dependency declarations. I have included dependencies on my hugo.db.sqlite.clj file and the clojure.contrib.sql library giving it a short name of sql. The short name allows me to call functions in the library by prefacing them with sql/. The sql library allows me to use JDBC to access the database.

hugo.db.sqlite/db contains the information needed by JDBC to locate and connect to the database. In order to create the database I needed to add the :create flag to the connection. I was able to do this by using the merge function with the db variable and :create flag allowed me to create the new-db-conn. Now I have a connection that allows me to create the database.

Once I had the new-db-conn connection it was time to get down to business. The ‘gateway’ function into the database creation process is the create-db function.

There really isn’t much to this function but there are a few things I would like to point out. First the with-connection function will ‘wrap’ the code that makes up the body of the call which in our case the line:

(create-tables)

Wrap? What I mean by wrap is any database related code within the ‘with-connection’ body will use the database connection created by the with-connection call. When the body has finished the connection is closed.

The create-db function contains the call to the create-tables call which is where my nominees table is created. The first parameter of the create-table function is the name of the table to create. The vectors that follow the table name are the definition for each column. Each vector has the name of the column followed by its data type. Any special description of the column like primary keys, unique, etc.. is listed after the data type.

Once the database table has been created it's time to parse and load the data into the nominees table. The last call in cmdline/create-and-load-db is to the process-awards function. The process-awards function takes the parsed award data and feeds each nominee to the add-new-nominees function using the map function. The add-new-nominees function doesn’t have much meat to it. I'm finding that with Clojure you can get a lot done with very little code. First the function grabs the year from the category struct which is the first parameter to the add-nominee function. Next, the map function is used to insert a record for each entry in the category struct’s book sequence. Each item is the second parameter for the add-nominee function. When the map call completes there will be a record for each of the year's nominees in the nominees table. When the process-awards function finishes I will have a nominees table loaded with all nominees/winners since 2000.

Running the HugoDB code

To create the database you can either run the code from within the REPL or the command line. Here's how to run it in the REPL. First run the following command from the project’s home directory:

lein repl

The first time you run the app you'll just need to call the -main function like so:

(-main)

After you run it one time you'll need to run it with the --drop true to drop the nominees table before you start creating the database. Here's how to call the main with --drop true

(-main "--drop" "true")

If you choose to run it from the command line run it like this the first time:

lein run

After running the app once you'll need to run it like this:

lein run --drop true

Now that we have a database of the nominees its time to do some querying to ensure that we have loaded the data correctly. There are two ways for us to accomplish this. First I will use the sqlite3 command line tools to run SQL against the database. After that I’ll use Clojure in the REPL to show show a few select functions. Now it’s time to fire up sqlite.

The Sqlite Shell

Jump back to the command prompt and cd into the project’s home page. From there change into the db directory and get a listing of its contents. The directory should have a file named hugo.sqlite3.  That is the database file the app just created. To get to a querying interface run the command (assuming you have added sqlite3 to your PATH):

sqlite3 hugo.sqlite3

To view the tables in our database enter .tables from the sqlite prompt and you should see our single table nominees listed.& Lets make sure that the 2004 records were loaded correctly.

image

The columns we care about here are id, year, title, author and whether or not the book was the winner plus three other columns that are there for my next post. The first record has 1 in the winner column which indicates it was the winner. All of the nominees for the year were also properly saved.

The hugo.db.sqlite.clj file

In addition to the sqlite shell I wrote a few functions that will retrieve the nominees from the database. The first one I will call use is get-nominees. It does what you might expect returns all of the records in the nominees table.

The get-nominees function introduces function overloading in Clojure. If get-nominees is called without parameters it will call get-sql passing in the sql statement defined in the get-all-nominees var which returns all of the records. However, if a year is passed in the function will return all winners/nominees for the given year. Before calling the get-sql function I add a predicate to base sql statement with a place holder. The new string is the first item in the vector. The next item is the value that will replace the placeholder when get-sql calls the with-query-results function. Really not much to get-nominees, most of the work is done in the get-sql function.

The functions that all the ‘get’ type functions are based on the function sqlite.clj/get-sql . It wraps the call to clojure.contrib.sql/with-query-results function. The first parameter is a sequence that will contain the results of the query. The next parameter is the sql statement and parameters to be ran. The doall statement forces the lazy sequence that contains the results into a ‘real’ sequence that is returned to the caller.

Summary

Creating a sqlite database and running queries against it with Clojure is straight forward. Adding support for command line options is trivial. In my next 'Hugo' post I am going create a UI that will allow me to view the data. Stay tuned!

As part of my Clojure learning process I appreciate any and all comments on my code. Following my last post I had great comments that helped improve my code and expand my Clojure knowledge. Please keep the comments coming.

Resources

clojure, clojure-contrib, enlive, leiningen, sqlite, my previous post.

Code

The code for this project can be found on the hugoDB branch of the Hugo project.  You can download the entire hugoDB branch of the project here.  The code and database files I discussed in the post can be viewed here:  cmdline.clj, createsqlite.clj, sqlite.clj and the database.

3 comments: