Tuesday, May 10, 2011

Parsing Web Pages with Clojure and enlive

As my infatuation with Clojure grows I thought I would write some code to retrieve all of the works that have either won or been nominated for the Hugo award's Best Novel category. I know it’s geeky but it is information that I can use so why not use it as a source to learn more Clojure?

Please keep in mind that I am writing this blog from my perspective as a Clojure noob. Any and all feedback on the post or the code is welcome.  Even if you think it is minor, please pass it along. With that said, lets get on with post!

Goal

By the end of the post is to have code that will retrieve the winners and nominees for the Best Novel category since 2000 and write them to a text file with the following layout:

Year Hugo Awards - Best Novel
         Title – Author (Winner) 
         Title – Author 
         Title – Author ...

The Setup

UPDATE As @Bendlas mentioned below leiningen will install the clojure jars. You can skip the first paragraph and go right to installing leiningen. I have tested it on a Ubuntu VM and when I ran lein repl leiningen downloaded the clojure jars. On Windows you will need to have curl.exe or wget.exe installed to get it to work. Thanks again @Bendlas.

If you don’t already have Clojure installed you can get everything you need from the download page. While you are there go ahead and grab the clojure-contrib.zip as well. Assuming you already have Java on your machine the next step is to add the Clojure and clojure-contrib directories to your CLASSPATH.

Once you have Clojure installed the next thing to install is lein. It is a Clojure build tool that helps you manage your projects. In my short period of time in the Clojure world lein has been a great tool and I have found it has many useful plugins. The install takes no time at all, just follow the instructions on the project’s page and you will be ready for business. Now that Clojure and lein are installed I'm ready to start the ‘Hugo’ project.

Creating the Hugo Project

To create the project using run lein with the parameters below:

lein new hugo

The command will create a directory structure for our project. For a little more information on the project directory structure lein creates take a look at my previous post Getting Started with Ring and Compojure - Clojure Web Programming.  I have a little more detail there.

I updated the project.clj file to include dependencies on enlive and clojure-contrib. A line was added to indicate which namespace my main function is located in. The new line allows me to run this 'app' from the command line.

The project setup is complete, now its time to start parsing!

The Code

The code for this project is in three source files under the project’s src directory.  The cmdline.clj file houses the app’s main function which allows the app to be started from the command line. The hugo/parser.clj file contains the code that retrieves and parses the web pages.  The last file is hugo/text-formatting.clj which contains the code to format the output.  In this post I will walk through the cmdline.clj and hugo/parser.clj files.

The cmdline.clj file

As you can see, there isn’t much to this file.  The file exists to create a class that allows me to run hugo from the command line and provides a concise way to run the code within the REPL. The first five lines set up the namespace, include my code’s namespaces and loads the clojure-contrib.duck-streams library which I will use to write the results out to the output file. 

Since I want to run this application from the command line I need to generate a java class file.  To do this I use the :gen-class macro.  The generated cmdline.class file will be placed in the project’s classes directory. Any methods in the java class will look in the source clj file for a method by the same name but preceded by a –.  That is why the –main function exists.  It is what is called when I execute the app outside of the REPL.

The –main method calls the hugo.parser/get-award-links function to retrieve the links to each year’s awards page from the Hugo awards history page.  The links are returned as a sequence, since I only want the entries for 2000 to 2010 the code grabs the first 12 links, which are passed to the prep-for-file function. 

The prep-for-file function is where the real parsing is kicked off, I will discuss the parsing in more detail later. For now just know that the data retrieved from the URL is formatted by the hugo.text-formatting/format-output function and the map function. The results are converted to a string by using the apply and str functions.

When the parsing and formatting is complete the results are passed to the clojure.contrib.duck-streams/spit function. The spit function, I really like that name, writes the results to a file named hugo_awards_best_novel.txt. That's it. I've given you the 5 second tour of the cmdline.clj file, now its time to take a look at the HTML parsing.

The hugo/parser.clj file

As you might expect this is where all the parsing code lives. The first function called is the get-award-links. The function is responsible for parsing out all of the links to the annual awards pages.

The get-award-links Function

The first task this function does is to retrieve the page’s HTML tags using the fetch-url function which wraps enlive’s html-resource function. The html-resource function retrieves a web page and returns its HTML tags in a sequence that is passed to other enlive functions as input.

Once the page has been parsed, I call enlive’s select function passing the tag sequence as the first parameter. Select's second parameter tells the function which tags I want out of the tags sequence using something similar to CSS selectors. In this case I’m telling select to grab all a tags inside of LI tags that are members of the page_item class and are within DIV tag that with the id of content. The second vector tells select that I want the text for each link so I can use it for the year value later. After the parsing of the tags is done the map function will grab the attrs for each tag that is returned which returns the following for each link:

image

When this function call is completed a map is returned with a title and href for each year that the Hugo awards were given.  The results of this call are passed to the prep-for-file function in the cmdline.clj file. 

The get-awards-per-year Function

Now that I have the links to all the award pages it is time to gather all the data on each category. The function creates a sequence of category structs that contain the award category, the nominees/winners in the category and the year the award was given.

The year’s page is retrieved using the fetch-url function and the results are stored in the page-content variable. Next, the parse-award-page function is called passing in the page-content as its only parameter. It returns a sequence that contains lists for each award given that year that will look like this: ((“Best Novel”) (array maps for each nominee/winner)). I will refer to this sequence as the category sequence from here out. Right now the parse-award-page function looks like a black box I promise to get into the details in a bit.

The results of the parse-award-page call are passed map to create a sequence that contains category structs.  The category struct is defined as:

Getting the award string

In the map function call I am using an anonymous function to create a category struct. When the map call completes it returns a sequence of category structs for the given year. Creating a new struct is easy, just pass in the name of the struct to create and a value for each of the keys in the struct. The category struct's first key is the :award key. The value is parsed with this code: (apply str (first %)). Since the first item in the category sequence is a string in a lazy sequence representing the award's title I need to use apply str instead of just str. If I called (str (first %)) what I would get back is something like this: clojure.lang.LazySeq@5784711f which is obviously not what we want. 

Getting the books sequence

Grabbing the books that represent the nominees/winners is almost as easy as the award. Since I know that the nominee/winners are stored in the second part of the category sequence lists I use the second function to retrieve them.

(get-book-info (rest (second %)))

I’m using rest here because for some reason the first entry in the sequence is “\n” I’m not sure why. In the future I will figure it out but for now I’m using the rest call to get to the ‘guts’ of the book sequence. The results of the rest call are passed to a helper function that returns a sequence of work structs that will be stored in the category struct’s books key.

Getting the year string

The last key in the category struct is the year key. It will store a string that begins with the year and ends with “Hugo Awards”.  The code to retrieve the ‘year’ makes use of the select statement, grabs the first element in the returned sequence and converts the value of :content to a string. Here's what the code looks like:

(apply str (:content (first (html/select page-content #{[:div#content :h2]}))))

Now I have a value for each of the category struct’s keys. The struct provides a much easier way to work with the data. At this point all of the parsing has been completed. All that is left is for the cmdline/prep-for-file function to format the data and write it out to the file. Since that is pretty straight forward I'm going to leave that code out of the post. Before I wrap up this post I’d like to dive into the hugo.parser/parse-award-page function, where the real parsing happens.

The parse-award-page Function

Once the year’s award page has been retrieved, its tag sequence function is passed to parse-award-page. The function grabs the category title and the nominees/winners and creates a sequence of lists. Here’s how it is done. All of the nominees/winners are found in the map function call. The sequence returned from the call to select returns all UL tags found in content DIV tag. Each tag is passed to the anonymous function which just pulls out the :content key from the tag’s array map creating a sequence of book titles.

The category titles are parsed on the line that has the split-at function call. Again, the select function is called to find all P tags that are within the content DIV. The text for the the first child of the P tag is returned creating a sequence of category titles. The split-at function is called to ‘remove’ the first four P tag results since the contain information on where the awards banquet was held.

After both the titles and then nominees/winners sequences are created the interleave function is called. Interleave creates a single sequence by combining the two sequences one item at a time. How the function works is the first item in the titles sequence is added to the new sequence followed by the first item in the nominees sequence, the second from titles is followed by the second nominees item, etc. When interleave returns I have one sequence that looks something like ( “award title” “nominees” “award title” “nominees”….).

Having the sequence provided by interleave is nice but it isn’t going to work for what I want. I need to pair the category title with the nominees/winners for the category. This is where the partition function comes in. According to the partition documentation the function will “create a lazy sequence of n items” which in our cause is 2. When the parse-award-page function completes it returns a sequence of lists that match the category up with it’s nominees/winner which is exactly what I need in the get-awards-per-year function.

How do I run it from the command line?

If you are like me most of the clojure you write is either run through the REPL or as a web app. I had no idea how to run this ‘app’ from the command line. After checking out the leiningen project again I noticed that there is a command called uberjar. What uberjar does is create a jar file that bundles everything up that is needed to run your app from the command line. The jar file uses the naming convention of:

<project name>-<version info>-standalone.jar

Remember I’m a .NET guy by day so I don’t have a real in-depth knowledge of jar files yet. I just know that they allow my to run the app from the command line. Once the jar file has been created I can run the app from the command line using this command:

java -jar hugo-0.0.3-SNAPSHOT-standalone.jar 

Summary

Parsing HTML using clojure is relatively easy using enlive. Using enlive I was able to parse the Hugo Awards information to create a text file with all of the Best Novel category nominees and winners ( hugo_awards_best_novels.txt ) since 2000.

One More Thing…

When you run the project you may encounter an IOException like this:

image

You can resolve the issue by visiting the URL through a web browser. I believe I can get around this issue by setting the user-agent for my enlive html-resource call but I couldn’t figure out how to do it. If anyone has a suggestion please leave me a comment.

Resources

clojure, clojure-contrib, enlive, lein

Code

Download entire project. Code files: cmdline.clj, parser.clj and text_formatting.clj

The output file: hugo_awards_best_novels.txt

12 comments:

  1. Hey,

    thanks for the writeup. It's a great read.

    I have some slight improvements on the technical side.

    - Leiningen is smart enough to setup its own environment. You can skip installing the clojure jars and download/run the lein script right away.

    - :keywords already are functions, so can write
    (map :attrs (html/select ...)) instead of
    (map #(:attrs %) (html/select ...))

    - let can make multiple bindings at once, so instead of
    (let [foo 1] (let [bar 2] ..)) you can write
    (let [foo 1, bar 2] ..), comma being optional.

    - (apply str ..) is idiomatic and fine, (clojure.string/join ..) might be a bit more straightforward though. This one is a matter of personal taste.

    Again, good work! Nice to see people leaving some breadcrumbs while they get up to speed.

    cheers

    ReplyDelete
  2. @Bendlas thank you for the comments. Part of the reason I'm writing these Clojure posts is for guidance like you provided. The :keywords tip will make the code cleaner and easier to read same for the let hint. On the Leiningen comment are you saying that on a machine without clojure installed I can just download the leiningen script and run it to install clojure and everything else? If so that is great.

    Thank you for taking the time to give me pointers on the code. I am looking for guidance. I am really enjoying Clojure. Thanks again.

    I'm going to update the code/post tomorrow.

    -Rob

    ReplyDelete
  3. Nice.
    Do you have more parsing that you do?

    - Jan

    ReplyDelete
  4. @Jan, I haven't done any other parsing using Clojure. I'm sure I will in the future.

    ReplyDelete
  5. I've changed the code over to the suggestions provided by @Bendlas.

    ReplyDelete
  6. @Rob, lein self-install works on windows too, it needs curl.exe or wget.exe on the path though.

    Anyways, the error messages are quite helpful if something's missing.

    ReplyDelete
  7. Thanks @Bendlas. Updated the update.

    ReplyDelete
  8. With recent versions of Leiningen you should be able to use "lein run" if you're running it from a checkout. You only need to create an uberjar if you want to distribute the project to folks who don't have Leiningen.

    Since you're using Clojure 1.2, you can just use clojure.core/spit; you don't need to use duck-streams any more.

    Looks good otherwise though; cheers.

    ReplyDelete
  9. Thank you for the tips Phil, lein run is much nicer than uberjar. I will update the code and blog to point out lein run and the spit function is now in clojure.core.

    ReplyDelete
  10. To set the user-agent - and use a very nice Clojure lib - see this http://pseudofish.com/blog/2011/04/27/set-the-http-user-agent-string-in-clojure/

    ReplyDelete
  11. @maacl, thanks for the tip. I'll try it out.

    ReplyDelete