Tuesday, November 22, 2011

Hugo-clr: Parsing Web Pages with Clojure-clr and HtmlAgilityPack

When I first became interested in learning Clojure I was in the middle of a science fiction reading kick and I was looking for new authors to read. So I decided I would try and pick up Clojure by writing code to parse the winners and nominees for the Best Novel category on the Hugo Awards web site.  While I was writing the code writing I decided I would share my experience as a Clojure noob (still am) through a three part blog series that covered what I did with Clojure on the JVM (Parsing Web Pages with Clojure and Enlive, Creating a Hugo Award DB with Clojure and Sqlite, and Creating a Simple UI for the Hugo DB) . 

About a month ago I decided to really give Clojure-clr a try so I thought I would go through the same process I did on the JVM version. Why? I thought it would give me a good way to compare and contrast the JVM and CLR versions of Clojure. Not that I’m a Clojure guru, I’m new to the world of parenthesis but doing the same project will allow me to point out the differences I came across between the CLR and JVM versions.  With that said, let’s make sure you have your Clojure-clr environment set up.

Setup

Since the CLR world doesn’t have lein or a lein equivalent I have to do the configuration by hand. The first step is to install Clojure-clr if you haven’t already installed it.  My post Getting Started with Clojure-clr will walk you through the steps. After setting up Clojure-clr download the HtmlAgilityPack.  It is the .NET library I am using to parse the Hugo web pages.  If you want the HtmlAgilityPack lib and source code you can grab it here: https://github.com/rippinrobr/hugo-clr/tree/hugoclr-parser and follow along that way.  Just make sure you have the code from the hugoclr-parser branch.  With the setup complete it is time to start looking at some code.

hugoclr.clj

The hugoclr.clj file is where the –main function lives. It calls hugoclr.parser/get-awards to retrieve the award pages, parse the nominees and winners data out and  passes the results to the hugoclr.data.csv/write-to-file function to write out the data in a comma-delimited file.

There are only a couple of items I’d like to point out in the hugoclr.clj file. First is the way that the HtmlAgilityPack library is loaded.

(assembly-load-form "..\\libs\\HmtlAgilityPack.dll")

The function assembly-load-from is a new function to Clojure-clr.  It was added in the 1.3 release.  It is a wrapper around the System.Reflection.Assembly/LoadFile call. I find the assembly-load-form more clojure’esque and less typing so I’ve started using it. 

The next line of interest is the :gen-class line.  Using the :gen-class call is what triggers the generation of the hugoclr.exe file. if I didn’t add that line to my source I would only generate DLLs when I compile hugoclr.  That’s it for hugoclr.clj. Its main purpose in life is to kick off the parsing and pass the results to the hugoclr.data.csv/write-to-file function. Next, I’ll discuss work horse of the project, the hugoclr/parser.clj file.

hugoclr/parser.clj

The hugoclr/parser.clj file is where most of the work is done. It handles the fetching of the web pages, parsing the award page links, and grabs the data from the awards pages, and converts the data into records that can will be used later. The entry point into the file is the get-awards function.

get-awards / get-html-elements / fetch-url

The get-awards function is the ‘main’ function of the hugoclr/parser.clj file. It is what drives the parsing process. The function starts by calls the get-html-elements function passing a URL to the history page and the XPATH that when applied will return a sequence of anchor tags starting with the 2011 awards page link.

Next get-html-elements passes the URL to fetch-url.  fetch-url makes a request to the URL by creating a HtmlAgilityPack.HtmlWeb object and making a call to the HtmlWeb.Load method.  The HtmlWeb.Load  method ‘converts’ the retrieved web page into a HtmlDocument object.

The returned HtmlDocument’s SelectNodes method is called the XPATH that was passed to get-html-elements. SelectNodes applies the XPATH and returns a sequence of HtmlNode objects that represent the anchor tags on the Hugo History page. Since I only want the anchor tags that will lead me to the awards pages I us the map function to pass the HtmlNode objects through the validate-award-link function.  The results of the map call is a sequence of links to award pages or nulls.  The nulls are in the place of links that were not award page links. I remove them by calling filter passing a function that only keeps non-null entries. At the end of this process I have a sequence of valid award page links. 

The last step of collecting the nominees and winners data is to parse each individual award page.  I start by taking the first 12 links from the awards-link sequence and pass each one to the parse-awards-page function using the map function.  Each link is then processed in the parse-awards-page function returning a sequence of Category records that represent each awards category for the given year.  Now I have a sequence of Category sequences ready to be written out to a file.  Before I go over that part of the code I would like to walk you through the parse-awards-page function.

You may be asking yourself why I’m only taking the 2000s.  The answer is simple, I’m lazy.  While writing the JVM and CLR versions I found that if I didn’t load the pages first in a browser I was unable to retrieve them programmatically.  So if I wanted to process all of the pages I would have had to load them all.  I’d be bored before I got of the 90s so I cut it off at 2000. 

parse-awards-page

parse-awards-page uses the get-html-elements function to get a HtmlDocument object that represents the awards page to parse.  The function then passes the object to the create-category-record function which as you might expect creates a Category record that represents each award category on the awards page.  Since each page has more than one award category parse-awards-page returns a sequence of Category records. 

create-category-record

As I said earlier, the Category record is the data structure that represents the nominees and winners of a particular Hugo Award category.  The first step in creating a Category record is to find the paragraph tag that appears just before the category’s UL tag.  The paragraph tag contains the year the award was given and the name of the award.     

Once I have the paragraph node the next step is to find all of the list item tags in the award category’s unordered list.  All but the first of the li tags contain the text that describe the nominees and winners for the award category currently being parsed. The nominee/winner li nodes are passed through a filter to make sure that only the li tags are kept. 

Now that I have the paragraph and li tags I’m ready to create the Category record.  The get-category-heading and get-year functions simply parse the text from the paragraph tag and return the award name and year.  The li tags are passed to the create-works-seq function which creates a sequence of Work records that represents nominees and winners for the category.   Once each category on the page has been parsed control is returned back to the parse-awards page so it can continue parsing the award pages until they have all been processed.

A Quick Side Note: Records vs. Structs

When I wrote the JVM version of this ‘application’ I used structs to model the categories and works.  Using structs worked fine for what I was doing.  However when I started writing the CLR project I was in the middle of reading the book The Joy of Clojure: Thinking the Clojure Way by Michael Fogus and Chris Houser.  The authors mentioned that records have some advantages over structs and for that reason structs are falling out of favor.  Some of the advantages of records are that the are created quicker than structs and take up less memory.  They also look up keys quicker than array or hash maps.  After reading that I went with records instead of structs in the CLR version.  By the way I have really enjoyed reading The Joy of Clojure and I would highly recommend it. 

And Now Back to the Code…

Now that we have parsed the all of 2000s award pages the only step left is to write the results out to a comma-delimited text file.  In the –main function the results of the get-awards are passed to hugoclr.data.csv/write-to-file as its first parameter and the name of the output file as its second parameter.  Lets walk through the last bit of code, the hugoclr/data/csv.clj file.

hugoclr.data.csv.clj

The write-to-file method does exactly what its name implies, writes something to a file.  In our case it takes the awards, converts each record into a comma-delimited line and then writes them to the output file.

First I create a writable stream using .NET’s System.IO.StreamWriter class.  I’ve told the stream to write the results to c:\temp\hugo.txt.  I could have used the spit function but I decided to use a .NET library here.  Once I have the stream I pass each category to the delimit function which simply cleans the title and publisher string and places a comma between all of the Work record’s fields. After each category has been converted the lines are then reduced into a single string.  The string is written to the output file.  Running the code produces an output file a file like this:

Running hugoclr

Now that I’ve walked you through the guts of the code it is time to show you what it looks like when it runs.  First, I will show you how to run it in the REPL.

image

It is pretty straight forward.  Fire up the REPL, load the hugoclr.clj file and then call the –main function.  From there the code grabs the link page, parses it out and lets you know where it is in the process by telling you which page it is retrieving. 

Remember, you must ‘prime’ the app before you run the code by loading each page in your favorite browser. I’m not sure why this is required. If anyone knows why this is happening and knows a way around please let me know.

Next, I will compile and run the code from the command line.

image

One thing to keep in mind when you compile your CLR code with Clojure 1.3.0 Debug on the 4.0 .NET CLR the executable and DLLs generated are placed in the compiler’s directory.  Obviously the results are the same either way I run it.

Summary

Parsing the Hugo Awards list for the winners in the 2000s wasn’t all that different from the JVM version.  I did find using the HtmlAgilityPack library a little easier to work with when parsing the web pages. This probably due to my familiarity with HtmlAgilityPack since I’ve used it in a few C# projects.  Another reason I found it easier this time around is probably related to the fact that I’m a ‘little’ more comfortable writing Clojure code.  I still have a long way to go though before I’m fluent in it.

Writing Clojure in the CLR environment wasn’t much different in this part of the project than the JVM version.  In the CLR world we don’t have things like lein but so far I haven’t come across any issues that would prevent me from continuing to become familiar with Clojure CLR in hopes of using it at my day job.  Which may come soon as in the next few days.

My next post in this serious will be on taking the data from the csv file, creating a SQL Server table, and loading the new table with the data from the file.

Since I am still pretty green in the Clojure world please feel free to leave a comment if you see something that is not idiomatic Clojure or if there is a better way to do something.  I’m eager for any and all feedback.

Resources

Clojure-clr I’m using the 1.3 version with .Net 4.0 and HtmlAgilityPack
The Joy of Clojure: Thinking the Clojure Way

The Code

,You can download the code for this post from https://github.com/rippinrobr/hugo-clr/tree/hugoclr-parser .  Just make sure you are on the hugoclr-parser branch.

No comments:

Post a Comment