Tuesday, November 30, 2010

Project Chadwick #1- Willie McCovey’s Career Batting Average

Before I get started on this post if you aren't familiar with Project Chadwick here's a quick overview.

I have started this off with an easy problem, calculating Willie McCovey’s career batting average.  Starting this way made it easy for me to concentrate on learning enough to get something running without taking up too much time.   In this post I will tackle a solution in all four of the languages I’m interested in:  Erlang, F#, Ruby and Objective-C.  As time goes on and the code for each solution grows I will probably move to a single post per language for each problem. 

A little background on my experience with the languages I have chosen.  I have been using Ruby for a few months now.  We have converted our build process from MSBuild over to rake and albacore, used rails to do a proof of concept, and written other Ruby scripts to do administrative things.  However with Erlang, F# and Objective-C I am truly learning them as I go through this process.  If you see room for improvement in any of my solutions please feel free to share them.

As for my setup for creating these solutions I am running the F# on Windows 7 and all other languages on Linux.  The F# may also move to Linux once I get my F# environment setup there.  With that said lets get started!

Batting Average Formula: Hits/At Bats

Languages: F#, Erlang, Ruby and Objective-C

Data: download from here.

Without further ado here are the functional languages (F# and Erlang) solutions:

Erlang Solution

A quick overview of the syntax you see in the script. First %% are used to write comments.  Erlang requires all variables start with a capital letter.  Also once a variable has been assigned a value it cannot be changed. Function bodies are preceded by the –> sign. For functions that have multi line bodies commas are used to indicate an end of line.  The main function is an example of this.  The last line of the function’s body has a period as its last character.

The first line of the file is setting up so I can run this like any other script in Linux.  After the comment lines the sum functions are defined.  Each function is uniquely identified by the module it appears in, the name and its ‘arity’.  What is arity? Arity is the number of arguments the function has.  In our cause the first sum method has an arity of 2 and the second has an arity of 1. The sum functions are two distinct functions they have nothing to do with each other.  The sum functions are used to total up the career hits and at bats.

The main function  does what you’d expect it to do, it is where the batting average is calculated.  The first two code lines set up lists of the career hits and career at bats for McCovey.  I had wanted to use tuples here but I was not getting the correct average using it.  So I junked the tuples and went with something I knew would work.  The last line of main simply prints out “Willie McCovey’s career bating average is 0.270” .  The place holder is replaced with the result of sum(Hits)/sum(Abs).  The function signature for the io:format method looks like this:

io:format([IoDevice,] Format, Data) 

The io is what module houses the format function.  format’s parameters are: IoDevice if left out stdout is used. If we were writing to a file we would pass the file handle as the IoDevice. Format is the string with as many placeholders as you need.  The Data parameter is used to replace the placeholders.  The number of items in the Data list must equal the number of placeholders in the string.

The Erlang solution was quick and to the point 

The F# Solution

F# is a .NET functional language.  I am learning this language in hopes that I can use it to do some more complicated comparisons quicker than I could in C#.  With that said, lets jump into the F# solution.

As you can see it is very similar to the Erlang code.  With F# we define the sum function in one line but it handles the same two situations as the Erlang sum functions do.  Head and tail have the same meaning as H and T do in Erlang. It adds the head to the result of the recursive call on sum.  When sum is at the end of the list or receives an empty list it returns zero. A real difference between the Erlang and F# code is that we have to add the rec keyword to the function’s definition to indicate that this is a recursive function.  The printfn method is called to write the results out to stdout.  Notice that in the calls to sum we do not use parenthesis that is because parenthesis are used for precedence operations, to create pairs and tuples and to denote a parameter of type void.  It takes a little getting use to not using parenthesis in function calls but after you do, the code seems a little cleaner.

The Ruby Solution

Now that we have the functional languages done lets move to something a little more familiar to most of us, object oriented programming.  Although this Ruby solution really doesn’t do much with objects. 

For this script I’ve combined the hits and at bats for each season as an array.  So the hits_ab array is an array of arrays.  Next I initialize a variable to store the total hits and one for the total at bats.  These are set to 0.0 so that when I do the division I do not have to convert the sums to floats using the to_f method. 

The each loop is one of the cool features of Ruby, the use of blocks.  The do …end is a block of code that is passed to the each method as a parameter.  In our case the block is taking each of the arrays that are ‘in’ the hit_ab array and storing the first entry in the h variable and the second entry in the ab variable. 

The last line prints the results to stdout after formatting the average.  The #{} in a ruby string is how you put the value of a variable, method call, or in our case the formatted results of the division into a string. 

The Objective-C Solution

Ok, I have to admit this up front, I’m really learning this language as I go through this process.  This is my weakest of the 4 languages here, so please help feel free to guide me into the proper way of doing things if you see something wrong.

Most of this solution is really straight C code but it gets the job done.  The first Objective-C or obj-c related line is the #include<Foundation/Foundation.h>.  This header file is the base header file for obj-c. The second obj-c line is the NSAutoreleasePool line.  The NSAutoreleasePool object is to support Cocoa’s, an Apple development framework, reference counted memory management system.  All of the objects in the pool are sent a release message when the NSAutoreleasePool' object is sent the drain message.  In the above code the drain message is sent in the [pool drain]; line.

More on the NSAutoreleasePool line

In this line of code NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init]; there are some interesting things going on.  Lets look at the inner bits of code [NSAutoreleasePool alloc].  In obj-c method calls are done like so [object methodname].  What inner call is doing is allocating the memory that the *pool pointer will point to.  When the new object is returned we call its init method which is its constructor call.  Now we have a fully initialized NSAutoreleasePool object.

 

Summary

Admittedly this was an easy problem to solve so the code for each language was very similar. It also hints that the functional languages tend to require less code to do the same thing.  Ruby is in between the functional and Objective-C code.  I’m sure I could make the Ruby solution look even more like the functional languages if I put more time into it.

Remember, I am learning these languages as I go so if you see something I’m doing completely wrong or not as gracefully as I could don’t hesitate to pass it along.  I would greatly appreciate it.  As these problems get more complicated and the code required grows, I will do each language one at a time.

Up next: I will find the top 5 on base percentages since the Giants moved to San Francisco.

Language Resources

For additional information on these languages you can visit these sites:

Erlang:        http://www.erlang.org/
F#:              The F# Survival Guide
Objective-C: Learning Objective-C: A Primer
Ruby:          http://www.ruby-lang.org/

Saturday, November 27, 2010

Introducing Project Chadwick

Project Chadwick is a set of sabermetric/baseball statistics formulas I am using to learn a few new programming languages: Erlang, F#, Objective-C and Ruby.  This project was inspired by Project Euler which is a series of math problems designed to keep your math and computer skills sharp.

Why Baseball Statistics?

Because I’m a baseball fan.  As a kid I spent many hours reading box scores and whatever other stats the papers would publish. I have been curious about these languages so I thought I would use something I enjoy to learn a these programming languages.

How it will work

I will present the problem and formula I am solving/using and then give a walk through of the code.  For the smaller/easier problems I will show multiple languages at once.  When the code gets long or the formulas are more involved I’ll have a post per language. 

Remember, I’m doing this to learn these languages so if you have any tips and/or hints please feel free to pass them along

Why is it called Project Chadwick?

It is named after the person who created the baseball box score. Henry Chadwick was born in England and was a Cricket report who started reporting on baseball in 1857.  For more information on Henry Chadwick check out his wiki page or for more history on baseball statistics check out The Numbers Game by Alan Schwartz.

My Sources

One of the main sources I have for my formulas is the book Baseball Hacks by Joseph Adler.  The statistical data I will use comes from the http://www.baseball-databank.org/.  It contains statistics from 1871 to 2009. You can download a MySQL database or files to load into Excel from the databank site.  If you do not wish to store the data yourself you can view the data at  http://www.baseball-reference.com/.   

The Languages

If you want to use the same languages I am you can download and install them from the following places:

I have the first problem posted here http://rob-rowe.blogspot.com/p/project-chadwick-problems.html.  Over the next few days I will be adding the problems and my solutions.  This should be a fun process I hope you join me in this project. 

Sunday, November 21, 2010

Part 3 – Creating the Domain class with CodeDom and IronRuby

This is the final post in a three part series in which I discuss how I used IronRuby to generate data access code.  In this post I’m going to discuss how IronRuby was used to generate a C# domain class for each model.

All source files can be downloaded from here

Generating the Domain/Service Layer Code

Now that we have models to truck the data from the DAL to the presentation layer we need the code to move the data between the two layers.  In this version of the project I used the CodeDom to generate the C# code.  This will change as I move this project more into our production environment.  I plan on moving towards a templating engine such as Ruby’s ERB or ASP.NET MVC 3 Razor view engine.  In the beginning of the project I had intended to write this layer using the Emit approach but it found it to be overkill and at this layer the chances are better that we will need to modify the generated code which is not possible if we go down the Emit path.  With that said lets dive into the CodeDom approach. 

In this post I will walk you through how I created the class, added a private field, the constructor

System.CodeDom – The Setup

In order generate the code we must first create a CodeCompileUnit object.  Think of it as a container for the code tree we are about to create.  Next we CodeNamespace object passing in the namespace that will contain the class we are creating. The last bit of setup we’ll do is to import any namespaces our class will need using the CodeNamespaceImport object.Here’s what the setup code looks like:

Creating the Domain Class

Now we can actually create our class type using the CodeTypeDeclaration class passing in the name of the type we are creating.  This call creates a type but we still need to indicate we are creating a class by setting the CodeTypeDeclaration.IsClass property to true.  We’ll also make the class public by setting the TypeAttributes property to TypeAttributes.Public.  After creating the class type we’ll add it to the namespace by passing the type to the @nspace_holder.Imports.Add method.  The code for creating the class type is below.

Adding a Field

Our class will need a field to store the reference to the repository it will use. The first step is to create a CodeMemberField object and set the attributes to private, give it the name _repo and set the type to be IRepository.  Next we’ll add it to our class by adding it to the Members list.

Adding the Constructor

Once we have the field to hold the repository we need to set it to something.  The constructor will have one parameter IRepository repo.  The body of the constructor will have a check to ensure that the repo parameter is not null.  If it is null it will throw an ArgumentNullException.

Creating the constructor object is as simple as create a new CodeConstructor object.  We make the constructor public by setting its attributes to MemberAttributes.Public.  Next, the IRepository parameter is added to the Parameters list by creating a CodeParameterDeclarationExpression object passing in the type and name of the parameter. 

Once we have created the parameter we need to grab a reference to it so we can use it to set the class’s _repo field to the value of the parameter. We pass the parameter and field references to the CodeAssignmentStatement constructor.  This object will be used as the true statement in the if/else code which is created when we call the create_if_else_statement method.

As you can see this method is pretty straight forward.  You pass in a condition to test, the statements to execute if the condition is true or false.  It returns the statement that we will add to the constructor object’s Statements list. The last step is to add the constructor to the class's Members list.

Creating the Get Methods

The domain class has two Get methods. The first one returns an IQueryable that when executed would return all records. The second will return the record that matches the Id parameter that is passed in.

All methods we create are started by calling the basic_method.  This method instantiates a CodeMemberMethod object, setting the Attributes to be public method, the name of the method and the return type.

The first get method

This method will return an IQueryable.  In order to set that up we create a CodeTypeReference object passing in IQueryable<T> where T is the EF model type to the constructor. Our next step is to add the return statement to the method’s body.  This is done by creating a CodeMethodReturnStatement passing the results of the create_repo_method_call method to it’s constructor.

The create_repo_method_call is a way to create calls to the repository class.  It creates a CodeMethodInvokeExpress object that represents the method we will call in our method.  It takes 3 parameters.  The first is a CodeTypeReferenceExpression which in our case is the _repo field. The second parameter is the name of the method we are going to call, in this case its Get. The final parameter takes an array of parameters that the called method receives.  If there are no parameters it is an empty array.

After the create_repo_method_call returns and the results of the call are stored in the method’s Statements property the method is added to the class’s Members list.

The Second get method

The second Get method takes an Id parameter which is used to find the single record that matches it. After the method’s CodeMemberMethod object is created the first thing we do is add the Id parameter following the same process we did with the constructor and create a reference to it.  Next, a MethodInvokeExpression object is created to make the call to the repository’s Get method.  In addition to this call there will be two other method calls chained to it.  The first is a LINQ Where method that takes a CodeSnippetExpression object as its parameter.  Finally an FirstOrDefault method is added to the chain.  Here is the code for the add_get_methods.

The Complete Domain/Service Generated Through this Process

Summary

Now we have the domain/service layer classes to go with the models we created in Part 2. The C# classes have methods to Get, Add, Update, Delete records that map to the models we’ve created.  These classes also have methods to map between the EF models and our models.  Generating these two layers of code and adding them to our MVC projects gives us the potential to have basic application up and running quickly.

What’s next with this project?  On the System.Emit portion of the project I will be adding the ability to store data from one model class into multiple tables, add attributes to the properties, and a way to generate models from non-database data sources such as flat files, URLs and HL7 messages.  I will continue to generate the models using the System.Emit process.  I will be changing my approach on how the source code is generated to use either Ruby on Rails’ ERB view engine or the new Razor view engine in ASP.NET MVC.  I believe this approach will make it easier for us to make changes to the process and makes it easier to maintain in the long run. I will be adding the ability to generate unit tests, controller class boiler plates and perhaps HTML views as well.

I enjoyed working on this series and I hope you enjoyed it! 

Monday, November 8, 2010

My CodeCamp RDU 2010 Experience

This past Saturday was my first CodeCamp and my first time speaking at an event.  My talk covered a project I’ve been working that uses IronRuby to generate a data access layer and domain layer classes.  The slides can be found here.  The experience of giving a talk is something everyone should do at least once.  It helped me get a better understanding of the subject I was speaking about and a new perspective on what people go through to put a talk together.

Since my talk was in the afternoon I spent the morning going to talks.  The first talk I attended was titled ‘Unit Testing for the rest of us’.  It was a nice introduction to unit testing and why you should incorporate it into your projects.  You can check out the slides here.  Eli did a great job with the talk.

The next talk was ‘Dynamic Programming in a Statically Typed World’ by Jim Wooley.  The talk wasn’t quite what I thought it was going to be but I still learned a few things. The thing about attending any talk is you will always learn at least one thing that you can apply.  The next day I was working on a personal project and I remembered the dynamic keyword demo that Jim gave and used it to resolve a sticky problem. 

Since lunch was right after Jim’s talk a co-worker and I ate shared a table with Jim.  We had a great conversation around writing your own IQueryable implementation and a little bit about natural language processing.  A definite positive to attending local talks is the availability of the speakers. After most talks they are open to continue any discussions that may have occurred during their talk.

After lunch I went to Mike O’Brien’s talk ‘What is F# and Why Should I Care?’.  It was a nice introduction into functional programming using F#.  The last talk I attended was also given by Mike.  It was titled ‘Release Management with Go/Ruby/Rake/Albacore’.  A little short and not as much on the ruby portion of the talk which is what I was hoping for.  However it was a nice introduction into Go.  The talk also spawned a conversation about other CI servers out there, such as TeamCity..

Overall it was a good experience.  I met a few people and had nice discussions.  Like I said, I have already used information I gained on Saturday.  If there is a CodeCamp in your area I would recommend attending and/or speaking at it.

The slides from my talk are here.  The source is here.