My Clojure Adventure: Baseball

Showing posts with label Baseball. Show all posts

Tuesday, January 25, 2011

Using SpecFlow to test my F# Baseball Stats Library

As part of my Ruby indoctrination I picked up a copy of The RSpec Book: Behaviour Driven Development with Rspec, Cucumber, and Friends (The Facets of Ruby Series) . I’m about half way through the book and I find the Behavior Driven Development (BDD) process very comfortable. Writing tests this way just ‘seems right’ and the process has already improved my Ruby code. However, during my day job I write C# code so I started looking around to see what BDD options are available for .NET. That’s when I found SpecFlow, which plays the role of Cucumber in the Ruby world. So to get up to speed with SpecFlow I’ve decided to use it to help me test my obp function which I wrote in Project Chadwick #2–Top 5 SF Giants OBP (F# Version) as I build out my baseball stats library. Before I get started I will describe the setup.

The Setup

Step 0. Create F# Library Project

On the File Menu in Visual Studio Select New Project. In the New Project dialog, open the Other Language option and click on Visual F# then Select F# Library.

I named this project BaseballStats. Remove the Module.fs and Script.fsx files, I’ll add my own *.fs file when the time comes and we won’t need the Script.fsx file. That’s it for the F# project until it is time to create test test project.

Step 1. Create the Test Project

Since I am using MSTest for the testing I am using a normal C# test project. I named mine BaseballStats.AccetpanceTests. Once the project is created I need to add the following references:

FSharp.Core assembly
F# library project you created in Step 0, in my case the BaseballStats project.

Delete the test class that was automatically added to the test project, I wont be using it. In real life I’d create a unit test project also lets just pretend we did that here.

Now that we have the test project created its time to install SpecFlow

Step 3. Install SpecFlow

If you already have SpecFlow installed then you can jump down to the NuGet section, otherwise go ahead and grab SpecFlow installer from here. I downloaded and ran the installer so that I had the file templates available in the ‘Add New Item’ dialog. After the install I ran the command below from the NuGet Console to add the necessary DLLS to the test project.

install-package –Id SpecFlow –Project BaseballStats.AcceptanceTests

SpecFlow uses NUnit as its test runner by default but it can be configured to use MSTest. Since I’m using MSTest I need to update the app.config file so it looks like this:

<specFlow>
    <!-- Possible values include NUnit (default), MsTest, xUnit -->
    <unitTestProvider name="MsTest" />
  </specFlow>

The setup is complete. Its time to get on with the testing!

The Testing

Step 4. Create a Feature

To create a SpecFlow feature file right click on the BaseballStats.AcceptanceTests project and select ‘Add…’ > ‘New Item…’ And Select SpecFlow feature file. Name it CalculatingObp.feature.

Any file with the .feature suffix will also have a code behind file that SpecFlow will use to call the step definition methods. The step definitions are what is run to perform the tests. I will define the steps after I have finished describing the feature . When a new feature file is created you will see the following:

Feature: Addition
In order to avoid silly mistakes
As a math idiot
I want to be told the sum of two numbers

@mytag
Scenario: Add two numbers
Given I have entered 50 into the calculator
And I have entered 70 into the calculator
When I press add
Then the result should be 120 on the screen

Obviously this isn’t the feature I want to describe but lets take a minute to discuss it. What the file does is describe the feature we are working in an almost plain English style. The text under the Feature line is a narrative to help remind me what I want the the feature to do. It has no real bearing on the code that we will use to test the feature.

The Scenario section does have an impact on the test’s code. The Given, And, When and Then statements will be used in the step definition file and they will drive the test.

Here’s what our feature for calculating the OBP looks like:

Feature: Calculating OBP
In order to determine the effectiveness of a batter
As a baseball fan
I want to be able to calculate a player's On Base Percentage

Scenario: Calculate a Season's On Base Percentage (OBP)
Given A batter had "389" ABs, "104" Hits, "56" BBs, "0" HBPs,
and "4" SFs
When I run the calculation 
Then I should see the result "0.356"

The feature is used to describe how I am going to calculate a batter’s seasonal OBP. When I save the feature file a code behind file is created that contains code that SpecFlow will use to find the step definition.

Step 5. Create the Steps

Now that I have the feature description in place I’m ready to test. Lets run the SpecFlow test and see what happens. To run the test make sure the feature file is the selected tab in VS and click the ‘Run Test in Current Context’ button. When you run the test the results will say ‘Inconclusive’. View the test run details and you will see a statement that says there were no matching steps found for …. which maps to the first line of the scenario. A little further down in the ‘Standard Console Output’ section you will see that SpecFlow has provided us with boiler plate code for the step definitions that looks like:

Given A batter had "389" ABs, "104" Hits, "56" BBs, "0" HBPs, and "4" SFs
-> No matching step definition found for the step. Use the following code to create one:

[Binding]
public class StepDefinitions
{
[Given(@"A batter had ""389"" ABs, ""104"" Hits,   
""56"" BBs, ""0"" HBPs, and ""4"" 
SFs")]
public void GivenABatterHad389ABs104Hits56BBs0HBPsAnd4SFs()
{
ScenarioContext.Current.Pending();
}
}

When I run the calculation I should see
-> No matching step definition found for the step. Use the following code to create one:
[Binding]
public class StepDefinitions
{
[When(@"I run the calculation I should see")]
public void WhenIRunTheCalculationIShouldSee()
{
ScenarioContext.Current.Pending();
}
}

Then I should see the result "0.356"
-> No matching step definition found for the step. Use the following code to create one:
[Binding]
public class StepDefinitions
{
[Then(@"I should see the result ""0\.356""")]
public void ThenIShouldSeeTheResult0_356()
{
ScenarioContext.Current.Pending();
}
}

It is time to add a step definition file to our testing project. Right click on the AcceptanceTests project and select Add New Item Select the SpecFlow Step Definition option and give it the name ObpStepDefinitions.

Remove the template code that is inserted into the ObpStepDefinitions test and replace it with the boiler plate methods, from the ‘Standard Output Console’ area of the test results, that were generated when when ran the SpecFlow test. My step definitions should now look like this:

Run the test again and this time the test should report:

Assert.Inconclusive failed. One or more step definitions are not implemented yet.
ObpSteps.GivenABatterHad389ABs104Hits56BBs0HBPsAnd4SFs()

This is a good sign! What it means is that SpecFlow now sees the steps and attempts to execute them but since the only code in the first method is the ScenarioContext.Current.Pending method call and it halts execution there since this is the first step. Now its time I put some actual code into the method bodies. I am going to start with the Given step. I’m going to use the numeric values in the Given statement as inputs for my test. How can I do that? With SpecFlow I can use regexes to grab the numeric values from the Given attribute so we can use them to run the OBP calculation. The values grabbed by the regexes are then passed to the step method via parameters. The values will be converted to the specified data type in the method signature by SpecFlow. My updated step looks like this:

This step is responsible for retrieving and storing the input values that will be used to calculate the OBP. Now when I run the test I see:

Assert.Inconclusive failed. One or more step definitions are not implemented yet.
ObpSteps.WhenIRunTheCalculationIShouldSee()

Again, it is a good sign. It actually executed the first step and is now trying to run the When step, but it encounters the Pending method again. This is the step where will do the OBP calculation. Since we do not have the Baseball.obp function in the F# code, we are adding code to the step that 'We wish we had', a phrase the author users repeatedly in the RSpec book. Here is the update step definition.

Since the Baseball.obp function doesn't exist yet we will not be able to build the project. So I am going to switch to the F# BaseballStats project and write just enough code to allow us to build and run the test. First we create a BaseballStats.fsi file followed by a BaseballStats.fs file. In F# projects a file’s order of appearance matters in the build process. Make sure that the fsi file appears before the fs file in the project’s listing. To move a file up or down right click on the file you wish to move and choose the appropriate movement direction.

The BaseballStats.fsi file is a signature file, you can think of it like a C/C++ header file. It describes the functions that are available in the BaseballStats.fs file. The val obp line is describing the function's signature. There will be 5 float parameters and it will return a float value.

The BaseballStats.fs file is where the function is implemented. In the real world we’d write just enough code to allow us to run the test again and when it failed we’d drop into unit testing or a RSpec .NET equivalent until the obp function was fully functional. Then we’d come back to SpecFlow, run the test and get green. In order to keep this post as brief as possible I’m not going to illustrate the process here. I have added the entire obp function but pretend we went the through process I just discussed. Once I’ve added the obp function to the fs file build the solution and run the test. It still comes up as Inconclusive. This time it is due to the last step not being defined.

The final step in my test is assert that the calculated OBP equals the expected value. Again I’m using regex to grab the expected value which will compared against a rounded off version of the results from the Batting.obp call. Once we have green we know that the feature is working for this set of test data.

Here is what the detailed view of the test run should look like:

That’s it, we have green! The OBP function performed as we had expected. Obviously we haven’t fully tested it but we now know that the function works with valid inputs. So what happens when we provide negative values or values that make the denominator zero? SpecFlow has a way that will allow me to use this single scenario to test all the possible permutations I can think of without writing additional scenarios. I’m going to save that topic for a later post.

My Thoughts on SpecFlow

SpecFlow gives .NET developers a way to get BDD into our projects. In the beginning using SpecFlow doesn’t seam to flow as smoothly as Cucumber and RSpec in the ruby world. This may be due to the fact that I haven’t used SpecFlow enough or could be due to the C# and Ruby differences. Overall, I like the BDD style of development that SpecFlow brings to the .NET world. BDD seems to fit better to my way of thinking. I am going to continue to use SpecFlow in my side projects and will work on incorporating it into my ‘day job’ environment.

Resources

You can download the source here

SpecFlow: project web site

TekPub’s free video on SpecFlow

F#: fsharp.net

Sunday, December 12, 2010

Project Chadwick #2–Top 5 SF Giants OBP (F# Version)

Before I get started on this post if you aren't familiar with Project Chadwick here's a quick overview.

The data for this problem can be downloaded from here

In this problem we are going to find the Top 5 On Base Percentage Seasons since the Giants moved to San Francisco. While I’m not convinced that I’m fully thinking like a functional programmer but I think I’m starting to ‘get it’. With that said, lets get started.

New F# Concepts

In my second F# script, I’m using a few concepts that I didn’t use in the first solution, namely record and multi-lined functions.

The Record Type

type Batter = { Last : string; First : string; Season : int; AB : float; OBP : float }

So what is a record? It looks like a different way to declare a class. Well, it may look that way but records are not the same as classes. Records allow you to group data into types and access the data using fields. Fields in records are immutable whereas classes do not offer the same type safety. Also, records cannot be inherited. There are other differences that are beyond the scope of my current F# knowledge but as my F# knowledge expands we may delve into the remaining differences.

Functions

A function is declared using the let statement. The keyword let is followed by the name of the function, a list of space delimited parameters, and optionally a return type. The body of the function is determined by white space. All lines that are indented after the declaration are considered part of the function’s body until a line is encountered at the same ‘level’ of indention as the let statement. The return value of a function is the result of the last line executed.

Seq.toList, Pipe Forward Operator and List.Map

When we read in the content of the data file using the ReadAllLines method the file content is returned as a string array. In F# it is easier to work with lists in than arrays, least with what F# knowledge I have. So to convert our file content into a list I used the Seq.toList method. The pipe forward operator, |>, is used to send the output of the ReadAllLines call to send or ‘pipe’ it to the Seq.toList method as its parameter. You can think of the pipe forward operator similar to the pipe utility UNIX command line. The results of those commands are stored as a list<Batter> in the stats variable.

let stats = File.ReadAllLines(@".\data\sf_giants_batting.csv") |> Seq.toList

The List.map operation allows us to take a list and pass each item as a parameter to a function in one line. In addition to calling the function it creates a new list with the results of each column, in this case we aren’t using the returned list. In my solution I’m using it like a one line foreach call. I’m calling the create_batters function passing in the tail of the stats list that I read in. Why just the tail, because the head is the line that contains the column headings.

List.map create_batters stats.Tail

An Overview of My Solution

Since I’m just getting started with F#, I find myself writing F# code that looks like C#. I think my functions reflect that. However, the last line makes me think that I’m starting to make the turn on understanding functional programming and how I can use it. Here is the last line:

List.map print_batter (Seq.take 5 all_obps |> Seq.toList)

In my first go round of this script I had a for loop that went from 0 to 4 to print the first five items in the list. As I was writing this post up I started looking at the loop thinking I could improve it. I remembered reading about the Seq.take method that takes the number of items specified from the given sequence. So I removed the for loop and plopped in the following code in its place:

List.map print_batter (Seq.take 5 all_obps)

When I ran the new and improved script I received the following error:

chadwick-2-top5-obp.fsx(64,24): error FS0001: This expression was expected to have type Batter list but here has type seq<'a>

I noticed that the error message specifically stated that it was given a sequence but it expected a list. So I tacked on the |> Seq.toList call and was able to get it to work. That type of code is what gets me excided about functional programming. I’m looking forward to getting to the point where I can truly use the functional programming aspects of F#.

Here is my entire solution:

I enjoyed working on this solution, while its nothing big in the grand scheme of things but it was my first ‘real’ F# script. As always, any critiques, nudges or hints would be greatly appreciated. My number one goal of going through this process is to learn the 4 languages.

Up Next…

I will be adding more problems to the list shortly and solving this problem in either ruby or objective-c next.

Tuesday, November 30, 2010

Project Chadwick #1- Willie McCovey’s Career Batting Average

Before I get started on this post if you aren't familiar with Project Chadwick here's a quick overview.

I have started this off with an easy problem, calculating Willie McCovey’s career batting average. Starting this way made it easy for me to concentrate on learning enough to get something running without taking up too much time. In this post I will tackle a solution in all four of the languages I’m interested in: Erlang, F#, Ruby and Objective-C. As time goes on and the code for each solution grows I will probably move to a single post per language for each problem.

A little background on my experience with the languages I have chosen. I have been using Ruby for a few months now. We have converted our build process from MSBuild over to rake and albacore, used rails to do a proof of concept, and written other Ruby scripts to do administrative things. However with Erlang, F# and Objective-C I am truly learning them as I go through this process. If you see room for improvement in any of my solutions please feel free to share them.

As for my setup for creating these solutions I am running the F# on Windows 7 and all other languages on Linux. The F# may also move to Linux once I get my F# environment setup there. With that said lets get started!

Batting Average Formula: Hits/At Bats

Languages: F#, Erlang, Ruby and Objective-C

Data: download from here.

Without further ado here are the functional languages (F# and Erlang) solutions:

Erlang Solution

A quick overview of the syntax you see in the script. First %% are used to write comments. Erlang requires all variables start with a capital letter. Also once a variable has been assigned a value it cannot be changed. Function bodies are preceded by the –> sign. For functions that have multi line bodies commas are used to indicate an end of line. The main function is an example of this. The last line of the function’s body has a period as its last character.

The first line of the file is setting up so I can run this like any other script in Linux. After the comment lines the sum functions are defined. Each function is uniquely identified by the module it appears in, the name and its ‘arity’. What is arity? Arity is the number of arguments the function has. In our cause the first sum method has an arity of 2 and the second has an arity of 1. The sum functions are two distinct functions they have nothing to do with each other. The sum functions are used to total up the career hits and at bats.

The main function does what you’d expect it to do, it is where the batting average is calculated. The first two code lines set up lists of the career hits and career at bats for McCovey. I had wanted to use tuples here but I was not getting the correct average using it. So I junked the tuples and went with something I knew would work. The last line of main simply prints out “Willie McCovey’s career bating average is 0.270” . The place holder is replaced with the result of sum(Hits)/sum(Abs). The function signature for the io:format method looks like this:

io:format([IoDevice,] Format, Data)

The io is what module houses the format function. format’s parameters are: IoDevice if left out stdout is used. If we were writing to a file we would pass the file handle as the IoDevice. Format is the string with as many placeholders as you need. The Data parameter is used to replace the placeholders. The number of items in the Data list must equal the number of placeholders in the string.

The Erlang solution was quick and to the point

The F# Solution

F# is a .NET functional language. I am learning this language in hopes that I can use it to do some more complicated comparisons quicker than I could in C#. With that said, lets jump into the F# solution.

As you can see it is very similar to the Erlang code. With F# we define the sum function in one line but it handles the same two situations as the Erlang sum functions do. Head and tail have the same meaning as H and T do in Erlang. It adds the head to the result of the recursive call on sum. When sum is at the end of the list or receives an empty list it returns zero. A real difference between the Erlang and F# code is that we have to add the rec keyword to the function’s definition to indicate that this is a recursive function. The printfn method is called to write the results out to stdout. Notice that in the calls to sum we do not use parenthesis that is because parenthesis are used for precedence operations, to create pairs and tuples and to denote a parameter of type void. It takes a little getting use to not using parenthesis in function calls but after you do, the code seems a little cleaner.

The Ruby Solution

Now that we have the functional languages done lets move to something a little more familiar to most of us, object oriented programming. Although this Ruby solution really doesn’t do much with objects.

For this script I’ve combined the hits and at bats for each season as an array. So the hits_ab array is an array of arrays. Next I initialize a variable to store the total hits and one for the total at bats. These are set to 0.0 so that when I do the division I do not have to convert the sums to floats using the to_f method.

The each loop is one of the cool features of Ruby, the use of blocks. The do …end is a block of code that is passed to the each method as a parameter. In our case the block is taking each of the arrays that are ‘in’ the hit_ab array and storing the first entry in the h variable and the second entry in the ab variable.

The last line prints the results to stdout after formatting the average. The #{} in a ruby string is how you put the value of a variable, method call, or in our case the formatted results of the division into a string.

The Objective-C Solution

Ok, I have to admit this up front, I’m really learning this language as I go through this process. This is my weakest of the 4 languages here, so please help feel free to guide me into the proper way of doing things if you see something wrong.

Most of this solution is really straight C code but it gets the job done. The first Objective-C or obj-c related line is the #include<Foundation/Foundation.h>. This header file is the base header file for obj-c. The second obj-c line is the NSAutoreleasePool line. The NSAutoreleasePool object is to support Cocoa’s, an Apple development framework, reference counted memory management system. All of the objects in the pool are sent a release message when the NSAutoreleasePool' object is sent the drain message. In the above code the drain message is sent in the [pool drain]; line.

More on the NSAutoreleasePool line

In this line of code NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init]; there are some interesting things going on. Lets look at the inner bits of code [NSAutoreleasePool alloc]. In obj-c method calls are done like so [object methodname]. What inner call is doing is allocating the memory that the *pool pointer will point to. When the new object is returned we call its init method which is its constructor call. Now we have a fully initialized NSAutoreleasePool object.

Summary

Admittedly this was an easy problem to solve so the code for each language was very similar. It also hints that the functional languages tend to require less code to do the same thing. Ruby is in between the functional and Objective-C code. I’m sure I could make the Ruby solution look even more like the functional languages if I put more time into it.

Remember, I am learning these languages as I go so if you see something I’m doing completely wrong or not as gracefully as I could don’t hesitate to pass it along. I would greatly appreciate it. As these problems get more complicated and the code required grows, I will do each language one at a time.

Up next: I will find the top 5 on base percentages since the Giants moved to San Francisco.

Language Resources

For additional information on these languages you can visit these sites:

Erlang:        http://www.erlang.org/
F#:              The F# Survival Guide
Objective-C: Learning Objective-C: A Primer
Ruby:          http://www.ruby-lang.org/

Saturday, November 27, 2010

Introducing Project Chadwick

Project Chadwick is a set of sabermetric/baseball statistics formulas I am using to learn a few new programming languages: Erlang, F#, Objective-C and Ruby. This project was inspired by Project Euler which is a series of math problems designed to keep your math and computer skills sharp.

Why Baseball Statistics?

Because I’m a baseball fan. As a kid I spent many hours reading box scores and whatever other stats the papers would publish. I have been curious about these languages so I thought I would use something I enjoy to learn a these programming languages.

How it will work

I will present the problem and formula I am solving/using and then give a walk through of the code. For the smaller/easier problems I will show multiple languages at once. When the code gets long or the formulas are more involved I’ll have a post per language.

Remember, I’m doing this to learn these languages so if you have any tips and/or hints please feel free to pass them along

Why is it called Project Chadwick?

It is named after the person who created the baseball box score. Henry Chadwick was born in England and was a Cricket report who started reporting on baseball in 1857. For more information on Henry Chadwick check out his wiki page or for more history on baseball statistics check out The Numbers Game by Alan Schwartz.

My Sources

One of the main sources I have for my formulas is the book Baseball Hacks by Joseph Adler. The statistical data I will use comes from the http://www.baseball-databank.org/. It contains statistics from 1871 to 2009. You can download a MySQL database or files to load into Excel from the databank site. If you do not wish to store the data yourself you can view the data at http://www.baseball-reference.com/.

The Languages

If you want to use the same languages I am you can download and install them from the following places:

Ruby: http://www.ruby-lang.org/en/downloads/
Objective-C: I’m not sure if its possible to run it on windows. I will be writing my objective-c code on Ubuntu. Here are the instructions for setting up Ubuntu for Objective-C.
Erlang: http://www.erlang.org/download.html
F# http://msdn.microsoft.com/en-us/fsharp/cc835251.aspx

I have the first problem posted here http://rob-rowe.blogspot.com/p/project-chadwick-problems.html. Over the next few days I will be adding the problems and my solutions. This should be a fun process I hope you join me in this project.

Monday, June 14, 2010

Back from Tech-ED and I have OData Fever!

The title sums it up. I am in the middle of an OData fit. In my free time I am working on compiling baseball statistics from the Baseball Databank and Retrosheet sites to create one stats database that is available using OData. I’m not sure how long it will take but I will have it up some time in the future.

What is OData?!

“The Open Data Protocol (OData) is a Web protocol for querying and updating data that provides a way to unlock your data and free it from silos that exist in applications today.” – taken from OData.org

Serving data up this way frees us developers from some of the headache of cross platform data access (well at least those of us who primarily write for the windows platforms). There are already quite a few client libraries out there to consume OData (PHP, Java, etc..).

I may have a few blog entries about OData and how to set it up on the server and client sides. I still haven’t really thought it through all the way quite yet.

Why should I care about your fever?

If you are a baseball fan you will have access to historical data free of charge ready to use for whatever you can dream up. In addition to the feed I plan on having a small app or two to demonstrate how to use the data and/or to query the data.

If you aren’t a baseball fan but are curious about OData then you may want to check back periodically to check my status. I’m sure as I run into issues and/or figure things out I’ll blog about them.

Now if I could only find a few more hours to put into my day I could get this all done quickly.