Code Kata #4 - Data Munging Part One
The fourth Code Kata in Dave Thomas’ Code Kata blog asks you to do the following:
write a program to output the day number (column one) with the smallest temperature spread (the maximum temperature is the second column, the minimum the third column).
He provides a weather.dat file for you to download and which you will parse through in order to obtain the solution. In this post, I will only cover the first part of the code kata. I completed the first part of the kata over the course of 3 days working up to 4 hours on one day and thinking about the problem when not in front of the computer screen. Let’s take a look at a solution I developed.
The Completed Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
TDD is Software Improv
It is unfortunate that I didn’t capture the dynamic interplay between writing/rewriting the tests and writing/rewriting the code in this example. I think I originally heard it from David Chelimsky at SCNA 2011 who said that Test Driven Development is Software Improv. TDD is a sweeping movement involving writing your tests, writing code to pass your tests, refactoring your code, refactoring your tests, critiquing your code to determine how you should write your next test, and critiquing your tests to inform the design of your code. TDD truly is an art form.
Even as I write this post, I understand how difficult it is to capture the TDD process in writing. Have you ever seen Live Improv as a series of photos? For one thing, you won’t catch the humor in the dialogue. You might laugh at the pictures, but you won’t get the jokes unless you had understood the context in which they were delivered; that’s where the dialogue is important. But listening to the dialogue alone isn’t enough. You wouldn’t be able to capture the subtle nuances in the gestures of the performers. The photos wouldn’t capture that because they are minutes, even seconds apart. Now what happens if you see Live Improv as an audience member? You finally get it. In the same way, TDD cannot be expressed as a static blog post. I do not do it justice by simply talking about it; however, I will do my best to inform you what happens in the process of doing TDD.
How’d you get there?
At the beginning, I started writing very trivial test cases that would check for the return type of the
read method. It would check to make sure that the return values were not nil. Of course, you don’t see these test cases in the final product. That’s because those tests acted like scaffolding that I temporarily used to get me started. After I build the right structure in my program, I wrote other tests that rendered that scaffolding test code obsolete.
TDD is an incremental process. You should only be writing the minimal amount of code to make your tests pass. I think this is where I didn’t do as great a job executing TDD. Once I understood the pattern of the code I needed to write, I went ahead and wrote that as production code. In the real world, the patterns won’t be as obvious and it will take having to write code for every case until you see an overarching picture of the kind of code that would cover the general case.
The example I will cite here is when I test a single line of data to determine if I get the right value for the spread of the maximum and minimum temperature. The next thing I do is write the code that does that for every line of code. Probably the way I should have gone about that was to write a regular expression that was hardcoded to that particular string. Then I should have thrown in another case where there is an asterisk next to the maximum or minimum temperature as is true for day 9 or day 26 in the weather.dat file. When you work on a larger project, it may not be obvious how you can generalize the production code you need to write. Thus, it is a good habit to build your program incrementally when you do TDD.
Programming With Lambdas
Recently, I have been making more of an effort to introduce lambdas in my code. Lambdas and Procs were one of the most nebulous ideas for me when I first started programming in Ruby. Only recently did I start playing with passing lambdas as arguments to methods. Originally, the
no_data method wasn’t in my code. It came about during the refactoring phase when I wanted to make the read method more concise (and curiously also wanted to see if I could extract the code block into something more meaningful). The code had read like so:
1 2 3 4 5
Now, there’s nothing wrong with that. However, if I were another programmer trying to read this code, I would be asking myself, “What are you matching against?” So I decided to write this instead:
1 2 3
To me that reads more clearly than the first example. It may be a little difficult for newer programmers to understand what the syntax is doing. But it is clear that I am rejecting data from the lines of the file I read. Here is the code for the
1 2 3
The name of the method communicates its function. It defines what data is not; however, I probably could have written a different name for
regex. Looking at it right now it doesn’t communicate its intent. Perhaps, I should have written something like this:
1 2 3
That would have been more helpful to someone who is trying to find out how the
no_data method works.
Being Expressive with Regular Expressions
If that same programmer wanted to see what
regex would look like, they would see this:
1 2 3 4 5 6 7 8 9
Originally, I had written down this expression:
1 2 3
What is that expression supposed to be doing? It is capturing three groups of numbers that are separated by spaces. But those numbers don’t have any meaning at all! That’s when I decided to use the ‘x’ option to allow spaces in my regular expression. I went ahead and added named subroutines in order to layer the regular expression with meaning. Notice that I named the groups as
<min>—all names that are relevant to the problem at hand. Unfortunately, those names alone don’t provide much meaning to the problem, I probably should have been a bit more descriptive with the names: “Maximum of what?”, “Minimum of what?” “Are we talking about heights? weights? distances?” A better implementation would have been:
1 2 3 4 5 6 7 8 9
The reason why I didn’t add the names
min_temperature, was because of the way I implemented the
temperature method in my program.
Take a look at the following code:
1 2 3 4 5 6 7 8 9
Had I changed the names in my regular expressions, I would have had to change the above code to this:
1 2 3 4 5 6 7 8 9
Why is temperature mentioned twice in a method call? It is adding too much noise. I should have just gotten rid of the
temperature method and attempted a better implementation like so:
1 2 3 4 5 6 7 8 9 10 11
Even the above implementation is too long to read, but it communicates its intent better than the method I had written.
When doing TDD, you always need to keep in mind the other programmer who will be reading your code. It is not enough to make the program work; it needs to be readable and easy to understand for the rest of your team.
There are other things I didn’t get a chance to cover in this post that would have made it too long: one of those things included the implementation of finding the smallest spread and the date for that spread. I’d like to see your implementation of this code kata. Perhaps there is some other way of implementing this that had not occurred to me. I welcome your insight. I invite you to provide any feedback in the comments below.