One Defect, Two Fixes

In A Tale of Two TDDers, I quickly documented what I see as two different defect-fixing behaviors of Test-Driven Development practitioners. I spent all of about ten minutes writing that blog. It generated a lot of interesting discussion, some of which bordered on deep misunderstanding. More detail would have helped. And so I dug through old support emails and version control history to document a real-world defect fix. It comes from Greatest Hits, an eLearning product that my colleagues and I created in 2006 and actively maintain to this day.

In this story, an album refers to one of our skill-building courses on practices like Refactoring, Test-Driven Development, etc. A Compilation represents either an Album or Box Set (collection of albums). Finally, the story refers to support email. Students can ask questions, post comments or report errors directly on any eLearning page. When that happens, emails get sent and then Industrial Logic’s coaches click on links in those support emails and respond to students directly on the Greatest Hits pages.

Those of us who develop the Greatest Hits product primarily pair or mob program, use Test-Driven Development (TDD) to produce most code, refactor liberally, check in to master (we don’t use branches) and deploy/release to production continuously. Our code base has high test coverage, including thousands of valuable, high-speed microtests, over a hundred acceptance tests, under a hundred Selenium tests and, for our continuous deployment safety, a few deployment tests. Yet even with all of this safety, we get the occasional defect. And that brings us to our story.

On January 2nd, 2013, at 7:50am CST, a support email from a student came in. A few minutes later, Tim, a coach at Industrial Logic, clicked on the support link in an email and went directly to the album page and location of the student’s question. No problems. But when Tim right-clicked on the “My Home” link on the page to open a new tab, he saw our OMG screen with a blank error message:

three people looking confused

At 7:58am, Tim reported the issue to our development email list.

At 8:39am, he documented the sequence that he and Curtis, another IL coach, had taken to reproduce the defect.

So was the system down? No. It worked, just not in the very specific sequence that Tim had just used. And, because of the defect, a null value had been saved in our Bookmark table for Tim’s account, which meant that the system generated a Null Pointer Exception when Tim tried to log in.

At 9:07, Tim emailed again, updating everyone about a conversation that he, Curtis and I had:

Josh (technically on vacation) jumped in and it looks like he has a fix for it. I've dropped the troublesome two entries in the bookmarks table.

According to git history, the first fix I checked in was at 9:13am. I don’t know if I paired with Curtis on this fix or not. It’s likely that I did, since he was around and whenever two or more of us aren’t with clients, we pair or mob on production code. However, it’s still unclear from the check-in comment.

It’s also important to note that the first fix wasn’t simply hacked into the code and checked in with a “hope and pray” strategy. Manual testing was done to first replicate the problem locally. Next, the code was changed and more manual testing was done to see that it fixed the issue. Finally, all tests were run to ensure that nothing had been broken. The fix was live soon after the check-in, since our continuous deployment pipeline pushes all changes that pass our build (including our large suite of tests) automatically to production.

From the time that Tim reported this issue, until a first fix was live, took 80 minutes. The fix I made was to simply add back in a null check that had recently been removed during refactoring. What’s this? A null check!? Yes, a null check. I don’t like them and generally avoid them. A better design would not require null checking.

So how did this null check get into our code? It was actually only a few days old. According to git history, while refactoring the PageAction class on December 28, 2012, Jaju, Karthik and I changed how we obtained the AlbumId when saving a bookmark. The old code was this:

String albumId = getUser().getCurrentAlbum().getAlbum().getId();

We changed it to:

String albumId = getStringParameter(COMPILATION);
if (albumId == null)
  return;

A null check?! Why did we add it? Apparently, something sporadic and hard to reproduce was happening sometimes with bookmarking, based on the git comment that we posted when checking in this code:

Changed PageAction to rely more on parameters and less on the getUser() calls to get an AlbumID or DevLanguage. Hoping this helps with a sporadic error we sometimes get related to bookmarking pages. --Joshua, Jaju, Karthik

The sporadic bookmarking error to which we referred was happening every once in a while when running our suite of Selenium tests. There aren’t any support emails about it that I could find. We wanted it fixed. Here’s the relevant code we touched in PageAction:

private String displayPage(String path, String devLanguage, ViewableAlbum viewableAlbum, User user) throws Exception {
    user.setDevLanguage(devLanguage);
    user.setCurrentLocatorFrom(path);

    setBookmarkFor(user);

    ...
}

private void setBookmarkFor(User user) {
    String pageKey = bookmarkablePageKeyFor(user);
    if (pageKey == null)
        return;
    String albumId = getStringParameter(COMPILATION);
    if (albumId == null)
        return;
    Language devLanguage = Language.instantiateFrom(getStringParameter(DEV_LANGUAGE_PARAMETER));
    user.setBookmark(albumId, pageKey, devLanguage);
}

I was not happy about this new null check in the setBookmarkFor(…) method! We already had a null check for pageKey. Adding that new null check was not even part of refactoring, since it’s new behavior. It also appears that we were not very diligent that day about test-driving that null check into the code, as there is no new test for it, which is quite rare for us. I suspect that we were relying on our suite of Selenium tests to confirm that we’d fixed a subtle, race condition.

In any event, that new null check bugged me and I vowed to remove it. Sure enough, two days later, on December 31st, 2012, Jaju and I paired to remove the null check. In fact, we got rid of the entire setBookmarkFor(…) method on PageAction when we realized it was largely a duplication of a similar method on the User class called setBookmark(…):

private String bookmarkAndDisplayPage(String path, String devLanguage, ViewableAlbum viewableAlbum, User user, Page page,
        String albumId)
        throws Exception {

    user.setBookmark(albumId, page.getKey(), Language.instantiateFrom(devLanguage));

    user.setDevLanguage(devLanguage);
    user.setCurrentLocatorFrom(path);
    parameters().put("viewableAlbum", viewableAlbum);
    return process(DisplayPageAction.class);
}

Much better! We’d removed unnecessary code and produced a simpler, clearer design without null checks! Unfortunately, we also opened the door for the defect that Tim would experience on Jan 2nd, 2013.

So, how exactly did we fix the defect? Two ways: a quick-and-safe fix and a better, more permanent, test-driven fix.

Quick-and-safe? The phrase alone might raise someone’s blood pressure. Is there even such a thing? I believe so. But words matter and to be clear, I don’t associate quickness with haste. For me, quickness is associated with craft. For others, it is forever associated with “haste” and an abundance of software development dysfunction, like constant pressure to cut corners, perform shoddy work, never return to improve a temporary solution, etc.

Quick-and-safe can in fact be both quick and safe. In this situation, given that Tim couldn’t log in and therefore couldn’t respond to a student question, it was important to find the quickest and safest solution to help both Tim and the student. In this case, the first fix wasn’t particularly complicated or dangerous because:

It was clear that the defect resulted from the recently removed null check.
We had a clear set of steps to follow to replicate the defect.
We could use those steps to manually confirm that adding the null check solved the issue.
We could run our large suite of automated tests to ensure we hadn’t broken anything else.
We knew we would fix the defect with a better, test-driven solution, once we figured out how to really solve the problem once and for all.

So what exactly did my quick-and-safe fix look like? Here’s my comment at 9:13am in git:

Added back a null check on AlbumId. We'll need better test coverage for this.

And here’s relevant code in PageAction:

private String bookmarkAndDisplayPage(String path, String devLanguage, ViewableAlbum viewableAlbum, User user, Page page,
        String albumId)
        throws Exception {

    bookmarkPage(devLanguage, user, page, albumId);

    user.setDevLanguage(devLanguage);
    user.setCurrentLocatorFrom(path);
    parameters().put("viewableAlbum", viewableAlbum);
    parameters().put(CURRENT_PAGE, page);

    return process(DisplayPageAction.class);
}

private void bookmarkPage(String devLanguage, User user, Page page, String albumId) {
    if (albumId == null)
        return;
    user.setBookmark(albumId, page.getKey(), Language.instantiateFrom(devLanguage));
}

It was 9:17, when I wrote the following to our developer email list:

Just checked in the fix for this. While refactoring PageAction (to make it more thread-save), we got rid of a null check on AlbumId. That essentially allowed a situation where a null AlbumId could be added to our Bookmark table. Given such a null in the DB table, when a user logged back in, we had a problem. So it's fixed, but needs a good test case - I'd think in Selenium to make sure we never re-introduce this bug.

At 9:46, I wrote a follow-up email:

Just took a shower and realized that the scenario which led to us seeing this bug is perfectly valid and the correct bookmark should've been recorded, not a null one. So my null check now fixes the problem, but doesn't get to the root cause. The steps to re-create this are: 1. Make sure you are logged out. 2. Click on a URL in a support email (I'm simulating this by pasting in the relevant part of the URL into the browser on my local build). 3. Login and it will forward you to the correct page just fine. 4. Right-click the My Home link and open it in a different tab. Whammo - the bug appears. As I said, I fixed this by checking if AlbumId is null in PageAction, but there should be no good reason for it being null in the above case. It should be recording a good bookmark (since you got to the page just fine). So further research is warranted.

Curtis and I then worked on the problem, pair-programming a better, test-driven design. We removed the temporary fix code and added a failing test to an existing test class:

@Test
public void userCanBookmarkAlbumWhenCompilationParameterIsNull() throws Exception {
    addAlbumWithValidLicense("testAlbum", LANGUAGE);
    parameters().put(COMPILATION, null);
    assertProcessSuccessfulFor("testAlbum/commonCodeSmells/deadCodeRemoval");
    Assert.assertThat(user.getBookmark().albumId, Matchers.equalTo("testAlbum"));
}

We figured out that we needed to be using a Locator in the PageAction class. A Locator helps figure out where a user is in our system given a variety of URLs. It’s also able to figure out the correct AlbumID if and when it is initially null:

private String getAlbumId(Locator locator) {
    String albumId = getStringParameter(COMPILATION);
    return albumId == null ? locator.getAlbumId() : albumId;
}

PageAction could now go back to simply asking the User class to save the bookmark. Our new test was now passing. To make doubly sure that we’d solved the problem, we manually tested the scenario that Tim had used. It worked fine. It was January 2, at 12:45pm, almost 5 hours since Tim had first reported the defect. Our 2nd fix was live. We never had any more troubles with bookmarking of pages.

From the time Tim discovered the problem until it was fixed, the first fix took 80 minutes, while the 2nd fix took five hours. Now you could argue that there were delays (showers, emails being written and sent) during those 5 hours, so the second fix could have been faster. But it would never have been as fast as the first fix because the 2nd fix required a deeper level of analysis, design and testing to produce a more permanent fix.

But did we take on too much risk on the first fix? Not at all. We deemed it safe enough, given our small change, our manual testing of a very specific scenario and the large battery of automated tests we used to confirm correct behavior of the entire system.

This story illustrates how two principles drive our behavior: Deliver Value Continuously and Make Safety A Prerequisite. Our first fix helped us deliver a fix to production sooner and with sufficient safety. Our second fix delivered a better design and greater safety via a new automated test. One defect, two fixes. Does this mean one should not use TDD for the first fix? Not at all. In another scenario, it could be that TDDing the fix is quick and safe and requires no second fix. We don’t have one recipe. Instead, we are guided by principles.

What do you think? Please share your thoughts and experiences. And if you made it this far, thanks for reading!

Workshops

eLearning Albums

Case Studies

Pages

Blog Posts

People

Browse eLearning

Take a Workshop

Read a Blog