Duplicated Logics

The Single Point of Truth (SPoT) principle states that every fact and every algorithm is expressed in one (easily-located) place in a codebase. Dave Thomas and Andy Hunt said that every piece of knowledge must have a single, unambiguous, authoritative representation.

There are, ironically, multiple ways of expressing this simple idea, and perhaps you’ve known this principle by other names:

  • Once And Only Once (OAOO)
  • Don’t Repeat Yourself (DRY)
  • Single Source of Truth (SSOT)

This isn’t a new concept in our software space. Many notable people for many years have repeatedly told us that duplicated code is a “code smell” and leads to our code being progressively bloated and degraded. It leads to code maintenance problems and errors incorrectly listed as “regressions.”

Coincidences and Accidents

Some people misunderstand this to mean that no series of characters and punctuation marks is repeated – a very naive textual concept. We didn’t say “no text repeats ever” because that would be absurd. For instance:

legs_on_a_dog = 4
legs_on_a_chair = 4
years_between_US_presidential_elections = 4

There is no duplication in the above snippet. None. Yes, the phrase ‘legs_on’ appears twice, and the number 4 appears 3 times, but they are all independent and separate facts.

Their coincidence is, well, coincidental. It’s not meaningful at all. If a chair has 3 legs, that would not affect presidential elections or dogs. These are independent facts, each appearing only once in the snippet above.

With everything in software development, things that change together are put together and things that change independently are kept separate. That’s the heart of software design (coupling and cohesion). To munge these “4”-related things together would be coupling things that change independently, which is a no-no.

So, in short, we never create an abstraction around accidental coincidences.

This is where static code analysis tools can get us into trouble if we accept their findings without inspection. Static analysis tools can identify textual duplication, but humans need to determine whether it’s coincidental or a duplication of facts in the codebase.

Since the three facts above are three independent facts, we have satisfied the SPOT principle by keeping them separate.

The Rule of Three

So, we’ve spotted some duplication. Should we remove the duplication right now? Is it a moral imperative or a choice? Sometimes we tackle duplication right away, perhaps most of the time we do, but it is a choice. Sometimes it seems too early to worry about it as we’re not sure how meaningful the duplication is (after all, we don’t remove coincidental/accidental similarities).

The Rule of Three also known as Three Strikes and You Refactor, states that once similar code is created for the third time, it should be extracted. It was popularized by Martin Fowler in his book, Refactoring, where he attributed it to Don Roberts. The idea behind this heuristic is that some duplication is easier to work with than a wrong abstraction.

Some may think that this is at odds with the SPOT principle, but we don’t see it that way. It is just one way of dealing with duplication. We may decide when to extract duplicated elements of our code based on their frequency, how confident we are in the abstraction that is forming, and how simple it would be to modify that abstraction in the future as we learn more. If we write well-tested, intention-revealing, simple code, a wrong abstraction is much less costly. If refactoring and TDD are part of our team’s daily practices, we may choose to refactor before the third occurrence presents itself.

Spotting Duplication: A Story

Let’s look deeper, though, at how the problems tend to occur and play out. An example might be helpful:

Joan is working on a module in the AwesomeSauce codebase. She collects the data she wants from the database, loops through the data (filtering out irrelevant entries), and performs some calculations which she then stores in an array and returns embedded in a JSON document.

She extracts the calculations into a private function and breaks out the filtering criteria into a separate lambda expression. The code reads well, makes sense, and passes all of its tests. She commits the code.

The next day, Manish is working on a report. He acquires the data he needs from the database, filtering out the irrelevant data. He extracts the filtering criteria to a private function. He performs the calculations he needs while looping through the remainder, and returns the result as an array.

The next day, Kristof takes on some new work. He needs to acquire some filtered data. He looks at the relevant classes, but there is no method that collects the data he wants or allows him to perform a query with a filter. He writes a private helper function _to acquire his data from the database. He writes his calculations into another private helper function_ and calls that function in a loop that iterates over all the data. He returns the result as a list.

Now the filtering criteria, the data collection, and the calculations exist in private helper functions in “edge” classes - the places where data is consumed.

None of these features have made their way into the domain model, even though they are common behaviors.

Manish has a new feature to implement. He realizes that he’s already done something similar, so he goes to his report and copies the code he wrote last time. This time, he has been experimenting with newer functional programming features, so he revises his copied code to work using filter, map, and collect. The new code is much cleaner than the old code. Because it’s now quite a small code expression, he leaves it inline in the function he’s writing.

Some of you may have recognized that Manish literally did a copy/paste operation, whereas the other did not. The others have repeatedly re-invented the process of querying, filtering, and calculating.

In Manish’s case, the new version of the code is a substantial improvement over the old version, and no static analysis tool will recognize that it is doing the same work in a slightly different way.

On the other side of the organization, Erin and Mike have been fixing bugs together. There is a problem with rounding in a calculation. They correct the error and make a patch in production.

A few minutes later, the bug is re-detected! This time, it’s in a report. Erin locates the bug, and sure enough there is a rounding error that they’ve already fixed once before. While Erin remembers what they did, she doesn’t really recall where it was and she’s in a hurry to fix the code so she repeats the fix in this new piece of code.

In the morning, Mike has a report that the bug has recurred! He loads Kristof’s changes from last week and sees that there is the same error that he and Erin fixed yesterday. He makes the fix again.

Two weeks later there is a bug report from the field. There seems to be a rounding error. Erin and Mike have become accustomed to this “groundhog day” scenario and have each memorized the change that corrects the rounding error so that they can retype it, and quickly turn-around errors from production each time the problem recurs.

How many opportunities to remove duplication did you spot in the above story? At least 7?

Great. What would it have taken for this story to NOT have unfolded this way?

How would each of the players have known not to repeat or reinvent the code and the defect?

What should Erin and Mike have done differently, and why?

What would have been different if they were following the SPOT principle?

Does the Rule of Three protect the team sufficiently?

Where should the duplicated methods have been kept, so that the developers could avoid reinventing the same code and errors?

Related Discussion:

We had an open discussion through our Industrial Logic TwitterSpace. Here is the recording of that conversation.