Harmony and Identity: Generalizing Code

Posted July 21, 2016 by Bill Wake

Do you find yourself creating code for similar methods or classes, and feel like it shouldn't be so hard?

You may be able to Harmonize Code and Introduce Identity Elements to make methods and classes melt away.

Harmonizing

Harmonize: (v) "bring into consonance or accord."

While "harmonize" is a good musical term, this sense of harmonize is related to the idea of harmonizing standards: ensuring they can line up in a compatible way.

Methods or classes can sometimes be harmonized in a way that makes it easier to remove the duplication among them.

Here's a real (but simplified) example:

writeCopyright() {
    stream.write("// Copyright 2016\n")
    stream.write("// All rights reserved\n")
    stream.write("// ********************\n")
}

writeHtmlCopyright() {
    stream.write("<!--\nCopyright 2016\n")
    stream.write("All rights reserved\n");
    stream.write("-->\n")
}

writeHashCopyright() {
    stream.write("# Copyright 2016 #\n")
    stream.write("# All rights reserved #\n")
}

In real life, these three methods were on the same class, but they could have easily been methods on different subclasses.

I've seen variations of this program structure a number of times. In one case, there were multiple combinations of fields to be sent in an email message. In another case, there were dozens of subclasses that mostly just had different data values and a couple minor variations of code.

Synoptic Harmony

Have you ever seen a synoptic ("one-view") Bible?

Synoptic view of the Bible

Synoptic view of the Bible

In the Christian Bible, several books describe the same events, in similar but not identical ways. A synoptic Bible shows the books in parallel, with events lined up (where they can be) so you can compare the various descriptions.

We'll use a similar idea with code:

  1. Write methods in parallel so we can see how they line up.
  2. Devise a generalization that covers all parts.
  3. Simplify conditionals and repetitions.
  4. Translate it into code.

Write Methods in Parallel

We can take a synoptic view of our code:

What gets written?
DescriptionwriteCopyright()writeHtmlCopyright()writeHashCopyright
bodyPrefix <!-- 
1 linePrefix// #
1 lineCopyright 2016Copyright 2016Copyright 2016
1 lineSuffix  #
1 nl\n\n\n
2 linePrefix//  
2 lineAll rights reservedAll rights reservedAll rights reserved
2 lineSuffix  #
2 nl\n\n\n
bodySuffix// **********\n-->\n 

No single method uses all the pieces - that's normal. The numbers on the left show where a pattern repeats. Notice that there are some rows where the same thing happens across all the methods; that's typical and desirable. (If no rows overlap, that suggests the methods aren't similar enough to harmonize.)

Devise a single generalization that "covers" each method.

In words, we can generalize these methods as:

	an optional body prefix
	followed by one or more formatted lines, each consisting of:
		an optional line prefix
		a line of text
		an optional line suffix
		a newline
	all followed by an optional body suffix

It's handy to be able to write this as a regular expression:

        bodyPrefix? (linePrefix? line lineSuffix? nl)+ bodySuffix?

where "?" means "0 or 1", and "+" means "1 or more". You may be used to regular expressions for string matching; we're doing something very similar by matching code chunks rather than strings.

Each of the three methods conforms to the generalized description.

This is a big step: it tells us the form of a generalized routine - provided we can figure out when each piece should appear!

Translate Into Code (V1)

In a bit, we'll look at simplifying the generalization.

But first, let's pause and consider how to move this back to code.

Harmonizing took us from individual solutions to a generalized form.

That gives us a new problem: how does the generalized form know which parts to activate? We may be able to devise a scheme where it all "just works", or we may have to pass in information that "programs" the new method to know what to do.

Pseudocode for our method might look something like this:

writeGeneralCopyright() {
    if (shouldWriteBodyPrefix) 
        stream.write(bodyPrefix)
    end

    writeCopyrightLine(lines[0])       // We called for a one-line minimum
    for (int i = 1; i < lines.count(); i++)
        writeCopyrightLine(lines[i])
    end

    if (shouldWriteBodySuffix) 
        stream.write(bodySuffix)
    end
}

writeCopyrightLine(line) {
    if (shouldWriteLinePrefix) 
        stream.write(linePrefix)
    end
    
    stream.write(line)

    if (shouldWriteLineSuffix) 
        stream.write(lineSuffix)
    end

    stream.write("\n")
}

Here's one way to use this. For simplicity, we'll assume there are boolean fields that tell what to write, and string fields for the prefixes and suffixes:

writeHashCopyright() {
    shouldWriteBodyPrefix = false
    shouldWriteLinePrefix = true
    shouldWriteLineSuffix = true
    shouldWriteBodySuffix = false
 
    linePrefix = "#"
    lineSuffix = "#"

    writeGeneralCopyright()
}

There are three things I'd like to improve:

  1. Fields are used as implicit parameters. We could look at passing arguments (though it's a long list), or perhaps creating an object representing the configuration. We won't pursue that further.
  2. Selecting the desired parts (by setting the booleans) is ugly.
  3. The need for a one-line minimum adds complexity.

Simplify: Leverage Identities

Every time we use "?" in a regular expression, we introduce the challenge, "Should this element be present or not?"

So, we'd like to remove the "?" operators.

In this case, we can do that - because writing to a stream is very much like appending strings.

Is there a string that has the effect of adding nothing (as required by writeCopyright() and writeHashCopyright() for the bodyPrefix case)? Of course there is - the empty string.

The empty string is an identity element with respect to appending strings. Just as adding 0 to a number leaves it the same, appending the empty string has no effect:

        "string" + ""   <=>   "string"

Since we can use the empty string for bodyPrefix etc. as needed, we can simplify our regular expression:

        bodyPrefix (linePrefix line lineSuffix nl)+ bodySuffix

We just have to ensure that the values are set correctly:

valuewriteCopyright()writeHtmlCopyright()writeHashCopyright()
bodyPrefix"""<!--"""
linePrefix"//""""# "
lineSuffix""""" #"
bodySuffix"//*********\n""-->\n"""


We can now create a common method that reflects the regular expression and that takes these values as parameters (along with the lines). We might even be able to inline the methods that call the common method, eliminating them entirely.

Identity elements (such as empty strings) help eliminate conditionals.

For more examples, see Wikipedia on "Identity Element".

When the Conditional Doesn't Simplify

Sometimes there is no suitable identity element.

For example, in the case of email, there is a distinction between an empty subject line (present but consists only of an empty string), and an absent subject line (not present at all). We can't use "" to represent both cases.

There are a few ways to work around this challenge:

  1. The Null Object
  2. A Null Object is a "do-nothing" object. It's usually designed to provide identity data where it can, and do-nothing methods when asked to work.

    In the email example, we could imagine a field-writing object. A subject that is present (empty or not) would have an object that writes a given field to a stream. A missing subject could have a null object would do nothing when asked to write.

  3. Null Value or Optional Type
  4. Many languages have a null value that can be used to distinguish empty string ("") from no value (null). A related approach is an optional type that can queried to understand whether a value is present.

    This saves us from the complexity of "extra" objects, but requires more complicated and error-prone code:

    	if (subject != null)
    	    write(subject)
    
  5. Boolean Flag
  6. Rather than use a null pointer, we can use a boolean flag to indicate whether the value is present. (This is especially useful with primitive types.) It's the approach we used in V1 above.

    For example:

    	if (ageRecorded)
    	    write(age + " years")
    

    (We can go C-stye and use a set of flags with values 1, 2, 4, 8, … and an "&" check.)

    Simplify Repetition: That Pesky Plus (+)

    Just as we like to eliminate "?" from our regular expression, it's often useful to question "+" ("1 or more") as well.

    Is it important to require at least one iteration? Sometimes it is, but "0 or more" ("*") is usually easier to work with.

    In the copyright case, we could pass in an array of lines, and be perfectly content if there were no lines to write. As a matter of fact, we found the copyright examples above when we needed to make it possible to not get any copyright notice.

    In our example, the regular expression becomes:

            bodyPrefix (linePrefix line lineSuffix nl)* bodySuffix
    

    With an empty bodyPrefix, no lines to write, and an empty bodySuffix, we can get "no notice" for free.

    With typical language constructs, "*" is usually easier to work with than enforcing the need for a one-line minimum.

    Translating to Code (V2)

    Using the identity technique and the simplified loop, let's look at a better version:

    writeGeneralCopyright(lines, bodyPrefix, linePrefix, lineSuffix, bodySuffix) {
        stream.write(bodyPrefix)
      
        for (line in lines)
            writeCopyrightLine(line)
        end
    
       stream.write(bodySuffix)
    }
    
    writeCopyrightLine(line) {
       stream.write(linePrefix)
       stream.write(line)
       stream.write(lineSuffix)
       stream.write("\n")
    }
    
    writeHashCopyright() {
        writeGeneralCopyright(lines, "", "#", " #", "")
    }
    

    Sidebar: Jackson Structured Programming

    The harmonization approach I've used reminds me of another method, not often mentioned these days: Jackson Structured Programming (aka JSP).

    In JSP, you define a grammar for the input and output, then create a generalized grammar that unites and reconciles them. (It's not always possible; you might have a "structure clash" that requires a different approach.)

    In the harmonization above, we're using a similar idea to harmonize across methods, a grammar for what code gets run.

    See Principles of Program Design, by M.A. Jackson.

    A Lurking Alternative: Interpreter

    Once we've gone to the trouble of describing our code using a grammar or regular expression, it suggests another alternative: the Interpreter pattern.

    An interpreter takes a description, then uses that to figure out what to do.

    It's not always useful, but the benefit is that we can make it easier to change the description without requiring any (or much) code to change.

    Limitations

    The harmonization approach is not always useful.

    We can generalize any set of methods (e.g., A? B? C? can generalize three unrelated methods a(), b(), and c()). But generalizing only benefits us if there is commonality - duplication - that we can identify and reduce.

    There are several other possible trouble areas (and if there are n, there are probably n+1):

    • Concurrency - We may need to be sure that everything is accessed in a similar order, and under the same locking protocols etc. as the original code.
    • Exceptions - Similarly, we need to be sure that exceptions can't mess up the flow from the original design. There are usually ways to do that, but they can be clunky.
    • Recursion - If a method calls itself, the flow is more complicated to describe (and you may not be able to find a good description).

    Don't let these limits scare you too much - a lot of code never has these issues.

    Summary

    When several bits of code are suspiciously similar, you may be able to Harmonize them - line them up and devise a generalized rule that describes them all together.

    1. Write methods in parallel to align corresponding parts.
    2. Devise a generalization that covers all methods.
    3. Simplify conditionals and loops if you can.
    4. Translate your generalization into code.
    To harmonize code: (1) Align similar code, (2) Generalize, (3) Simplify, and (4) Translate back into code. [Tweet this]


    On unlucky days, you find there's no benefit to this technique, or no convenient way to make it work with your code.

    On average days, you can create a generalized method fairly easily, but have to develop a scheme that tells the generalized method which parts to activate.

    On lucky days, you can find identities (such as empty string or 0) that make if statements melt away, and you get rid of a lot of duplication!


    Thanks to Chris Freeman, Tim Ottinger, and Kevin Rutherford for early feedback.
  • unroar

    One issue you don’t mention is that the more abstracted code is more difficult to understand. The original code, with literal strings, is very easy to understand. A new developer encountering the generalized code will require some thinking and some time to understand what is going on.

    The conflict between simplicity and DRY creates a balance that developers have to strike. Below some level of repetition the generalized code is not worth it. But deciding what level is the tipping point is context-dependent and it’s hard to make a universal rule of thumb.

    • Drazen Dotlic

      > But deciding what level is the tipping point is context-dependent and it’s hard to make a universal rule of thumb.

      Nicely put. In practice, this is the reason we often get overarchitected messes on one hand and copy & paste galore on the other, but rarely a balanced approach 🙁

    • Bill Wake

      You’re right, there’s definitely a balance to strike. But I’ve seen cases where this technique gets rid of dozens of classes or methods, and changes go from “maybe next release” to “configure these parameters”. I especially agree with your last sentence! –Thanks, Bill

  • zhech

    A similar idea I’ve read about before is commonality and variable analysis.