Copyright 2017-2024 Jason Ross, All Rights Reserved

Conway’s Law states:

"Any organization that designs a system...will inevitably produce a design whose structure is a copy of the organization's communication structure."

It’s often paraphrased as the structure of software developed by an organization reflects the structure of that organization. What if this sort of reflection applies in other areas?

If your data is in CSV format or a data table, the code to handle it tends to look something like this:

for each row:
    Do something with the fields on this row

This “sort of” looks like it might reflect the data structure, but there’s not really much to work with. Even if it is some sort of reflection, it’s hardly a surprise that a series of rows is processed as exactly that.

Let’s take a look at some JSON data:

{
    "order_summaries": [
        "order": {
            "reference": "AB19234",
            "customer" {
                "name": "John Smith",
                "address": {
                    "town": "Toronto"
                    "county_state_province": "Ontario",
                    "country": "Canada"
                }
            }
        },
        "order": {
            "reference": "ZX54390",
            "customer" {
                "name": "Fred Jones",
                "address": {
                    "town": "Worcester"
                    "county_state_province": "Worcestershire",
                    "country": "United Kingdom"
                }
            }
        },
        "order": {
            "reference": "LK196854",
            "customer" {
                "name": "Dave Anderson",
                "address": {
                    "town": "Toronto"
                    "county_state_province": "Ontario",
                    "country": "Canada"
                }
            }
        }
    ]
}

An initial piece of code to parse this, assuming we’re using a parser library of some sort, might look like:

var orderSummariesNode = document["order_summaries"];

foreach (var orderNode in orderSummaries)
{
    string reference = orderNode["reference"].Value();
    string name = orderNode["customer"]["name"].Value();
    string address= orderNode["customer"]["address"].Value();
    string town = orderNode["customer"]["address"]["town"].Value();
    string countyStateOrProvince = orderNode["customer"]["address"]["county_state_province"].Value();
    string country = orderNode["customer"]["address"]["country"].Value();
   
    // Code to create new order object and add it to a collection
}

This works, but it isn’t “pretty”; there’s a lot of inefficient repeated evaluation and then there’s the question of what happens if a node is missing? Some libraries can handle this, and will return null if any node along the path is missing. Others will just throw exceptions at the first sign of missing data. In C# there are the null-conditional - ?[] and ?. - operators, which can help stop missing nodes from causing too many problems. However these are not available in all languages and they don’t make the code above any prettier or more efficient.

There’s also the look of “cut and paste” about the code, but assuming there are suitable unit test cases and coverage, we’ll presume it works.

If you’re reviewing code like this bear in mind there are many reasons why it may look as it does; maybe the developer was inexperienced with the data format. Maybe the developer just didn’t feel the need to make it pretty, or maybe there was a deadline looming, it was late in the evening with a sense of panic, constant interruptions and status requests from people who should have known better, and the developer was feeling the frustration and misanthropy that atmosphere can create.

Regardless, the great thing about test driven development (TDD) is that it gives us the security that we can tell whether the code above works, and we can tidy the code up when we have a moment and have regained some level of motivation.

Refactoring the code could give us something like:

var orderSummariesNode = document["order_summaries"];

if (orderSummariesNode != null)
{
    foreach (var summaryNode in orderSummaries)
    {
        string reference = summaryNode["reference"].Value();
        var customerNode = summaryNode["customer"];
       
        if (customerNode != null)
        {
            string name = customerNode["name"].Value();
            string addressNode = customerNode["address"];
           
            if (addressNode != null)
            {
                string town = addressNode["town"].Value();
                string countyStateOrProvince = addressNode["county_state_province"].Value();
                string country = addressNode["country"].Value();
               
                // Code to create new order object and add it to a collection
            }
        }
    }
}

Now there’s no duplication of evaluation, so the code is more efficient, and the outline of the code is starting to resemble that of the data, so it’s function is clearer.

Assuming we’re still using the same unit tests, and that they work, this is a reasonable refactoring. However have we made the code more fragile BECAUSE it reflects the structure of the data? Not necessarily in the case of additional or missing data nodes, which both examples will handle at least to some extent, but what about when the schema changes more? If data nodes are moved around, then there may be more substantial code changes required; the first example may allow array subscripts to be changed, whereas the second may require more structural code changes. Either way the code will need to be retested with new data.

As the developer it’s up to you to decide the way to go when you write code to process data, although increased clarity, efficiency and speed are quite persuasive when schemas don’t change often. Incidentally this is how it should be: a schema, if not set in stone, at least shouldn’t be changing frequently. If it is, and you’re not working with a product that’s under development, this points to a more serious problem with the producer.

Of course, there’s a chance the data structure may never change, which means we don’t need to worry about changing the code. The problem with this sort of assumption is that unless you are the one producing the data that this code is processing, or at least someone you can yell at is responsible for its production, then the schema is going to change at some point. If you’re dealing with a third party data source, there’s a good chance you won’t know of any changes until this code fails to decode anything.

In this sort of situation you need to assume the worst all of the time, and generally expect the unexpected.

Made In YYC

Made In YYC
Made In YYC

Hosted in Canada by CanSpace Solutions