
Fertile data: why building quality into your process is important
By
Published: July 19, 2018
When it comes to data quality, most technologists are familiar with the adage āGarbage in, garbage outā; and yet today, most organizations appear to be content with wallowing in junk. Just 3% of organizations examined by Harvard Business Review had data that .
Thatās worrying. If youāre planning to increase your use of AI and machine learning to automate some decision making, the actions you make will be fatally compromised if your data isnāt good. Little wonder then that senior IT decision makers see .
Thereās often a fear that data quality initiatives will become a Sisyphean task ā where endless effort is expended to little effect. But it need not be that way. Huge quality improvements can be realized by applying the principles we routinely use in software development to data.
Instead of thinking about data as oil, we prefer to think of it as soil ā that if you invest time and effort into cultivating you can reap the rewards. And thatās where data quality comes in. Even if data quality isnāt your primary driver for an initiative, by making a routine aspect of the way you work, you can make small investments that pay off in the long run.
Recently, I was working for a large, multinational client that wanted to improve the effectiveness of its online store: to better understand the customer payment conversion funnel, to be able to spot any important patterns in incomplete journeys.
And these folks have a lot of user behavior data: they capture events about everything you click or type while browsing their store: from the buttons you click to buy a product, to the number of scrolls or swipes you do while reading a product description. There are petabytes of anonymous user behavior data, and it grows by hundreds of terabytes every day.

Weād spoken to the client about the customer journey, and the happy path theyād take through the store. But as I went through the data to find examples to validate my understanding, I kept coming up with instances where that didnāt happen: the customers were following some path that we didnāt properly understand. And thatās one of the challenges with big data: you find your edge cases actually happen quite frequently.
The problem for us was that in trying to come up with ways to improve this web store, we now needed to understand these edge cases. Why were customers making unexpected decisions to terminate a purchase? Are we seeing this unusual behavior because of a bug? Is it because we didnāt capture an important bit of information? Are we capturing the wrong information?
To investigate this, we werenāt just looking at user clicks, but also all of the logs for back-end processes that are triggered at different points throughout the journey. We were actually doing a lot of QA by trying to understand this data: going back to engineers with bugs on the server side and the front-end side. And they were aware of these small things but didnāt really see the positive impact of the fix.
Once engineers on both sides were able to see the cumulative effect of these minor data issues, they were keen to get it fixed. But ideally, you should be looking to fix these systems before theyāre put in production.
But if you think about fertile data, caring for quality so that you can really deliver powerful insights, youāll start to appreciate that you need to cultivate it closer to the source: to treat data quality as an integral part of everything you do.
Thatās not to say you will solve all your data quality problems. But if you only ever fix them downstream, youāll never get round to fixing the source, and it can cost you more later on since more consumers will have to deal with the same issues. Or, worse, they will start consuming the āclean dataā from the downstream system instead of the source of truth. This is a common anti-pattern: where aggregated data from enterprise data warehouse systems are used as input to transactional systems because they had to go through the pain of fixing earlier data quality issues.
So what do we mean by making data quality part of your routines? We can make the parallel with software development practices by looking from two perspectives: the activities we do in pre-production and those we look at once we are live.
In pre-production, if you really care about cultivating your data, and you find thereās a problem, you stop the line and fix it so that you donāt produce junk further down the road.
Ģż
Youād typically have a bunch of tests to determine whether your code is working. For data, we think itās useful to have a published schema for the events you expect to produce, and an idea of how those events flow. That allows you to perform some basic sanity checks on your data, to ensure you are not producing something that doesnāt fit with the schema. Schema evolution testing and publishing can also be added to your deployment pipeline, to ensure changes are backward- and forward-compatible.
Sometimes this process is more like functional testing. For instance, I expected an element to look like a date. So we could assert that the data is in the right format and within a valid range of dates āĢżif youāre getting sales data from the 1800s, you might have a data quality issue.
But the āstop the lineā thinking doesnāt work so well for systems in production.
Ģż
I learned this lesson from a client that was building a product that ingested data from government sources. We were consuming a data feed, with the aim of surfacing interesting information to their customers āĢżand we were working with the client to build that product.
That meant weād had to make various assumptions about the data that was coming from this external feed. But we kept hitting problems with the data: sometimes fields were missing, or the data didnāt make sense. And that was just breaking our product features.
To work out what was happening, we started writing tests about our assumptions on the data, to establish what sort of patterns would break our system. As we built those tests, we found more and more data quality issues. As a result, we incorporated those tests into our data ingestion pipeline. So every day, the client gave us a new dump of the data, and we started running our tests against that.
The problem we faced here was that we were able to spot things that might break the product, but stopping the data pipeline to fix them would mean that the product would surface outdated information.
And that's why I started to question the idea of stopping the line for production issues. Instead, we could ingest the data that was good, and just mark the things that were wrong as things to follow up.
This approach is more similar to the software development practices for doing : observability and monitoring become the preferred approach.
You can setup thresholds to define your normal expectations and only take action when you get alerts that the thresholds were crossed. Whereas in pre-production you might want to aim for as close to perfection as possible, once youāre in production, youāre looking at what level of quality is acceptable and adapting as you learn more about the data flows.
It is also unlikely that a single data quality program initiated by senior decision makers is going to solve your problems. Instead, data quality needs to become part of your development process, so you have the teams that are producing data or consuming data to be the ones thinking about data quality all the time.
These are big changes for many organizations. And if you want to cultivate and use data to empower your business, we think itās the right way to go.
Thatās worrying. If youāre planning to increase your use of AI and machine learning to automate some decision making, the actions you make will be fatally compromised if your data isnāt good. Little wonder then that senior IT decision makers see .
Thereās often a fear that data quality initiatives will become a Sisyphean task ā where endless effort is expended to little effect. But it need not be that way. Huge quality improvements can be realized by applying the principles we routinely use in software development to data.
Fertile data: soil not oil
Weāre often told that , but itās a pretty weak analogy. Having lots of data won't make you rich. Many of the clients we work with have an abundance of data, but they also struggle to derive significant value from their data.Instead of thinking about data as oil, we prefer to think of it as soil ā that if you invest time and effort into cultivating you can reap the rewards. And thatās where data quality comes in. Even if data quality isnāt your primary driver for an initiative, by making a routine aspect of the way you work, you can make small investments that pay off in the long run.
Recently, I was working for a large, multinational client that wanted to improve the effectiveness of its online store: to better understand the customer payment conversion funnel, to be able to spot any important patterns in incomplete journeys.
And these folks have a lot of user behavior data: they capture events about everything you click or type while browsing their store: from the buttons you click to buy a product, to the number of scrolls or swipes you do while reading a product description. There are petabytes of anonymous user behavior data, and it grows by hundreds of terabytes every day.

Weād spoken to the client about the customer journey, and the happy path theyād take through the store. But as I went through the data to find examples to validate my understanding, I kept coming up with instances where that didnāt happen: the customers were following some path that we didnāt properly understand. And thatās one of the challenges with big data: you find your edge cases actually happen quite frequently.
The problem for us was that in trying to come up with ways to improve this web store, we now needed to understand these edge cases. Why were customers making unexpected decisions to terminate a purchase? Are we seeing this unusual behavior because of a bug? Is it because we didnāt capture an important bit of information? Are we capturing the wrong information?
To investigate this, we werenāt just looking at user clicks, but also all of the logs for back-end processes that are triggered at different points throughout the journey. We were actually doing a lot of QA by trying to understand this data: going back to engineers with bugs on the server side and the front-end side. And they were aware of these small things but didnāt really see the positive impact of the fix.
Once engineers on both sides were able to see the cumulative effect of these minor data issues, they were keen to get it fixed. But ideally, you should be looking to fix these systems before theyāre put in production.
Cultivating your data
You might take the view that data quality is important, but itās something that exists separately from your day-to-day work ā that thereās a data quality department/function that will take care of it at some point down the line. Thatās where we can learn from how the QA practice evolved in the broader field of software development: we also used to have separate QA department/function, but Agile and Lean thinking helped us bring quality into the software development process.But if you think about fertile data, caring for quality so that you can really deliver powerful insights, youāll start to appreciate that you need to cultivate it closer to the source: to treat data quality as an integral part of everything you do.
Thatās not to say you will solve all your data quality problems. But if you only ever fix them downstream, youāll never get round to fixing the source, and it can cost you more later on since more consumers will have to deal with the same issues. Or, worse, they will start consuming the āclean dataā from the downstream system instead of the source of truth. This is a common anti-pattern: where aggregated data from enterprise data warehouse systems are used as input to transactional systems because they had to go through the pain of fixing earlier data quality issues.
So what do we mean by making data quality part of your routines? We can make the parallel with software development practices by looking from two perspectives: the activities we do in pre-production and those we look at once we are live.
In pre-production, if you really care about cultivating your data, and you find thereās a problem, you stop the line and fix it so that you donāt produce junk further down the road.
Ģż
You can think about testing your data quality much as youād think about testing your code.
Youād typically have a bunch of tests to determine whether your code is working. For data, we think itās useful to have a published schema for the events you expect to produce, and an idea of how those events flow. That allows you to perform some basic sanity checks on your data, to ensure you are not producing something that doesnāt fit with the schema. Schema evolution testing and publishing can also be added to your deployment pipeline, to ensure changes are backward- and forward-compatible.
Sometimes this process is more like functional testing. For instance, I expected an element to look like a date. So we could assert that the data is in the right format and within a valid range of dates āĢżif youāre getting sales data from the 1800s, you might have a data quality issue.
But the āstop the lineā thinking doesnāt work so well for systems in production.
Ģż
You canāt always just stop a production system to fix a data quality issue.
I learned this lesson from a client that was building a product that ingested data from government sources. We were consuming a data feed, with the aim of surfacing interesting information to their customers āĢżand we were working with the client to build that product.
That meant weād had to make various assumptions about the data that was coming from this external feed. But we kept hitting problems with the data: sometimes fields were missing, or the data didnāt make sense. And that was just breaking our product features.
To work out what was happening, we started writing tests about our assumptions on the data, to establish what sort of patterns would break our system. As we built those tests, we found more and more data quality issues. As a result, we incorporated those tests into our data ingestion pipeline. So every day, the client gave us a new dump of the data, and we started running our tests against that.
The problem we faced here was that we were able to spot things that might break the product, but stopping the data pipeline to fix them would mean that the product would surface outdated information.
And that's why I started to question the idea of stopping the line for production issues. Instead, we could ingest the data that was good, and just mark the things that were wrong as things to follow up.
This approach is more similar to the software development practices for doing : observability and monitoring become the preferred approach.
You can setup thresholds to define your normal expectations and only take action when you get alerts that the thresholds were crossed. Whereas in pre-production you might want to aim for as close to perfection as possible, once youāre in production, youāre looking at what level of quality is acceptable and adapting as you learn more about the data flows.
Making the change stick
While I firmly believe that we can learn a lot about improving data quality from looking at software development best practice, it doesnāt mean that solving issues is easy. A lot of our clients donāt have the structures in place to be able to implement these types of practices. So thereās a cost of entry that will require an upfront investment.It is also unlikely that a single data quality program initiated by senior decision makers is going to solve your problems. Instead, data quality needs to become part of your development process, so you have the teams that are producing data or consuming data to be the ones thinking about data quality all the time.
These are big changes for many organizations. And if you want to cultivate and use data to empower your business, we think itās the right way to go.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of ÷ČÓ°Ö±²„.