Data quality is a battle against entropy over time. More often than not, quality is a function of both sheer determination and the unwavering conviction that better data leads to improved and more consistent outcomes. Although data quality isn’t by any means a new topic, the renewed interest is considerably more recent.
This resurrected interest can likely be tied to the confounding promises of “big data” and “data is the new oil” – a nirvana in which the more data that is collected the better the outcome. This fifteen-year-old notion – of quantity over quality – feels about as dead as the dinosaurs that the metaphor is built upon.
In a surprise that absolutely everyone saw coming, it turns out that data quality had taken a back seat. It is conceptually difficult to define, hard to measure and a constantly moving target. While measuring the volume of data isn’t complex, measuring the quality of anything (let alone data) is plagued with difficulty.
The difficulty in measuring data quality arises from a number of sources. Assessments are often subjective “this isn’t giving me the right numbers for last month” as well as comparative, making an absolute judgement elusive. Who decides what is considered quality and what isn’t – and who then owns the responsibility of not just identifying but rectifying any issues?
Challenging things are mostly worth doing though – and ensuring that you have high quality, validated data is no exception.
High quality data results in increased trust and adoption, and often unlocks solutions or identifies problems that weren’t previously visible. Conversely, problems that are identified in data at any stage of the collection, processing and visualisation of information erodes trust in a way that can have long standing effects.
Data that can’t be trusted can have widespread effects beyond the dataset itself – individuals or teams may be less likely to trust data from this source or team in the future, they may decide that the tool / vendor collecting or processing is in fact tainted. In certain situations this may result in the establishment of alternate tools or processes that are perceived to have greater trustworthiness. In others, it creates an environment where, rather than relying on data, more decisions are based on feel rather than being supported or rejected by available information.
There are a number of cornerstones of data quality – and in the first instalment of this series I’m going to focus on accuracy, reliability, timeliness and predictability. In the second post I’ll cover some of what I consider to be the more overlooked components of quality including safety, explainability and privacy and bias.
I hope from reading this post you can take away an idea of how to think and measure aspects of data quality, what the challenges are in doing so, and some incremental steps on how you can start asking and answering some questions around measurement.
Accuracy refers to how far a measurement is from the “true” value. It’s not always possible to identify a true value, but careful calibration and testing can enable the likelihood that we detect drift in our measurements from what we are intending to capture.
In a similar nature to writing code for tests (test-driven development), we can and should write tests and assert values for data – given an input we expect a certain predictable output or transformation.
These tests are often determined by rules that attempt to map data into simple understandable constraints:
- A price should be a number (with precision 2) between 0.01 and 1000 of unit dollars.
- The name of a product should be a string, and not exceed 255 characters
Tests can easily assert structure, but to assert whether a value is correct often requires running tests in an environment that checks expected versus actual values.
Multiple grammars have been developed for expressing these constraints both in isolation – as an individual data point – as well as together – for example two measurements that are related to each other. JSON schema is a good example of this and has numerous validation libraries in multiple programming languages. Schemas defined in JSON are human readable and also contain validation rules, in addition to typing (e.g., Avro, Protobuf) and nullability requirements.
In addition to considering the validation of a single event we should take into account for all measures what we are measuring on. Is it an individual data point, a collection of data points belonging to the same object, or a collection of related events (in time, or nature)? Good data quality attempts to test all three as issues may be isolated or related in nature.
- How close is the measured value to the “true” value? Can this value be calibrated against a known / ground truth?
- Is a system setup to identify drift between observed values and actual values? (if possible)
- Does accuracy lie with the producer or consumer? Ideally pushing this upstream towards the producer makes more sense.
- Can the accuracy of a single data point or event influence data collected in the past or future or are they independent?
Data reliability is all about consistency. The ability to measure something truthfully (accuracy) is useful – but if we can’t measure the same thing consistently over a time period then we can’t make claims about its reliability.
When thinking about reliability we should ask ourselves:
- Completeness: Are we collecting the full set of data we expect? What are the proportions of NULL, missing, or incorrect values?
- Consistency: Is data collected in a consistent manner? If the same data was collected repeatedly would the recorded output be close together (precision) or disparate? Writing tests for data – and running them regularly (e.g., as part of a CI/CD pipeline) is useful in determining consistency over time. Whilst passing tests indicate stability, failing tests may indicate flaky collection or processing.
- Storage: How robust is the mechanism for storing data? This is likely to vary depending on if data is being stored in memory, on disk, in an object store etc. If the same data is stored in two different systems how identical are their representations?
- Recoverability: Is it possible to detect data that has been corrupted – and if so either to correct for that corruption (e.g., checksums) or discard it partially / entirely? Is data replicated across regions in the event of outages or natural disasters? How long must the data be stored for – this may depend on industry and legislative requirements.
- Transit: When data is transmitted is it done so in a reliable fashion e.g., in a lossless manner? Data truncation, encoding, deliverability issues can impact data on it’s way from source to destination. Can we assert that data from source to sink meets our expectations?
Predictability is a property that emerges from both accuracy and reliability. Predictability enables us to hold expectations about how data behaves and is processed in the future based on the present. Data that is unpredictable may change over time, from the data source or due to context that the user is unaware of.
Poor data quality on either of these measures impacts the predictability component of data quality.
- Are changes in the underlying sources or collection instruments detected and documented? How are sources and tools versioned over time?
- If data changes in the source system in an unpredictable manner:
- Are systems in place to detect changes in source data that may impact downstream systems?
- When systems fail or degrade what is the expected behaviour (e.g., partial data missing, data missing entirely, data with late arrival) and does that behaviour occur?
- If data changes in the source system in an expected manner:
- Is the data semantically compatible (e.g., old value and new values are both in dollars versus euros)
- Is the data structurally compatible (e.g., renaming, removing or changing the type of a captured value)? This often depends on the data store and any indexing mechanisms.
- Are changes in the environment encapsulated that may impact quality? (e.g., changes in browser functionality, availability of APIs)
- Is there an explicit contract between the producer and consumer of data as to how this data should behave? Is this a mutually agreed upon contract or a contract that the consumer and / or producer followers? A producer may include algorithms, tools and people.
- Is there bias (implicitly or explicitly) within the collection or processing of the data?
- For bias corrections – is there a mechanism in place to detect or correct for drift in these assumptions?
- Is data sampled?
- If so – how is the data sampled and how does this impact analysis or processing? (e.g., Data immediately post-purchase may influence NPS)
- Do certain data points or processes exist that may propagate error over time? (e.g., bot traffic influencing the outcome of an A/B test). Can these variables / effects be accounted or controlled for?
- Can data from users be identified from data sent from bots, crawlers, spiders or exploit scanners? (e.g., excluding bot traffic from viewability analysis)
The ability to collect data on time enables us to reason and make decisions faster about the world (though, not necessarily better decisions).
When thinking about timeliness we should consider the differences between:
- The time take to collect data
- The time taken to send data (may be very fast when processing locally, or significantly longer e.g., IoT to the cloud)
- The time taken to process data (to modify, validate or enrich)
- The time taken to invoke an action or decision (reaction and response time)
- The time to capture data from the action back into the system (feedback mechanism) – this could be an indicator that an action has been performed or data about the outcome of a given action (if available)
Collecting and processing data fast is great – but without the capability to detect changes or perform an action can often be of limited utility or in some circumstances a hindrance. Thinking about time allows us to set service level agreements (SLAs) around the expectation of when data will enter or exit these gates so that we can set clear measures and alarms around the deliverability of data.
We’ve looked at accuracy, reliability, predictability and timeliness with respect to data quality in this post. Data quality in theory is reasonably clear however quantifying quality in the real world is often difficult, error prone and murky. It is always important to recognise that improving data quality is an incremental process and to measure improvement – rather than an absolute assessment. In the next post we’ll look at lineage and ownership, privacy and explainability.