A Lexicon for "Data-Quality"

Of late I have been thinking a lot about data, how we consume it, how we store it, its often temporal nature, what it means to have "good" data and even how we dispose of it.

source: dokuti project ( https://github.com/GrindrodBank/dokuti )

You may have heard the saying "garbage-in garbage-out", arising from the time of the first punch card computers. You may have also heard executives talk about "data quality", and these ideas inform our mental model about data, but it unfortunately does not take us far enough.

Why is the single approach of "quality" not a good-enough way to think of your data? Well, mainly because this word means different things to different people. For the executive it might mean inferring something about the data's predicative capability, whereas for a customer relationship manager it might mean that they have captured all the mandatory fields, but to the developer it might simply mean that all the fields actually are of the correct type and format.

So of necessity we need a broader lexicon than just "quality" when dealing with our data.

Lets start with the idea of data-linage. This concept means that we know where our data comes from. Data has a way of being populated from various sources at various times under varying conditions. ETL (Extract, Transform & Load) is a common example where the linage of data gets obscured unless we also keep meta-data about the origin of our data. Useful related concepts here include; the idea of identifying source documents & data, idempotent regeneration of operational data from these source data, and all relevant ETL processing in-between. Further, once data has been operationalised, its linage is further determined by all the operational changes in the normal course of business i.e. to keep track of what changed when by whom for which reasons.

Whereas lineage is mostly concerned with where your data has been, the closely related concept of data-provenance is concerned with who has-owned the data (in addition to where it has been) and this extends for the data's entire existence, from creation or source, through it's archival and right up to the moment of it's destruction. We interpret "ownership" to include who may have had "access" to the data. Data-provenance therefore encompasses the idea of who has had what kind of access and for when and to which data, for the data's life time. This is particularly relevant in a regulatory environment which prioritises consumer data protection rights, such as GDPR (General Data Protection Regulation - EU, 2016) & PoPI (Protection of Personal Information Act - South Africa, 2013), or in the card payment space where PCI-DSS (Payment Card Industry - Data Security Standards) may be applicable.

One key concept inferred when referring to data quality would be the veracity of the data, that is the truthfulness or correctness of the data. Whilst this is often just taken for granted because "computers never lie" or at least not yet, there are many reasons why the veracity of the data consumed by business cannot just be taken for granted. These reasons range from the technical, such as data type conversion errors, to design flaws that allow for conflicting semantic interpretation of free input data, to poor data validation, to the absence of ongoing data validation to catch the "cruft" that gathers in the corners of databases & file-systems over time. To have any confidence about data veracity, data needs to be continually, and preferably automatically, tested in the same way that code, reports and processes are tested.

Data veracity is one of several factors that would give rise to data reliability. Reliability really means that business can rely on the data & it's structure or format. If however the data remained "truthful" but changed in other unexpected ways (this could be in it's format, structure, taxonomy or even accessibility etc.) then it cannot be relied upon as expected. Data-reliability is as much a consequence of system & database design as it is of ongoing validation & maintenance.

As data grows, a structural form of entropy often embeds itself in the data, and therefore in the applications that access that data and eventually in the business processes that relay on those systems that rely on those data. This occurs because the taxonomy of the data was not clearly understood nor were the structures, grouping and relationships between data effectively understood or designed during these early stages - data simply tends to accumulate organically just as growing organisations do. Simply, life-happens!

Bringing meaning to your data by arriving at a meaningful data dictionary, revisiting your data entities and the relationships between them, and normalising data to eradicate duplication will go a long way to unlocking the value you have in your data. Establishing a data-warehouse to further explore your data becomes a do-able next step.

Other ideas that we might consider for a useful data lexicon include; timeliness (especially in the context of ETL & dependencies on external sources of data) and temporal (such as historical) data, data aggregation for large data handling along with data anonymity (as may be required in aggregated data for compliance and data sharing reasons), and finally data obfuscation to allow developers & testers to work on full & consistent data sets which are obfuscated for privacy yet still ensuring integrity in format & relationships between data entities.

So one could argue that data quality is indeed a real thing, and as part of our brave new lexicon it indeed is. Encompassing several of the other concepts above such as veracity it extends to the smallest issues such as camel-case versus all-caps of captured fields and the simple consistency of date formats or that of the numeric positional notation used. Data clean-ups are often performed as once off exercises but data quality requires a continuous and ongoing vigilance of your data. Automated quality checking is one way to achieve this.

The above discussion is far from complete but hopefully it highlights the necessity to avoid an overly simplistic approach of labelling all data challenges as being "data-quality" issues and opens the door for adopting a broader and more meaningful lexicon by which to comprehend and address your data.

If you need help with your data feel free to Contact Us for assistance.

#data #database #dataquality #ETL #PostgreSQL

A Lexicon for "Data-Quality"

Comments