How strictly should we govern different data sets?
Whilst most of us could do with more data governance, it’s not true that we should always be governing data more tightly
Data governance may be an old discipline, but it’s receiving a lot of attention recently. It was one of the two most talked about topics at a CDAO event I recently attended in Amsterdam (the other was treating data-as-a-product) and at the core of a very interesting debate between Oliver Laslett and Oliver Hughes of Lightdash and Count at an Analytics Engineering London event last month. It was a big part of Tristan Handy’s Coalesce keynote.
The recent debates about data contracts are really debates about data governance, because data contracts are tools for defining and enforcing data governance. (So the debate decomposes into two issues: should we be governing data more tightly, and if so, what is the best engineering approach to doing so?)
In this post I want to unpick:
Why is data governance having a moment? Why are so many people arguing for more data governance, now?
What is the right level of governance to apply to different data sets? We often assume that more governance is better. This is not correct - I think there are very practical / pragmatic steps teams can take to figure out what the right level of governance is, for a specific data set, at a specific point in time, and evolve their approach as the way that the data is used in the organization changes.
What is data governance?
There are lots of definitions of data governance that are accurate (on the “what” of data governance”) but not helpful if we want to understand why data governance is valuable:
“A system of decision rights and accountabilities for information-related processes, executed according to agreed-upon models which describe who can take what actions with what information, and when, under what circumstances, using what methods.” (Data Governance Institute)
“The exercise of authority and control (planning, monitoring, and enforcement) over the management of data assets.” (DAMA International)
“The specification of decision rights and accountability framework to ensure the appropriate behavior in the valuation, creation, consumption, and control of data and analytics.” (Gartner Glossary)
I think it’s helpful to think about data governance in terms of control. The more strictly we control data through its lifecycle, the higher degree of assurance and confidence we have that it is used in a clearly specified set of ways. We can differentiate between how strictly we control:
How data is created, processed and delivered to the different consumers (applications and people) that want to use the data.
How those data consumers interpret and use the data.
If we tightly govern (1), we end up delivering data to consumers with a high level of assurance around what it means and how accurate and complete it is. If we tightly govern (2), we have a high degree of assurance about how the data is interpreted and used.
Why is data governance having a moment? Why are so many people arguing for more data governance, now?
A number of factors drive the need to govern data more tightly:
Regulation forces us to govern data more tightly, so we can represent to regulators that the way we use data is compliant.
Data socialization and democratization drive us to govern data more tightly. That’s because the more different data consumers are able to self-serve on the data, the higher the risk that someone somewhere misinterprets the data, or uses the data in a way that we are not comfortable with. (Maybe from an ethical standpoint, or a methodological standpoint.) A nice example is from accounting: when public companies produce accounts, the processes for calculating the line items in the financial statements need to be very tightly governed, to ensure that any analyst and shareholder interpreting the data can effectively do so without being misled, and to make it easy to compare line items between different companies safe in the knowledge they are comparing apples with apples. Financial audits are checks on that process.
Criticality of use cases drives a need for more tightly governed data. If we are using the data to make a life or death decision about how to treat a patient, or whether and where to launch a missile, we want much higher levels of assurance about both the data and the way the data is used to reach the decision than if we’re targeting an advert or plotting traffic trends over time.
Given the above, it’s not surprising that data governance is an increasingly hot topic:
Data and the use of data is being increasingly regulated.
Organizations are socializing more and more data. In the case of behavioral data, for example - a data set that was once produced by one team (the digital team) and consumed by the same team (the digital team) is now produced by multiple teams and consumed by multiple teams across the business, for a whole host of insights and data applications including personalization, customer segmentation, product analytics, lead scoring, churn reduction and more.
Organizations are increasingly using data not just to build reports and power analysis, but to actually drive business critical data applications (e.g. pricing optimization or fraud prevention, to give just two examples).
Many organizations have made significant progress socializing more and more data sets, and using data to power more business critical applications. Previous efforts to do this have been gated by technical challenges around collecting the data in a single location, processing it at scale and velocity, and delivering it into tools that enable data consumers to explore and activate it. But as the toolset for working with data has come on leaps and bounds (with cloud data warehouses, new tools for building and running data pipelines, next generation AI and BI tooling etc.), the constraints have shifted: now it is the quality, consistency and level of assurance around the data itself that is holding organizations back. So an increasing amount of attention is shifting to governance to deal with this new constraint. So governance, observability, lineage and semantic layers have become hot topics.
Data governance costs
This is an obvious but important point: there is a real cost to implement data governance. It is cheap to slurp up data at scale and bung it in a data lake. It is harder to be precise about the data that we want to create: to tightly specify the engineering processes that will be used to generate, process and deliver that data, and document and rigorously enforce those processes. There’s a return on that investment though: it will be easier to socialize the data that is created, because there is a high level of assurance that the data means what it says, and it is clear how to correctly use the data to drive insight and value. This makes it easier for data consumers to build on, and easier for security teams to ensure compliance with regulation.
So for any data set there’s a trade off: what’s the right amount of resource to put into governance, to drive an appropriate level of return? Too much and we’re wasting time and effort governing data that might not being used to drive value. Too little and we expose ourselves to unnecessary regulatory risk, and make it too hard for data consumers to build value creating applications on top of the data.
Data governance is not one size fits all: different data sets require different levels of governance
The right level of governance is going to vary data-set by data-set, because we use different data sets in different ways. Data sets that are more highly regulated, more widely socialized and used to power more critical applications, need to be more tightly defined than data sets that do not. Within an organization there will be a spectrum of data sets, with very tightly governed data sets (e.g. master data, financial data) on one end of the spectrum, and completely ungoverned data sets on the other. (E.g. data related to experiments performed by individual teams, that is only consumed by those individual teams, for the duration of those experiments.) Most data sets will sit somewhere between these two extremes.
Data governance is not one size fits all: the level of governance for a given data set could well change over the lifecycle of the data
As the uses for a particular data set change, the level of governance that needs to be applied to that data set will need to evolve.
Consider the example of data that describes how a customers engage with a purchase funnel. This data might have originally been created by a product team that wants to optimize the purchase the funnel. The team is working rapidly to experiment with different approaches to purchase, and as they do, they’re running multiple experiments and evolving the way that they generate this data very rapidly. Given they are the only people using the data, that is fine. The data can be completely ungoverned. (Or more precisely, informally governed - if someone in the team isn’t sure how to interpret the data, she can ask her team mates, who have all the relevant knowledge.)
Now consider a second team who are working on personalizing the user experience. They want to use the data to understand whether a personalized experience makes it more likely that a user will progress down the purchase funnel. Further, they want to use the data to drive the personalization - if they see that a user has progressed down the purchase funnel with particular items in their basket, that’s valuable product affinity information they can use to build features to power their personalization models.
At this point the level of governance needs to increase. There are now two teams that are consuming the data. Further, the second team is using the data not just to drive human decision making, but to drive a production data application. They could rely on informal collaboration to ensure that their use of the data is correct as the first team evolves the data. Or the organization could formalize the specification for the data, and the enforcement of the specification, such that the second team (and a third if it comes along) can self-serve on the data based on the specification. The latter is going to make more sense as more people consume the data, and as more business critical applications are built on top of it. So a data set that started off as requiring little governance, ends up needing more and more as it is adopted by more and more data consumers in the organization.
Seen in this light, the “pain” of discovering that a data set is not sufficiently governed is a good thing - it suggests that the business is using the data in more sophisticated ways, building new use cases and data applications that (hopefully) deliver real value for the business.
So how should we determine what is the “right” level of data governance for a specific data set?
This is more art than science, but I’d suggest that “pain” is a good guide. If your organization is experiencing pain because data is not sufficiently governed - if there are things you’re trying to do at your organization that are being held back because of a lack of assurance around the data, then that’s a strong signal that it’s time to up the level of data governance. Conversely, if you’re not experiencing that pain, now’s not the time to invest in governance.
This recommendation is at odds with a lot of folk I see advocating for data teams to invest in data governance early. The thought here, I believe, is that by investing in data governance as early as possible, data teams can avoid the pain of discovering data quality is a blocker to effectively build a data application or use case down the line. This is misguided. It only makes sense to invest in data governance if it’s going to provide value - if it’s going to reduce regulatory risk, power data socialization or enable more powerful use cases. If not - it is wasted effort. It might be that the effort is not wasted if there is a need at a later stage to socialize the data or build more critical applications on top of it - but at this point we’re gambling: the number of data sets in a modern organization is enormous - it’s crazy to go governing all of them in the hope that there’s a payback down the line.
I’m old enough to remember when the conventional wisdom was to collect all the data you can so that when you decide you want to do something with with it, you already had it. That didn’t work because when you came to use the data, you inevitably discovered that it wasn’t fit for purpose and then all the expense of collecting it was for nothing. It is the same thing with data governance - you can spend lots of time and effort tightly governing data, but unless you know how that data’s going to be used, you run the risk that that effort (which is more costly than just collecting the data) will be in vain, and the data still wont be fit for purpose, or worse, there is no value-creating data application or use case to be built.
So if you’re not experiencing the pain of ungoverned data - don’t bother investing in data governance. There’s some other constraint holding back your data ambition (or something constraining your data ambition) and that is where your focus and effort should be. And when you start to experience that pain - take that as a positive. It means your organization is evolving their use of data, and now the investment in governance will pay off.