"Data is the 21st century oil". If you work anywhere in the vicinity of data, odds are you've heard some variation of this statement at least once. But while the value of data and data-driven decision making is becoming increasingly more apparent, it is not immediately obvious how to build and maintain a healthy data ecosystem.
In fact, this not a trivial endeavor at all. One of the main reasons is that there are several pitfalls that are easy to fall into, which only become apparent down the road when your data increases in size or number of users. At which point it becomes difficult to fix the problems without disrupting your company's workflow.
The goal of this blog post is to hopefully shed some light on what a healthy data ecosystem looks like, so that you might explore your company's data to its full potential.
What is a data ecosystem?
We define a company's data ecosystem as the collection of processes a company uses to collect, store, analyze, and leverage data - as well as all the participants in these processes, both developers and consumers.
We can draw a lot of parallels between the components of a data ecosystem, and the process of oil extraction. Take a look at the following diagram, which details the life-cycle of oil:
We have crude oil on the ground, which gets extracted by the oil rig; it then gets transported to the refinery, where it is modified according to the needs of different final consumers. Finally, the refined oil is delivered to different locations.
A lot of parallels can be drawn between this diagram and the processes/technologies used in a data ecosystem.
First, there's the crude oil. This is your source data - unprocessed data detailing some part of your company's activity. For example, logs from the devices you sell, or information about sales of your products. This data will be at the root of your data ecosystem.
You have the oil storage tanks, which are where the crude oil is stored in bulk. In the data ecosystem, this role is taken by the data lake.
Then, you have the oil tanker transporters. These are your data ETL (extract, transform, load) processes, your CI/CD (continuous integration/continuous deployment) pipelines, and other automated processes involved in moving the data to the "refinery", which will be your data warehouses. Usually, these are tables containing a complete overview of your data.
Last but not least, we have the final consumers - petrol stations, airports and ports. Each of these requires a different type of processed oil. These processed oils, intended for usage by final consumers, are what we call "data marts". They can be, for example, specialized views over your data warehouse, that filter only the data of interest for some final consumer.
It's important to note that all these different components can have different owners: some people will need to own the original data sources (land the crude is on), different people will need to own the pipeline so they can actually fix it if it breaks, and different people might own the tanks where it is stored, etc. This is why we always advocate for good DevOps and automation practices, as these allow the multiple systems to interact while minimizing human intervention (and error).
As you read the next chapters, keep the oil diagram in mind, and try to identify what parts of the process each chapter refers to.
The pillars of a solid data ecosystem
Trying to conceptualize or summarize the different components of a data ecosystem is something that has been attempted by a lot of people, and everyone has their opinions on what the pillars of a healthy data ecosystem are. My own opinion on the matter is constantly evolving as I broaden my knowledge on this topic.
There are, however, a few core ideas whose proper implementation, or lack thereof, has definitely felt very impactful in the different data ecosystems I have worked in.
They are the following:
- data accessibility
- data discovery
- data lineage
- data ownership
These are, in my opinion, the pillars of a strong data ecosystem. In the following chapters I will briefly explain them and give some examples of how to implement them in a sustainable way.
Out of the four concepts mentioned, this is probably the easiest one to understand. Your data only has value if it's accessible by the people who can build something with it, or make decisions based on it.
To stay with the oil metaphor, it doesn't really matter if you're sitting on a huge pool of oil if you don't have the tools necessary to drill down into it and the pipes to bring it to the surface.
Situations like these are particularly easy to spot in older companies, where most of the data can be found in old, proprietary databases under lock and key, and only accessible to a select few. Most other people can only look at the data through the key-hole, and are not empowered to explore it and transform it according to their needs.
Another concept that is talked about nowadays is that of a "self-service" data ecosystem. The focus of this type of ecosystem is to remove unnecessary barriers to data access in your company, and incentivize everyone - even people in roles that traditionally don't use data - to learn how they can make informed data driven decisions and improve their workflow.
The best way to achieve this is through good data accessibility processes. More concretely, your company's data should be easily accessible:
- to the data engineers: it should be straight-forward to write some new code to move a piece of data around the company.
- to the final data consumers: there should be a single, easy-to-access point of contact with the data for all dashboarding and analytical needs.
At Daredata, we tend to favor streamlined datalake/data warehouse architectures. Our implementation usually looks something like this:
- we grab all data from your different systems, and store it all on a cloud-based, cost-effective data lake (your oil tanks!). This is our recommended solution to keep track of your company's data history. This usually solves the data engineering accessibility issue, because it's usually quite simple to build connectors between whatever data source you might have and a modern data lake.
- we ingest the data from a data lake into a cloud data warehousing solution, with a SQL-like interface (your refinery!). This is where all your crude oil will be processed and converted to something usable. This component solves the consumer accessibility issue by establishing a single point of contact with the final processed data, instead of several different databases/APIs spread across your infrastructure.
This architecture has worked pretty well for us in enabling our customers to explore their data effectively.
A pitfall of data accessibility: never control data-access based on individual user identities! Although it might seem tempting to start managing permissions at the user-level, on the long run this usually results in unmanageably long permission lists which tend to become outdated and make you lose track of exactly who has access to what data. Always use role-based accesses! This involves assigning one or more roles to every data consumer, based for example on what department they are in, and then assigning permissions to the roles instead of the individuals. This way you can enable/disable roles for specific people as the landscape of your company changes, which is much more manageable in the long-run.
Having a lot of accessible data might be valuable in itself, but to really generate business value, data analysts need to be able to easily sift through all the data and discover the exact pieces they need to compose their analyses. This is the equivalent of having an organized oil rig and refinery operation, where everyone knows exactly where each type of oil is stored, in what amounts, and when the latest batch arrived.
You can probably picture why having good data discovery processes is essential to a healthy data ecosystem. However, as a data ecosystem grows, these processes are also among the hardest ones to correctly maintain.
The best way to enable good data discovery processes is to consistently label all your data. Every data artifact should be clearly described, and tagged with short and easily understandable terms. These can be related, for example, with the functional area which that particular artifact pertains to.
The responsibility of documenting all data must be gradually instilled in all data engineers and data producers in your company. This can be challenging, because it's easy to overlook documentation in favor of more urgent business needs and deliverables.
There are a few ways to make the documentation process less cumbersome. You can, for example, supply data engineers with pre-built forms that contain a set of agreed-upon tags and a few free-text fields for descriptions.
Another idea I have been entertaining is that of "self-documenting code": design your data-related code and CI/CD pipelines in a way that enables the data engineers to focus only on the functional side of the data processes. Then, some automatic process generates the documentation directly from the code and code comments. This makes it simpler for the data engineers to document their data processes, since they can do it during their normal workflow, and also has the advantage of standardizing the documentation style across all data artifacts; but it is definitely a challenging problem and needs to be carefully implemented.
After having implemented this layer of metadata, and ensuring that it is growing smoothly alongside your data, you can then develop some sort of search engine over it to make sure everyone in your company can find the data they need.
If you have a huge variety of data, you might even want to consider hiring a data steward - a relatively recent role that is appearing in some companies, which has the responsibilities of ensuring that data policies and standards turn into practice, and assisting the company in leveraging domain data assets to full capacity.
A pitfall of data discovery: make sure to clearly tag all your final data artifacts (those intended for consumption by data analysts) differently from your intermediary data artifacts (those used during the process of building the final data artifacts). As the number of automated data processes in your company increases, it's easy to end up with a "needle-in-a-haystack" situation (or a library of Babel) where for every useful search result, there are dozens of intermediary artifacts that are not suitable for downstream usage.
Data lineage refers to the ability of tracing back a particular data artifact to its origins, and to be able to tell which data process or processes originated it and how it evolved over time. Having the ability to do this makes it much simpler to detect errors on some source data, and immediately identify all downstream data processes that might be affected by this faulty data. Going back to the oil diagram, this is essentially equivalent to having functional GPS systems on all your oil tanker transporters, to know the exact route of each shipment of oil.
As the number of data processes in your company increase, having good data lineage becomes crucial to keep track of how the data is flowing between producers and consumers. Data lineage processes usually tie in very well with data discovery processes, because you can leverage the network of metadata that you have built to enable data discovery. One way to do this is, for example, to tag all your data artifacts with their "parents" (that is, the data artifacts that generated them), in a way that an analyst can click it's way up the hierarchy until reaching the source data (which can be, for example, data that comes directly from your factory and hasn't been processed yet).
Another good way to enable data lineage is to develop your data ETL processes (extract, transform, load) as DAGs - directed acyclical graphs - using an orchestrator such as Airflow or Digdag, for example. This allows you to establish dependencies and SLAs (service-level agreements) between your data processes and set up automatic warnings in case of failure, so that those responsible for downstream processes can be notified.
Having the ability to easily research the lineage of your data artifacts is definitely a hard thing to implement, but it pays off down the road and usually results is a lot less downtime in case of critical errors.
A pitfall of data lineage: it's very common to neglect data lineage documentation and instead rely on a few critical people to have this knowledge. As the number of data processes in your company increase, their inter-dependencies grow exponentially, and the task of keeping track lineage quickly becomes harder than what a few key people can handle. Instead, share the data lineage documentation with the data users, and empower to understand where the data is coming from. This way, in case of an error, they make an informed request for help to the data engineers, who can then jump straight into fixing the faulty data.
Last but not least, we should talk about data ownership. This refers to the idea that every piece of data must have one or more owners, who are responsible for ensuring that it is up-to-date, healthy and fit for consumption by downstream processes. Thinking back on the oil metaphor, this is the same as having someone be responsible for ensuring that each batch of petrol that arrives at the gas is fit for consumption and has not degraded in quality.
This seems pretty straight-forward, but it's important to consider the effects of "time-erosion" in your data ownership processes:
- When someone leaves the company, what happens to the data artifacts under their ownership?
- How can you ensure that over time, your data ecosystem does not become a wasteland of orphan data artifacts, whose purpose is not clear to anyone, but that no one feels comfortable in removing (for fear of disrupting some downstream data process)?
Bad data discovery and data lineage practices are exacerbated by weak data ownership standards, making it very easy to end up in a situation like the one described above, in just a few years of company activity. One way to prevent this situation is to be very strict in enforcing data ownership.
You can, for example, set up automated reminders for data owners to clean-up their data artifacts. Another good idea is to include a data artifact handover process in a checklist for employees leaving the company.
A more extreme idea I have been considering is that of "aggressive data pruning". This consists in tagging your data artifacts as outdated if they haven't been updated in a few weeks, and then automatically notifying their owner. Then, if the owner does not update the data in a few days, simply delete it from your data ecosystem.
This concept might be a bit too extreme for practical usage. It would at the very least require an excellent data backup system, to allow reverting accidental deletion of critical data. But at the same time, I think it illustrates the importance of data ownership and how keeping a bit of pressure on the data owners to keep their data up-to-date can contribute to a healthier data ecosystem in the long term.
A pitfall of data ownership: Make sure to attribute ownership not only to the data artifacts themselves, but also to the processes that generate them, It's easy to end up in a situation where there is someone who can clearly explain a certain piece of data in terms of business knowledge, but no one who can actually explain the transformations that generated this piece of data. To prevent this situation, it's a good idea to explicitly enforce two owners: one functional owner, and one business owner (they might be the same person in most cases).
There are a lot of things to consider when building a healthy data ecosystem for your company. At DareData, we are building up our knowledge on how exactly this process looks like, and what challenges lie along the way. We hope that the ideas discussed in this article can help you in making the right decisions for your company. And if you want to discuss this topic some more, we would love to have a call!