Data Mess to Data Mesh
Poptarts and Hurricanes
In the early 2000s, Walmart began mining its terabytes, if not petabytes of data in order to find trends. The story goes that an insightful data analyst blended Walmart sales and store data with outside data sources. Combining these data sets led to a surprising realization. Hurricanes drive up sales of pop-tarts seven times the normal amount. These kinds of insights (e.g. Google’s identification of searches for flu-like symptoms and its correlation to outbreaks) occur when businesses combine data from different parts of the organization.
Driven by the desire to find these kinds of insights, most businesses try to consolidate data from various sources into a single repository. The laudable goal is to break down data silos, improve data accessibility, and enhance data quality. The most common approach is to create a centralized data lake that promises to facilitate more efficient data analysis and enable cross-functional collaboration.
Data Lakes Quickly Become Data Swamps
The go-to solution for most enterprises is to centralize data. The hope is that centralizing data will enable teams to harness the potential of vast and varied data sources. The allure lies in the promise of a centralized repository that can store massive volumes of raw and unstructured data, allowing for comprehensive analytics, data-driven decision-making, and fostering innovation. The envisioned benefits include improved accessibility, scalability, and flexibility, as data lakes are designed to accommodate diverse data types without the need for upfront structuring.
The reality of implementing a data lake often falls short of its goals. Dreams of crystal-clean data lakes usually end up as smelly, soupy data swamps. This up-front architecture, which designs a well-organized and easily navigable repository, usually devolves into a chaotic landscape where data quality, structure, and governance erode. The sheer volume and diversity of data and inadequate management result in a murky environment where locating relevant information becomes challenging.
Another side effect of centralizing data in a data lake is the slow provisioning of entitlements for user access. In the traditional data lake model, granting permissions and access rights to different users or teams can be time-consuming. As the volume of data grows, access requirements generally become more complex. This increases the administrative burden of managing entitlements. Delays in provisioning access hinder the agility organizations expect from their data lakes, impeding the timely utilization of data for critical business decisions.
The slow entitlement provisioning exacerbates governance challenges, making it difficult to enforce security protocols and maintain a structured approach to data access. Because of this, organizations face the risk of a data swamp, operational inefficiencies, and potential security vulnerabilities.
Float Above the Swamp With a Data Mesh
A Data Mesh is the best way to combat the entropy inherent in enterprise data solutions and provides faster time to insight. This approach originated with ThoughtWorks circa 2020. A Data Mesh is a decentralized approach to data architecture that addresses challenges in large enterprises, emphasizing domain-oriented decentralized data ownership. In contrast to traditional centralized models, Data Mesh promotes breaking down data silos by treating data as a product owned and distributed by the domain owners via a centralized “marketplace”. The marketplace enables the data products to be searchable, and the product nature of the data ensures usability. This approach enables enterprises to make information more accessible and usable. Which in turn allows the business to use their data actionably.
A Data Mesh architecture is generally comprised of three components: Data Products, Data Infrastructure, and Federated Governance.
Data as a Product
Treating data as a product involves viewing data as a valuable deliverable with its development lifecycle. Data is curated, refined, and presented user-friendly like a software product, aligning with specific business requirements. This approach emphasizes the creation of well-documented, reliable datasets that cater to the needs of developers, analysts, and other stakeholders. By treating data as a product, organizations adopt practices that prioritize data quality, accessibility, and user satisfaction, fostering a parallel to the meticulous development processes applied to software products. This mindset ensures that data becomes a purposeful and valuable asset within the software development ecosystem.
This approach benefits consumers by easing and streamlining “upstream” data adoption. It accomplishes this by easing access, improving usability, and clearly understanding the data. This empowers users to make informed decisions. Treating data as a product enhances user satisfaction, enabling more effective and valuable utilization of information.
Data Infrastructure
Data infrastructure is crucial in a Data Mesh because it acts as the backbone that supports decentralized data ownership. The infrastructure must provide the tools and technology for seamless data flow between domains. This infrastructure facilitates the creation, storage, and processing of diverse data types while maintaining reliability. Usage of the infrastructure needs to be self-service. This allows domain teams to provide and deliver their data products effortlessly. This approach enables efficient collaboration, ensuring that data can be shared and utilized across the organization. It fosters a decentralized yet interconnected ecosystem that empowers each domain to manage its data while contributing to the broader organizational goals.
Federated Data Governance
Federated data governance is integral to a Data Mesh, embodying a decentralized approach to managing and overseeing data across diverse domains. Unlike traditional centralized governance models (see above in the data lake section), federated data governance empowers individual domain teams to define and enforce policies tailored to their specific data products. This distributed governance structure ensures that each domain maintains autonomy while adhering to overarching organizational guidelines. It involves collaboration between domain-specific governance bodies, fostering collective data quality, security, and compliance responsibility. Federated data governance enhances transparency, as each domain is accountable for its data assets, reducing the risk of data silos and ensuring a unified understanding of data across the organization. This approach facilitates agility, allowing domains to adapt governance practices to their unique requirements while contributing to a cohesive and well-governed data ecosystem within the broader Data Mesh framework.
Final Thoughts
Data mesh is a relatively new approach to solving the problem of managing and providing access to data via entitlements. While it is a powerful approach, there is no silver bullet or application/platform/service that enterprises can buy to “solve” the problem.
When embarking on the journey, significant care is needed when designing infrastructure and governance models. Both of these impact how domain teams “package” and “deliver” their data products to end users.