A Data Lake is a large repository of data that offers easy availability of and access to internal or external data, both structured and unstructured.
With a universe of data at your disposal, true value is reflected in the ability to search and discover data in ways that may not have been considered before. Imagine, data from different systems and external sources all in one place! Great! But… the data must be searchable, scalable, and usable to support research activities, analytics discovery, modeling or even plain-old operational reporting.
In order to support the value of the Lake, NEOS proposes Zones, each of which has its own purposes and characteristics. As we will see, without an organized structure, data runs the risk of being co-mingled, unsearchable, and very quickly unusable.
As we dig deeper into the Data Lake concept, two basic definitions will become important: Ingestion and Curated Data.
- Ingestion is the ability to load data into the lake in its natural state, which may be unstructured, incomplete or maybe inaccurate.
- Curated data is data that has been ensured to a certain degree of quality and fidelity and is modeled to support specific business cases.
A properly zoned data lake will allow an organization the flexibility to quickly make available new sources of data while also providing the benefit of quality-checked and augmented data for downstream consumers. Data scientists can be given access to data for research, and consuming systems can reliably access the data needed for operational reporting or business analytics.
In all, NEOS recommends 5 Zones for a Data Lake: a Raw Zone, a Structured Zone, a Curated Zone, a Consumer Zone(s) and an Analytics Zone. Each of these Zones performs a specific function within a Data Lake.
As its name implies, the Raw Zone in a Data Lake stores the data in its raw form. In most cases, this means that the data is not suitable for analytics or reporting use (e.g. every field is a text string). However, it may still provide value to data scientists. A general best practice, when ingesting data from a source, is to ingest all of the data from that source regardless of how much of it will currently be used by consumers. We call this a “touch it, take it” principle. There will be far more data in the Raw Zone than will ever exist in any other Zone of the Lake. As a result, the Raw Zone will contain attributes and columns from sources that may never make the journey to the Curated or Consumer Zones. Another best-practice is to establish the lineage of data, which is the exercise of formally tracking the origin of data elements as they enter the Lake. Establishing lineage and augmenting the lineage with data definitions and other management techniques, will allow users to search and understand the data in the Lake.
Benefit of the Raw Zone: the Raw Zone allows new sources to be ingested quickly and the lineage of the data in Zones further along in processing to be established.
The Structured Zone contains structures that provide a first stage of transformation of the data from the Raw Zone. A Structured Zone will contain data within typed table structures. For data that has been ingested from structured sources, the additional benefit gained from the Structured Zone is that the data is available in typed columns and not merely as strings. For non-structured sources, processing needs to be done on the data to get it into the Structured Zone. For example, if video files are ingested, a process to transcribe the audio may be performed to populate a transcribed column in a table providing information about the video file. Information about the various scenes in a video may also be loaded into typed tables in the Structured Zone.
A properly designed ingestion process will automatically ingest data into the Raw Zone and make all of that data available in the Structured Zone. This is an important aspect of the Structured Zone and what distinguishes it from the Raw Zone on one side and the Curated Zone on the other side. Without processes to quickly add new sources to the Raw Zone and process them for the Structured Zone, much of the benefit of the Structured Zone is lost.
Benefit: The Structured Zone: this is the first destination where data research can be performed to start to make sense of the data in the Lake. And going forward, any new sources that are added can be quickly put into a form that is useful for analytics.
The Curated Zone contains data that is often organized in a data model which combines like-data from a variety of sources (often referred to as a canonical model). This Zone may be used to feed an external data warehouse, or serve as the organization’s data warehouse. The data from one source may be augmented with calculations, aggregations, or data from another source in this Zone. Checks for referential integrity, data quality or missing data can be performed in this Zone. Whereas the Raw and Structured Zones contain all data ingested from various sources, the Curated Zone contains only data that will be used by consumers.
Benefit of the Curated Zone: the data in this Zone has been quality-checked and combined with like sources.
The Consumer Zone of a Data Lake provides an easy access point for consumers. While the Curated Zone can be thought of as the data warehouse of the Data Lake, the Consumer Zone can be thought of as the data marts for the Lake. The data in this zone is organized so that consumers can easily fetch what they need. Feeds to downstream systems can be created from this Zone of the Lake. It is important to weight the ease of access to consumers to the cost of space in this Zone.
Benefit of the Consumer Zone: provides data to consumers in a friendly format for business applications and reporting.
The Analytics Zone is an area in the Data Lake that allows data scientists to analyze and experiment with data in the Lake. This is the Data Scientists “sand box” environment. Access to data from the Raw, Structured, and even Curated Zones can be provided, but most of the data they will be interested in will be found in the Structured Zone.
Depending on tools being used, the Analytics Zone may be populated in a truly self-service manner or may require a request to a technical data team to populate data into it. Data scientists should be allowed to import their data into this Zone if it is for a one-off purpose.
A good best-practice here is to proactively engage the Data Scientists in data governance and operations process that allows them to understand the attributes and limitations of data sets because, as you remember, all the data might not be 100% accurate or complete; it is there for exploration.
Benefit of an Analytics Zone: allows data scientists access to all data in the Lake and allows them to combine it with data from outside the Lake without having to go through an effort of setting up an entire project. In fact, most of the data a data scientist would want to analyze should be available to them within hours if not minutes depending on the amount of data desired.