By: Rob Nocera, Partner
A properly zoned data lake will allow an organization the flexibility to quickly ingest and make available new sources of data while also providing the benefit of quality-checked and augmented data for downstream consumers. In this type of lake, data scientists are given quick access to data, and consuming systems can reliably access the data needed for reporting or even for operations.
A data lake is a large repository of all types of data, and to make the most of it, it should provide both quick ingestion methods and access to quality curated data. While curated data provides quality data for feeding of downstream systems, it also takes the longest to get from the point of intake to the consumption layer, not just in terms of actual processing time once transforms are put into place, but also in terms of the time it takes to develop the processes necessary to transform the data. A properly zoned data lake allows access to data in various states of a transform.
In all, NEOS recommends 5 zones that should be considered for a data lake. These include a Raw Zone, a Structured Zone, a Curated Zone, a Consumer Zone and an Analytics Zone. Each of these zones provides a benefit to the data lake and are described below.
As its name implies, the Raw Zone in a data lake stores data in its raw form. In most cases, this means that the data is not suitable for consumption (e.g. every field is a text string). However, it may still provide value to data scientists. A general best practice, when ingesting data from a source, is to ingest all of the data from that source regardless of how much of it will currently be used by consumers. There will be far more data in the Raw Zone than will ever exist in any other zone of the lake. As a result, the Raw Zone will contain attributes and columns from sources that will not entirely make the journey to the Curated or Consumer Zones.
The benefit of a Raw Zone is that it allows new sources to be ingested quickly and the lineage of the data in Zones further along in processing to be established.
The Structured Zone in a data lake is the first stage of transformation of ingested data. This zone contains structures that provide some increased value over the pure raw data in the Raw Zone. A Structured Zone will contain data within typed table structures. For data that has been ingested from structured sources, the additional benefit gained from the Structured Zone is that the data is available in typed columns and not merely as strings. For non-structured sources, processing needs to be done on the data to get it into the Structured Zone. For example, if video files are ingested, a process to transcribe the audio may be performed to populate a transcribed column in a table providing information about the video file. Information about the various scenes in a video may also be loaded into typed tables in the Structured Zone.
A properly designed ingestion process will automatically ingest data into the Raw Zone and make all of that data available in the Structured Zone. This is an important aspect of the Structured Zone and what distinguishes it from the Raw Zone on one side and the Curated Zone on the other side. Without processes to quickly add new sources to the Raw Zone and process them for the Structured Zone, much of the benefit of the Structured Zone is lost.
The benefit of the Structured Zone is that it will contain what would be the first stage of transformation that would need to happen to the data for a data scientist to start to make sense of it. New sources that are added can be quickly put into a form that is useful for analytics.
The Curated Zone in a data lake contains curated data that is often stored in a data model, which combines like data from a variety of sources (often referred to as a canonical model). This Zone may be used to feed an external data warehouse or serve as the organization’s data warehouse. The data from one source may be augmented with calculations, aggregations, or data from another source in this zone. Checks for referential integrity, data quality or missing data can be performed in this zone. Whereas the Raw and Structured Zones contain all data ingested from various sources, the Curated Zone only contains data that will be used by consumers.
The benefit of a Curated Zone is that the data in this zone has been quality-checked and combined with like sources making it more consumer friendly.
The Consumer Zone of a data lake provides an easy access point for consumers. While the Curated Zone can be thought of as the data warehouse of the data lake, the Consumer Zone can be thought of as the data marts for the lake. The data in this zone is organized so that consumers can easily fetch what they need. Feeds to downstream systems can be created from this zone of the lake. It is important to weigh the ease of access to consumers to the cost of space in this zone.
The benefit of a Consumer Zone is the way that it provides data to consumers in a friendly format for those systems.
The Analytics Zone is an area in the data lake that allows data scientists to analyze and experiment with data in the lake. This team should be able to access data from the Raw, Structured, and even Curated zones as needed, but most of the data they will be interested in will be found in the Structured Zone. Depending on tools being used, the Analytics Zone may be populated in a truly self-service manner or may require a request to a technical data team to populate data into it. Data scientists should be allowed to import their data into this zone if it is a one-off spreadsheet or source.
The benefit of an Analytics Zone is that it allows data scientists access to all data in the lake and allows them to combine it with data from outside the lake without having to go through the effort of setting up an entire project. In fact, most of the data a data scientist would want to analyze should be available to them within hours if not minutes depending on the amount of data desired.
Without clear boundaries between data, a once clean data lake can quickly become a disorganized data swamp. However, by properly implementing these five zones, an organization can ensure the quality of data received by downstream consumers while still retaining the ability to quickly and easily ingest new sources of data.