20 Jun 2017
A Data Lake is a large repository of data that offers easy availability of and access to internal or external data, both structured and unstructured.
With a universe of data at your disposal, true value is reflected in the ability to search and discover data in ways that may not have been considered before. Imagine, data from different systems and external sources all in one place! Great! But… the data must be searchable, scalable, and usable to support research activities, analytics discovery, modeling or even plain-old operational reporting.
In order to support the value of the Lake, NEOS proposes Zones, each of which has its own purposes and characteristics. As we will see, without an organized structure, data runs the risk of being co-mingled, unsearchable, and very quickly unusable.
As we dig deeper into the Data Lake concept, two basic definitions will become important: Ingestion and Curated Data.
- Ingestion is the ability to load data into the lake in its natural state, which may be unstructured, incomplete or maybe inaccurate.
- Curated data is data that has been ensured to a certain degree of quality and fidelity and is modeled to support specific business cases.
A properly zoned data lake will allow an organization the flexibility to quickly make available new sources of data while also providing the benefit of quality-checked and augmented data for downstream consumers. Data scientists can be given access to data for research, and consuming systems can reliably access the data needed for operational reporting or business analytics.
In all, NEOS recommends 5 Zones for a Data Lake: a Raw Zone, a Structured Zone, a Curated Zone, a Consumer Zone(s) and an Analytics Zone. Each of these Zones performs a specific function within a Data Lake.
As its name implies, the Raw Zone in a Data Lake stores the data in its raw form. In most cases, this means that the data is not suitable for analytics or reporting use (e.g. every field is a text string). However, it may still provide value to data scientists. A general best practice, when ingesting data from a source, is to ingest all of the data from that source regardless of how much of it will currently be used by consumers. We call this a “touch it, take it” principle. There will be far more data in the Raw Zone than will ever exist in any other Zone of the Lake. As a result, the Raw Zone will contain attributes and columns from sources that may never make the journey to the Curated or Consumer Zones. Another best-practice is to establish the lineage of data, which is the exercise of formally tracking the origin of data elements as they enter the Lake. Establishing lineage and augmenting the lineage with data definitions and other management techniques, will allow users to search and understand the data in the Lake.
Benefit of the Raw Zone: the Raw Zone allows new sources to be ingested quickly and the lineage of the data in Zones further along in processing to be established.
The Structured Zone contains structures that provide a first stage of transformation of the data from the Raw Zone. A Structured Zone will contain data within typed table structures. For data that has been ingested from structured sources, the additional benefit gained from the Structured Zone is that the data is available in typed columns and not merely as strings. For non-structured sources, processing needs to be done on the data to get it into the Structured Zone. For example, if video files are ingested, a process to transcribe the audio may be performed to populate a transcribed column in a table providing information about the video file. Information about the various scenes in a video may also be loaded into typed tables in the Structured Zone.
A properly designed ingestion process will automatically ingest data into the Raw Zone and make all of that data available in the Structured Zone. This is an important aspect of the Structured Zone and what distinguishes it from the Raw Zone on one side and the Curated Zone on the other side. Without processes to quickly add new sources to the Raw Zone and process them for the Structured Zone, much of the benefit of the Structured Zone is lost.
Benefit: The Structured Zone: this is the first destination where data research can be performed to start to make sense of the data in the Lake. And going forward, any new sources that are added can be quickly put into a form that is useful for analytics.
The Curated Zone contains data that is often organized in a data model which combines like-data from a variety of sources (often referred to as a canonical model). This Zone may be used to feed an external data warehouse, or serve as the organization’s data warehouse. The data from one source may be augmented with calculations, aggregations, or data from another source in this Zone. Checks for referential integrity, data quality or missing data can be performed in this Zone. Whereas the Raw and Structured Zones contain all data ingested from various sources, the Curated Zone contains only data that will be used by consumers.
Benefit of the Curated Zone: the data in this Zone has been quality-checked and combined with like sources.
The Consumer Zone of a Data Lake provides an easy access point for consumers. While the Curated Zone can be thought of as the data warehouse of the Data Lake, the Consumer Zone can be thought of as the data marts for the Lake. The data in this zone is organized so that consumers can easily fetch what they need. Feeds to downstream systems can be created from this Zone of the Lake. It is important to weight the ease of access to consumers to the cost of space in this Zone.
Benefit of the Consumer Zone: provides data to consumers in a friendly format for business applications and reporting.
The Analytics Zone is an area in the Data Lake that allows data scientists to analyze and experiment with data in the Lake. This is the Data Scientists “sand box” environment. Access to data from the Raw, Structured, and even Curated Zones can be provided, but most of the data they will be interested in will be found in the Structured Zone.
Depending on tools being used, the Analytics Zone may be populated in a truly self-service manner or may require a request to a technical data team to populate data into it. Data scientists should be allowed to import their data into this Zone if it is for a one-off purpose.
A good best-practice here is to proactively engage the Data Scientists in data governance and operations process that allows them to understand the attributes and limitations of data sets because, as you remember, all the data might not be 100% accurate or complete; it is there for exploration.
Benefit of an Analytics Zone: allows data scientists access to all data in the Lake and allows them to combine it with data from outside the Lake without having to go through an effort of setting up an entire project. In fact, most of the data a data scientist would want to analyze should be available to them within hours if not minutes depending on the amount of data desired.
Top 5 Most Often Asked Questions About Data Lakes
What is a data lake anyway?
A data lake allows you to store any type of digital data in its raw format. The idea of a data lake is a simple one that technology such as Hadoop has made possible. Data stored in a lake can be structured data, such as data from existing relational data bases or flat files exported from other systems. It can be XML document or JSON messages received from internal or external services. It can even be recorded call center phone calls or videos, or every tweet mentioning your industry. Once persisted in the lake, that data can then be analyzed in a variety of ways.
Why would I want a data lake?
A data lake allows you to bring all of your data from disparate sources into one place and analyze it. New sources are easy to add because it stores them in their native format. Consumers of this data are typically people who are highly skilled at data manipulation and analysis.
In an increasing number of cases, data lakes are used to front an enterprise data warehouse. This allows new sources to be ingested into the lake quickly. Sometimes those sources may not be used immediately, but when use for that data is found, it can be transformed within the lake and exported to a warehouse in a more appropriate consumable format. For instance, you may store every tweet about your company in the lake but export only the statistical analysis of those tweets to the warehouse.
How is a data lake different from a data warehouse?
The two major differences between a data lake and a data warehouse are the way that data is stored and the point in the process where transformation takes place.
A data lake typically stores data in its raw format no matter what that format. If some sort of transformation is needed to put the data into a different structure for analysis or reporting, it happens after the data is persisted and when it is going to the consuming system. This concept is referred to as “schema on read,” which means that the lake doesn’t need a special structure to store the data since it can store the data in its original format. When reading the data, it may make sense to transform it into a new structure that is more easily consumed. For instance, you may store résumés in PDF and Doc format in the lake as PDFs and Docs. If you want to see how many “Hadoop” résumés you have stored, you could put in a process to count how many résumés contain the word “Hadoop” and make that data available.
A data warehouse, on the other hand, uses a “schema on write” process. To store data in a data warehouse, you first need a data model to map that data to. If you have the same type of data coming in from various systems in varying formats, a large part of the effort of setting up a warehouse for that data is creating a model that holds it all. This effort is postponed in a data lake and may not even be necessary at all depending on how the data is used.
What is a data swamp?
One criticism of a data lake is also one of its strengths. Since a data lake can persist so many different types of data and since it is easy to ingest data into the lake, without proper governance in place it can quickly become a dumping ground for all types of uncategorized and untracked data. This disorganized mess of data is referred to as a “data swamp.” It is vital that proper governance processes be put into place so that users and administrators understand what is in the lake, where it is, where it came from, and how it can be used.
How can I make my data lake successful?
One very good goal for a data lake is to eventually make it a self-service station for data consumers. Since ideally all data flowing through the enterprise would find its way to the lake, that makes it a one-stop superstore for enterprise data. The catch is that you need to be able to find that data, and you need to be able to validate its source. Without good governance policies, that won’t be possible. Making sure that the data lake supports metadata is extremely important for governance and the success of the lake.
With that governance in place, you are in a position to make the data in the data lake easily searchable so that potential consumers can easily find what they are looking for. With good business metadata, they will also be able to understand how to use the data they have found.
In short, the technology exists to enable you to create a successful data lake in your organization. Doing so will require more than technology. Like any large initiative, the technology is only a portion of the project; the rest involves putting the people and processes in place to ensure it is successful.