A data lake allows you to store any type of digital data in its raw format. The idea of a data lake is a simple one that technology such as Hadoop has made possible. Data stored in a lake can be structured data, such as data from existing relational data bases or flat files exported from other systems. It can be XML document or JSON messages received from internal or external services. It can even be recorded call center phone calls or videos, or every tweet mentioning your industry. Once persisted in the lake, that data can then be analyzed in a variety of ways.
A data lake allows you to bring all of your data from disparate sources into one place and analyze it. New sources are easy to add because it stores them in their native format. Consumers of this data are typically people who are highly skilled at data manipulation and analysis.
In an increasing number of cases, data lakes are used to front an enterprise data warehouse. This allows new sources to be ingested into the lake quickly. Sometimes those sources may not be used immediately, but when use for that data is found, it can be transformed within the lake and exported to a warehouse in a more appropriate consumable format. For instance, you may store every tweet about your company in the lake but export only the statistical analysis of those tweets to the warehouse.
The two major differences between a data lake and a data warehouse are the way that data is stored and the point in the process where transformation takes place.
A data lake typically stores data in its raw format no matter what that format. If some sort of transformation is needed to put the data into a different structure for analysis or reporting, it happens after the data is persisted and when it is going to the consuming system. This concept is referred to as “schema on read,” which means that the lake doesn’t need a special structure to store the data since it can store the data in its original format. When reading the data, it may make sense to transform it into a new structure that is more easily consumed. For instance, you may store résumés in PDF and Doc format in the lake as PDFs and Docs. If you want to see how many “Hadoop” résumés you have stored, you could put in a process to count how many résumés contain the word “Hadoop” and make that data available.
A data warehouse, on the other hand, uses a “schema on write” process. To store data in a data warehouse, you first need a data model to map that data to. If you have the same type of data coming in from various systems in varying formats, a large part of the effort of setting up a warehouse for that data is creating a model that holds it all. This effort is postponed in a data lake and may not even be necessary at all depending on how the data is used.
One criticism of a data lake is also one of its strengths. Since a data lake can persist so many different types of data and since it is easy to ingest data into the lake, without proper governance in place it can quickly become a dumping ground for all types of uncategorized and untracked data. This disorganized mess of data is referred to as a “data swamp.” It is vital that proper governance processes be put into place so that users and administrators understand what is in the lake, where it is, where it came from, and how it can be used.
One very good goal for a data lake is to eventually make it a self-service station for data consumers. Since ideally all data flowing through the enterprise would find its way to the lake, that makes it a one-stop superstore for enterprise data. The catch is that you need to be able to find that data, and you need to be able to validate its source. Without good governance policies, that won’t be possible. Making sure that the data lake supports metadata is extremely important for governance and the success of the lake.
With that governance in place, you are in a position to make the data in the data lake easily searchable so that potential consumers can easily find what they are looking for. With good business metadata, they will also be able to understand how to use the data they have found.
In short, the technology exists to enable you to create a successful data lake in your organization. Doing so will require more than technology. Like any large initiative, the technology is only a portion of the project; the rest involves putting the people and processes in place to ensure it is successful.