Data lakes are a product of the 21st century and the explosion in the variety of data generation and handling. They also feature in the breakdown of information silos and the need to reshape access to data to foster gains in business insights. But a data lake isn't just a place to dump and fish out data. Without management and curation, data lakes can quickly turn into data swamps, where nobody can be certain of finding anything.
Data lakes were seen at one time as a replacement of data warehouses, but over time it became clear that the two actually serve different purposes. Since then, the lines have blurred, though the primary use cases for the two are still different.
In general, data warehouses serve a more limited, streamlined function than data lakes. According to Amazon AWS, data warehouses are in particular operationally or business line oriented. Unlike data lakes, data warehouses are usually SQL-based, data is groomed for use beforehand, and trust in the data is high. Also, relational databases within data warehouses are geared for optimizing search speed.
Data lake definition in AWS
A data lake solution is much broader in scope, and it offers a different set of advantages - and dangers. Data lakes can mean different things to different organizations and in various contexts. Most visions of a data lake, however, roughly follow the AWS definition that, "A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale."
This article focuses on the AWS data lake service in particular, so we'll take that definition as a given.
The contents of a data lake on AWS, or any other offering, may or may not be unstructured, and AWS points out that data can be put into the data lake 'as-is'. This can make for cheaper, faster and more flexible data entry. Typically, data lakes contain data ranging from unstructured social media posts and comments to structured transaction and inventory database data to binary files such as audio and video.
However, 'as-is' does not mean that files are dumped in without a second thought. Metatags and other markers become vital components of the data in data lakes. This metataging is what makes it possible to find and share data across an organization, thus providing access by users ranging from data scientists and data engineers to business analysts for a variety of purposes. This is in contrast to data warehouses, which are often used as the source domain of information for business analysts utilizing curated data. Without proper metatagging, a data lake becomes a morass very quickly.
AWS data lake architecture
One feature of data lakes is that the contents are stored in their native formats. This allows the data to be kept in a raw state, but it also imposes handling issues. Creating an environment that enables data professionals to work with the content has led to changes in the architecture of the data lakes. This also parallels the drive to break down silos in enterprises and further increase the availability of data for gleaming insights.
As virtualization and cloud technology developed, the ability to expand storage at will, move data more freely and generate virtual machines created new possibilities for data lakes. Data access and processing sped up as a result. Amazon, with its focus on cloud storage, quickly became a major player in data lake development, and the Amazon data lake has been an important part of the industry since its inception.
The evolution from older architectures created the modern data lake, but it was also a part of a great acceleration of change in the IT industry. Everything from storage to communications to compute has been affected by this acceleration, and keeping up takes focus. At Proxet, working with data itself as well as in applications such as ML/AI is our livelihood, and we keep a close eye on the latest developments in the field."
— Vlad Medvedovsky, CEO at Proxet (ex - Rails Reactor) - a custom software development company.
Today's cloud-native data lake architecture is designed for multiple users accessing the same data, each in isolation. The capacity to increase users and user workloads without degrading system performance for the most important jobs is vital, so resource allocation has been a focus of this generation of data lakes.
To make the most out of today's data lakes, a solution should be able to handle:
- Multiple clusters and data sharing
- Adding users without reducing performance
- Scaling compute and storage resources independently
- Data loading and querying simultaneously without a performance drop
AWS data lake formation: first steps
Without tools designed to assist with the process, data lake set up can take weeks. Such work includes:
- loading data from the different sources involved,
- data flow monitoring,
- partition setup,
- encryption initialization,
- transformation jobs definition and monitoring,
- data reorganization,
- redundant data combing and removal, and
- linked records matching.
After all this is done and the lake is filled, there is still a lot of time required just on manual maintenance and updating tasks. Furthermore, you will also need to fine-tune permissions and access to datasets. Human access is only one part of this equation, as analytics and machine learning tools will also need to reach into the data lake.
Amazon faced the problem of creating and curating data lakes head-on. The result, AWS Data Lake Formation, was released to the general public in 2019 to minimize the time needed to create a data lake and increase the consistency of the tagging associated with importing data into the lake. Data Lake Formation helps cut the time needed to fill the lake from weeks to days. Moreover, the lake created is centralized, curated, and secured and designed to keep your data in a prepared form for analysis as well as in its original state.
Data Lake Formation streamlines processes ranging from defining data sources to access and security. Once that is done, Lake Formation aids in collecting and cataloging data and moving it into the Amazon Simple Storage Service (S3) data lake. You can also use ML algorithms to prepare and categorize data as well as set security levels for sensitive data even down to the cell level.
Maintenance is also streamlined by the creation of a centralized data catalog with descriptions of the datasets available and their intended usages. Users can choose their own Amazon services for analytics and machine learning for data processing.
Current data lake solutions are more streamlined than their predecessors, but, as the following image for a data lake in AWS shows, it's still quite an undertaking.
AWS creates a client data lake via AWS Lake Formation. AWS Cloud Formation is then utilized for deploying the components that comprise the infrastructure supporting the data in the lake. The Amazon data lake API is created, which then uses Amazon API Gateway for access to data lake microservice functions related to data and data packages, search, and administration.
The microservices utilize the following for audit, management, and storage:
- Amazon S3 (storage)
- AWS Glue (compute resources management)
- Amazon Athena (query engine)
- Amazon DynamoDB (NoSQL DB service)
- Amazon OpenSearch Service (search results publication) and
- Amazon CloudWatch Logs (monitoring)
Access to the data pool is provided by an S3 bucket configured for static website hosting. CloudFront is used for content delivery for the Amazon data lake solution. User access to the console and data lake API is handled by Amazon Cognito.
AWS data lake costs
Before cloud computing and systems like AWS data lakes, large-scale data accumulation and processing was done physically on-site in data warehouses. Cloud-based data warehouses are the norm these days, and the price of utilizing a data warehouse has dropped significantly. However, because data warehouses by definition contain data that is more highly cleaned and structured than in data lakes, the cost of the data there is higher than that in a data lake.
Keeping the cost of data in a data lake at a minimum, though, should not come at the expense of the quality of the organization of the data therein. Excessively prioritizing cost savings at this point is likely to turn your data lake into a data swamp, especially if the data being entered is very diverse.
AWS data lake pricing can be confusing for the uninitiated. There can also be confusion between the data lake itself and AWS Lake Formation, which has some free services. Lake Formation is used to speed up and simplify the process of creating the data lake, but it isn't required.
There are free services within AWS Lake Formation. For example, the first million objects stored in a month are stored for free. Also, the first million requests per month are free. Other services have fees attached. Data filtering prices, for example, can vary by geography. The cost of filtering data on the East Coast of the U.S. is $ 2.25 per TB. The second million requests cost $1.
When trying to plan and optimize data lake costs before creating one, there are several factors to take into consideration the work listed above. In particular, try to derive good estimates for
- the volume and variety of the data
- the degree of processing and transformation required
- analytics and storage requirements
- Visualization and API access
- upkeep, especially managing metadata and the data catalog
Some of these points will not require paid services from AWS, but because the AWS model prices each action separately, knowing how much of what action is required is key to understanding the costs involved and to monitor data lake usage in order to avoid cost inefficiencies. Moving data is a cost, for example, so having to move terabytes of data because of poorly constructed data lakes can be an expensive waste of funds.
AWS data lake vs other solutions
There are two options to consider when considering the creation of an AWS data lake. First, as mentioned previously, is a data lake the way to go at all? Second, if a data lake is the way to go, which one suits best? The Amazon datalake offering is very well tested by the market, but it is by no means the only one out there.
Data lake versus data warehouse versus lakehouse
We previously mentioned that data lakes are preferred over data warehouses when the data inside is more varied, and that highly curated data warehouses are faster than data lakes. However, as data warehouses become increasingly cloud-native, and the realization that data going into a lake needs to be processed and transformed to at least a degree if data swamps are to be avoided, then the differences between them narrow. Indeed, a middle-ground has appeared with what are called lakehouses in an attempt to combine the speed and ease of operation commonly ascribed to data warehouses for business analysts with the flexibility available in data lakes that is needed by data engineers. However, this spectrum of use is heavily dependent on the type of user accessing the data and the goals that user has.
Since data engineers and data scientists are finding positions even in seemingly unrelated fields, the probability that a data lake will provide a given enterprise the insight it needs is increasing. In that case, a comparison of data lakes is in order.
Other data lakes
Some data lakes on the market leverage AWS to optimize the data lake for specific use cases. This can push them closer to data warehouses or lakehouses. For example, Snowflake is a data platform optimized for speed as well as possessing serious data science and engineering chops. While known for their data warehouses, snowflake's data handling solution covers the gamut from data warehouses to lakehouses and back to data lakes.
There are many non-AWS related data lakes. Google BigLake, for example, is a data lake, though it is branded more of a data lakehouse than the datalake aws offers. The difference stems not only from the tools that the platforms connect to, but also the opportunities that users have to work with the data.
In both cases, differentiating them from the Amazon data lake solution takes a close reading of the capabilities of the platforms.
Learn a step-by-step framework for constructing an optimal modern data stack — hear Proxet's CTO cover crucial elements like build vs buy choices, open source tools, typical mistakes, and how we can assist.
Build a modern data stack by following best practices from data engineering experts. Learn about data maturity, data stack components, and how to build.