Data lakes are part of a fundamental shift in the storage, use and handling of data. Over the last 20 years, enterprises have increasingly relied on the availability of data for measuring performance and forecasting. Broadening access to data that at one time would have been siloed has changed the way that whole industries operate. However, this change has brought about a need for a different means of data management and curation. Without it, data lakes become data swamps, or amorphous jumbles of files.
Cloud computing is an integral part of most data lake operations, Even within a private data lake, cloud technology is vital to ensuring availability. Cloud technology enables disaster recovery, for example, especially when one site goes down. As a consequence, the largest cloud storage companies have become leaders in offering cloud data lakes services.
In this post, we will look at the data lake concept, what the cloud stores, and how cloud data lakes work. We will also cover why cloud data lakes are important going forward. Cloud data lakes take time and effort to set up, though the effort has lessened as new tools become available. Still, setting up a data lake platform in a cloud requires planning and a vision toward maintenance. Knowing what to expect and why they can be worth the effort can be the difference between creating a data lake and a swamp..
What is a cloud data lake
Let's start with a data lake definition. Whether it's in the cloud or on a single on-premises, a data lake acts as storage for large quantities of data in a format-agnostic manner.
Implementations generally fit the AWS data lakes definition: "A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale."
Any that don't fit would be out of the scope of this article.
The contents of a data lake do not have to be structured, and part of the idea of a data lake is to be able to put data in 'as-is'. The data entry burden is lower, and entry is faster as well. Data lakes can contain everything from structured database files to unstructured social media material to binary items such as audio and video content.
That said, 'as-is' is not the same as copy-pasting files. A clean data lake needs metatags and other markers for the files inside. Meta tags enable search systems to find data so that it can be shared within an enterprise. If the idea of a data lake is to provide access by users ranging from data scientists and data engineers to business analysts for a variety of purposes, then the step of making that data findable is vital. In this regard, data lakes differ from data warehouses, as the latter usually utilize a more structured approach, and the curation of processed data is generally a given.
This difference in approaches is part of why data lakes were at first seen as the successor to the data warehouse. However, as usage of both forms developed, their relative advantages were laid clear and use case diverged. However, some overlap does exist and the use of databases with hybrid forms is developing.
In short, data warehouses prioritize speed within a relatively limited scope. Amazon AWS points out that data warehouses are utilized primarily for on-premises work, where speed is of the essence. To this end, data warehouses have SQL-based databases, data is curated before entry, and access is generally restricted.
Cloud Lake Technologies
The set of technologies that make cloud-based data lakes possible rose from the multiple parallel revolutions in the realm of data in the 2000s and 2010s. The invention of cloud technologies on the one hand and non-SQL databases on the other, along with improvements in communications, both in speed and reliability, created the groundwork for implementing a cloud-based data lake concept.
Current cloud data lake services enable a variety of users to access the same data without interfering with each other's operations or changing the data involved. Increasing user numbers and workloads without a drop in system efficiency is vital, and as such resource allocation is an essential selling point for all data lake vendors.
A cloud data lake platform needs to ensure the following:
- Sharing data
- Capacity for multiple clusters
- Increasing user numbers does not affect performance
- Compute and storage resources are managed independently
- Data ingress, query, manipulation, and storage occur simultaneously without affecting performance
From the biggest cloud storage companies to the smallest cloud-based private data lakes, the process of filling and maintaining a data lake remains essentially the same. Because of the complexity of setting up a data lake, the range of services used to help with the task has had to evolve as well. The AWS data lake service portfolio, for example, now helps:
- load data
- monitor data flow
- set up partitions
- initialize encryption
- define and monitor transformation jobs
- reorganize data
- comb and remove redundant data, and
- match linked records.
One criticism from the early days of data lakes was that there was a lot of manual updating and maintenance. Access and permissions were also not automated, and access by other machines and analytics platforms made the issue even more intricate.
Cloud Data Lake Setup
Continuing with Amazon for an example, the company's response has changed over time. The current solution is AWS Data Lake Formation, which was made available to the general public in 2019. Data Lake Formation is designed to cut down on the effort and time required for the initial workflow, ease maintenance, and just as important, make tagging more consistent. The last point is a necessity because without it, the data lake quickly becomes a data swamp. Doing so also increases security and the accuracy of analysis, as the proper items can be reached - but only by the right people.
AWS Data Lake Formation does more than just statically create a framework for the dta lake. It handles all of the processes within the formation of the data lake and the actual data inflow. Furthermore, it helps users define data sources as well as user access and security. Lake Formation simplifies data collection and cataloging data as well as movement to the actual S3 data lake. Data preparation and categorization, including security, can be set via ML algorithms down to the cell level.
Amazon streamlines maintenance through the creation and use of a unified data catalog containing dataset and usage descriptions. Unlike many of the upstream data lake services, users can choose their own Amazon services for analytics and machine learning for data processing.
AWS Lake Formation is first used to make the lake, and AWS Cloud Formation comes into play in order to deploy the supporting infrastructure for the data that gets put into the data lake. Next, the API for the data lake is created. Amazon API Gateway then facilitates user access to microservice functions related to the lake's data and data packages as well as the data lake's search and administration needs.
AWS uses the following to store, manage and audit within the data lake:
- S3 for storage
- Glue for compute resources management
- Athena, a query engine
- DynamoDB, a NoSQL database service
- OpenSearch Service for publishing search results, and
- CloudWatch Logs for monitoring
An S3 bucket configured for static website hosting provides access to the data pool. CloudFront delivers content. Amazon Cognito manages user access to the console and data lake API.
Advantages of cloud data lakes
Cloud data lakes offer levels of availability that single-site storage cannot match at the commercial level for most enterprises. This increased availability includes aspects such as disaster recovery, security, and maintenance. Likewise, the ability to hold a variety of file types as well as the ability to store data that is in its raw form or is not completely processed gives data lakes an advantage over older forms of data storage. Combined, the ability to share data easily and in any form even under adverse conditions goes far in explaining the popularity of cloud data lakes.
Cloud-based storage is best known as the product launched by Amazon Web Services in 2000 and offered to the public in 2002. Storage on the cloud makes the same data available from any one of multiple sites. Doing so reduces a variety of threats. The physical disruption of service can be almost completely removed because of multiple copies existing in multiple locations.
Using cloud helps an IT department meet the general 3-2-1 rule for disaster recovery, namely:
Have three copies of the file. These copies need to be stored on at least two different media, and one is stored off-site. While for some enterprises whose data backup needs tend more toward archiving could use tape or virtual tape storage, a cloud data lake would be better suited to companies needing to quickly come back to a recent point in time.
The major cloud lake services providers are geared toward helping clients with disaster recovery. One example is AWS Elastic Disaster Recovery, or AWS DRS, which works in the following manner:
Disaster recovery is important for any enterprise, as the range of issues that the topic encompasses is broad. Events as large as a hurricane and as small as an accidental deletion all fit here. Regardless, a cloud data lake for disaster recovery can help minimize the downtime caused by the event.
Cloud data lakes offer a route to go beyond Security Information and Event Management (SIEM). As variety and volume of data grow along with the multiple uses one datum can undergo, the need to ensure that only the right people have access to that information and that it is handled appropriately will only rise.
"Cloud lakes offer a way forward with security that takes into account the needs of security admins as well as those in the data field and in different business units. Data is no longer just 'everywhere', it needs to be both available and protected. And the underlying technology has to keep pace with the dizzying pace of change in both business and tech. Cloud data lakes gives us that flexibility."
— Vlad Medvedovsky, CEO at Proxet (ex - Rails Reactor) - a custom software development company.
Servers don't live forever. Nor does the rest of the hardware in a data center. Servers themselves tend to be replaced in 3-5 years, according to serverwatch.com. Your mileage may vary, but intensely used servers with hard drives added could need replacing even sooner.
And then there are failures. Virtual machines as well as physical one can fail. OS problems, issues with files being backed up, communications outages - these are just some issues that show that virtual machines, for all their abilities, are not immune to stoppages and malfunctions.
Whether maintenance is planned or not, cloud data lakes makes maintaining equipment easier because there is no single point of failure to disrupt business continuity. Even when the disaster recovery issue is removed, cloud data lakes help IT admins plan ahead further and more clearly.
Cloud data lake technology has changed the way that enterprises from the smallest, locally-centered companies to the world's largest corporations take care of their data. With cloud data lakes, disaster recovery, maintenance planning and security all become more robust and flexible.
Cloud data lakes also promote the access to data on the part of any worker within an organization with a recognized need to obtain it. The resulting increase in business intelligence from better and more timely data analysis has changed fields from chemistry and medicine to aerospace. As the variety, velocity, volume and vulnerability of data continues to jump, storing and handling that data needs to adjust accordingly. Cloud data lakes offer a solution for most enterprise needs in that realm.
Learn a step-by-step framework for constructing an optimal modern data stack — hear Proxet's CTO cover crucial elements like build vs buy choices, open source tools, typical mistakes, and how we can assist.
Build a modern data stack by following best practices from data engineering experts. Learn about data maturity, data stack components, and how to build.