If you or your company works with data, chances are you have already heard of data lakes. Even if you don’t work with data directly, it is still likely that you have some sort of contact with data lakes at your work—just that you might not even realize it.
Over the last decade or so, the idea of big data has grown prominent in the business sector. Instead of guessing and speculating on how things are going, why not rely on some hard facts—in this case, data—to help us predict trends and make the right decisions?
However, data exists in different forms—photos, pictures, charts, PDF files of conference transcripts … the question is how to harness all these data and have a systematic approach to make use of them. This is probably why data lakes are getting popular these days.
But wait wait wait … let’s take a step back. What are data lakes exactly? What do they do? And are they really the best approach for data analysis?
No worries, we are going to explain everything.
Data Lakes for Dummies: Basics
If we are to define lake lakes in simple terms, they can be seen as, well, a lake of data in its native format, where they can be pulled out (or fished out, if you prefer to see it that way) whenever they are needed for analytic purposes. They can be pictures, videos, audio recordings, PDF files … anything, really.
Data lakes are essentially data architectures. What does that mean? It means that they are a way to store and utilize data. Or in other words, they can be seen as a design pattern to house data for later use.
Still a bit confused? Let’s look at some solid examples, but simplified ones—now, pretend we are all back in school for a moment.
Tom’s Little Assignment
Our protagonist today is a 12-year-old boy called Tom. Unlucky for Tom, he was given an assignment on identifying the nationwide unemployment rate and suggesting possible solutions.
He started his research and came across a bunch of data—research notes scribbled on his napkins, audio recordings with those who are unemployed, and PDF reports he found on the government websites about the unemployment rate over the last ten years.
Now, imagine you have all these data—normally, you would store them in a hierarchical order. Something like this:
Homework -> Audio Recordings -> 2022 -> Detroit
Well, then what about the scribbled notes? In his case, he could probably type them manually into a text document. But that would be kind of annoying, isn’t it? On top of that, arranging every single file and having them fit into the right folder can also be a tedious process.
So Tom said screw it and tossed everything onto his desktop because, luckily for him, his dad is a data scientist.
Fishing Out the Data
Below is a brief transcription of the conversation between Tom Sr. and Tom Jr.
— Dad, I need to know the cause of unemployment based on these data.
— Say no more, son.
So Tom Sr., being a genius that he is, wrote algorithms that fished out data Tom Jr. collected, specifically those related to the causes of unemployment. After a day of work, he passed on the processed data—with detailed analysis—to his son so he could hand in his homework.
Of course, this is a humorous over-simplification of data lakes, as real-world versions are a bit different and much more complex in nature, but we hope you got the general idea of how it works.
Now, onto the data lake architecture itself.
Data Lake Structure
Instead of structuring the data for later calculations and analysis, data lakes instead keep the data in their native forms and utilize algorithms to take out the meaningful data for analysis, as somewhat demonstrated by the example above.
Structured vs. Unstructured Data
There are different forms of data, but when it comes to data structure and architecture, they can normally be separated into structured and unstructured data. Structured refers to the data that are processed and formatted for pre-defined purposes, sometimes in rows and columns—imagine electronic health records (EHRs) in hospitals, for example.
Meanwhile, unstructured data refers to data in their native format. In the case of medical care, that’d be scans of doctor’s notes, email communications, or X-ray images, among many others.
However, to give you a better understanding of data lake structures, perhaps it will be easier if we give you two other examples of data architecture.
One of the more primitive forms of data storage, databases are an organized data structure. Imagine data in a sales system where the customer and order info is neatly entered and categorized for easy access.
Data warehouses, on the other hand, are large storage locations for data from multiple sources. They are mostly used for business analytics, helping corporations make informed decisions. By design, data warehouses are also filtered and more structured.
Take the data from various sources, organize and structure them, then use them to generate reports and analytics—you get the rough idea. They are popular with medium-to-large-sized companies. While some companies develop their own data warehouses in-house (no puns intended), some outsource to companies that provide software development services.
Now, let’s take a look at the data lake structure.
So, what is data lake architecture? As we have explained before, data lakes can be seen as lakes of data in their native formats where researchers and data scientists can fish out the data they need for their analysis. Or as James Dixon, the man who named data lakes, put it:
"If you think of a datamart as a store of bottled water—cleansed and packaged and structured for easy consumption—the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples."
— James Dixon, Founder and CTO of Pentaho.
Of course, since data lakes basically take in all sorts of data, that’d technically include structured data as well. The point is that no matter what kind of data it is, one can find them in a data lake architecture.
Data Warehouses vs. Data Lakes
By now you probably have a somewhat clear understanding of what data lakes are. However, what’s the point of data lakes? Why are there different types of data architectures? Why do some people choose data lakes as opposed to, say, data warehouses?
Flexibility is one of the major benefits of data lakes.
A simple comparison would be model planes vs. Legos: Data warehouses are like model planes, with everything prebuilt with a purpose in mind. One simply needs to put the pieces together in a predetermined scheme, and voila, here’s your result. Data warehouses only utilize data that has been structured and processed, feed it into the system built based on the user’s needs, and receive the results on the other end.
Meanwhile, data lakes are like Legos in this case: You have all the pieces lying in front of you, and given that you have enough skills, you can build whatever you want. Need to build something else instead? Easy. Simply change the algorithms and you can utilize the data for an entirely different analysis. With data lakes, you can reconfigure everything as you please—and that’s one of the major advantages of data lakes.
Data Lake Use Cases
As we mentioned at the very beginning of this article, there are many data lake use cases—you might be interacting with them every day without realizing it.
Medical care – we actually have a full article covering the use of data lakes in the medical sector. Utilizing data lakes in clinical management not only simplifies the process but also allows for further research in the future with all the data gathered.
Marketing – understanding where the market goes is paramount in the marketing sector, and with the enormous amount of data available in real-time, data lakes provide the perfect solution for marketers to make the right strategic decisions on the spot.
Banking – we probably shouldn’t be surprised that the banking sector relies heavily on data lakes. From customer experiences to risk mitigation, data lakes are the ideal architecture for banking-related artificial intelligence (AI) analysis.
Insurance – similar to the banking sector, data lakes provide an ideal solution for insurance companies to run their AI analysis. With the amount of data available, tasks like risk assessment can be carried out with much higher accuracy.
"Data lakes are just an architecture—a tool. It’s about how we make use of them."
— Vlad Medvedovsky, CEO at Proxet (ex - Rails Reactor) - a custom software development company.
Data Lakes Challenges
That is not to say that data lake doesn’t come with their own set of challenges—while it does offer some clear advantages over other architectures, there are still some obstacles that hinder the widespread use of data lakes.
While there are open-source data lakes available (such as Hadoop), the high cost stems from labor and maintenance—hiring a data engineer to maintain the system is not cheap. Glassdoor estimates that the salary for a data lake engineer can reach USD 133,000 a year.
Lack of Expertise
The lack of experienced personnel who can run and maintain a data lake system also contributes to the rising cost of data lake solutions. In fact, there has been a
constant shortage of experienced data engineers in the industry.
Rapid Data Growth
The amount of data involved in data lakes grows rapidly compared to other solutions, and it has gotten to the point where storing and computing them can sometimes exceed our ability today. That can also be partially translated into rising costs, as companies will need to pay more for the computing resources required.
As data lakes take in all sorts of data, the quality of the data can be questionable, which, in turn, affects the analysis. Good data engineers can identify and utilize meaningful data and analysis, but unreliable and corrupt data still presents issues in itself.
Limited Access for Ordinary Users
This is rather an industry-wide challenge—due to the expertise required to run and maintain a data lake system, the architecture remains inaccessible for a majority of users who are looking to transform their data pipeline through data lakes. Despite the availability of open-source data lakes, it’s still difficult for ordinary users to run and maintain one. This is why some businesses work with companies that provide software development services instead.
Future of Data Lakes
While it’d be difficult to speculate on how the industry will develop in the coming years, we are confident that a lot of shortcomings will be overcome as the industry and technology further develop. With that in mind, we believe that data lakes will see a more prominent role in big data and data science as a whole as it matures.
Like other novel technologies of their time, data lakes will likely be more accessible in the coming years. The advantages of data lakes, with the right applications, far outweigh the currently available alternatives—namely its flexibility and easy access to a large pool of data, including historical ones.
As different industries are beginning to embrace big data in their business strategies, data lakes will become an indispensable tool in the process.
"Large enterprises tend to run into a problem with a data warehouse, where they're unable to integrate lots of their datasets into the data warehouse, and they're slowed down by those technical limitations of needing to apply the schema on write. If a company is frozen in that, that problem, then a data lake could be the solution for them.
— Michael Knopf, senior software engineer at TruSTAR, a cybersecurity platform vendor.
With a whopping estimated compound annual growth rate (CAGR) of 29.9 percent between 2021 and 2026, according to a report by
Mordor Intelligence, the global data lake market is estimated to reach a staggering USD 17.60 billion by 2026. Based on the report, we have a strong belief that businesses will embrace data lakes as part of their strategies in the coming years.
Proxet is a software development firm with years of experience in developing flexible AI-based solutions utilizing data lakes. Instead of starting anew and developing a data lake system on your own, we have the experience to help create a tailor-made data lake solution for you, saving you the time and cost required to build one—contact us for more information today.
Accurate parsing enables Q&A quality — but is it possible? No matter the industry or sector, businesses regularly deal with the question of how to efficiently process large amounts of info-heavy documents. Organization leaders, including CTOs, CDOs, and CPOs, are often looking for solutions to this question.
Build a modern data stack by following best practices from data engineering experts. Learn about data maturity, data stack components, and how to build.