Skip to main content

A beginner's guide to data lakes

What is a data lake?

James Dixon, CTO of the business intelligence software platform Pentaho, is believed to have coined the term data lake when he contrasted this form of storage with a data mart.

 "If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake and various lake users can come to examine, dive in, or take samples."

In short, a data lake is a storage repository – either on-premises, in the cloud with Google, Microsoft, Oracle, or Amazon, or hybrid, which can accommodate a steady stream of incoming data, from multiple sources, in its original format. 

 

These are typically built using Hadoop or big data technologies that enable organizations to store significant volumes of data cost-effectively.

What does a data lake do?

Fundamentally, a data lake holds data in its rawest form, without the need for it to have been processed or analyzed.

This data source may be relational (from operational databases or line of business applications) or non-relational (from mobile apps, IoT devices, and social media).

Once the data has been imported, functions within your organization – such as data scientists, developers, or business analysts – can crawl, catalog, index, and analyze it without the need to run through a separate analytics system. 

How could a data lake benefit my business?

Because the data is imported "as-is," it can be worked on by a wide range of applications, including:

  •  big data processing,
  • data visualization,
  • machine-learning tools and AI.

This analytical agility level can translate to substantial RoI: a survey by Aberdeen found that organizations with a data lake outperformed similar companies by 9% in organic revenue growth. At the same time, Markets and Markets estimate that the market for data lakes will be worth almost $9bn by 2021.

Do I need a data lake?

This is a reasonable question to ask, given the doom-laden warnings about data lakes turning into "swamps" overflowing with petabytes of useless data.

However, according to an article on Forbes.com by Shant Hovsepian, co-founder and CTO of Arcadia Data, most organizations that use data lakes have positive things to say, particularly their ability to enable non-technical users to analyze data.

Among the leading cheerleaders for the technology is Epic Games, which uses a data lake to store and analyze the colossal amount of data from game clients, servers, and services generated by Fortnite, the world's most popular game. 

How do I secure information stored in this way?

The flexibility and agility of data lakes – they allow you to dump data in its original format and can become a sandbox in which analysts and developers can play – plus their cloud-based storage, make them a potential security nightmare, especially from a regulatory compliance point of view.

Authentication, access controls, and data encryption must be applied, all traffic to the lake secured and scrutinized, and the data backed up to guard against the risk of a ransomware attack. 

Related contents: