Don't let your data lakes decay into data swamps

Grace Liu, Senior VP of IT Strategy and Global Applications, Seagate Technology suggests four ways to keep enterprise data lakes vibrant and insightful

Data remains an intangible asset on most companies' balance sheets - and its value is too often indeterminable and not fully tapped. Figures compiled by IDC for Seagate's Rethink Data report estimate only 32% of data available to enterprises is fully leveraged, with the remaining 68% untapped and unused.

To extract the most value from the data, companies are increasingly turning to constructing cloud-based 'data lakes' - a platform that centralises all types of data storage. It provides elastic storage capacity and flexible I/O throughput, covering different data sources, and supporting multiple computing and analytics engines.

A data lake can be hundreds of petabytes in size, or even larger. A big risk therefore in any data lake project is that - if left unattended - it could turn into a data swamp; a repository where unleveraged yet potentially useful data sits dormant on storage media. It risks becoming a massive, mostly idle swamp of 'sunk' data that is completely inaccessible to end users - a waste.

To keep data lakes from morphing into swamps - and keep them fresh, vibrant, and full of insights -CIOs, CTOs, and data architects should implement the following four points.


1. Have a clear vision of the business problem you're trying to solve
With a clearly-defined objective in mind, it should be relatively straightforward to target the data you need to collect, and the best machine learning techniques for gleaning insight from that data. Most business outcomes can be improved by an investment in storage infrastructure.

In advertising, a data lake analytics engine can be leveraged to run advertising campaigns accurately by reaching the correct groups of people through the right channels. A data lake can be used to perform data collection, storage, and analytics across the full life cycle of data management. By doing so, Yeahmobi, a Chinese marketing services company, has successfully reduced its overall operating costs by approximately 50%.

In the manufacturing sector, data lakes have been used to improve the yield rates, through the integration of an artificial intelligence (AI) algorithm, deep learning, and the related manufacturing parameters.

For the above to work effectively, it is important that new data is constantly introduced into the data lake so optimal results can be extracted with the right software applications.

2. Capture and store all the information you can
Organisations need to be able to capture the right data, identify it, store it where it is needed, and provide it to decision makers in a usable way. Activating data - putting it to use - starts with data capture.

Given the overwhelming growth of data due to the proliferation of IoT applications and 5G deployments, enterprises cannot keep up and do not currently capture all available data. But increasingly enterprises are learning to capture and save as much data as they can in order not to miss out on its complete value: the value that's there today and the value that will come alive in the use cases of tomorrow. If the data is not stored, this value never materialises.

In the early days of data lakes, it was the power user who had the ability to dive in, swim in the lake and find the right data. Nowadays Structured Query Language (SQL) has made big inroads into the data lake and given ordinary users more access to the data. For these users the focus is more on outcomes, with AI and machine learning (ML) being introduced to sift through the data and look for patterns. ML now gives rise to near real-time analytics, advanced analytics, and visualisation.

The data lake landscape has evolved rapidly, with an emphasis much more on turning the right data into value.


"Data lakes need auditing and refreshing. Enterprises should review the datasets they're managing in a cloud-based data lake - or they will find that the data lake becomes increasingly harder (muddier) to use. Worse yet, the organisation's data scientists will find it more difficult - if not impossible - to find the patterns they're searching for in the data."
Transferring data to a well-managed cloud storage service helps companies move data generated daily by their businesses into a scalable data architecture. For example, Twitter transferred 300 PB of data to the Google Cloud Storage service. Transferring that much data via the network took months to complete, but businesses can find faster methods of moving this.

3. Periodically evaluate the data
Data lakes need auditing and refreshing. Enterprises should review the datasets they're managing in a cloud-based data lake - or they will find that the data lake becomes increasingly harder (muddier) to use. Worse yet, the organisation's data scientists will find it more difficult - if not impossible - to find the patterns they're searching for in the data.

The use of cloud storage services, along with AI and automation software, is expected to have the most impact on making massive data lakes more manageable. It remains the magical solution for ploughing through the information. The best way to do this is to pick a data set, select a machine learning technique to go through it, and then apply it to others once a favourable result has been achieved. For example, in fraud detection in a bank, AI-based systems are being designed to learn what type of transactions are fraudulent based on frequency of transactions, transaction size and type of retailer.

Data that has aged or is no longer relevant can be transferred to another repository where it can be retained. You never know when that data may offer new, yet-undiscovered value. To do so, an enterprise can, again, use a data movement service designed to move massive amounts of data across private, public, or hybrid-cloud environments. Such services deliver fast, simple, and secure edge storage and data transfers that can accelerate time to insights.

4. Engage mass data operations
Mass data operations, or DataOps, are defined by IDC as the discipline of connecting data creators with data consumers. DataOps should be part of every successful data management strategy. In addition to DataOps, a sound data management strategy includes data orchestration from endpoints to core, data architecture, and data security. The goal of data management is to facilitate a holistic view of data and enable users to access and derive optimal value from it: both data in motion and at rest.

DERIVING VALUE FROM DATA
Businesses today are generating massive amounts of enterprise data, which is forecast to grow at an average annual rate of 42% annually from 2020 to 2022, according to the Rethink Data report.

A new Seagate-commissioned IDC survey (see URL below) found that enterprises frequently move this data among different storage locations, including endpoints, edge, and cloud. In over a thousand businesses surveyed, more than half move data between storage locations daily, weekly, or monthly. With the average size of a physical data transfer being over 140TB, the faster businesses can move this data from edge to cloud, the quicker they can unlock insights and derive value from their data.

Given the rapid pace of digitisation - accelerated in many cases by the pandemic - many organisations are gathering and managing more data than ever before. Cultivating vibrant and insightful data lakes will lay the groundwork for long-term success of enterprise data management strategies, in turn enabling the success of digital infrastructure and business initiatives.

More info: www.seagate.com/promos/future-proofing-storage-whitepaper/