Toxic Data is contaminating data lakes and data warehouses. How can you clean it up before it’s too late?

by | Oct 23, 2019 | Data privacy, Data Privacy Solutions, Data Privacy Technology, Data Science, Privacy blog

Data is the new oil. Understandably, over the past few years, organizations have been gathering larger and larger quantities of it. However, a reckoning is on the way. New regulations such as CCPA mean that most of this data carries an inherent risk, that could affect and disrupt organizations if not dealt with.

Toxic data lakes and data warehouses

Websites, apps, social media – they all form part of how organizations use the digital space to gather consumer information, and then use that information to generate better solutions and services. All of this information is being stored in data lakes and data warehouses. 

The big problem with storing all of this data is that the majority of it is personal information. And under the new privacy regulations, personal information has to be handled with special care. Mismanagement of this information opens the door to fines that could go up to nine digits, as well as to the loss of customer trust and revenue. 

In light of the new era of privacy regulations, most of the data sitting in data lakes and data warehouses is highly toxic.

Unfortunately, organizations are having a hard time measuring their privacy exposure and adopting processes and technologies to control and reduce risk. The toxicity of data lakes and warehouses keeps going up and is a ticking bomb waiting to explode.   


Decontaminating before it is too late

Data governance has been the traditional way in which organizations have tried to control the risk exposure of their data assets. However, traditional data governance needs to evolve to cover the rise of privacy risk. 

Modern-day data governance must contain the following elements to be able to clean the data lakes and warehouses:

  • Provide a comprehensive privacy risk measure: Reducing privacy risk without being able to measure the risk is like flying a plane without instruments. Organizations need to be able to measure their privacy risk exposure as well as understand how each data consumer impacts this risk.


  • Privacy enhanced data discovery and classification: In order to measure and reduce privacy risk, organizations need to know what data they have. This discovery and classification need to incorporate privacy terminology to be effective in measuring privacy risk.


  • Variety of privacy-preserving techniques: Reducing privacy risk requires an understanding of how the data’s analytical value gets degraded. Utilising a variety of privacy techniques, like differential privacy and k-anonymity, allows organizations to reduce privacy risk while preserving analytical value.


  • Automatic policy enforcement: Making sure that the data that is coming in and out of the data lakes and warehouses is a huge endeavour that can’t be done manually. Organizations need systems that support and automate policy enforcement.


  • Data governance reports: Knowing exactly who accessed what data is a must for any data governance process.


Cleaning your data lake and warehouse from toxic data is possible as long as you implement data governance tools that are suited for understanding and managing the privacy risk inherent in your data assets. 

Subscribe to our newsletter