The Three Greatest Regulatory Threats to Your Data Lakes

by | Nov 13, 2019

Emerging privacy laws restrict the use of data lakes for analytics. But organizations who invest in privacy automation maintain the use of these valuable business resources for strategic operations and innovation.


Over the past five years, as businesses have increased their dependence on customer insights to make informed business decisions, the amount of data stored and processed in data lakes has risen to unprecedented levels. In parallel, privacy regulations have emerged across the globe. This has limited the functionality of data lakes and turned the analytical process from a corporate asset into a business nightmare.

Under GDPR and CCPA, data is restricted from being used for purposes beyond that which was initially specified — in turn, shutting off the flow of insights from data lakes. As a consequence, most data science and analytics actions fail to meet the standards of privacy regulations. Under GDPR, this can result in fines of up to 4% of a business’s annual global revenue.

However, businesses don’t need to choose between compliance and insights. Instead, a new mindset and approach should be adopted to meet both needs. To continue to thrive in the current regulatory climate, enterprises need to do three things:

  1. Anonymize data to preserve its use for analytics
  2. Manage the privacy governance strategy within the organization
  3. Apply privacy protection at scale to unlock data lakes


Anonymize data to preserve its use for analytics

While the restrictions vary slightly, privacy regulations worldwide establish that customer data should only be used for instances that the subject is aware of and has given permission for. GDPR, for example, determined that if a business intends to use customer data for an additional purpose, then it must first obtain consent from the individual. As a result, all data in data lakes can only be made available for use after processes have been implemented to notify and request permission from every subject for every use case. This is impractical and unreasonable. Not only will it result in a mass of requests for data erasure, but it will slow and limit the benefits of data lakes.

Don’t get us wrong. We think protecting consumer privacy is important. We just think this is the wrong way to go about it.

Instead, businesses should anonymize or pseudonymize the data in their data lakes to take data out of the scope of privacy regulations. This will unlock data lakes and protect privacy, regaining the business advantage of customer insights while protecting individuals. The best of both worlds. 


Manage the privacy governance strategy within the organization

Across an organization, stakeholders operate in isolation, pursuing their own objectives with individualized processes and tools. This has led to fragmentation between legal, risk and compliance, IT security, data science, and business teams. In consequence, a mismatch between values has led to dysfunction between privacy protection and analytics priorities. 

The solution is to implement an enterprise-wide privacy control system that generates quantifiable assessments of the re-identification risk and information loss. This enables businesses to set predetermined risk thresholds and optimize their compliance strategies for minimal information loss. By allowing companies to measure the balance of risk and loss, privacy stakeholder silos can be broken, and a balance can be found that ensures data lakes are privacy-compliant and valuable.


Apply privacy protection at scale to unlock data lakes

Anonymization is not as simple as removing direct personal identifiers such as names. Nor is manual deidentification a viable approach to ensuring privacy compliance in data lakes. In fact, the volume and velocity at which data is accumulated in data lakes make traditional methods of anonymization impossible. What’s more, without a quantifiable risk score, businesses can never be certain that their data is truly anonymized.

But applying blanket solutions like masking and tokenization strips the data of its analytical value. This dilemma is something most businesses struggle with. However, there is no need. Through privacy automation, companies can ensure defensible anonymization is applied at scale. 

Modern privacy automation solutions assess, quantify, and assure privacy protection by measuring the risk of re-identification. Then they apply advanced techniques such as differential privacy to the dataset to optimize for privacy-protection and preservation of analytical value.

The law provides clear guidance about using anonymization to meet privacy compliance, demanding the implementation of organizational and technical controls. Data-driven businesses should de-identify their data lakes by integrating privacy automation solutions into their governance framework and data pipelines. Such action will enable organizations to regain the value of their data lakes and remove the threat of regulatory fines and reputational damage.

Subscribe to our newsletter