The Three Greatest Regulatory Threats to Your Data Lakes

The Three Greatest Regulatory Threats to Your Data Lakes

Emerging privacy laws restrict the use of data lakes for analytics. But organizations who invest in privacy automation maintain the use of these valuable business resources for strategic operations and innovation.


Over the past five years, as businesses have increased their dependence on customer insights to make informed business decisions, the amount of data stored and processed in data lakes has risen to unprecedented levels. In parallel, privacy regulations have emerged across the globe. This has limited the functionality of data lakes and turned the analytical process from a corporate asset into a business nightmare.

Under GDPR and CCPA, data is restricted from being used for purposes beyond that which was initially specified — in turn, shutting off the flow of insights from data lakes. As a consequence, most data science and analytics actions fail to meet the standards of privacy regulations. Under GDPA, this can result in fines of up to 4% of a business’s annual global revenue.

However, businesses don’t need to choose between compliance and insights. Instead, a new mindset and approach should be adopted to meet both needs. To continue to thrive in the current regulatory climate, enterprises need to do three things:

  1. Anonymize data to preserve its use for analytics
  2. Manage the privacy governance strategy within the organization
  3. Apply privacy protection at scale to unlock data lakes


Anonymize data to preserve its use for analytics

While the restrictions vary slightly, privacy regulations worldwide establish that customer data should only be used for instances that the subject is aware of and has given permission for. GDPR, for example, determined that if a business intends to use customer data for an additional purpose, then it must first obtain consent from the individual. As a result, all data in data lakes can only be made available for use after processes have been implemented to notify and request permission from every subject for every use case. This is impractical and unreasonable. Not only will it result in a mass of requests for data erasure, but it will slow and limit the benefits of data lakes.

Don’t get us wrong. We think protecting consumer privacy is important. We just think this is the wrong way to go about it.

Instead, businesses should anonymize or pseudonymize the data in their data lakes to take data out of the scope of privacy regulations. This will unlock data lakes and protect privacy, regaining the business advantage of customer insights while protecting individuals. The best of both worlds. 


Manage the privacy governance strategy within the organization

Across an organization, stakeholders operate in isolation, pursuing their own objectives with individualized processes and tools. This has led to fragmentation between legal, risk and compliance, IT security, data science, and business teams. In consequence, a mismatch between values has led to dysfunction between privacy protection and analytics priorities. 

The solution is to implement an enterprise-wide privacy control system that generates quantifiable assessments of the re-identification risk and information loss. This enables businesses to set predetermined risk thresholds and optimize their compliance strategies for minimal information loss. By allowing companies to measure the balance of risk and loss, privacy stakeholder silos can be broken, and a balance can be found that ensures data lakes are privacy-compliant and valuable.


Apply privacy protection at scale to unlock data lakes

Anonymization is not as simple as removing direct personal identifiers such as names. Nor is manual deidentification a viable approach to ensuring privacy compliance in data lakes. In fact, the volume and velocity at which data is accumulated in data lakes make traditional methods of anonymization impossible. What’s more, without a quantifiable risk score, businesses can never be certain that their data is truly anonymized.

But applying blanket solutions like masking and tokenization strips the data of its analytical value. This dilemma is something most businesses struggle with. However, there is no need. Through privacy automation, companies can ensure defensible anonymization is applied at scale. 

Modern privacy automation solutions assess, quantify, and assure privacy protection by measuring the risk of re-identification. Then they apply advanced techniques such as differential privacy to the dataset to optimize for privacy-protection and preservation of analytical value.

The law provides clear guidance about using anonymization to meet privacy compliance, demanding the implementation of organizational and technical controls. Data-driven businesses should de-identify their data lakes by integrating privacy automation solutions into their governance framework and data pipelines. Such action will enable organizations to regain the value of their data lakes and remove the threat of regulatory fines and reputational damage.

Subscribe to our newsletter

All organizations need to be moving toward Privacy by Design

All organizations need to be moving toward Privacy by Design

Organizations should think about privacy the same way they think about innovation, R&D, and other major organizational processes. Privacy isn’t a one-time compliance check; it’s an integral element to an organization’s functioning. 


What is Privacy By Design? 

Privacy by design (PbD) was developed in the 1990’s to complement the increasing need for privacy assurance (see here). PbD is a proactive approach to managing and preventing invasive events by making privacy an organization’s default operating system. This is achieved through privacy operations management, where IT systems, business practices, and networked data systems are built with privacy in mind from step one.


Why Should Organizations Implement PbD?

Automatically embedding privacy into your organization’s processes provides many benefits: strengthening customer trust, reducing the likelihood of future breaches, and cost savings.


Strengthening Customer Trust

  • The seventh foundational principle of PbD emphasizes respect for user privacy. This translates into a privacy system that is completely customer-centric. Communicating to stakeholders about taking privacy seriously; treating personal information with utmost care; and committing to an alliance with the Fair Information Practices (FIP) principles all increases customer trust in an organization. PoB makes it easy to demonstrate and prove how customers’ personal data is automatically safeguarded from privacy and security related threats. This approach signals organizational maturity, allowing for a competitive edge.

Reducing Future Breaches 

  • Neglecting privacy and categorizing it as a function that should be managed only when new or amended data privacy laws are enforced or when a data breach occurs is detrimental to an organization’s growth and increases risk. There will always be an element of organizational privacy risk, but that risk can be tremendously reduced by implementing a default privacy system. Such a system provides several benefits such as preventing privacy invasions before they happen, and allowing for seamless delivery of data privacy.

Cost Reduction

  • The average cost of a data breach is $8.9 million USD. That’s a lump sum of funds that could have been allocated to more critical organizational needs, rather than a breach that could have been prevented. PbD can eliminate all unnecessary incident response costs while simultaneously circumventing penalties associated with data privacy law noncompliances. PbD is scalable and applicable to a wide variety of privacy frameworks (FIP, GAPP, APEC) and global privacy laws (GDPR, CCPA). By embedding PbD into an organization’s IT and networked data systems, privacy and compliance teams can rest assured that the risk of data breach is minimized, privacy laws are adhered to, and expenses are reduced.

PbD is a dire necessity that is critical to the future success of an organization. Understanding this, privacy risk prevention should be a top goal of all organizations and PbD is a proactive way to achieve it.

    Join our newletter

    Why privacy automation is the only route to CCPA de-identification compliance

    Why privacy automation is the only route to CCPA de-identification compliance

    The volume and variety of big data is surpassing the functionality of traditional privacy management. With the California Consumer Privacy Act (CCPA) coming into effect on January 1, 2020, it is more critical than ever for every organization operating in California to make real changes in how they manage their data. The only viable solution is privacy automation.

    Traditional data privacy management approaches are slow, unscalable, and imperfect

    Across organizations, data drives results. Yet the velocity at which data is growing threatens to turn this “new oil” from a profit-driver to fine-magnifier. 

    Organizations are continuously collecting data in massive volumes, while data consumers utilize that information to perform their day to day jobs. This ceaseless cycle of data acquisition and analysis makes it almost impossible for organizations to monitor and manage all their data.

    Yet today, data privacy management is often performed manually, with a survey-based approach. These processes do not scale. Not only are they unreliable, but manual implementation slows down data analysis and has made it impossible to stay current with privacy regulations. On top of this, first-generation techniques such as encryption, masking and hashing no longer cut it. In consequence, privacy and compliance teams are seen to be preventing companies from unlocking their most valuable resource. 

    In reality, compliance is impossible with manual human review. It would be like cutting your lawn with a pair of scissors. 

    Privacy compliance requires a unified effort from the various departments and privacy-related stakeholders within an organization. This requires the right tools and processes.

    Now, with the CCPA coming into effect on January 1, 2020, organizations are being put to the test. For the first time, enterprises with operations in California will be held accountable to strict privacy regulations. There is an urgent need to build a manageable and effective data privacy strategy.

    Under the CCPA, personal data cannot be used for secondary purposes unless explicit notice and the opportunity to opt-out has been provided from each user. These secondary purposes, like data science and monetization, are what makes data so valuable – why risk opt-outs?

    If data has been de-identified or aggregated, it is no longer restricted. However, the standards for data classification as “de-identified or aggregated” are extremely high, and traditional methods of anonymization, like tokenization and hashing, will not cut it. It is only when advanced privacy techniques (differential privacy, k-anonymization) are applied correctly that data science and monetization can continue.

    As a result, the complex structures of the average organization require a single enterprise-wide, end-to-end, automated solution to meet data and privacy compliance regulations: Privacy Automation.


    Privacy automation: the only tool that can ensure CCPA compliance

    Privacy automation assesses, quantifies and assures privacy by measuring the risk of identification, applying privacy-protection techniques, and providing audit reports throughout the whole process. With AI and a combination of the most advanced privacy techniques, this solution will simplify the compliance process and allow for privacy rules definition, risk assessments, application of privacy actions, and compliance reporting to happen within a single application. This process is part of what is known as Privacy by Design and Privacy by Default.

    With Privacy Automation, metadata classification becomes possible. This lets you generate an automated and easy-to-understand privacy risk score.

    Automation extends enterprise-wide, harmonizing the needs of Risk and Compliance and data science teams, and ensuring regulations are abided. This allows companies to unlock data in a manner that protects and adds value to consumers in a safer method than manual privacy-protection.

    With privacy automation, enterprises can leverage state-of-the-art solutions to innovate without limitation or fear. In consequence, it is the only tool that will realistically enable enterprises to become CCPA-compliant by January 2020.

    For more information, read our blog, The Business Incentives to Automate Privacy Compliance Under CCPA.

    Join our newsletter

    Automated risk assessment tools are the future of data compliance

    Automated risk assessment tools are the future of data compliance

    Privacy regulations such as GDPR, CCPA, and LGPD are requiring organizations to acquire consent in order to use their customers’ data for any purpose beyond the narrow one for which it was originally collected. Unless that data has been anonymized.

    How do organizations know if their data has been properly anonymized, and how do they prove it?

    These two questions present a huge burden for enterprises, and answering them properly means implementing significant changes in the way they have been doing business. No longer can they process data internally, or release it for third-party use, without explicit consent. This is a huge and potentially paralysing change. 

    The first step that organizations need to take is to analyze their data to assess the risk of re-identification. They should know, beyond all doubt, the probability that their data could lead to the exposure of personally identifiable information. Once they have this knowledge, they can take appropriate actions to reduce the risk. The second step is to ensure that data that is de-identified retains analytical value, so that organizations can generate the insights they rely on for data science and data analytics. 

    But for many organizations, this process could take a long time, and cause a loss of significant revenue and competitive advantage. Having the ability to automatically assess the risk of re-identification, apply privacy actions, and retain analytical value, will allow organizations to continue to grow and innovate – while remaining  compliant.


    How AI-driven attribute tagging enables powerful risk assessment

    In order to carry out proper risk assessment, you need your data to be correctly tagged. The attributes that must be tagged are direct identifiers and indirect or quasi-identifiers, both sensitive and insensitive. But tagging of data is a slow and time-consuming process. Automatic tagging greatly reduces costs, increases compliance, and allows organizations to stay ahead.

    Artificial intelligence can really help here. A neural net, for example, can be trained to recognize direct and indirect identifiers.  Once the model is ready, it can be used to automatically tag your data. Better still, its understanding can evolve over time as your data changes.

    Once the data is properly tagged, a risk assessment can occur that takes into account these attributes. That risk assessment can then provides a metric that an organization can utilize to decide on the appropriate privacy actions.

    These privacy actions will reduce the risk of re-identification, but will also cause information loss. Therefore, these actions must consider the use of the data so that the right attributes retain the proper fidelity, while still reducing risk. The organization at this point can automate this process by recording the steps taken and then applying those same steps automatically for each additional dataset. Additionally, the actions can be different for different use cases and still enable an automatic process.

    With these automated systems, an enterprise can implement “Privacy by Design.” Privacy regulations want to see this framework in business processes, in order to enforce compliance.  Adopting this approach will ensure that your organization is ready for the future.

    Join our newsletter

    Toxic Data is contaminating data lakes and data warehouses. How can you clean it up before it’s too late?

    Toxic Data is contaminating data lakes and data warehouses. How can you clean it up before it’s too late?

    Data is the new oil. Understandably, over the past few years, organizations have been gathering larger and larger quantities of it. However, a reckoning is on the way. New regulations such as CCPA mean that most of this data carries an inherent risk, that could affect and disrupt organizations if not dealt with.

    Toxic data lakes and data warehouses

    Websites, apps, social media – they all form part of how organizations use the digital space to gather consumer information, and then use that information to generate better solutions and services. All of this information is being stored in data lakes and data warehouses. 

    The big problem with storing all of this data is that the majority of it is personal information. And under the new privacy regulations, personal information has to be handled with special care. Mismanagement of this information opens the door to fines that could go up to nine digits, as well as to the loss of customer trust and revenue. 

    In light of the new era of privacy regulations, most of the data sitting in data lakes and data warehouses is highly toxic.

    Unfortunately, organizations are having a hard time measuring their privacy exposure and adopting processes and technologies to control and reduce risk. The toxicity of data lakes and warehouses keeps going up and is a ticking bomb waiting to explode.   


    Decontaminating before it is too late

    Data governance has been the traditional way in which organizations have tried to control the risk exposure of their data assets. However, traditional data governance needs to evolve to cover the rise of privacy risk. 

    Modern-day data governance must contain the following elements to be able to clean the data lakes and warehouses:

    • Provide a comprehensive privacy risk measure: Reducing privacy risk without being able to measure the risk is like flying a plane without instruments. Organizations need to be able to measure their privacy risk exposure as well as understand how each data consumer impacts this risk.


    • Privacy enhanced data discovery and classification: In order to measure and reduce privacy risk, organizations need to know what data they have. This discovery and classification need to incorporate privacy terminology to be effective in measuring privacy risk.


    • Variety of privacy-preserving techniques: Reducing privacy risk requires an understanding of how the data’s analytical value gets degraded. Utilising a variety of privacy techniques, like differential privacy and k-anonymity, allows organizations to reduce privacy risk while preserving analytical value.


    • Automatic policy enforcement: Making sure that the data that is coming in and out of the data lakes and warehouses is a huge endeavour that can’t be done manually. Organizations need systems that support and automate policy enforcement.


    • Data governance reports: Knowing exactly who accessed what data is a must for any data governance process.


    Cleaning your data lake and warehouse from toxic data is possible as long as you implement data governance tools that are suited for understanding and managing the privacy risk inherent in your data assets. 

    Subscribe to our newsletter

    Forget Third-party Datasets – the Future is Data Partnerships that Balance Compliance and Analytical Value

    Forget Third-party Datasets – the Future is Data Partnerships that Balance Compliance and Analytical Value

    Organizations are constantly gathering information from their customers. However, they are always driven to acquire extra data on top of this. Why? Because more data equals better insights into customers, and better ability to identify potential leads and cross-sell products. Historically, to acquire more data, organizations would purchase third-party datasets. Though these come with unique problems, such as occasionally poor data quality, the benefits used to outweigh the problems. 

    But not anymore. Unfortunately for organizations, since the introduction of the EU General Data Protection Regulation (GDPR), buying third-party data has become extremely risky. 

    GDPR has changed the way in which data is used and managed, by requiring customer consent in all scenarios other than those in which the intended use falls under a legitimate business interest. Since third-party data is acquired by the aggregator from other sources, in most cases, the aggregators don’t have the required consent from the customers. This puts any third-party data purchaser in a non-compliant situation that could expose them to fines, reputational damage, and additional overhead compliance costs.

    If organizations can no longer rely on third-party data, how can they maximize the value of the data they already have? 

    By changing their focus. 

    The importance of data partnerships and second-party data

    Instead of acquiring third-party data, organizations should establish data partnerships and access second-party data. This new approach has two main advantages. One, second-party constitutes the first-party data of another organization, so it is of high quality. Two, there are no concerns about customer consent, as the organization who owns this data has direct consent from the customer. 

    That said, to establish a successful data partnership, there are three things that have to be taken into consideration: privacy protection, IP protection, and data analytical value.   

    Privacy Protection

    Even when customer consent is present, the data that is going to be shared should be privacy-protected in order to comply with GDPR, safeguard customer information, and prevent any risk. Privacy protection should be understood as a reduction in the probability of re-identifying a specific individual in a dataset. GDPR, as well as other privacy regulations, refer to anonymization as the maximum level of privacy protection, wherein an individual can no longer be re-identified. 

    Privacy protection can be achieved with different techniques. Common approaches include  differential privacy, encryption, the adding of “noise,” and suppression. Regardless of which privacy technique is applied, it is important to always measure the risk of re-identification of the data.

    IP (Intellectual Property) Protection

    There are some organizations that are okay with selling their data. However, there are others that are very reticent, because they understand that once the data is sold, all of its value and IP is lost, since they can’t control it anymore. IP control is a big barrier when trying to establish data partnerships. 

    Fortunately, there is a way to establish data partnerships and ensure that IP remains protected.

    Recent advances in cryptographic techniques have made it possible to collaborate with data partners and extract insights without having to expose the raw data. The first of these techniques is called Secure Multiparty Computation.

    As its name implies, with Secure Multiparty Computation, multiple parties can perform computations on their datasets as if they were collocated but without revealing any of the original data to any of the parties. The second technique is Fully Homomorphic Encryption. With this technique, data is encrypted in a way in which computations can be performed without the need for decrypting the data. 

    Because the original raw data is never exposed across partners, both of these advanced techniques allow organizations to augment their data, extract insights and protect IP safely and securely.

    Analytical Value

    The objective of any data partnership is to acquire more insights into customers and prospects. For this reason, any additional data that is acquired needs to add analytical value. But maintaining this value becomes difficult when organizations need to preserve privacy and IP protection. 

    Fortunately, there is a solution. Firstly, organizations should identify common individuals in both datasets. This is extremely important, because you want to acquire data that adds value. By using Secure Multiparty Computation, the data can be matched and common individuals identified, without exposing any of the sensitive original data. 

    Secondly, organizations must use software that balances privacy and information loss. Without this, the resulting data will be high on privacy protection and extremely low on analytical value, making it useless for extracting insights.

    Thanks to the new privacy regulations sweeping the world, acquiring third-party datasets has become extremely risky and costly. Organizations should change their strategy and engage in data partnerships that will provide them with higher quality data. However, for these partnerships to add real value, privacy and IP have to be protected, and data has to maintain its analytical value.

    For more about CryptoNumerics’ privacy automation solutions, read our blog here.

    Join our newsletter