The Big Data Era Demands Privacy by Design

The Big Data Era Demands Privacy by Design

Most commodities come with a price. This applies to everything from tangible items like our morning coffee to intangible items such as our online banking transactions. And the commoditization of goods and services in today’s economy is evolving to the point where even personal privacy is associated with a price tag.

Increasingly, enterprises perceive big data and privacy as in competition with one another. Privacy comes with a price, and that price detracts from the profits of big data. However, this is a myth. Organizations don’t have to choose between the two. Privacy can be prioritised, and big data innovations can also prosper. 

How? Through Privacy by Design (PbD). With PbD, companies can meet privacy compliance and legal requirements, while simultaneously creating useful data for analytics.


What is Big Data Analytics and why should it be concerned about privacy?


Big data is extremely powerful, due to its scale and ability to process structured and unstructured data in real time and produce data linkages between unrelated non-identifiable data. What gives big data analytics its name are five core components; the Five V’s of big data. 

Big Data has the ability to process large volumes of a variety of both structured and unstructured data at high velocity (source). Additionally, the data is clean and accurate (veracity), and produces necessary value (source). Once the data is processed, data scientists and analysts are able to identify patterns, infer situations, predict behaviours, and understand trends to drive business decisions (source).

Let’s put into perspective the amount of data that is processed in the modern world. Every single day:

  1. 294 billion emails are sent;
  2. 500 million tweets are shared;
  3. 3.5 billion Google searches are conducted;

Currently, the digital universe contains 4.4 zettabytes of data (source).

The increasing amount of data generated creates concerns for consumers about how and what data is being processed. 88% of respondents in a Deloitte data ethics survey said they would cease their business relationship with an organization who uses their data unethically (source). Consumers are more likely to share their data if they know that the organization is handling their data with privacy in mind (source).

The more data in the hands of organizations paired with the powerful capabilities of big data analytics exposes organizations to unpredictable risks. These risks can include:

  • Data misuse and breach
  • Loss of consumer trust 
  • Fines for privacy related non-compliance
  • Revenue loss


How can organizations sustain big data innovation while considering privacy?

There is no doubt that the scale and diversity of the big data value chain creates challenges for privacy implementation. This is mainly because big data processing contradicts some of the core values of privacy preservation, such as minimization. Data minimization involves processing lean data, and collecting only relevant and necessary data for analysis. Big data powers lie in collecting and storing large volumes of rich information before it is used. There is an obvious tension here (

However, rather than seeing data misuse and breaches as a result of a lack of privacy controls within the organization, organizations can proactively implement privacy controls as a default setting by enforcing the PbD framework.

PbD was conjured in the 1990’s by Dr. Ann Cavoukian, the Information and Privacy Commissioner of Ontario (source). The framework established that privacy should be integrated into all organizational processes from technology to business processes and operations. What is beneficial about PbD is its potential to scale. It’s designed in a way that even big data can easily implement the framework to ensure that data is processed in a way that does not jeopardize personal privacy.


How is this done? The step by step implementation guide:

PbD must be an integral part in the Big Data value chain: 

  1. Data Acquisition/Collection
  2. Data Analysis
  3. Data Curation
  4. Data Storage
  5. Data Usage

The European Union Agency for Network and Information Security (ENISA) created a PbD Engineering Framework that is suited specifically for Big Data. The process involves  implementing eight strategies (minimize, hide, separate, aggregate, inform, control, enforce, and demonstrate) throughout the big data value chain to allow for seamless privacy without sacrificing analytically valuable data (source).

Here are the PbD principles of behaviour that data controllers within organizations should adhere to:


Data Collection, Analysis, and Curation

  1. Minimize: Define what data needs to be collected. Avoid using data that serves no purpose to analytics.
  2. Aggregate: Implement anonymization techniques to remove all personal identifiable Indicators (PII).
  3. Hide: Implement techniques such as encryption, identity making, and secure file sharing, that allows users to control what data is being processed
  4. Inform: Inform all users about what data is collected to allow for transparency.
  5. Control: Implement opt-in measures, and make opt-out tools available throughout the big data processing.

Data Storage

  1. Separate: Keep data separate; this deters central warehouses and allows for computation across different databases by using privacy preserving analytics in distributed systems to protect personal data. 

Data Use

  1. Aggregate: Consider the level of aggregation of metadata to avoid re-identification of individuals as well as to meet legal obligations. Implement privacy-preserving techniques like anonymization to mitigate the risk of potential re-identification.

All Components

  1. Enfronce and Demonstrate: Enforce a privacy policy that also meets legal requirements such as the GDPR, CCPA, and HIPPA, while demonstrating they are in compliance with the policies set forth.

If this framework is applied, enterprises will no longer need to place a price tag on big data or privacy. When PbD is implemented across the organization, both can be achieved.

Join our newletter

What is ‘Privacy by Default’ and why is it the future of data compliance?

What is ‘Privacy by Default’ and why is it the future of data compliance?

If we approach data privacy as a fundamental right of individuals, then it must become a founding principle of innovation and technology. Privacy by Default addresses the increasing awareness of data privacy and ensures that businesses will consider consumer values throughout the product lifecycle.

What is Privacy by Default?

Privacy by Default is a principle of data protection designed to ensure that privacy is baked into the framework of new software, in an effort to provide data subjects with the highest level of protection. Put simply, Privacy by Default is the notion that active consent must be given for data handlers to access a subject’s information. Privacy by Default ensures that businesses are bound to uphold and consider privacy values. The approach holds businesses accountable for their actions and intentions.

To achieve Privacy by Default, data protection must be integrated throughout the product lifecycle, from design to implementation. This means that upon the release of a product or service to the public, the strictest privacy settings must be the default, without requiring action from the end-user. Further, any information required from the user in order to enable a product’s optimal use should only be kept for as long as is necessary to provide the product or service.

For example, when a consumer creates a new social media account and inputs their information, the default setting should be to keep their data private. While there may be places to include their birthday, gender, or location on the platform, users must opt-in, or provide consent, for this information to be shared.


Why is it expected?


While Privacy by Default is not a new idea. Furthermore, it is now a legal requirement under GDPR (the EU General Data Privacy Regulation). In Article 25, GRPR spells out that:

The controller shall implement appropriate technical and organisational measures for ensuring that, by default, only personal data which are necessary for each specific purpose of the processing are processed. That obligation applies to the amount of personal data collected, the extent of their processing, the period of their storage and their accessibility. In particular, such measures shall ensure that by default personal data are not made accessible without the individual’s intervention to an indefinite number of natural persons.

This law reflects the shift in the prescribed importance of privacy in society. Now more than ever, consumers expect businesses to protect their personal information. This was demonstrated in a 2019 study in the Journal of Consumer Policy, which determined that individuals believe that their privacy should be assumed and that is the responsibility of the businesses they interact with to ensure this privacy.

Consumers expect their personal data to be processed carefully, transparently, and only for uses they consented to, by default. Now, under GDPR, businesses have no choice, as hefty regulatory penalties mean non-compliance will cost businesses their reputation, continuity, and profits. 

As a result, Privacy by Default forces businesses to rethink privacy: it is not a burden but a best practice. Now that privacy is considered a fundamental right of users, and a social value, it must become a founding principle of innovation and technology.


What does this mean for your business?


Privacy by Default is the future of data compliance and must become a business priority – not only in Europe, where it is mandated by law, but across the globe. The shift in privacy centricity signals that Privacy by Default will soon be mandated everywhere. Consequently, strategic investments should be made today to get ahead of the curve and demonstrate a privacy-first mindset to consumers. When you value their values, it will make a world of difference.

Under GDPR, organizations should enable users to manage their accounts so that they can define their permissions and determine what information they want to share and make usable to organizations. This means that data minimization is key to providing Privacy by Default, as only necessary personal data should be gathered. Moreover, data should only be stored for as long as is needed to perform the privacy purpose, and deleted or anonymized after that time has passed.

The key here is that data can be maintained so long as it is anonymized and not linked to the consumer in any way. This means that by investing in privacy automation solutions, your business can continue to derive the insights that matter to you, whilst reassuring and appealing to consumers with Privacy by Default. 

The future is private, but that does not mean it is business-depleting. Coupling Privacy by Default with anonymization is the competitive advantage your business needs.

Join our newletter

Why Private Set Intersection (PSI), Differential Privacy (DP) and Secure Multi-Party Computation (SMC) are the future of data privacy

Why Private Set Intersection (PSI), Differential Privacy (DP) and Secure Multi-Party Computation (SMC) are the future of data privacy

Privacy regulations like GDPR and CCPA are changing the way data is collected and used.  Data-driven organizations that use data collaboration to understand their customers and research organizations that rely on data collaboration to advance research are being restricted.  As more privacy regulations come online, what can organizations do to future-proof their use of data, whilst still adhering to privacy regulations?

Technology is now available that will allow organizations to continue to collaborate without ever exposing or moving the underlying data. 

Private Set Intersection (PSI) enables organizations to identify common individuals without revealing anything else. This is key to being able to properly organize data into a geometry that is ultimately consumable by computational algorithms.

Differential Privacy (DP) places mathematical guarantees on privacy in the presence of any amount of side information including knowledge about who is in the intersection. 

Secure Multi-Party Computation (SMC) enables organizations to jointly compute a function while keeping the inputs from being observed.  

All three of these are in fact a perfect combination of mathematical guarantees on how to do useful things with data while preserving privacy and intellectual property.

PSI, DP and SMC in action

Picture this: One data owner has information about cancer rates in the general population and another has information about food purchases over twenty years. A researcher is trying to understand how long-term patterns of food consumption might lead to cancer. To gain this understanding, they need to match food purchases with cancer diagnostics. They need to intersect data in the food purchase panel with the cancer diagnostics to build an attribution analysis. It is a requirement of the numerical algorithm that all the pieces line up appropriately.  PSI allows these data owners to find the commonality between the two data sets without revealing anything about the members that do not overlap. 

At this point, Differential Privacy and Secure Multi-party Computation take over, as we compute the attribution between food and cancer diagnosis. Applying Differential Privacy will create uncertainty around the PSI operations. Even though all parties know with certainty who was included in the original problem formulation, applying Differential Privacy guarantees that the output of any analysis will be uncertain as to who was included in that analysis within certain probabilistic boundaries. 

Finally, the attribution analysis can take place using Secure Multi-party Computation.  Secure Multi-party Computation never moves or exposes the underlying data but yields results that are consistent with co-locating the data. It is a very powerful approach that relies on secret shares that are protected with one-time pad encryption; a technique that cannot be cracked. All the operations in the analysis are computed with Secure Multi-party Computation and require communication between the parties. The result is an attribution analysis that has been properly constructed without compromising data privacy, the IP of each data owner, or data residency requirements.

As regulations continue to evolve and threaten to clamp down on an organization’s ability to generate insights, new technology holds promise for not just organizations but also for consumers that demand privacy protection. Secure Multi-party Computation, Private Set Intersection, and Differential Privacy will make it possible for organizations to continue to generate insights and satisfy future privacy regulations thereby future-proofing their data.

Join our newletter

Is your GDPR-restricted data toxic or clean?

Is your GDPR-restricted data toxic or clean?

Do your data lakes and warehouses contain personal information? Then you may have data that is toxic in the view of GDPR. If you have not obtained consent for every purpose that you plan to process data for, or haven’t anonymized the personal information, then under GDPR, your business has a significant exposure that could cost hundreds of millions.


When GDPR was implemented in May 2018, few businesses realized the impact it would have on data science and analytics. A year and a half in, the ramifications are indisputable. There have been more than €405 million in fines issued, and brands like British Airways have been irreparably harmed. Today, privacy infractions land on the front page, meaning data lakes pose a monumental threat to the longevity of your business.

The fact is, untold bounds of personal information is being collected, integrated, and stored in data lakes and data warehouses in almost every business. In many cases, this data is being stored for purposes beyond the original for which it was collected. 

In light of the new era of privacy regulations and legal compliance, most of the data sitting in data lakes and warehouses should be considered highly toxic for GDPR compliance.

Toxic data will result in regulatory penalties and a loss of consumer trust

GDPR-determined data controllers must establish a specific legal basis for each and every purpose where personal data is used. If a business intends to use customer data for an additional purpose, then it must first obtain explicit consent from the individual. 

As a result, all data in data lakes can only be made available for use after processes have been implemented to notify and request permission from every subject for every use case. This is impractical and unreasonable. Not only will it result in a mass of requests for data erasure, but it will slow and limit the benefits of data lakes. 

The risk is what we refer to as toxic data. This is identifiable data that you are processing in ways that you have not obtained consent for under GDPR. Left in a toxic state, your data lakes put your business at risk of fines worth 4% of your annual global revenue. 

Worse yet, the European DPA’s have been strict with their enforcement, leading to a flood of GDPR fines and a mass loss of customer confidence for many major data-driven companies. You need to act now before it is too late.

Anonymize your data to remove it from the scope of GDPR

Toxic data exposes your organization to significant business, operational, security, and compliance overheads and risks. Luckily, there is another way to clean your data lakes without undertaking the process of obtaining individual and meaningful consent: anonymize your data.

Rather than scramble to minimize data and update data inventory systems to comply, businesses should invest in automated defensible anonymization systems that can be implemented at an architectural point of control with regard to data lakes and warehouses.

Once data has been anonymized, it is no longer considered personal data. As such, it is no longer regulated by GDPR, and consent is not required to process it.

The impact to your business of using toxic data could be very damaging. If you want to leverage and monetize your data without risking violations and fines, you need to put it outside of the scope of GDPR. To do this, you need to decontaminate your data lakes.

Businesses essentially have two choices: 

(a) maintain the status quo and retain toxic information in data lakes and warehouses, or 

(b) anonymize your data using provable, automated, state-of-the-art solutions, so that GDPR is not applicable.

One option will save your brand reputation and bottom-line. The other is a mass of expensive regulatory complications and litigation exposures.

Join our newletter