Six Things to Look for in Privacy Protection Software

Six Things to Look for in Privacy Protection Software

This is the fourth blog in our Crash Course in Privacy series.


Enterprises want to:

  • Leverage their data assets
  • Comply with privacy regulations
  • Reduce the risk exposure of consumer information.

To maintain data utility while protecting privacy, here is a list of six key things you should consider in data privacy software:

1) Allows you to understand the privacy risk of your data set

It is easy to think that by removing information like names and IDs, privacy risk is eliminated. However, as shown by the Netflix case, there is a lot of additional information in a data set that can be used to re-identify someone, even when those fields have been removed. Therefore, it is important to know what the probability of re-identification is for individuals in your data set after you have applied privacy-protection. There are other lesser-known types of privacy risks that could matter to you, such as membership disclosure and attribute disclosure.

The software you use should help you understand and manage these risks.

2) Enables you to understand information loss and maintain the analytical value

Every time you apply anonymization techniques to your dataset, the information is transformed. This transformation either redacts, generalizes, or replaces the original data, causing some information loss. Depending on what the data will be used for, you need to be able to understand the impact on your data quality. Your data quality could vary widely even with the same privacy risk, so knowing this makes a huge difference when using privacy-protected data for analytics.

Software that helps you understand the information loss and maintain analytical value after de-identification is critical.

3) Protects all attribute types

To achieve optimal privacy protection while balancing data quality, all data elements need to be classified appropriately. Incorrectly classifying a data element as an identifier, quasi-identifier, sensitive, or insensitive attribute, could lead to insufficient privacy protection or excessive data quality loss.

The right privacy-protection software should support all four attribute types (identifier, quasi-identifier, sensitive, insensitive) and allow you to customize the classification of your data elements based on your needs.

To learn more about the data attributes read “Why privacy is important.”

4) Supports a range of privacy techniques and is tunable

Each different privacy technique has pros and cons depending on what the data will be used for. For example, masking removes analytical value completely but is good for protection. You should look for software that supports a range of privacy protection techniques as well as tunable parameters for each of them to find the perfect balance for your needs.

5) Applies consistent privacy policies

Satisfying privacy regulations is a cumbersome and manual process. Being able to create privacy frameworks and share them across the organization for application purposes is key. As a result, you should look for software that allows you and your team to apply consistent privacy policies.

6) Your data stays where you can protect it

You are looking to privacy-protect your data, so the software you use should work in the environment where you are already protecting your data. Using software that runs locally in your environment will remove an additional layer of risk.


The other blogs in the Crash course in Privacy series are:

Join our newsletter

Why Masking and Tokenization Are Not Enough

Why Masking and Tokenization Are Not Enough

This is the third blog in our Crash Course in Privacy series.


Protecting consumer privacy is much more complex than just removing personally identifiable information (PII). Other types of information, such as quasi-identifiers, can re-identify individuals or expose sensitive information when combined. There are four types of information, called attributes, that are frequently referred to when applying privacy techniques:

  • Identifiers: Unique information that identifies a specific individual in a dataset. Examples of identifiers are names, social security numbers, and bank account numbers. Also, any field that is unique for each row.
  • Quasi-identifiers: Information that on its own is not sufficient to identify a specific individual but when combined with other quasi-identifiers makes it possible to re-identify an individual. Examples of quasi-identifiers are zip code, age, nationality, and gender.
  • Sensitive: Information that is more general among the population, making it difficult to identify an individual with it. However, when combined with quasi-identifiers, sensitive information can be used for attribute disclosure. Examples of sensitive information are salary and medical data. Let’s say we have a set of quasi-identifiers that form a group of men aged 40-50. A sensitive attribute could be “diagnosed with heart disease”. Without the quasi-identifiers, the probability of identifying who has heart disease is low, but once combined with the quasi-identifiers, the probability is high.
  • Insensitive: Information that is not identifying, quasi-identifying, or sensitive, and that you do not want to be transformed.

Apart from knowing the types of information that needs to be protected, it is also important to know how privacy techniques affect data quality. There is always a trade-off between protecting privacy and retaining analytical value. The following is a review of some common privacy techniques:

  • Masking: Replaces existing information with other information that looks real, but is of no use to anyone who might misuse it and is not reversible. This approach is typically applied to identifying fields, such as name, credit card number, and social security number. The only masking techniques which sufficiently distorts the identifying fields are suppression, randomization, and coding. These techniques render the data useless for analysis.
  • Tokenization: This is the process of encoding direct identifiers, like email addresses, into another value (token) and keeping the original mapping of token stored somewhere for relinking in the future. When businesses employ this technique, they leave the indirect identifiers (quasi-identifiers) as they are. Yet, combinations of quasi-identifiers are a proven method to re-identify individuals in a dataset. 
  • k-anonymity: This technique transforms quasi-identifiers such that each group (also called an equivalence class) has at least k-1 members which are indistinguishable from each other. Transformation occurs by generalizing and/or suppressing the quasi-identifiers. For example, if k is set equal to 5, then any group must contain at least 5 individuals. As k increases, the data becomes more general, and the risk of re-identification is reduced, but at the same time, analytical value is reduced as well. By balancing privacy risk with data quality, the resulting data can be still be used for analysis.
  • Differential Privacy: This technique uses randomness to reduce the probability that it is possible to determine if a particular individual is in a dataset or not. One approach is the use of random noise to alter aggregate results. For example, if two professors publish a report, with data from different months, about the number of students with international parents. A smart student notices that there is a difference of 1 and deduces that Joe, who dropped out last month, is the missing student. Now that student knows that Joe had international parents. If both professors used differential privacy, they would report a number in which the difference is larger than 1, making it very difficult to re-identify the missing student. There are many approaches to utilizing differential privacy. The most promising approaches provide significant privacy guarantees while maintaining the analytical value of the data.

In order to protect consumer privacy and retain analytical value, it is important to choose the proper privacy technique for your desired application.


The other blogs in the Crash Course in Privacy series are:

Join our newsletter

Why Protecting Sensitive Data is Important

Why Protecting Sensitive Data is Important

This is the second blog in our Crash Course in Privacy series


Privacy risk is the probability of extracting information about a specific individual in a data set. Organizations must protect the significant personal information they have from exposure.

Governments around the world have been very active in making sure that consumer privacy is protected by publishing regulations that dictate how the data must be handled and used. These regulations include HIPAA, GDPR, CCPA, PIPEDA etc. The consequences of not complying with these regulations are fines, lawsuits, and reputational damage.

Organizations find themselves trying to answer this question:

How can I comply with privacy regulations & protect consumer privacy while leveraging my data assets for business purposes?

The answer is contained in the regulations:

  • HIPAA: The Health Insurance Portability and Accountability Act (HIPAA) is American legislation that requires the protection of 18 specific identifiers: name, Social Security Number, Health Insurance Numbers, and others. Once the dataset has been protected by anonymizing or de-identifying, it can be used for analysis. (Source)
  • GDPR: The General Data Protection Regulation is a privacy regulation that has to be observed by any organization that has information about European citizens. GDPR contemplates two ways in which privacy can be protected, pseudonymization, and anonymization. When a dataset is anonymized, GDPR no longer applies to it. (Source)
  • CCPA: The California Consumer Privacy Act regulates what each person’s rights are regarding their data. Specifically, CCPA is concerned with information that could reasonably be linked, directly or indirectly, with a particular consumer or household. Data that has been aggregated or de-identified is excluded from the CCPA. (Source)

In light of these regulations and consumer expectations for privacy protection, it is clear that organizations must enact privacy policies. Organizations need to embrace privacy and find a way to embed it into their analytic process if they want to extract value from sensitive data without facing any consequences.


The other blogs in the Crash Course in Privacy series are:

Join our newsletter

Understanding the Difference Between Data Privacy and Data Security

Understanding the Difference Between Data Privacy and Data Security

This is the first blog in our Crash Course in Privacy series.


Privacy is all over the news these days, from Facebook scandals to European fines associated with failing to comply with GDPR. This is caused, in part, because protecting the privacy of your customer’s data is a complex issue that requires an understanding of two very important terms that are often used interchangeably: Privacy and Security.

“Data security refers to the protection of data from unauthorized access, use, change, disclosure, and destruction.” (Carnegie Mellon University) It encompasses network security, physical security, and file security. Some standard techniques to secure data are encryption, multi-factor authentication, and access controls. Encryption encodes data so that only authorized users can decrypt it with an encryption key. Multi-factor authentication requires users to provide two or more pieces of evidence that prove they have permission to access the data. Access controls restrict users’ ability to access data until they have provided the correct credentials. Creating a comprehensive data security policy is critical, but it is not sufficient because:

  • Breaches can occur when the standard techniques fail. For example, if the encryption key was obtained or if unauthorized access occurred, as was the case in the Marriott data breach.
  • The standard techniques for securing data makes it difficult, and in some cases, impossible to extract analytical value from the data.
  • Analysis of encrypted data is not practical. Therefore, organizations decrypt, and the data becomes exposed during analysis.

Data Privacy involves protecting consumer data by eliminating or reducing the possibility of re-identifying an individual whose information is present in the data. This is done by either removing specific information or by transforming the data with random “noise” or generalization. Privacy regulations, like GDPR, refer to two different privacy measures that can be used to protect privacy:

  • Pseudonymization – a data management procedure by which personally identifiable information(PII) fields within a consumer’s data record are replaced by one or more artificial identifiers, or pseudonyms, and can be recalled at a later date to re-identify the record.
  • Anonymization – the process of removing any identifiable information from consumer data, such that individuals are no longer re-identifiable.

The key to managing data privacy is understanding the trade-off between protecting privacy and retaining analytical value. The techniques to protect privacy transform the original data by making it more general. The more general the data becomes, the less useful it is for analysis, but the more protected it is from re-identification. It is important to have a quantifiable measure of how these techniques impact the analytical value of your data.

Traditionally, organizations have focused more on security than privacy, locking data behind passwords and access control. However, to fully protect the data, organizations need to consider a combination of privacy and security techniques that help them comply with regulations, protect privacy, reduce the risk of consumer exposure, and increase ROI on their digital strategies.


The other blogs in the Crash course in Privacy series are:

Join our newsletter

Announcing CN-Protect Free Downloadable Software for Privacy-Protection

Announcing CN-Protect Free Downloadable Software for Privacy-Protection

We are pleased to announce the launch of CN-Protect as free, downloadable software to create privacy-protected data sets. We believe:

  • Protecting consumer privacy is paramount.
  • Satisfying privacy regulations such as HIPAA, GDPR, and CCPA should not sacrifice analytical value.
  • Data scientists, privacy officers, and legal teams should have the ability to easily ensure privacy.

Today’s businesses are faced with data breaches or misuse of consumer information on a regular basis. In response, governments have moved to protect their citizens through regulations like GDPR in Europe and CCPA in California. Organizations are scrambling to comply with these regulations without adversely impacting their business. However, there is no doubt that people’s privacy should not be compromised.

Current approaches to de-identify data such as masking, tokenization, and aggregation can leave data unprotected or without analytical value.

  • Data masking has no analytical use once applied to all values and, if not applied to all values, does not protect against re-identification. Data masking works by replacing existing sensitive information with information that looks real, but is of no use to anyone who might misuse it and is not reversible.
  • Tokenization removes all data utility of the tokenized fields, but re-identification is still possible through untokenized fields. Tokenization replaces sensitive information with a non-sensitive equivalent or a token which can be used to map back to the original data, but without access to the tokenization system, it is impossible to reverse.
  • Aggregation severely reduces the analytical value and if not done correctly can lead to re-identification. Data aggregation summarizes the data in a cumulative fashion such that any one individual is not re-identifiable. However, if the data does not contain enough samples, re-identification is still possible.

CN-Protect leverages AI and the most advanced anonymization techniques, such as optimal k-Anonymity and Differential Privacy to protect your data and maintain analytical value. Furthermore, CN-Protect is easy to adopt, as a downloadable application or plug-in for your favourite data science platform.

With CN-Protect you can:

  • Comply with privacy regulations such as HIPAA, GDPR, and CCPA;
  • Create privacy protected datasets while maintaining analytical value.

There are a variety of privacy models and data quality metrics available that you can choose from depending on your desired application. These privacy models use anonymization techniques to protect private information, while data quality metrics are used to balance those techniques against the analytical value of the data.

The following privacy models are available in CN-Protect:

  • Optimal k-Anonymity;
  • t-Closeness;
  • Differential Privacy;
  • and more.

You will be able to:

  • Specify parameters for the various privacy models that can be applied across your organization and fine-tuned for your many applications;
  • Define acceptable levels of privacy risk for your organization and the intended use of your data;
  • Get quantifiable metrics that you can use for compliance;
  • Understand the impact of privacy protection on your statistical and machine learning models.

Stay ahead of regulations and protect your data. Download CN-Protect now for a free trial!

Join our newsletter