Select Page
Transactional data: A privacy nightmare and what to do about it

Transactional data: A privacy nightmare and what to do about it

Our everyday actions, like buying a morning coffee or taking the train, all create a digital trail of our lives. We as humans tend to fall into individual habits, taking the same routes every day, eating at the same restaurants on certain nights. We create a unique fingerprint through our routine actions. These ‘fingerprints’ make it very easy to predict our next moves.

And with the rapidly growing machine learning technologies, companies are able to predict our next moves. 

Here is where transactional data comes in. This data relates to transactions of an organization and includes information that is captured, for example, when a product is sold/purchased. This data is collected from a wide variety of industries, spanning from financial services to transportation, and retail, to name a few.

These collections of information paint a picture of your entire life, online and offline.


How Transactional data is everywhere, in everything. 


Transactional data provides a constant flow of information and is necessary for maintaining a company’s competitive edge, deepening client insight, and customer experience. 

Each purchase, click, and online movement is held under the umbrella of transactional data. This demographic data, dealing with transactional records, time, or location, all provide access to our real-life behaviors and movements.

Thanks to transactional data, companies can provide customers with a personalized experience. This can be a good thing. For example, banks are a significant participant in growing client profiles using transactional data. Each purchase we do with our bank cards establishes spending patterns. Having AI detect and learn from our purchase habits can help in fraud detection or credit card theft. 

However, transactional data constitutes a colossal privacy exposure that is exceptionally difficult to control. For example, perhaps you are someone who Uber’s to work and home. If this action happens only once, it does not represent a significant risk; however,  doing every day creates a pattern that can depict several aspects of who you are, where you live, or where you hang out. 

Because of this, if a company puts efforts into removing a personal identifier such as a name, it would appear compliant to safeguarding user data. However, these patterns of information can group to re-identify a person without using personal identifiers. An attacker could discover a place of work, a stop for coffee, and a house address without having to know the person’s name.  

These extensive collections of our information are not protected to the extent they should be. If companies know and are using such detailed information, how is it not protected to the point of no risk? 


Protecting Transactional data


Transactional data will keep growing as IoT becomes more prevalent. As mentioned before, reducing the privacy risk of a dataset that contains transactional data is challenging. It is not just about applying different privacy protection techniques but also understanding how each row relates to each other because the most crucial aspect is to preserve the analytical value.  

At CryptoNumerics, we have developed a way to solve this problem. By leveraging CN-Protect and our technical expertise, we are helping telematics companies, as well as companies in the finance sector, reduce the risk of re-identification in their transactional datasets. 


Join our newsletter

Masking is killing data science

Masking is killing data science

When it comes to data science, the trade-off for protecting data while keeping its value appears near impossible. And with the introduction of privacy legislation like the California Consumer Privacy Act (CCPA), this trade-off makes the job even harder.

Methods such as data masking appear the standard option, with privacy risks landing at almost 0%. However, with information loss reaching a potential of over 50%, the opportunity for data analytics vanishes.

Data Masking is a lost battle

Data Masking is a de-identification technique that focuses on the redaction or transformation of information within a dataset to prevent exposure. The information in the resulting is of low quality. This technique is not enough to move a company forward in innovation.

Companies need to privacy protect their consumer data. However, they also need to preserve the value of the data for analytical uses.

Masking fails to address how data works today and how a business benefits for it. Consumer data is beneficial to all aspects of an organization and creates a better experience for the customer. Failing to utilize and protect the datasets leaves your company behind in innovation and consumer satisfaction.

Privacy-protection that preserves analytical value

Data scientists need to be able to control the trade-off, and the only way to do it is by using “smart” optimization solutions.

A “smart” optimization solution is one that can modify the data in different ways using privacy risk and analytical value as its optimization functions. With a solution like this, a data scientist would get a data set that is optimized for analytics, and that is privacy compliant, the best of both worlds.

Smart Optimization vs Masking

Let’s look at the impact that both privacy-protection solutions have on a machine learning algorithm.

For this example, we want to predict loan default risk using a random forest model. The model is going to be run on three datasets:

  • In the clear: The original dataset without any privacy transformations.
  • Masked dataset: Transformation of the original dataset using standard rule-based masking techniques.
  • Optimized dataset: Transformation of the original dataset using a smart optimization solution.


The dataset has 11 variables:

  • Age
  • Sex
  • Job
  • Housing
  • Saving Account Balance
  • Checking Account Balance
  • Credit Account Balance
  • Duration
  • Purpose
  • Zipcode
  • Risk

Let’s compare the results.

Running the model with the original dataset gave us an accuracy of 93%; however, the risk of re-identification is 100%. When we used the masked data, the model accuracy dropped to 28%, since there were 5 risk levels, the accuracy of this model is barely better than random. On the positive side, the risk of re-identification is 0%. Lastly, the accuracy with the optimized dataset was 87%, a drop of only 5 points vs the original data. Additionally, the risk of re-identification was only 3%.

While having a 0% privacy risk is appealing, the loss in accuracy makes masking worthless for analytic purposes.

This example highlights why masking is killing data science, and organizations need to implement smart optimization solutions, like CryptoNumeric’s CN-Protect, that reduce the risk of-reidentification while preserving the analytical value of the data.

Gaining a competitive edge in your industry means utilizing consumer data. And by adequately protecting the data without mass data loss, a high data value can take your company far.



Join our newsletter

Processing personal data through anonymization methods

Processing personal data through anonymization methods

Companies are becoming increasingly reliant on user data to understand consumers better and improve performance. But with the rise of new privacy legislation and the growing concerns for personal data security, ensuring that your company is checking all the boxes in privacy protection is more critical than ever.

Utilizing different privacy-protecting techniques, organizations can then protect consumer information while extracting value at the same time. These techniques include masking, k-anonymity, and differential privacy.  

By understanding the potentials and challenges of these techniques, processing personal data so that user data is not re-identifiable is achievable.  

Let’s look at the three privacy-protection techniques mentioned before.

Masking is the process of replacing the values in a dataset with different values, that in many cases, resemble the structure of the original value. Unfortunately, masking tends to destroy the analytical value of data since the relationship between values gets affected by the replacing actions.

The ideal use case for masking is in DevOps environments where there is a need for data, but the analytical value is irrelevant. 

k-anonymity objective is to reduce privacy risk by grouping individual records into “cohorts.” Grouping is achieved by using generalization (substitution of a specific value with a more general value) and suppression (removal of values) to group the quasi-identifiers (QID’s) in ways that make them indistinguishable from one another. The k value defines the minimum number of elements in one group; the higher the value is, the higher the level of data protection. 

While k-anonymity reduces the analytical value of the data, it still preserves enough value for data scientists to perform analytics and Machine Learning using the dataset.

Differential privacy is a privacy technique that provides a privacy guarantee on how much information can be extracted on an individual.  

Differential privacy uses a technique called perturbation, which adds random noise to a point where it becomes incredibly difficult to know with certainty if a specific individual is present in a dataset. 

Differential privacy is one of the most promising privacy techniques; however, it can only be used with large data sets because applying perturbation to a small data set would destroy its analytical value.

With these privacy-techniques techniques, privacy and analytics no longer have to be at odds. Companies who dare to ignore them are exposing themselves to unnecessary risks. 

Contact us today to learn how you can use CN-Protect to apply any of these techniques to protect your data while preserving its analytical value.

To read more privacy blogs, click here

Join our newsletter

CCPA is here. Are you compliant?

CCPA is here. Are you compliant?

As of January 1, 2020, the California Consumer Privacy Act (CCPA) came into effect and has already altered the ways companies can make use of user data. 

Before the CCPA implementation, Big Data companies had the opportunity to harvest user data and use it for data science, analytics, AI, and ML projects. Through this process, consumer data was monetized without protection for privacy. With the official introduction of the CCPA, companies now have no choice but to oblige or pay the price. Therefore begging the question; Is your company compliant?

CCPA Is Proving That Privacy is not a Commodity- It’s a Right

This legislation enforces that consumers are safe from companies selling their data for secondary purposes. Without explicit permission to use data, companies are unable to utilize said data.

User data is highly valuable for companies’ analytics or monetization initiatives. Thus, risking user opt-outs can be detrimental to a company’s progressing success. By de-identifying consumer data, companies can follow CCPA guidelines while maintaining high data quality. 

The CCPA does not come without a highly standardized ruleset for companies to satisfy de-identification. The law comes complete with specific definitions and detailed explanations of how to achieve its ideals. Despite these guidelines in place, and the legislation only just being put into effect, studies have found that only 8% of US businesses are CCPA compliant.  

For companies that are not CCPA compliant as of yet, the time to act is now. By thoroughly understanding the regulations put out by the CCPA, companies can protect their users while still benefiting from their data. 

To do so, companies must understand the significance of maintaining analytical value and the importance of adequately de-identified data. By not complying with CCPA, an organization is vulnerable to fines up to $7500 per incident, per violation, as well as individual consumer damages up to $750 per occurrence.

For perspective, after coming into effect in 2019, GDPR released that its fines impacted companies at an average of 4% of their annual revenue.

To ensure a CCPA fine is not coming your way, assess your current data privacy protection efforts to ensure that consumers:

  • are asked for direct consent to use their data
  • can opt-out or remove their data for analytical purposes
  • data is not re-identifiable

In essence, CCPA is not impeding a company’s ability to use, analyze, or monetize data. CCPA is enforcing that data is de-identified or aggregated, and done so to the standards that its legislation requires.

Our research found that 60% of datasets believed, by companies, to be de-identified, had a high re-identification risk. There are three methods to reduce the possibility of re-identification: 

  • Use state-of-the-art de-identification methods
  • Assess for the likelihood of re-identification
  • Implement controls, so data required for secondary purposes is CCPA compliant

Read more about these effective privacy automation methods in our blog, The business Incentives to Automate Privacy Compliance under CCPA.

Manual Methods of De-Identification Are Tools of The Past

A standard of compliance within CCPA legislation involves identifying which methods of de-identification leaves consumer data susceptible to re-identification. The manual way, which is extremely common, can leave room for re-identification. By doing so, companies are making themselves vulnerable to CCPA.

Protecting data to a company’s best abilities is achievable through techniques such as k-anonymity and differential privacy. However, applying manual methods is impractical for meeting the 30-day gracing period CCPA provides or in achieving high-quality data protection.

Understanding CCPA ensures that data is adequately de-identification and has removed risk, all while meeting all legal specifications.

Achieving CCPA regulations means ditching first-generation approaches to de-identification, and adopting privacy automation defers the possibility of re-identification. Using privacy automation as a method to protect and utilize consumer’s data is necessary for successfully maneuvering the new CCPA era. 

The solution of privacy automation ensures not only that user data is correctly de-identified, but that it maintains a high data quality. 

CryptoNumerics as the Privacy Automation Solution

Despite CCPA’s strict guidelines, the benefits of using analytics for data science and monetization are incredibly high. Therefore, reducing efforts to utilize data is a disservice to a company’s success.

Complying with CCPA legislation means determining which methods of de-identification leave consumer data susceptible to re-identification. Manual approach methods of de-identification including masking, or tokenization, leave room for improper anonymization. 

Here, Privacy Automation becomes necessary for an organization’s analytical tactics. 

Privacy automation abides CCPA while benefiting tools of data science and analytics. If a user’s data is de-identified to CCPA’s standards, conducting data analysis remains possible. 

Privacy automation revolves around assessment, quantification, and assurance of data. Simultaneously, a privacy automation tool measures the risk of re-identification, applying data privacy protection techniques, and providing audit reports. 

A study by PossibleNow indicated that 45% of companies are in the process of preparing, but had not expected to be compliant by the CCPA’s implementation date. Putting together a privacy automation tool to better process data and prepare for the new legislation is critical in a companies success with the CCPA. Privacy automation products such as CN-Protect allow companies to succeed in data protection while benefiting from the data’s analytics. (Learn more about CN-Protect)

Join our newsletter

The top 4 privacy solutions destroy data value and fail to meet regulatory standards.

The top 4 privacy solutions destroy data value and fail to meet regulatory standards.

Businesses are becoming increasingly reliant on data to make decisions and learn about the market. Yet, due to an increase in regulations, the information they have collected is becoming less and less useful. While people have been quick to blame privacy laws, in reality, the biggest impediment to analytics and data science are insufficient data privacy solutions.

From our market research, the top four things people are doing are (1) access controls, (2) masking, (3) encryption, and (4) tokenization. While these solutions are a step in the right direction, they wipe the data of its value and leave businesses open to regulatory penalties and reputational damage.

Your data privacy solutions are insufficient

Access controls: Access controls limit who can access data. While important, they are just not an effective privacy-preserving strategy because the controls do not protect the identity of the individuals or prevent their data from being used for purposes they have not consented to. It is a an all-or-nothing approach, whereby someone has access to the data, and privacy is not protected, or not, in which case, no insights can be gleaned at all.

Masking: This is a process by which sensitive information is replaced with synthetic data. In doing so, the analytical value is wiped. While this solution works for testing, it is not an advantageous solution if you are planning to provide the data to data scientists. After all, you are sending them this data to unlock valuable insights!

Encryption: Encryption is a security mechanism that protects data until it is used. At which point, the data is decrypted, exposing the private data to the user. Additionally, the concern with encryption, is that if someone accesses the key, they can reverse the entire process (decryption), putting the data at risk.

Tokenization: Tokenization, also known as pseudonymization, is the process of encoding direct identifiers, like email addresses, into another value (token) and keeping the original mapping of token stored somewhere for relinking in the future. When businesses employ this technique, they leave the indirect identifiers (quasi-identifiers) as they are. Yet, combinations of quasi-identifiers are a proven method to re-identify individuals in a dataset. 

Such a risk emphasizes the importance of understanding the re-identification risk of a dataset when comparing the effects of your organizations’ privacy protection actions. Moreover, this process is often reversed to perform analysis -violating the very principle of the process. The most important question to ask yourself is how do I know my datasets have been anonymized? If you only implement tokenization, the answer is you don’t.


Risk-aware anonymization will unlock the value of your data.

To unlock the value of your datasets in the regulatory era, businesses should implement privacy techniques. And many have! However, as we’ve discussed, the commonly used techniques are insufficient to preserve analytical value and protect your organization. The only way data will be useful to your data scientists is if you transform the data in such a way that the privacy elements enabling re-identification are removed while degrading the data as little as possible.

Consequently, businesses must prioritize risk-aware anonymization in order to optimize the reduction of re-identification risk and protect the value of data.

CN-Protect is the ideal solution to achieve your goals. It utilizes AI and advanced privacy protection methods, like differential privacy and k-anonymization, to assess, quantify and assure privacy and insights are produced in unison.

The process is as follows:

  1. Classify metadata: identify the direct, indirect, and sensitive data in an automated manner, to help businesses understand what kind of data they have.
  2. Quantify risk: calculate the risk of re-identification of individuals and provide a privacy risk score.
  3. Protect data: apply advanced privacy techniques, such as k-anonymization and differential privacy, to tables, text, images, video, and audio. This involves optimizing the tradeoff between privacy protection (removing elements that constitute privacy risk) and analytical value (retaining elements that constitute data fidelity) 
  4. Audit-ready reporting: keep track of what the dataset is, what kind of privacy-protecting transformations were applied, changes in the risk score (before and after privacy actions have been applied), who applied the transformation and at what time, and where the data went. This is the key piece to proving data has been defensibly anonymized to regulatory authorities.

In doing so, businesses are able to establish the privacy-protection of datasets to a standard that fulfills data protection regulations, protects you from privacy risk, and most importantly, preserves the value of the data. In essence, it will unlock data that was previously restricted, and help you achieve improved data-driven outcomes by protecting data in an optimized manner.

By measuring the risk of identification, applying privacy-protection techniques, and providing audit reports throughout the whole process, CN-Protect is the only data privacy solution that will comprehensively unlock the value of your data.

Join our newsletter

Forget Third-party Datasets – the Future is Data Partnerships that Balance Compliance and Analytical Value

Forget Third-party Datasets – the Future is Data Partnerships that Balance Compliance and Analytical Value

Organizations are constantly gathering information from their customers. However, they are always driven to acquire extra data on top of this. Why? Because more data equals better insights into customers, and better ability to identify potential leads and cross-sell products. Historically, to acquire more data, organizations would purchase third-party datasets. Though these come with unique problems, such as occasionally poor data quality, the benefits used to outweigh the problems. 

But not anymore. Unfortunately for organizations, since the introduction of the EU General Data Protection Regulation (GDPR), buying third-party data has become extremely risky. 

GDPR has changed the way in which data is used and managed, by requiring customer consent in all scenarios other than those in which the intended use falls under a legitimate business interest. Since third-party data is acquired by the aggregator from other sources, in most cases, the aggregators don’t have the required consent from the customers. This puts any third-party data purchaser in a non-compliant situation that could expose them to fines, reputational damage, and additional overhead compliance costs.

If organizations can no longer rely on third-party data, how can they maximize the value of the data they already have? 

By changing their focus. 

The importance of data partnerships and second-party data

Instead of acquiring third-party data, organizations should establish data partnerships and access second-party data. This new approach has two main advantages. One, second-party constitutes the first-party data of another organization, so it is of high quality. Two, there are no concerns about customer consent, as the organization who owns this data has direct consent from the customer. 

That said, to establish a successful data partnership, there are three things that have to be taken into consideration: privacy protection, IP protection, and data analytical value.   

Privacy Protection

Even when customer consent is present, the data that is going to be shared should be privacy-protected in order to comply with GDPR, safeguard customer information, and prevent any risk. Privacy protection should be understood as a reduction in the probability of re-identifying a specific individual in a dataset. GDPR, as well as other privacy regulations, refer to anonymization as the maximum level of privacy protection, wherein an individual can no longer be re-identified. 

Privacy protection can be achieved with different techniques. Common approaches include  differential privacy, encryption, the adding of “noise,” and suppression. Regardless of which privacy technique is applied, it is important to always measure the risk of re-identification of the data.

IP (Intellectual Property) Protection

There are some organizations that are okay with selling their data. However, there are others that are very reticent, because they understand that once the data is sold, all of its value and IP is lost, since they can’t control it anymore. IP control is a big barrier when trying to establish data partnerships. 

Fortunately, there is a way to establish data partnerships and ensure that IP remains protected.

Recent advances in cryptographic techniques have made it possible to collaborate with data partners and extract insights without having to expose the raw data. The first of these techniques is called Secure Multiparty Computation.

As its name implies, with Secure Multiparty Computation, multiple parties can perform computations on their datasets as if they were collocated but without revealing any of the original data to any of the parties. The second technique is Fully Homomorphic Encryption. With this technique, data is encrypted in a way in which computations can be performed without the need for decrypting the data. 

Because the original raw data is never exposed across partners, both of these advanced techniques allow organizations to augment their data, extract insights and protect IP safely and securely.

Analytical Value

The objective of any data partnership is to acquire more insights into customers and prospects. For this reason, any additional data that is acquired needs to add analytical value. But maintaining this value becomes difficult when organizations need to preserve privacy and IP protection. 

Fortunately, there is a solution. Firstly, organizations should identify common individuals in both datasets. This is extremely important, because you want to acquire data that adds value. By using Secure Multiparty Computation, the data can be matched and common individuals identified, without exposing any of the sensitive original data. 

Secondly, organizations must use software that balances privacy and information loss. Without this, the resulting data will be high on privacy protection and extremely low on analytical value, making it useless for extracting insights.

Thanks to the new privacy regulations sweeping the world, acquiring third-party datasets has become extremely risky and costly. Organizations should change their strategy and engage in data partnerships that will provide them with higher quality data. However, for these partnerships to add real value, privacy and IP have to be protected, and data has to maintain its analytical value.

For more about CryptoNumerics’ privacy automation solutions, read our blog here.

Join our newsletter