Select Page
The Privacy Risk Most Data Scientists Are Missing

The Privacy Risk Most Data Scientists Are Missing

Facebook privacy issues

Data breaches are becoming increasingly common, and the risks of being involved in one are going up. A Ponemon Institute report (an IBM-backed think tank), found that the average cost of a data breach in 2018 was $148 per record, up nearly 5% from 2017.

Privacy regulations and compliance teams are using methods like masking and tokenization to protect their data — but these methods come at a cost. Businesses often find that these solutions prevent data from being leveraged for analytics and on top of that, they also leave your data exposed.

Many data scientists and compliance departments protect and secure direct identifiers. They hide an individual’s name, or their social security number, and move on. The assumption is that by removing unique values from a user, the dataset has been de-identified. Unfortunately, that is not the case.

In 2010, Netflix announced a $1 million competition to whoever could build them the best movie-recommendation engine. To facilitate this, they released large volumes of subscriber data with redacted direct identifiers, so engineers could use Netflix’s actual data, without compromising consumer privacy. The available information included users’ age, gender, and zip code. However, when these indirect identifiers (also known as quasi-identifiers) were taken in combination, they could re-identify a user with over 90% accuracy. This resulted in the exposure of millions of Netflix’s consumers. Within a few months, the competition had been called off, and a lawsuit was filed against Netflix.

When it comes to the risk exposure of indirect identifiers, it’s not a question of if, but when. That’s a lesson companies have continuously found out the hard way. Marriott, the hotel chain, faced a data breach of 500 million consumer records and faced $72 million in damages due to a failure to protect indirect identifiers.

Businesses are faced with a dilemma. Do they redact all their data and leave it barren for analysis? Or do you leave indirect identifiers unprotected, and create an avenue for exposure that will lead to an eventual leak of your customers’ private data?

Either option causes problems.

That’s why we founded CryptoNumerics. Our software is able to autonomously classify your datasets into direct, indirect, sensitive, and insensitive identifiers, using AI. We then use cutting-edge data science technologies like differential privacy, k-anonymization, and secure multi-party computation, to anonymize your data while preserving its analytical value. Your datasets are comprehensively protected and de-identified while maintaining the integrity needed for machine learning and data analysis.

Data is the new oil. Artificial intelligence and machine learning represent the future of technological value, and any company that does not keep up will be left behind and disrupted. Businesses cannot afford to leave data siloed or uncollected.

Likewise, data privacy is no longer an issue that can be ignored. Scandals like Cambridge Analytica, and policies like GDPR, prove that, but the industry is still not knowledgeable on key risks, like indirect identifiers. Companies that use their data irresponsibly will feel the repercussions, but those that don’t use their data at all will be left behind. Choose instead not to fall into either category.

The Three P’s of Retail Success

The Three P’s of Retail Success

Facebook privacy issues

As a retailer, you have a limited view of your customer based on what you gather from your POS data and social media because you don’t know how customers are spending their money outside of your store. All this can be solved if you acquire access to a very useful piece of data –financial data.

By combining financial data from millions of customers with your POS data, you can achieve a solid 360-degree view of your customer based on their preferences and habits, and grow your ROI by running more targeted marketing strategies. Additionally, you can also outperform your competition by spotting trends and offering better deals.

Adding Financial Data to the Mix: The Benefits

With access to customer’s financial data, not only will you be able to make more informed business decisions, but think of all the efficiencies you would gain from additional customer knowledge and optimized marketing expenditure.



The amount of personalization possible with all this added financial data allows for stronger customer experience and retention. Talk about a mutual benefit!

There are two advantages when it comes to pairing financial data with POS data to boost personalization: increased customer intimacy and increased customer loyalty. With customer intimacy, we are talking about being able to better anticipate customer needs by analyzing buying patterns and understanding shopper behaviour. On the other hand, with customer loyalty, you can customize your offerings and deals according to a target group’s needs, or even an individual’s needs, to ensure the customers feel heard and important.

Thus, personalization is always adding value to both the customer and the company by boosting relevance and customer retention.



With financial data in the mix, you are further able to maximize the quality of your marketing spend.

You could optimize your marketing expenditure by combining your sales data (for example, what was purchased at your store, when it was purchased and how much it was purchased for), and financial data (for example, how much was spent at your store versus a competitor’s). Seeing these customer preferences allows you to obtain valuable insights which will help you make smarter products, pricing, and promotional decisions.

Another important aspect of leveraging financial data is knowing what percentage of your customer’s wallet is going towards you as opposed to your competition. But wait, that’s not all! In terms of data insights, unlocking this greater potential will help your organization build more powerful models. Financial data will make it easier to forecast future sales and buying preferences.

Financial data can help to better direct your promotional efforts in terms of efficiency and information such as data insights, saving you both time and money.



Using privacy-protected financial data that is secure and compliant with all legislative regulation helps you be worry-free and avoid any problems or PR nightmares.

Luckily, there are companies out there that combine security and privacy to form an optimal solution to comply with regulation, ensure privacy and IP protection as well as secure the best possible ROI for your company. Additionally, their privacy and security methods are intact throughout the data pipeline, from acquisition to publishing, using access controls and cryptography.

Combining financial data to your existing pool of information will help you (1) increase local demand, (2) optimize media spending and promotional activities, (3) focus on customer experience, and (4) compliment your privacy compliance. Modelling all these functions will help you forecast future sales and growth as well, thus increasing performance.

Still not convinced? Let’s check out a large corporation that stands by this…

See How Walmart is Implementing this Solution

“Walmart uses big data to make the company’s operations more efficient and improve the lives of customers”.

To power its goal to provide the best shopping experience possible, Walmart is maximizing its use of big data to reveal consumer patterns. Transactional, online, and mobile data, all combine to help them serve the customer better so that they keep coming back.

They use data mining to extrapolate trends from their POS data, to see what the customer buys, when they buy it, how they buy it (online or in-store), and what they buy before or after a certain product. POS data allows the organization to see shopping patterns to determine how to display merchandise and stock shelves. Furthermore, they can send out personalized rollback deals and vouchers based on consumer spending habits. Not only do they use this data to create customer value, they also use it for staffing purposes. For example, to help lower the amount of time it takes to fill a prescription, Walmart looks at how many prescriptions are filled each day to determine staff scheduling and inventory.

Additionally, Walmart has created its own credit card, which gives them firsthand knowledge of their customers. Using the expenses from the financial data of customers helps them gain a solidified understanding of consumer habits and preferences. This enables the company to anticipate demand for each product or service.

The outcome of using this big data includes improving store checkout procedures, managing its supply chain, and optimizing product assortment.

To Sum it Up

Without data, companies are not able to grow and digitally enhance their business model according to the needs of their target market. So, being able to leverage data to its full ability is a competitive advantage on its own, especially with data being such a huge commodity today. Unlock greater potential with respect to increasing your customer value by expanding your access to the data available around you.

Join our newsletter

Top 10 Challenges Data Scientists Face at Work

Top 10 Challenges Data Scientists Face at Work

We all have heard that “data is the new oil”. As with oil, data has to be transformed to be of real value to society. The people in charge of this transformation are data professionals.

Data professionals are constantly trying to make sense of data by building models that can provide the insights necessary for organizations to grow and generate more value. However, these professionals face many challenges that prevent them from building powerful models.

In 2017, Kaggle did a study titled the “State of Data Science and Machine Learning”. One of the questions the survey asked was, “At work, which barriers or challenges have you faced this past year? (Select all that apply)”. Here are the top 10 results:

Here is a look at how often they encountered these problems:

  Most of the time Often Sometimes Rarely
Dirty Data 43% 40% 16% 1%
Lack of data science talent in the organization 31% 40% 27% 2%
Company politics / Lack of management/financial support for a data science team 26% 40% 30% 4%
Unavailability of/difficult access to data 28% 42% 27% 2%
The lack of a clear question to be answering or a clear direction to go in with the available data 29% 43% 27% 2%
Data Science results not used by business decision makers 16% 44% 37% 3%
Explaining data science to others 19% 41% 36% 3%
Privacy Issues 25% 36% 34% 5%
Lack of significant domain expert input 22% 46% 29% 3%
Organization is small and cannot afford a data science team 37% 36% 24% 3%

Data cleanliness is clearly a big issue, as data scientists spend 80% of their time cleaning data. However, challenges, like a lack of talent/expertise, company politics meaning results are not used, and data inaccessibility, are more difficult to solve as they require systemic changes within the organization.

To find how data professionals answered the other questions in the study, click here to visit Kaggle 2017 study.

Join our newsletter

Six Things to Look for in Privacy Protection Software

Six Things to Look for in Privacy Protection Software

This is the fourth blog in our Crash Course in Privacy series.


Enterprises want to:

  • Leverage their data assets
  • Comply with privacy regulations
  • Reduce the risk exposure of consumer information.

To maintain data utility while protecting privacy, here is a list of six key things you should consider in data privacy software:

1) Allows you to understand the privacy risk of your data set

It is easy to think that by removing information like names and IDs, privacy risk is eliminated. However, as shown by the Netflix case, there is a lot of additional information in a data set that can be used to re-identify someone, even when those fields have been removed. Therefore, it is important to know what the probability of re-identification is for individuals in your data set after you have applied privacy-protection. There are other lesser-known types of privacy risks that could matter to you, such as membership disclosure and attribute disclosure.

The software you use should help you understand and manage these risks.

2) Enables you to understand information loss and maintain the analytical value

Every time you apply anonymization techniques to your dataset, the information is transformed. This transformation either redacts, generalizes, or replaces the original data, causing some information loss. Depending on what the data will be used for, you need to be able to understand the impact on your data quality. Your data quality could vary widely even with the same privacy risk, so knowing this makes a huge difference when using privacy-protected data for analytics.

Software that helps you understand the information loss and maintain analytical value after de-identification is critical.

3) Protects all attribute types

To achieve optimal privacy protection while balancing data quality, all data elements need to be classified appropriately. Incorrectly classifying a data element as an identifier, quasi-identifier, sensitive, or insensitive attribute, could lead to insufficient privacy protection or excessive data quality loss.

The right privacy-protection software should support all four attribute types (identifier, quasi-identifier, sensitive, insensitive) and allow you to customize the classification of your data elements based on your needs.

To learn more about the data attributes read “Why privacy is important.”

4) Supports a range of privacy techniques and is tunable

Each different privacy technique has pros and cons depending on what the data will be used for. For example, masking removes analytical value completely but is good for protection. You should look for software that supports a range of privacy protection techniques as well as tunable parameters for each of them to find the perfect balance for your needs.

5) Applies consistent privacy policies

Satisfying privacy regulations is a cumbersome and manual process. Being able to create privacy frameworks and share them across the organization for application purposes is key. As a result, you should look for software that allows you and your team to apply consistent privacy policies.

6) Your data stays where you can protect it

You are looking to privacy-protect your data, so the software you use should work in the environment where you are already protecting your data. Using software that runs locally in your environment will remove an additional layer of risk.


The other blogs in the Crash course in Privacy series are:

Join our newsletter

Why Masking and Tokenization Are Not Enough

Why Masking and Tokenization Are Not Enough

This is the third blog in our Crash Course in Privacy series.


Protecting consumer privacy is much more complex than just removing personally identifiable information (PII). Other types of information, such as quasi-identifiers, can re-identify individuals or expose sensitive information when combined. There are four types of information, called attributes, that are frequently referred to when applying privacy techniques:

  • Identifiers: Unique information that identifies a specific individual in a dataset. Examples of identifiers are names, social security numbers, and bank account numbers. Also, any field that is unique for each row.
  • Quasi-identifiers: Information that on its own is not sufficient to identify a specific individual but when combined with other quasi-identifiers makes it possible to re-identify an individual. Examples of quasi-identifiers are zip code, age, nationality, and gender.
  • Sensitive: Information that is more general among the population, making it difficult to identify an individual with it. However, when combined with quasi-identifiers, sensitive information can be used for attribute disclosure. Examples of sensitive information are salary and medical data. Let’s say we have a set of quasi-identifiers that form a group of men aged 40-50. A sensitive attribute could be “diagnosed with heart disease”. Without the quasi-identifiers, the probability of identifying who has heart disease is low, but once combined with the quasi-identifiers, the probability is high.
  • Insensitive: Information that is not identifying, quasi-identifying, or sensitive, and that you do not want to be transformed.

Apart from knowing the types of information that needs to be protected, it is also important to know how privacy techniques affect data quality. There is always a trade-off between protecting privacy and retaining analytical value. The following is a review of some common privacy techniques:

  • Masking: Replaces existing information with other information that looks real, but is of no use to anyone who might misuse it and is not reversible. This approach is typically applied to identifying fields, such as name, credit card number, and social security number. The only masking techniques which sufficiently distorts the identifying fields are suppression, randomization, and coding. These techniques render the data useless for analysis.
  • Tokenization: This is the process of encoding direct identifiers, like email addresses, into another value (token) and keeping the original mapping of token stored somewhere for relinking in the future. When businesses employ this technique, they leave the indirect identifiers (quasi-identifiers) as they are. Yet, combinations of quasi-identifiers are a proven method to re-identify individuals in a dataset. 
  • k-anonymity: This technique transforms quasi-identifiers such that each group (also called an equivalence class) has at least k-1 members which are indistinguishable from each other. Transformation occurs by generalizing and/or suppressing the quasi-identifiers. For example, if k is set equal to 5, then any group must contain at least 5 individuals. As k increases, the data becomes more general, and the risk of re-identification is reduced, but at the same time, analytical value is reduced as well. By balancing privacy risk with data quality, the resulting data can be still be used for analysis.
  • Differential Privacy: This technique uses randomness to reduce the probability that it is possible to determine if a particular individual is in a dataset or not. One approach is the use of random noise to alter aggregate results. For example, if two professors publish a report, with data from different months, about the number of students with international parents. A smart student notices that there is a difference of 1 and deduces that Joe, who dropped out last month, is the missing student. Now that student knows that Joe had international parents. If both professors used differential privacy, they would report a number in which the difference is larger than 1, making it very difficult to re-identify the missing student. There are many approaches to utilizing differential privacy. The most promising approaches provide significant privacy guarantees while maintaining the analytical value of the data.

In order to protect consumer privacy and retain analytical value, it is important to choose the proper privacy technique for your desired application.


The other blogs in the Crash Course in Privacy series are:

Join our newsletter

Why Protecting Sensitive Data is Important

Why Protecting Sensitive Data is Important

This is the second blog in our Crash Course in Privacy series


Privacy risk is the probability of extracting information about a specific individual in a data set. Organizations must protect the significant personal information they have from exposure.

Governments around the world have been very active in making sure that consumer privacy is protected by publishing regulations that dictate how the data must be handled and used. These regulations include HIPAA, GDPR, CCPA, PIPEDA etc. The consequences of not complying with these regulations are fines, lawsuits, and reputational damage.

Organizations find themselves trying to answer this question:

How can I comply with privacy regulations & protect consumer privacy while leveraging my data assets for business purposes?

The answer is contained in the regulations:

  • HIPAA: The Health Insurance Portability and Accountability Act (HIPAA) is American legislation that requires the protection of 18 specific identifiers: name, Social Security Number, Health Insurance Numbers, and others. Once the dataset has been protected by anonymizing or de-identifying, it can be used for analysis. (Source)
  • GDPR: The General Data Protection Regulation is a privacy regulation that has to be observed by any organization that has information about European citizens. GDPR contemplates two ways in which privacy can be protected, pseudonymization, and anonymization. When a dataset is anonymized, GDPR no longer applies to it. (Source)
  • CCPA: The California Consumer Privacy Act regulates what each person’s rights are regarding their data. Specifically, CCPA is concerned with information that could reasonably be linked, directly or indirectly, with a particular consumer or household. Data that has been aggregated or de-identified is excluded from the CCPA. (Source)

In light of these regulations and consumer expectations for privacy protection, it is clear that organizations must enact privacy policies. Organizations need to embrace privacy and find a way to embed it into their analytic process if they want to extract value from sensitive data without facing any consequences.


The other blogs in the Crash Course in Privacy series are:

Join our newsletter