Datasets contain an inherent privacy risk. By holding customer data, you create the potential for exposing your organization to legal action and a loss of consumer trust. To manage this, businesses have begun to de-identify their data. However, without privacy risk scoring, enterprises cannot ensure that privacy-protection actions have actually de-identified the data.
In recent years, new privacy regulations have emerged that restrict the use of data to produce valuable insights. This has led to an increase in businesses utilizing privacy-preservation techniques to anonymize their data and take it out of scope from overhead-heavy legislation like GDPR and CCPA.
However, businesses today are unable to measure the effectiveness of their de-identification strategies because they do not evaluate their data with a privacy risk score.
Under the latest privacy regulations, using data to perform most forms of analytics is against the law without consent – unless data has been de-identified. This means that wrongly assuming that data is anonymized could cost your business as much as 4% of your annual revenue.
Fines of this nature could rock the bottom line of any business. Fortunately, they are entirely avoidable thanks to privacy risk scoring.
Privacy risk scoring quantifies the risk of analytics to your business.
When a dataset undergoes an automated risk assessment, the privacy risk is measured based on metadata classification. A quantifiable score is then produced that assesses the likelihood of re-identification of individuals in the dataset.
Through this process, direct and indirect identifiers are used to assess the privacy risk of the data that a company holds. This is essential, because de-identification is much more complex than merely masking the direct identifiers like name and social insurance number. Yet this is the point at which most organizations believe they have properly de-identified data.
This means that the approach your business is taking is likely ineffective, and you don’t even know it. Taking this risk is unnecessary and naive, it is like locking your door but not checking your windows.
A privacy risk score of 100% means you have identifiers in your data. If the score is less than 100%, then it corresponds to the probability of re-identification of an average number of records using just quasi-identifiers.
For example, suppose you have a dataset with 2 features and 2 values each: sex (M, F) and political affiliation (R, D).
This could create 4 possible groups, also known as equivalence classes: M+R, M+D, F+R, and F+D.
- Suppose the input database has 40 people with an even spread across each equivalence class (10 people each). Risk is then calculated as 1 over the average number of people in the equivalence classes, in this case, 1 over 10, or 10%.
- If all 40 people were in the same equivalence class, say M+D, the risk would be 1 over 40 or 2.5%.
- If each person was in a different equivalence class of 40 possible classes, the risk would be 1 over 1 or 100%.
Automating the risk assessment process is the only way to manage the volume of data.
Businesses use data to understand and influence their decision-making process every day. But when it comes to privacy, they often rely on traditional methods to apply privacy-protection and manage risk. Why would you use AI to clean your floor, but manual checks to determine that a dataset is considered de-identified?
Data lakes contain an exorbitant quantity of data that expands at a rapid rate every single day. It is impractical to quantify the re-identification risk associated with each dataset accurately by hand. This means it is impossible to determine that all datasets being used for analytics are genuinely de-identified.
Privacy risk scoring is an automated process that can occur throughout the privacy protection cycle so that businesses can quantify their risk and make informed decisions. A system of this nature will break down the guesswork that accompanies traditional methods of anonymization, and empowers enterprise to define acceptable risk thresholds.
Businesses must customize risk thresholds based on their data use case.
Businesses do not use all of their data to undertake the same activities, nor do they all manage the same level of sensitive information. As a consequence, privacy-preservation is not a uniform process. In general, we suggest following these guidelines when assessing your privacy risk score:
- Greater than 33% implies that your data is identifiable.
- 33% is an acceptable level if you are releasing to a highly trusted source.
- 20% is the most commonly accepted level of privacy risk.
- 11% is used for highly sensitive data.
- 5% is used for releasing to an untrusted source.
With a privacy risk score, businesses can continue to adjust their privacy protection techniques until an acceptable score is returned. Businesses can act with certainty that their data has been properly anonymized and is safe to perform analytics on. This also gives consumers and regulatory authorities the peace of mind that your business has incorporated privacy-values into your analytics process. Privacy risk scoring is the new standard for privacy compliance.
Join our newletter