Why masking and tokenization are not enough
This is the third blog in our Crash course in Privacy series
Protecting consumer privacy is much more complex than just removing personally identifiable information(PII). Other types of information such as quasi-identifiers can re-identify individuals or expose sensitive information when combined. There are four types of information called attributes that are frequently referred to when applying privacy techniques:
- Identifiers: Unique information that identifies a specific individual in a data set. Examples of identifiers are names, social security numbers, and bank account numbers. Also, any field that is unique for each row.
- Quasi-identifiers: Information that on its own is not sufficient to identify a specific individual but when combined with other quasi-identifiers makes it possible to re-identify an individual. Examples of quasi-identifiers are zip code, age, nationality, and gender.
- Sensitive: Information that is more general among the population making it difficult to identify an individual with it. However, when combined with quasi-identifiers, sensitive information can be used for attribute disclosure. Examples of sensitive information are salary and medical data. Let’s say we have a set of quasi-identifiers that form a group of men aged 40-50, a sensitive attribute could be “diagnosed with heart disease”. Without the quasi-identifiers, the probability of identifying who has heart disease is low, but once combined with the quasi-identifiers the probability is high.
- Insensitive: Information that is not identifying, quasi-identifying, or sensitive and that you do not want to be transformed.
Apart from knowing the types of information that needs to be protected, it is also important to know how privacy techniques affect data quality. There is always a trade-off between protecting privacy and retaining analytical value. The following is a review of some common privacy techniques:
- Masking: Replaces existing information with other information that looks real, but is of no use to anyone who might misuse it and is not reversible. This approach is typically applied to identifying fields, such as: name, credit card number, and social security number. The only masking techniques which sufficiently distort the identifying fields are suppression, randomization, and coding. However, these techniques cannot be used on privacy attributes other than identifying fields because they render the data useless for analysis.
- Tokenization: This technique replaces sensitive information with a non-sensitive equivalent or a token. The token can be used to map back to the original data, but without access to the tokenization system, it is impossible to reverse. This requires that the tokenization system is separated from the data processing systems. However, any fields replaced by tokens are useless for analysis.
- k-anonymity: This technique transforms quasi-identifiers such that each group (also called an equivalence class) has at least k-1 members which are indistinguishable from each other. Transformation occurs by generalizing and/or suppressing the quasi-identifiers. For example, if k is set equal to 5 then any group must contain at least 5 individuals. As k increases, the data becomes more general and the risk of re-identification is reduced, but at the same time analytical value is reduced as well. By balancing privacy risk with data quality the resulting data can be still be used for analysis.
- Differential Privacy: This technique uses randomness to reduce the probability that it is possible to determine if a particular individual is in a dataset or not. One approach is the use of random noise to alter aggregate results. For example, two professors publish a report, with data from different months, about the number of students with international parents. A smart student notices that there is a difference of 1 and deduces that Joe, who dropped out last month is the missing student and now that student knows that Joe had international parents. If both professors used differential privacy, they would report a number in which the difference is larger than 1 making it very difficult to re-identify the missing student. There are many approaches to utilizing differential privacy. The most promising approaches provide significant privacy guarantees while the data can still be used for analysis
In order to protect consumer privacy and retain analytical value, it is important to choose the proper privacy technique for your desired application.
The other blogs in the Crash course in Privacy series are: