Create a privacy-protected dataset where privacy risk has been balanced with analytical value. Available as a downloadable app, as a stand-alone product or as a plug-in to your preferred data science platform.
Your data contains personal information that can re-identify individuals in your dataset
Unique information that identifies a specific individual in a data set. Examples of personal identifiers are names, social security numbers, and bank account numbers.
Information that on its own is not sufficient to identify a specific individual but when combined with other quasi-identifiers is possible to re-identify an individual. Examples of quasi-identifiers are zipcode, age, nationality, and gender.
Information that is more general among the population making it difficult to identify an individual with it. However, when combined with quasi-identifiers, sensitive information can be used for attribute disclosure. Examples of sensitive information are salary and medical data.
Protecting privacy comes at a price, as privacy increases analytical value decreases.
Depending on your desired application CN-Protect offers a variety of privacy models and data quality metrics. Each privacy model interacts with the dataset differently to produce a privacy-protected dataset and is determined through a set of parameters that you can choose. It is also possible to combine different privacy models to achieve the desired level of privacy-protection for your application. These models work to reduce risk while maximizing analytical value. Different data quality metrics are also available for you to determine the best trade-off between risk and quality.
Protects against re-identifying an individual and against whether an individual is a member of a dataset.
- Transforms quasi-identifiers (like age and zip code) such that each group (also called an equivalence class) has at least k-1 members, w
hichare indistinguishable from each other.
- Transformation occurs by generalizing or suppressing the quasi-identifiers.
- As k increases, risk reduces and data quality decreases.
- Example, if k is set equal to 5 then any group must contain at least 5 individuals.
Protects against linking an individual with a sensitive attribute.
- Transforms quasi-identifiers such that each group is within a distance t of the distribution of sensitive values for the entire dataset.
- The transformations used are generalization and suppression.
- As t decreases, the risk of sensitive attribute disclosure decreases and data quality decreases.
- Example, if the sensitive attribute is salary then each group’s frequency distribution of salary will be within a distance t from the salary frequency distribution for the entire dataset. The distance is measured as the cumulative absolute difference of the distributions.
Protects against re-identification of an individual, against linking an individual to a sensitive attribute, and against whether an individual is a member of the dataset.
- Characterizes an algorithm rather than a dataset. Differential Privacy and its variants (epsilon, epsilon-delta) are used to describe the distance between the output of randomized algorithms when the datasets input to those algorithms differs by a single element.
- Randomized means that the output of the algorithm is not deterministic, where output distance is defined by the probabilities of ending up in a particular part of all possible outputs conditional on the input.
- Example, an Epsilon Differentially Private algorithm is a randomized algorithm where the probabilistic outputs of two input datasets that are different by a single item are indistinguishable up to a multiplicative function of a constant Epsilon.
- If using Epsilon-Delta, then 1-Delta gives the probability that Epsilon Differential Privacy holds.
- As Epsilon or Delta decrease, the more Differentially Private the algorithm becomes and data quality decreases.
Data Quality metrics
Information loss measures the average amount of information removed by making the data more
Impact on Machine Learning
The measures of data quality may depend on how the data is being used. This metric optimizes how generalization is applied to the data by applying a penalty if a row is suppressed or if the class label (for classification models for example) is not the same for members of an equivalence class.