The Key to Anonymizing Datasets Without Destroying Their Analytical Value

by | Oct 3, 2019 | Anonymization, Data privacy, Data Privacy Solutions, Data Privacy Technology, Privacy blog

Enterprise need for “anonymised” data lies at the core of everything from modern medical research, to personalised recommendations, to modern data science, to ML and AI techniques for profiling your customers for upselling and market segmentation. At the same time, anonymised data forms the legal foundation for demonstrating compliance with privacy regimes such as GDPR, CCPA, HIPPA, and all other established and emerging data residency and privacy laws from around the world.

For example, the GDPR Recital 26 defines anonymous information as “information which does not relate to an identified or identifiable natural person” or “personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.” Under GDPR law, only properly anonymized information can be handled or utilized by enterprises.

The perils of poorly or partially anonymised data

Why is anonymised data such a central part of demonstrating legal and regulatory privacy compliance? And why does failing to comply expose organisations to the risk of significant fines, and brand and reputational damage?

Because if the individuals in a dataset can be re-identified, then their promised privacy protections evaporate. Hence “anonymisation” is the process of removing personal identifiers, both direct and indirect, that may lead to an individual being identified. An individual may be directly identified from their name, address, postcode, telephone number, photograph or image, or some other unique personal characteristics. An individual may also be indirectly identifiable when certain information is combined or linked together with other sources of information, including their place of work, job title, salary, gender, age, their postcode or even the fact that they have a particular medical diagnosis or condition.

Anonymization is so relevant to legislation such as GDPR because recent research has now conclusively shown that poorly or partially anonymised data can lead to an individual being identified simply by combining that data with another dataset. In 2008, individuals were re-identified from an anonymised Netflix dataset of film ratings by comparing the ratings information with public scores on the IMDb film website. In 2014. the home addresses of New York taxi drivers were identified from an anonymous datasets of individual taxi trips in the city.  

In 2018, The University of Chicago Medical team shared with Google anonymised patient records which included appointment date and time stamps and medical notes. A 2019 pending class action lawsuit brought against Google and the University claims that Google can combine the appointment date and time stamps with other records its holds from Waze, Android phones and other location records to re-identify these individuals.

And data compliance isn’t the only reason that organizations need to be smart with how they anonymize data. An equally major issue is that fully anonymised techniques tend to devalue the data or render it less useful for purposes such as data science, AI and ML, and other applications looking to gain insights and extract value. This is particularly true with indirect identifying information.     

The challenges of anonymization present businesses with a dilemma: Fully anonymising directly and indirectly identifying customer data keeps them compliant, but it renders that data less valuable and useful. But partially anonymising and the increased risks of individuals being identified.

How to anonymise datasets without wiping out their analytical value

The good news is that it is possible to create fully complaint anonymised datasets and still retain the analytical value of data for data science, and AI and ML applications. You just need the right software.

The first challenge is to understand the risk of re-identification of an individual or individuals from a dataset. This cannot be done manually or by scanning a dataset. A systematic and automated approach has to be applied to assess the risk of re-identification. This risk assessment forms a key part of demonstrating your Privacy Impact Assessment (PIA), especially in a data science and data lake environments. How many unique individuals or identifying attributes exist is a dataset that can identify an individual directly or indirectly?  For example, say there are three twenty-eight-year-old males living in a certain neighbourhood in Toronto. As there are only three individuals, if this information was combined with one other piece of information – such as employer, or car driven, or medical condition – then you have a high probability of being able to identify the individual. 

Once we’re armed with this risk assessment information, modern systems-based approaches to anonymisation can be applied. In the first example, using an anonymisation generalisation technique, we can generalise the indirect identifiers in such a manner that the analytical value of the data is still retained but we can also meet our privacy compliance objectives to fully anonymise the dataset.  So with the twenty-eight-year-old males living in a certain neighbourhood in Toronto, we can generalise gender to show that there are nine twenty-eight-year-old individuals living there, thereby reducing the risk of an individual being identified.  

Another example is age binning, where the analytical value of the data is preserved by generalising the age attribute. By binning the age “28” to a range such as “25 to 30,” we now show that there are 15 individuals aged 25 to 30 living in the Toronto neighbourhood, further reducing the risk of identification of an individual.

In the above examples, two key technologies enable us to fully anonymize datasets while retaining the analytical value: 

  1. An automated risk assessment feature which identifies the risk of re-identification in each and every dataset in a consistent and defensible manner across the enterprise is the first step. 
  2. The application of anonymisation protection using privacy protection actions such as generalisation, hierarchies, and differential privacy techniques.

Using these two techniques, enterprises can start to overcome the anonymisation dilemma.


Subscribe to our newsletter