The Privacy Risk Most Data Scientists Are Missing
Data breaches are becoming increasingly common, and the risks of being involved in one are going up. A Ponemon Institute report (an IBM-backed think tank), found that the average cost of a data breach in 2018 was $148 per record, up nearly 5% from 2017.
Privacy regulations and compliance teams are using methods like masking and tokenization to protect their data — but these methods come at a cost. Businesses often find that these solutions prevent data from being leveraged for analytics and on top of that, they also leave your data exposed.
Many data scientists and compliance departments protect and secure direct identifiers. They hide an individual’s name, or their social security number, and move on. The assumption is that by removing unique values from a user, the dataset has been de-identified. Unfortunately, that is not the case.
In 2010, Netflix announced a $1 million competition to whoever could build them the best movie-recommendation engine. To facilitate this, they released large volumes of subscriber data with redacted direct identifiers, so engineers could use Netflix’s actual data, without compromising consumer privacy. The available information included users’ age, gender, and zip code. However, when these indirect identifiers (also known as quasi-identifiers) were taken in combination, they could re-identify a user with over 90% accuracy. This resulted in the exposure of millions of Netflix’s consumers. Within a few months, the competition had been called off, and a lawsuit was filed against Netflix.
When it comes to the risk exposure of indirect identifiers, it’s not a question of if, but when. That’s a lesson companies have continuously found out the hard way. Marriott, the hotel chain, faced a data breach of 500 million consumer records and faced $72 million in damages due to a failure to protect indirect identifiers.
Businesses are faced with a dilemma. Do they redact all their data and leave it barren for analysis? Or do you leave indirect identifiers unprotected, and create an avenue for exposure that will lead to an eventual leak of your customers’ private data?
Either option causes problems.
That’s why we founded CryptoNumerics. Our software is able to autonomously classify your datasets into direct, indirect, sensitive, and insensitive identifiers, using AI. We then use cutting-edge data science technologies like differential privacy, k-anonymization, and secure multi-party computation, to anonymize your data while preserving its analytical value. Your datasets are comprehensively protected and de-identified while maintaining the integrity needed for machine learning and data analysis.
Data is the new oil. Artificial intelligence and machine learning represent the future of technological value, and any company that does not keep up will be left behind and disrupted. Businesses cannot afford to leave data siloed or uncollected.
Likewise, data privacy is no longer an issue that can be ignored. Scandals like Cambridge Analytica, and policies like GDPR, prove that, but the industry is still not knowledgeable on key risks, like indirect identifiers. Companies that use their data irresponsibly will feel the repercussions, but those that don’t use their data at all will be left behind. Choose instead not to fall into either category.
Join our newsletter