Google and the University of Chicago’s Medical Center have made headlines for the wrong reasons. According to a June 26th New York Times report, a lawsuit filed in the US District Court for Northern Illinois alleged that a data-sharing partnership between the University of Chicago’s Medical Center and Google had “shared too much personal information,” without appropriate consent. Though the data sets had ostensibly been anonymized, the potential for re-identification was too high. Therefore, they had compromised the privacy rights of the individual named in the lawsuit.
The project was touted as a way to improve predictions in medicine and realize the utility of electronic health records through data science. Its coverage today instead focuses on risks to patients and invasions of privacy. Across industries like finance, retail, telecom, and more, the same potential for positive impact through data science exists, as does the potential for exposure-risk to consumers. The potential value created through data science is such that institutions must figure out how to address privacy concerns.
No one wants their medical records and sensitive information to be exposed. Yet, they do want research to progress and to benefit from innovation. That is the dilemma faced by individuals today. People are okay with their data being used in medical research, so long as their data is protected and cannot be used to re-identify them. So where did the University of Chicago go wrong in sharing data with Google — and was it a case of negligence, ignorance, or a lack of investment?
The basis of the lawsuit claims that the data shared between the two parties were still susceptible to re-identification through inference attacks and mosaic effects. Though the data sets had been stripped of direct identifiers and anonymized, they still contained date stamps of when patients checked in and out of the hospital. When combined with other data that Google held separately, like location data from phones and mapping apps, the university’s data could be used to re-identify individuals in the data set. Free text medical notes from doctors, though de-identified in some fashion, were also contained in the data set, further compounding the exposure of private information.
Inference attacks and mosaic effect methods combine information from different data sets to re-identify individuals. They are now well-documented realities that institutions cannot be excused for being ignorant of. Indirect identifiers must also be assessed for the risk of re-identification of an individual and included when considering privacy-protection.
Significant advancements in data science have led to improvements in data privacy technologies, and controls for data collaboration. Autonomous, systematic, meta-data classification, and re-identification risk assessment and scoring, are two processes that would have made an immediate difference in this case. Differential privacy and Secure Multiparty-Computation are two others.
Privacy automation systems encompassing these technologies are a reality today. Privacy management is often seen as an additional overhead cost to data science projects. That is a mistake. Tactical use of data security solutions, like encryption and hashing, to privacy-protect data sets are also not enough, as can be attested to by the victims of this case.
As we saw with Cybersecurity over the last decade, it took several years and continued data theft and hacks making headlines before organizations implemented advanced Cybersecurity and intrusion detection systems. Cybersecurity solutions are now seen as an essential component of an enterprise’s infrastructure and have a commitment at the board level to keep company data safe and their brand untarnished. Boards must reflect on the negative outcomes of lawsuits like this one, where the identity of its customers are being compromised, and their trust damaged.
Today, data science projects without advanced automated privacy protection solutions should not pass internal privacy governance and data compliance. Additionally, these projects should not use customer data, even if the data is anonymized, until automated privacy risk assessments solutions can accurately reveal the level of re-identification risk (inclusive of inference attacks, and the mosaic effect).
With the sensitivity around privacy in data science projects in our public discourse today, any enterprise not investing and implementing advanced privacy management systems only exposes itself as having no regard for the ethical use of customer data. The potential for harm is not a matter of if, but when.