Protect Your Data Throughout the Pipeline

Protect Your Data Throughout the Pipeline

Organizations all over the world have embraced the opportunity that data and data analysis presents. Millions of dollars are spent every year in designing and implementing data pipelines that allow organizations to extract the value from their data. Unfortunately, data misuse and data breaches have led government bodies to promote regulations such as GDPR, CCPA, and HIPAA, bestowing privacy rights upon consumers and placing responsibilities upon businesses.

Maximizing data value is essential, however, privacy regulations must be satisfied when doing so. This is achievable by implementing privacy-protecting techniques throughout the data pipeline to avoid compliance risks. 

Before introducing the privacy-protecting techniques, it is important to understand the four stages of the data pipeline:

  1. Data Acquisition: first off, the data must be acquired, which can be either generated internally or externally from third parties.
  2. Data Organization: the data is now stored for future use, and needs to be protected along the pipeline to avoid misuse and breaches. This can be achieved using access controls.
  3. Data Analysis: the data must now be opened up and mobilized in order to analyze it, which allows for a better understanding of an organization’s operations and customers, as well as improved forecasting.
  4. Data Publishing: lastly, analysis results are published, and/or internal data is shared with another party. 

Now that we have talked about the 4 stages of the data pipeline, let’s go over the sixteen privacy-protecting techniques that can be implemented throughout the pipeline to make it privacy-protected.

Within the randomizing group, there are two techniques: additive and multiplicative noise, where random noise is added to or multiplied to the individual’s record to transform the data. These techniques can be used in the Data Acquisition stage of the data pipeline. 

The sanitizing group has five privacy techniques in it. The first technique is k-anonymity, where identifiable attributes of any record in a particular database are indistinguishable from at least one other record. Next comes l-diversity, which is an extension of k-anonymity. However, this technique solves the k-anonymity shortfall by making sure there is a diversity of sensitive information in each group. Another technique in this group is t-closeness, which makes sure that the distribution of sensitive elements in each group remains the same as the distribution in the whole group. This technique is used to prevent attribute disclosure by maintaining a ‘t’ threshold. Additionally, there is the personalized privacy technique, in which privacy levels are defined and customized by owners. The last technique in this group is ε-differential privacy, which ensures any single record does not affect the overall outcome of the data’s analysis. These techniques can be used in the Data Acquisition stage, Data Organization stage, and the Data Publishing stage of the data pipeline. 

The output group has three techniques, which are used to reduce the inference of sensitive information from the output of any algorithm. The first technique is known as association rule hiding, where information used to exploit privacy can be taken from the rules identified in the data set. Next, there is the downgrading classifier effectiveness technique, where data is sanitized to reduce the classifier’s effectiveness to prevent information from being leaked. Finally, the query auditing and inference control technique, where data queries can output data that can be used to detect sensitive information. These techniques can be applied to the Data Publishing stage of the data pipeline. 

Last but not least, the distributed computing group, made up of seven privacy-protecting techniques. 1-out-of-2 oblivious transfer is where two messages are sent, but only one out of the two messages, are received and encrypted. Another technique in this group is homomorphic encryption, a method of performing a calculation on encrypted information (ciphertext) without decrypting it (to plaintext) first. Secure sum receives the sum of inputs without revealing these inputs to others. Secure set union shares and creates a union of sets without compromising the owners of each set. Secure size of intersection figures out the size of the data set’s intersection without revealing the data itself. The scalar product technique computes the scalar product between two vectors without revealing the input vector to each other’s party. Finally, the private set intersection technique computes the intersection of two sets from each party without revealing anything else. This technique can be used in the Data Acquisition stage, as well. All of the techniques from the distributed computing group prevent access to original, raw data while allowing analysis to be performed. All of these techniques can be applied to the Data Analysis stage and Data Publishing stage of the data pipeline. Homomorphic encryption can also be used in the Data Organization stage of the data pipeline.

These sixteen techniques help protect data’s privacy throughout the data pipeline. For a visual view on the privacy-exposed pipeline versus the privacy-protected pipeline, download the ‘Data Pipeline infographic’. 

For more information, or to find out how to privacy-protect your data, contact us today at [email protected].

Join our newsletter


How Google Can Solve its Privacy Problems

How Google Can Solve its Privacy Problems

Google and the University of Chicago’s Medical Center have made headlines for the wrong reasons.  According to a June 26th New York Times report, a lawsuit filed in the US District Court for Northern Illinois alleged that a data-sharing partnership between the University of Chicago’s Medical Center and Google had “shared too much personal information,” without appropriate consent. Though the datasets had ostensibly been anonymized, the potential for re-identification was too high. Therefore, they had compromised the privacy rights of the individual named in the lawsuit.

The project was touted as a way to improve prediction in medicine and realize the utility of electronic health records, through data science. Its coverage today instead focuses on risks to patients, and invasions of privacy.  Across industries like finance, retail, telecom, and more, the same potential for positive impact through data science exists, as does the potential for exposure-risk to consumers. The potential value created through data science is such that institutions must figure out how to address privacy concerns.

No one wants their medical records and sensitive information to be exposed. Yet they do want research to progress, and to benefit from innovation. That is the dilemma faced by individuals today. People are okay with their data being used in medical research, so long as their data is protected, and cannot be used to re-identify them. So where did the University of Chicago go wrong in sharing data with Google — and was it a case of negligence, ignorance, or a lack of investment?

The basis of the lawsuit claims that the data shared between the two parties were still susceptible to re-identification through inference attacks and mosaic effects. Though the datasets had been stripped of direct identifiers and anonymized, they still contained date stamps of when patients checked in and out of the hospital. When combined with other data that Google held separately, like location data from phones and mapping apps, they could be used to re-identify individuals in the data set. Free text medical notes from doctors, though de-identified in some fashion, were also contained in the data set, further compounding the exposure of private information.

Inference attacks and mosaic effect methods combine information from different datasets to re-identify individuals. They are now well-documented realities that institutions cannot be excused in being ignorant of. Indirect identifiers must also be assessed for the risk of re-identification of an individual and included when considering privacy-protection. What most are unaware of, is that they can be, without decimating the analytical value of the data required for data science, analytics, and ML.

Significant advancements in data science have led to improvements in data privacy technologies, and controls for data collaboration.  Autonomous, systematic, meta-data classification, and re-identification risk assessment and scoring are two that would have made an immediate difference, in this case. Differential Privacy and Secure Multiparty-Computation are two others.

Privacy Automation systems encompassing these technologies are a reality today.   Privacy management is often seen as an additional overhead cost to data science projects. That is a mistake. Tactical use of data security solutions like encryption and hashing to privacy-protect datasets are also not enough, as attested to by this case involving Google and the University of Chicago Medical Center.  

As we saw with Cybersecurity over the last decade, it took several years and continued data theft and hacks making headlines before organizations implemented advanced Cybersecurity and intrusion detection systems. Cybersecurity solutions are now seen as an essential component of an enterprise’s infrastructure and have a commitment at the board level to keep company data safe and their brand untarnished. Boards must reflect on the negative outcomes of lawsuits like this one, where the identity of its customers are being compromised, and their trust damaged. 

Today’s data science projects, without advanced automated privacy protection solutions, should not pass internal privacy governance and data compliance. Additionally, these projects should not use customer data, even if the data is anonymized, until automated privacy risk assessments solutions can accurately reveal the level of re-identification risk (inclusive of inference attacks, and the mosaic effect).  

With the sensitivity around privacy in data science projects in our public discourse today, any enterprise not investing and implementing advanced privacy management systems only exposes itself as having no regard for the ethical use of customer data. The potential for harm is not a matter of if, but when.

Join our newsletter


Do You Know What Your Data is Worth?

Do You Know What Your Data is Worth?

Facebook privacy issues

Do You Know What Your Data is Worth?

Your data is more than your name, age, gender, and address. Your Google searches, tweets, comments, time spent on videos and posts, purchase behaviours, smart home assistant commands and much more is also your data.

There is a new bill in the U.S. senate, hoping to enforce technology companies to disclose to each of their users, the actual value of their data. While this proposed law, seeks to further protect individuals’ privacy, evaluating the exact value of someone’s data is more difficult that it seems. Currently, evaluations range from $1.00 USD for an average person’s data to $100.00 USD for someone with an active social media presence. 

Data sensitivity doesn’t just come from the data itself, but also how companies and agencies can use the data to exert influence. Author of The Support Economy: Why Corporations Are Failing Individuals, Shoshana Zuboff expands on this, as she claims that tech giants like Google and Facebook are practicing surveillance capitalism with intentions to shape consumer behaviour towards a more profitable future.

The truth is, datafication, which refers to the processes and tools that transform a business into a data-driven enterprise, doesn’t affect everyone equally. Women, minorities and people with low-incomes are affected much more than the rest. Thus, the new proposed bill aims to address these concerns. 

Three Tips to Maintain Data Privacy for Marketers

A large chunk of a digital marketer’s time is spent understanding and working with consumer data. Marketers analyze consumer data daily, from click-through rates, to unsubscribe rates. The more data they have, the more powerful their personalization efforts become-from relevant product recommendations to a consumer’s preferred communication method.

Additionally, marketers must know and comply with privacy regulations, such as GDPR, HIPAA, and CCPA. Here are three tips you can use to prepare for new privacy regulations without sacrificing your digital marketing efforts:

  • Conduct regular data reviews to make sure company policies are up to date
  • Know how the data is collected, used, analyzed and shared
  • Use the right technology and that allow you to gain insights from data while protecting people’s privacy

Data Sharing in the Healthcare Field

When it comes to the use of healthcare data, many ethical questions arise. Who is responsible for the safety of health data? Who owns co-produced clinical trial data?  

“We owe it to patients participating in research to make the data they help generate widely and responsibly available. The majority desire data sharing, viewing it as a natural extension of their commitment to advance scientific research” (Source).

Researchers can develop new cures and speed up the innovation process with the use of data-sharing. However, data is not easily shared, especially in the healthcare field. To address this problem, several university researchers from universities such as Yale and Stanford are creating a set of good data-sharing practices for both healthcare facilities and pharmaceutical companies. They have also partnered with key stakeholders and end-users to ensure an all-rounded approach to their guidelines. 

Join our newsletter


The Safety of Healthcare Data is a Top Priority

The Safety of Healthcare Data is a Top Priority

Facebook privacy issues

There is no doubt that medical data and healthcare records are highly sensitive. However, recent events have shone light on this data not being secure enough. How can we prevent privacy risk but still allow researchers to benefit us all from our medical data?

Pressure builds to secure health care data

Due to recent healthcare data breaches, there has been a strong push toward the US federal government for increased personal medical information protections. Especially as more of the healthcare processes shift from on-paper to online, and as many of the data is turned into analytics for better patient care in the future, this is a booming concern.

For example, reporter Maggie Miller states that “one major recent data breach led to the personal information of 20 million customers of blood testing groups Quest Diagnostics, LabCorp and Opko Health being exposed”.

Currently, much of the momentum has been in efforts to urge law officials to focus on the sale and use of data amidst the social media space, however, in light of recent breaches, there has been much more attention geared towards the importance of securing health record and medical data.

Evidence That Consumers Are Now Putting Privacy Ahead Of Convenience: Gartner

Gartner researchers have discovered a considerable amount of consumers and employees, that do not consent to trading their data’s security, safety and peace-of-mind for more convenience. 

With that in mind, many companies and organizations are redefining their internal views of customer data.

Chris Howard, a distinguished research vice president at Gartner, states that “As a CIO, you have a mandate to maintain data protections on sensitive data about consumers, citizens and employees. This typically means putting someone in charge of a privacy management program, detecting and promptly reporting breaches, and ensuring that individuals have control of their data. This is a board-level issue, yet barely half of organizations have adequate controls in place” (Source).

Recently, at the Gartner IT Symposium in Toronto, he argued that companies must be able to change their practices and become more adaptive to privacy-related demands. Gartner calls this the ‘ContinuousNext’ approach, and they hope it will build momentum through digital transformation and beyond. 

The steady erosion of privacy at home 

Most public areas are under the watch of AI cameras, cellphone companies, and advertisers that watch your every move. 

All these internet connected gadgets-smart assistants, internet-connected light bulbs, video doorbells, Wi-Fi thermostats, you name it, and they’re watching you. 

The problem: These devices learn to pick up your voice, interests, habits, TV preferences, meals, times home and away, and all other types sensitive data. The gadgets then relay this information back to the companies where they were manufactured.

However, can people switch back to their old ways? Can people go back to regular temperature control systems, TV’s that aren’t smart, and human assistants rather than robotic ones?

Nevertheless, the Supreme Court has indeed placed new boundaries along the lines of digital snooping especially without warrants and consent. What does the future look like, for a world that cannot live without tech?

Join our newsletter


The Privacy Risk Most Data Scientists Are Missing

The Privacy Risk Most Data Scientists Are Missing

Facebook privacy issues

Data breaches are becoming increasingly common, and the risks of being involved in one are going up. A Ponemon Institute report (an IBM-backed think tank), found that the average cost of a data breach in 2018 was $148 per record, up nearly 5% from 2017.

Privacy regulations and compliance teams are using methods like masking and tokenization to protect their data — but these methods come at a cost.
Businesses often find that these solutions prevent data from being leveraged for analytics and on top of that, they also leave your data exposed.

Many data scientists and compliance departments protect and secure direct identifiers. They hide an individual’s name, or their social security number, and move on. The assumption is that by removing unique values from a user, the dataset has been de-identified. Unfortunately, that is not the case.

In 2010, Netflix announced a $1 million competition to whoever could build them the best movie-recommendation engine. To facilitate this, they released large volumes of subscriber data with redacted direct identifiers, so engineers could use Netflix’s actual data, without compromising consumer privacy. The available information included users’ age, gender, and zip code. However, when these indirect identifiers (also known as quasi-identifiers) were taken in combination, they could re-identify a user with over 90% accuracy. That’s exactly what happened, resulting in the exposure of millions of Netflix’s consumers. Within a few months, the competition had been called off, and a lawsuit was filed against Netflix.

When it comes to the risk exposure of indirect identifiers, it’s not a question of if, but a question of when. That’s a lesson companies have continuously found out the hard way. Marriott, the hotel chain, faced a data breach of 500 million consumer records and faced $72 million in damages due to a failure to protect indirect identifiers.

Businesses are faced with a dilemma. Do they redact all their data and leave it barren for analysis? Or do you leave indirect identifiers unprotected, and create an avenue for exposure that will lead to an eventual leak of your customers’ private data?

Either option causes problems. That can be changed!

That’s why we founded CryptoNumerics. Our software is able to autonomously classify your datasets into direct, indirect, sensitive, and insensitive identifiers, using AI. We then use cutting-edge data science technologies like differential privacy, k-anonymization, and secure multi-party computation to anonymize your data while preserving its analytical value. Your datasets are comprehensively protected and de-identified, while still being enabled for machine learning, and data analysis.

Data is the new oil. Artificial intelligence and machine learning represent the future of technology-value, and any company that does not keep up will be left behind and disrupted. Businesses cannot afford to leave data siloed, or uncollected.

Likewise, Data privacy is no longer an issue that can be ignored. Scandals like Cambridge Analytica, and policies like GDPR, prove that, but the industry is still not knowledgeable on key risks, like indirect identifiers. Companies that use their data irresponsibly will feel the damage, but those that don’t use their data at all will be left behind. Choose instead, not to fall into either category.

Join our newsletter



Announcing CN-Protect for Data Science

Announcing CN-Protect for Data Science

We are pleased to announce the launch of CN-Protect for Data Science

CryptoNumerics announces CN-Protect for Data Science, a Python library that applies insight-preserving data privacy protection, enabling data scientists to build better quality models on sensitive data.  

Toronto – April 24, 2019CryptoNumerics, a Toronto-based enterprise software company, announced the launch of CN-Protect for Data Science which enables data scientists to implement state-of-the-art privacy protection, such as differential privacy, directly into their data science stack while maintaining analytical value.

According to a 2017 Keggle study, two of the top 10 challenges that data scientists face at work are data inaccessibility and privacy regulations, such as GDPR, HIPAA, and CCPA.  Additionally, common privacy protection techniques, such as Data Masking, often decimate the analytical value of the data. CN-Protect for Data Science solves these issues by allowing data scientists to seamlessly privacy-protect datasets that retain their analytical value and can subsequently be used for statistical analysis and machine learning.

“Private information that is contained in data is preventing data scientists from obtaining insights that can help meet business goals.  They either cannot access the data at all or receive a low quality version which has had the private information removed.” Monica Holboke, Co-founder & CEO CryptoNumerics. “With CN-Protect for Data Science, data scientists can incorporate privacy protection in their workflow with ease and deliver more powerful models to their organization.”

CN-Protect for Data Science is a privacy-protection python library that works with Anaconda, Scikit and Jupyter Notebooks, smoothly integrating into the data scientist workflow.  Data scientists will be able to:

  • Create and apply customized privacy protection schemes, streamlining the compliance process.
  • Preserve analytical value for model building while ensuring privacy protection.
  • Implement differential privacy and other state-of-the-art privacy protection techniques using only a few lines of code.

CN-Protect for Data Science follows the successful launch of CN-Protect Desktop App in March. It is part of CryptoNumerics’ efforts to bring insight-preserving data privacy protection to data science platforms and data engineering pipelines while complying with GDPR, HIPAA, and CCPA. CN-Protect editions for SAS, R Studio, Amazon AWS, Microsoft Azure, and Google GCP are coming soon.  

Join our newsletter