Protect Your Data Throughout the Pipeline

Protect Your Data Throughout the Pipeline

Organizations all over the world have embraced the opportunity that data and analytics present. Millions of dollars are spent every year in designing and implementing data pipelines that allow organizations to extract value from their data. However, data misuse and data breaches have led government bodies to promote regulations such as GDPR, CCPA, and HIPAA, bestowing privacy rights upon consumers and placing responsibilities upon businesses.

Maximizing data value is essential, but, privacy regulations must be satisfied when doing so. This is achievable by implementing privacy-protecting techniques throughout the data pipeline to avoid compliance risks. 

Before introducing the privacy-protecting techniques, it is important to understand the four stages of the data pipeline:

  1. Data Acquisition: first off, the data must be acquired, which can be either generated internally or externally from third parties.
  2. Data Organization: the data is now stored for future use, and needs to be protected along the pipeline to avoid misuse and breaches. This can be achieved using access controls.
  3. Data Analysis: the data must now be opened up and mobilized in order to analyze it, which allows for a better understanding of an organization’s operations and customers, as well as improved forecasting.
  4. Data Publishing: analysis results are published, and/or internal data is shared with another party. 

Now that we have talked about the 4 stages of the data pipeline, let’s go over the sixteen privacy-protecting techniques that can be implemented throughout the pipeline to make it privacy-protected.

These techniques can be categorized based on their function into four groups: randomizing, sanitizing, output, and distributed computing.

Within the randomizing group, there are two techniques: additive and multiplicative noise. In applying these techniques, random noise is added or multiplied on the individual’s record to transform the data. These techniques can be used in the Data Acquisition stage of the data pipeline. 

The sanitizing group has five privacy techniques in it. The first technique is k-anonymity, where identifiable attributes of any record in a particular database are indistinguishable from at least one other record. Next comes l-diversity, which is an extension of k-anonymity. However, this technique solves the k-anonymity shortfall by making sure there is a diversity of sensitive information in each group. Another technique is t-closeness, which makes sure that the distribution of sensitive elements in each group remains the same as the distribution in the whole group. This technique is used to prevent attribute disclosure by maintaining a ‘t’ threshold. Additionally, there is the personalized privacy technique, in which privacy levels are defined and customized by owners. The last technique in this group is ε-differential privacy, which ensures any single record does not affect the overall outcome of the data’s analysis. These techniques can be used in the Data Acquisition stage, Data Organization stage, and the Data Publishing stage of the data pipeline. 

The output group has three techniques, which are used to reduce the inference of sensitive information from the output of any algorithm. The first technique is known as association rule hiding, where information used to exploit privacy can be taken from the rules identified in the data set. Next, there is the downgrading classifier effectiveness technique, where data is sanitized to reduce the classifier’s effectiveness to prevent information from being leaked. Finally, the query auditing and inference control technique, where data queries can output data that can be used to detect sensitive information. These techniques can be applied to the Data Publishing stage of the data pipeline. 

Last but not least, the distributed computing group, made up of seven privacy-protecting techniques. 1-out-of-2 oblivious transfer is where two messages are sent, but only one out of the two messages, are received and encrypted. Another technique in this group is homomorphic encryption, a method of performing a calculation on encrypted information (ciphertext) without decrypting it (to plaintext) first. Secure sum receives the sum of inputs without revealing these inputs to others. Secure set union shares and creates a union of sets without compromising the owners of each set. Secure size of intersection figures out the size of the data set’s intersection without revealing the data itself. The scalar product technique computes the scalar product between two vectors without revealing the input vector to each other’s party. Finally, the private set intersection technique computes the intersection of two sets from each party without revealing anything else. This technique can be used in the Data Acquisition stage, as well. All of the techniques from the distributed computing group prevent access to original, raw data while allowing analysis to be performed. All of these techniques can be applied to the Data Analysis stage and Data Publishing stage of the data pipeline. Homomorphic encryption can also be used in the Data Organization stage of the data pipeline.

These sixteen techniques help protect data’s privacy throughout the data pipeline. For a visual view on the privacy-exposed pipeline versus the privacy-protected pipeline, download our Data Pipeline infographic

For more information, or to find out how to privacy-protect your data, contact us today at [email protected].
Join our newsletter


Your Employer Is Watching You

Your Employer Is Watching You

This Week: Your employer has too much information about you. Lessons to be learned from Facebook’s $5 billion USD settlement. Why Snapchat is different.

Think Facebook and Amazon have too much of your personal data? Think again.

The truth is, your employer has much more of your private data than your social media, banking, or e-commerce accounts. 

The majority of employees feel uncomfortable with their employers tracking their moves in the workplace, network, or devices. However, this is slowly evolving. A Gartner study has found that as employers become more transparent about monitoring their employees, employees are more willing to accept being watched. 

Regardless, there is still a significant power distance. Unfortunately, employers have unlimited scope to install monitoring tools and tracking systems in their employee devices and internet connection. There is yet to be federal regulation preventing workplace surveillance. 

There are three key ways employers monitor their employees: 

  1. Location tracking via an employee ID badge or a company device.
  2. Communication tracking by monitoring email, Slack messaging, and keystroke logging. 
  3. Health monitoring, such as sleeping patterns and fitness through wellness programs. 

Here are some steps employees can take to protect yourself from your employer’s surveillance systems:

  1. Assume you are always being watched. Anything you do on the company’s devices, Wi-Fi, email, messaging platform, etc. could be tracked. 
  2. Keep it professional. Keep your work and personal devices separate. Anything on the company Wi-Fi can be scanned.
  3. Understand what information you are giving to your employer. Carefully, read over documents and contracts, like the company’s privacy policy, union laws, and your employment contract.

Three takeaways from the $5 Billion USD FTC and Facebook settlement

Known as the largest fine imposed by the FTC, the settlement reached with Facebook has three key takeaways

  1. The impact of the settlement or fine amount itself. Keep in mind Facebook did have to agree to this amount, but is $5 billion USD significant enough for Facebook to make changes to their policies?
  2. The structural remedies made necessary in addition to the fine. For example, the company will need to create a committee that deals exclusively with privacy. 
  3. The definitions that appear at the beginning of the settlement order. Some of these may show the FTC’s approach to how they interpret current laws and regulations. For example, the settlement order defines and clarifies the meaning of “covered information,” and “personally identifiable information” (PII), which is understood differently across the world.

For more information on this settlement, click here to watch the IAPP video segment. 

Snap has risen above tech giants

Snap cares about your privacy way more than you can imagine. Their most used app, Snapchat, was originally designed for private conversations. It has unique features such as automatic content deletion, private posts, increased user privacy control and much more. “We’ve invested a lot in privacy, and we care a lot about the safety of our community,” CEO Evan Spiegel said in a quarterly earnings call. 

Several brand-safety conscious companies like Proctor & Gamble have boycotted Google and YouTube after inappropriate videos were posted openly, and as a result, they are prioritizing brand-safety conscious companies. Currently, Snap is hoping to secure a venture with P&G, as their values on privacy and user safety are aligned. 

Join our newsletter


FaceApp and Facebook Under the Magnifying Glass

FaceApp and Facebook Under the Magnifying Glass

FaceApp is Under Heavy Scrutiny After Making a Comeback

The U.S. government has aired its concerns regarding privacy risks with the new trending face-editing photo app, FaceApp. With the 2020 Presidential Elections campaigns underway, the FBI and Federal Trade Commission are conducting a national security and privacy investigation into the app.

Written in the fine print, the app’s terms of use and privacy policies are rather shocking, according to Information security expert Nick Tella. It states that as a user, you “grant FaceApp a perpetual, irrevocable, non-exclusive, royalty-free, worldwide, fully-paid, transferable sub-licensable license to use, reproduce, modify, adapt, publish, translate, create derivative works from, distribute, publicly perform and display your User Content and any name, username or likeness provided in connection with your User Content in all media formats and channels now known or later developed, without compensation to you”. 

Social media experts and journalists don’t deny that if users are downloading the app, they are willingly handing over their data because of the above terms of use. However, government bodies and other institutions are aiming to make regulations stronger and ensure data protection is effectively enforced. 

On the other side, FaceApp has denied any accusations of data selling or misuse of user data. In a statement cited by TechCrunch, the company stated that “99% of users don’t log in; therefore, we don’t have access to any data that could identify a person”. Additionally, they made claims assuring the public that they delete most images from their services within 48 hours of the image upload time. Furthermore, they added that their research and development team is their only team based in Russia and that their servers are in the U.S.

With everything going on in the world around privacy and user data misuse, we must ask ourselves; should we think twice before trusting apps like FaceApp? 

Facebook to Pay $5 Billion USD in Fines

On Friday, July 12th, the FTC and Facebook finalized a settlement to resolve the Cambridge Analytica data misuse from last year, for a fine of $5 billion USD. Unfortunately, concerns still arise over whether or not Facebook will even change any of their privacy policies or data usage after paying this fine. “None of the conditions in the settlement will impose strict limitations on Facebook’s ability to collect and share data with third parties,” according to the New York Times. 

Although the FTC has approved this settlement, it still needs to get approved by the Justice Department, which rarely rejects agreements reached by the FTC. 

Join our newsletter


How Google Can Solve its Privacy Problems

How Google Can Solve its Privacy Problems

Google and the University of Chicago’s Medical Center have made headlines for the wrong reasons.  According to a June 26th New York Times report, a lawsuit filed in the US District Court for Northern Illinois alleged that a data-sharing partnership between the University of Chicago’s Medical Center and Google had “shared too much personal information,” without appropriate consent. Though the data sets had ostensibly been anonymized, the potential for re-identification was too high. Therefore, they had compromised the privacy rights of the individual named in the lawsuit.

The project was touted as a way to improve predictions in medicine and realize the utility of electronic health records through data science. Its coverage today instead focuses on risks to patients and invasions of privacy. Across industries like finance, retail, telecom, and more, the same potential for positive impact through data science exists, as does the potential for exposure-risk to consumers. The potential value created through data science is such that institutions must figure out how to address privacy concerns.

No one wants their medical records and sensitive information to be exposed. Yet, they do want research to progress and to benefit from innovation. That is the dilemma faced by individuals today. People are okay with their data being used in medical research, so long as their data is protected and cannot be used to re-identify them. So where did the University of Chicago go wrong in sharing data with Google — and was it a case of negligence, ignorance, or a lack of investment?

The basis of the lawsuit claims that the data shared between the two parties were still susceptible to re-identification through inference attacks and mosaic effects. Though the data sets had been stripped of direct identifiers and anonymized, they still contained date stamps of when patients checked in and out of the hospital. When combined with other data that Google held separately, like location data from phones and mapping apps, the university’s data could be used to re-identify individuals in the data set. Free text medical notes from doctors, though de-identified in some fashion, were also contained in the data set, further compounding the exposure of private information.

Inference attacks and mosaic effect methods combine information from different data sets to re-identify individuals. They are now well-documented realities that institutions cannot be excused for being ignorant of. Indirect identifiers must also be assessed for the risk of re-identification of an individual and included when considering privacy-protection. 

Significant advancements in data science have led to improvements in data privacy technologies, and controls for data collaboration. Autonomous, systematic, meta-data classification, and re-identification risk assessment and scoring, are two processes that would have made an immediate difference in this case. Differential privacy and Secure Multiparty-Computation are two others.

Privacy automation systems encompassing these technologies are a reality today. Privacy management is often seen as an additional overhead cost to data science projects. That is a mistake. Tactical use of data security solutions, like encryption and hashing, to privacy-protect data sets are also not enough, as can be attested to by the victims of this case.

As we saw with Cybersecurity over the last decade, it took several years and continued data theft and hacks making headlines before organizations implemented advanced Cybersecurity and intrusion detection systems. Cybersecurity solutions are now seen as an essential component of an enterprise’s infrastructure and have a commitment at the board level to keep company data safe and their brand untarnished. Boards must reflect on the negative outcomes of lawsuits like this one, where the identity of its customers are being compromised, and their trust damaged. 

Today, data science projects without advanced automated privacy protection solutions should not pass internal privacy governance and data compliance. Additionally, these projects should not use customer data, even if the data is anonymized, until automated privacy risk assessments solutions can accurately reveal the level of re-identification risk (inclusive of inference attacks, and the mosaic effect).  

With the sensitivity around privacy in data science projects in our public discourse today, any enterprise not investing and implementing advanced privacy management systems only exposes itself as having no regard for the ethical use of customer data. The potential for harm is not a matter of if, but when.

Join our newsletter


Do You Know What Your Data is Worth?

Do You Know What Your Data is Worth?

Facebook privacy issues

Do You Know What Your Data is Worth?

Your data is more than your name, age, gender, and address. It is your Google searches, tweets, comments, time spent on videos and posts, purchasing behaviours, smart home assistant commands, and much more.

There is a new bill in the U.S. Senate, hoping to enforce technology companies to disclose the actual value of their data to their users. While this proposed law, seeks to further protect individuals’ privacy, prescribing a dollar amount to someone’s data is more difficult that it seems. Currently, evaluations range from $1.00 USD for an average person’s data to $100.00 USD for someone with an active social media presence. 

Data sensitivity doesn’t just come from the data itself, but also from how companies and agencies can use the data to exert influence. Author of The Support Economy: Why Corporations Are Failing Individuals, Shoshana Zuboff expands on this, as she claims that tech giants like Google and Facebook are practicing surveillance capitalism with intentions to shape consumer behaviour towards a more profitable future.

The truth is datafication, which refers to the processes and tools that transform a business into a data-driven enterprise, doesn’t affect everyone equally. Women, minorities and people with low-incomes are affected much more than the rest. Thus, the new proposed bill aims to address these concerns. 

Three Tips to Maintain Data Privacy for Marketers

A large chunk of a digital marketer’s time is spent understanding and working with consumer data. Marketers analyze consumer data daily, from click-through rates, to unsubscribe rates. The more data they have, the more powerful their personalization efforts become -from relevant product recommendations to a consumer’s preferred communication method.

Additionally, marketers must know and comply with privacy regulations, such as GDPR, HIPAA, and CCPA. Here are three tips you can use to prepare for new privacy regulations without sacrificing your digital marketing efforts:

  • Conduct regular data reviews to make sure company policies are up to date
  • Know how the data is collected, used, analyzed and shared
  • Use the right technology to gain insights from data while protecting privacy

Data Sharing in the Healthcare Field

When it comes to the use of healthcare data, many ethical questions arise. Who is responsible for the safety of health data? Who owns co-produced clinical trial data?  

“We owe it to patients participating in research to make the data they help generate widely and responsibly available. The majority desire data sharing, viewing it as a natural extension of their commitment to advance scientific research” (Source).

Researchers can develop new cures and speed up the innovation process with the use of data-sharing. However, data is not easily shared, especially in the healthcare field. To address this problem, several university researchers from universities such as Yale and Stanford are creating a set of good data-sharing practices for both healthcare facilities and pharmaceutical companies. They have also partnered with key stakeholders and end-users to ensure their guidelines have a well-rounded perspective. 

Join our newsletter