How Safely Opening Data Silos Facilitates Cutting-edge Data Science

How Safely Opening Data Silos Facilitates Cutting-edge Data Science

In their day-to-day work, data scientists face a range of challenges. (We’ve covered the big ten challenges on the blog before.) One of the biggest of all? Siloed data; data inaccessible to anyone other than the owner.

The problem with data silos

Typically, the problem of data silos presents itself like this. 

A data scientist notices that their model is not performing as well as they hoped. The data scientist has a hypothesis as to why this might be the case, and wants to test their hypothesis. The data currently available does not adequately measure the potentially explanatory variable. The data scientist begins a long expedition, searching for a dataset to test the hypothesis. The hunt is slow and arduous, including many red herrings and wild geese. 

At long last, the data is found, and lo and behold, it’s been inside the organization this entire time! 

The scientist sends off an email requesting access, and heads home, content the search is over. They arrive at work the next day to an email from the data owner rebuffing their request. Back to square one, foiled by siloed data. 

While this problem may be solved with a simple email between managers, the cost is already apparent. Time was spent seeking out data that was internally available. More time passes waiting for clearance of the data. Even the time spent hypothesizing about model performance likely could have been reduced had the data been accessible from the outset. 

Further, these problems don’t always get solved. Sometimes siloed data is never found. Sometimes it’s never cleared. In these cases, the data scientist is unable to test their hypothesis. At best, siloed data inhibits productivity. At worst, it limits fundamental understanding of the problem by obfuscating relationships between data.

Why do data silos appear?

Siloed data can crop up within an organization for a wide variety of reasons, ranging from the malicious (teams wanting to maintain a competitive advantage) to the innocuous (too many layers of hierarchy/bureaucracy to traverse). As data privacy concerns and a more nuanced understanding of identifying information emerge, limiting access to sensitive data is an increasingly pressing motivation for the creation of data silos. 

Unfortunately, limiting data access also limits data utility. Luckily, there are a couple techniques available to gain data utility, while maintaining acceptable privacy standards.

How to break open data silos

One technique is to anonymize siloed data. The goal of anonymization is to limit the risk of any individuals in the dataset being identified. Simple anonymization, such as removal of direct identifiers like name and ID, have long been commonplace. However, these approaches are insufficient. Indirect identifiers remain, leaving the data susceptible to inference attacks.

Luckily, there are more effective ways to anonymize data. By utilizing concepts such as k-anonymity and t-closeness, data owners can possess a clear understanding of their data’s risk of reidentification. Applying advanced practical privacy-preserving protection to indirect identifiers to reach a desired reidentification risk is one way to open data silos.

Another solution is to implement Secure Multi-Party Computation (SMC). SMC enables a number of parties to jointly compute a function over a set of inputs that they wish to keep private (head here for a deeper explanation). This allows training a machine learning model across datasets held by multiple parties as if they were a single dataset, but without actually moving, centralizing, or disclosing the data between the parties. This approach increases data utility without actually opening the silo.

Data privacy concerns are likely to only increase moving forward. Because of this, data silos are likely to continue to be created. Being able to safely open or connect these silos will be key to unlocking the analytical value of the data within.

For more about CryptoNumerics’ privacy automation solutions, read our blog here.

Subscribe to our newletter

How Third Parties who Act as Brokers of Data will Struggle as the Future of Data Collaboration Changes

How Third Parties who Act as Brokers of Data will Struggle as the Future of Data Collaboration Changes

Today, everyone understands that, as The Economist put it, “data is the new oil.”

And few understand this better than data aggregators. Data aggregators can loosely be defined as third parties who act as brokers of data to other businesses. Verisk Analytics is perhaps the largest and best-known example, but many other companies exist as well: Yodlee, Plaid, MX and many more.

These data aggregators understand the importance of data, and how the right data can be leveraged to create value through data science for consumers and companies alike. But the future of data collaboration is starting to look very different. Their businesses may well start to struggle.

Why data aggregators face a tricky future

As the power of data has become more widely recognized, so too has the importance of privacy. In 2018, the European Union implemented GDPR (General Data Protection Regulation), the most comprehensive data privacy regulation of its kind, with broad-sweeping jurisdiction. GDPR did its work right away, with a succession of privacy leaks across multiple industries that led to highly negative media coverage. Facebook suffered a $5-billion fine.

Where once many were skeptical, today, few people deny the importance of data privacy. Privacy itself has become a separate dimension, distinct from security. The data scientist community has come to understand that datasets must not only be secure from hackers, but de-identified, to ensure no individual can have their information stolen as the data is shared.

In the new era of privacy controls, third party data aggregators will face two problems: 

  1. Privacy Protection Requirements
    Using a third party to perform data collaboration is a flawed approach. No matter what regulations or protections you enforce, you are still moving your data out of your data centers, and exposing your raw information (which contains both PII and IP-sensitive items) to someone else. Ultimately, third party consortiums do not maintain a “privacy-by-design” frame, which is the standard required for GDPR compliance.

  2. Consumers Don’t Consent to Have their Data Used
    The GDPR requires that collectors of data also collect the consent of their consumers for its use. If I have information that I’ve collected, I can only use it for the specific purpose the consumer has allowed for. I cannot just share it with anyone, or use it however I like.

These challenges are serious obstacles to data collaboration, and will affect data aggregators the most due to their unique value proposition.Many see data aggregators as uniquely flawed in their dealings with these issues, and that has generated some negative traction against them. A recent Nevada state law required all who qualified to sign up for a public registry. 

There is a need for these aggregators to come out ahead of this, in order to overcome challenges to their business model, and to avoid negative media attention.

How CryptoNumerics can help

At CryptoNumerics, we recognise the genuine ethical need for privacy. But we also recognize the vast good that data science can provide. In our opinion, no-one should have to choose one over the other. Hence we have developed new technology that enables both.

CN-Insight uses a concept we refer to as Virtual Data Collaboration. Using technologies like secure multi-party computation and secret share cryptography, CN-Insight enables companies to perform machine learning and data science across distributed datasets. Instead of succumbing to the deficits of the third-party consortium model, we enable companies to keep their data sets on-prem, without need of co-location or movement of any kind, and without needing to expose any raw information. The datasets are matched using feature engineering, and our technology enables enterprises to build the models as if the data sets were combined.

Data aggregators must give these challenges serious thought, and make use of these new technology innovations in order to stay ahead of a new inflection point in their industry. Privacy is here to stay, and as the data brokers that lead the industry, they have an opportunity to play a powerful role in leading the way forward, and improving their business future.

Subscribe to our newsletter

Forget Third-party Datasets – the Future is Data Partnerships that Balance Compliance and Analytical Value

Forget Third-party Datasets – the Future is Data Partnerships that Balance Compliance and Analytical Value

Organizations are constantly gathering information from their customers. However, they are always driven to acquire extra data on top of this. Why? Because more data equals better insights into customers, and better ability to identify potential leads and cross-sell products. Historically, to acquire more data, organizations would purchase third-party datasets. Though these come with unique problems, such as occasionally poor data quality, the benefits used to outweigh the problems. 

But not anymore. Unfortunately for organizations, since the introduction of the EU General Data Protection Regulation (GDPR), buying third-party data has become extremely risky. 

GDPR has changed the way in which data is used and managed, by requiring customer consent in all scenarios other than those in which the intended use falls under a legitimate business interest. Since third-party data is acquired by the aggregator from other sources, in most cases, the aggregators don’t have the required consent from the customers. This puts any third-party data purchaser in a non-compliant situation that could expose them to fines, reputational damage, and additional overhead compliance costs.

If organizations can no longer rely on third-party data, how can they maximize the value of the data they already have? 

By changing their focus. 

The importance of data partnerships and second-party data

Instead of acquiring third-party data, organizations should establish data partnerships and access second-party data. This new approach has two main advantages. One, second-party constitutes the first-party data of another organization, so it is of high quality. Two, there are no concerns about customer consent, as the organization who owns this data has direct consent from the customer. 

That said, to establish a successful data partnership, there are three things that have to be taken into consideration: privacy protection, IP protection, and data analytical value.   

Privacy Protection

Even when customer consent is present, the data that is going to be shared should be privacy-protected in order to comply with GDPR, safeguard customer information, and prevent any risk. Privacy protection should be understood as a reduction in the probability of re-identifying a specific individual in a dataset. GDPR, as well as other privacy regulations, refer to anonymization as the maximum level of privacy protection, wherein an individual can no longer be re-identified. 

Privacy protection can be achieved with different techniques. Common approaches include  differential privacy, encryption, the adding of “noise,” and suppression. Regardless of which privacy technique is applied, it is important to always measure the risk of re-identification of the data.

IP (Intellectual Property) Protection

There are some organizations that are okay with selling their data. However, there are others that are very reticent, because they understand that once the data is sold, all of its value and IP is lost, since they can’t control it anymore. IP control is a big barrier when trying to establish data partnerships. 

Fortunately, there is a way to establish data partnerships and ensure that IP remains protected.

Recent advances in cryptographic techniques have made it possible to collaborate with data partners and extract insights without having to expose the raw data. The first of these techniques is called Secure Multiparty Computation.

As its name implies, with Secure Multiparty Computation, multiple parties can perform computations on their datasets as if they were collocated but without revealing any of the original data to any of the parties. The second technique is Fully Homomorphic Encryption. With this technique, data is encrypted in a way in which computations can be performed without the need for decrypting the data. 

Because the original raw data is never exposed across partners, both of these advanced techniques allow organizations to augment their data, extract insights and protect IP safely and securely.

Analytical Value

The objective of any data partnership is to acquire more insights into customers and prospects. For this reason, any additional data that is acquired needs to add analytical value. But maintaining this value becomes difficult when organizations need to preserve privacy and IP protection. 

Fortunately, there is a solution. Firstly, organizations should identify common individuals in both datasets. This is extremely important, because you want to acquire data that adds value. By using Secure Multiparty Computation, the data can be matched and common individuals identified, without exposing any of the sensitive original data. 

Secondly, organizations must use software that balances privacy and information loss. Without this, the resulting data will be high on privacy protection and extremely low on analytical value, making it useless for extracting insights.

Thanks to the new privacy regulations sweeping the world, acquiring third-party datasets has become extremely risky and costly. Organizations should change their strategy and engage in data partnerships that will provide them with higher quality data. However, for these partnerships to add real value, privacy and IP have to be protected, and data has to maintain its analytical value.

For more about CryptoNumerics’ privacy automation solutions, read our blog here.

Subscribe to our newsletter

What do Trump, Google, and Facebook Have in Common?

What do Trump, Google, and Facebook Have in Common?

This year, the Trump Administration declared the need for a national privacy law to supersede a patchwork of state laws. But, as the year comes to a close, and amidst the impeachment inquiry, time is running out. Meanwhile, Google plans to roll out encrypted web addresses, and Facebook stalls research into social media’s effect on democracy. Do these three seek privacy or power?

The Trump Administration, Google, and Facebook claim that privacy is a priority, and… well… we’re still waiting for the proof. Over the last year, the news has been awash with privacy scandals and data breaches. Every day we hear promises that privacy is a priority and that a national privacy law is coming, but so far, the evidence of action is lacking. This begs the question, are politicians and businesses using the guise of “privacy” to manipulate people? Let’s take a closer look.

Congress and the Trump Administration: National Privacy Law

Earlier this year, Congress and the Trump Administration agreed they wanted a new federal privacy law to protect individuals online. This rare occurrence was even supported and campaigned for by major tech firms (read our blog “What is your data worth” to learn more). However, despite months of talks, “a national privacy law is nowhere in sight [and] [t]he window to pass a law this year is now quickly closing.” (Source)

Disagreement over enforcement and state-level power are said to be holding back progress. Thus, while senators, including Roger Wicker, who chairs the Senate Commerce Committee, insist they are working hard, there are no public results; and with the impeachment inquiry, it is possible we will not see any for some time (Source). This means that the White House will likely miss their self-appointed deadline of January 2020, when the CCPA goes into effect.

Originally, this plan was designed to avoid a patchwork of state-level legislature that can make it challenging for businesses to comply and weaken privacy care. It is not a simple process, and since “Congress has never set an overarching national standard for how most companies gather and use data.”, much work is needed to develop a framework to govern privacy on a national level (Source). However, there is evidence in Europe with GDPR, that a large governing structure can successfully hold organizations accountable to privacy standards. But how much longer will US residents need to wait?

Google Encryption: Privacy or Power

Google has been trying to get an edge above the competition for years by leveraging the mass troves of user data it acquires. Undoubtedly, their work has led to innovation that has redefined the way our world works, but our privacy has paid the price. Like never before, our data has become the new global currency, and Google has had a central part to play in the matter. 

Google has famously made privacy a priority and is currently working to enhance user privacy and security with encrypted web addresses.

Unencrypted web addresses are a major security risk, as they make it simple for malicious persons to intercept web traffic and use fake sites to gather data. However, in denying hackers this ability, power is given to companies like Google, who will be able to collect more user data than ever before. For the risk is “that control of encrypted systems sits with Google and its competitors.” (Source)

This is because encryption cuts out the middle layer of ISPs, and can change the mechanisms through which we access specific web pages. This could enable Google to become the centralized encryption DNS provider (Source).

Thus, while DoH is certainly a privacy and security upgrade, as opposed to the current DNS system, shifting from local middle layers to major browser enterprises centralizes user data, raising anti-competitive and child-protection concerns. Further, it diminishes law enforcement’s ability to blacklist dangerous sites and monitor those who visit them. This also opens new opportunities for hackers by reducing their ability to gather cybersecurity intelligence from malware activity that is an integral part of being able to fulfil government-mandated regulation (Source).

Nonetheless, this feature will roll out in a few weeks as the new default, despite the desire from those with DoH concerns to wait until learning more about the potential fallout.

Facebook and the Disinformation Fact Checkers

Over the last few years, Facebook has developed a terrible reputation as one of the least privacy-centric companies in the world. But it is accurate? After the Cambridge Analytica scandal, followed by endless cases of data privacy ethical debacles, Facebook stalls its “disinformation fact-checkers” on the grounds of privacy problems.

In April of 2018, Mark Zuckerburg announced that the company would develop machine learning to detect and manage misinformation on Facebook (Source). It then promised to share this information with non-profit researchers who would flag disinformation campaigns as part of an academic study on how social media is influencing democracies (Source). 

To ensure that the data being shared could not be traced back to individuals, Facebook applied differential privacy techniques.

However, upon sending this information, researchers complained data did not include enough information about the disinformation campaigns to allow them to derive meaningful results. Some even insisted that Facebook was going against the original agreement (Source). As a result, some of the people funding this initiative are considering backing out.

Initially, Facebook was given a deadline of September 30 to provide the full data sets, or the entire research grants program would be shut down. While they have begun offering more data in response, the full data sets have not been provided.

A spokesperson from Facebook says, “This is one of the largest sets of links ever to be created for academic research on this topic. We are working hard to deliver on additional demographic fields while safeguarding individual people’s privacy.” (Source). 

While Facebook may be limiting academic research on democracies, perhaps they are finally prioritizing privacy. And, at the end of the day with an ethical framework to move forward, through technological advancement and academic research, the impact of social media and democracy is still measurable without compromising privacy.

In the end, it is clear that privacy promises hold the potential to manipulate people into action. While the US government may not have a national privacy law anywhere in sight, the motives behind Google’s encrypted links may be questionable, and Facebook’s sudden prioritization of privacy may cut out democratic research, at least privacy is becoming a hot topic, and that holds promise for a privacy-centric future for the public.

Join our newsletter
The Key to Anonymizing Datasets Without Destroying Their Analytical Value

The Key to Anonymizing Datasets Without Destroying Their Analytical Value

Enterprise need for “anonymised” data lies at the core of everything from modern medical research, to personalised recommendations, to modern data science, to ML and AI techniques for profiling your customers for upselling and market segmentation. At the same time, anonymised data forms the legal foundation for demonstrating compliance with privacy regimes such as GDPR, CCPA, HIPPA, and all other established and emerging data residency and privacy laws from around the world.

For example, the GDPR Recital 26 defines anonymous information as “information which does not relate to an identified or identifiable natural person” or “personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.” Under GDPR law, only properly anonymized information can be handled or utilized by enterprises.

The perils of poorly or partially anonymised data

Why is anonymised data such a central part of demonstrating legal and regulatory privacy compliance? And why does failing to comply expose organisations to the risk of significant fines, and brand and reputational damage?

Because if the individuals in a dataset can be re-identified, then their promised privacy protections evaporate. Hence “anonymisation” is the process of removing personal identifiers, both direct and indirect, that may lead to an individual being identified. An individual may be directly identified from their name, address, postcode, telephone number, photograph or image, or some other unique personal characteristics. An individual may also be indirectly identifiable when certain information is combined or linked together with other sources of information, including their place of work, job title, salary, gender, age, their postcode or even the fact that they have a particular medical diagnosis or condition.

Anonymization is so relevant to legislation such as GDPR because recent research has now conclusively shown that poorly or partially anonymised data can lead to an individual being identified simply by combining that data with another dataset. In 2008, individuals were re-identified from an anonymised Netflix dataset of film ratings by comparing the ratings information with public scores on the IMDb film website. In 2014. the home addresses of New York taxi drivers were identified from an anonymous datasets of individual taxi trips in the city.  

In 2018, The University of Chicago Medical team shared with Google anonymised patient records which included appointment date and time stamps and medical notes. A 2019 pending class action lawsuit brought against Google and the University claims that Google can combine the appointment date and time stamps with other records its holds from Waze, Android phones and other location records to re-identify these individuals.

And data compliance isn’t the only reason that organizations need to be smart with how they anonymize data. An equally major issue is that fully anonymised techniques tend to devalue the data or render it less useful for purposes such as data science, AI and ML, and other applications looking to gain insights and extract value. This is particularly true with indirect identifying information.     

The challenges of anonymization present businesses with a dilemma: Fully anonymising directly and indirectly identifying customer data keeps them compliant, but it renders that data less valuable and useful. But partially anonymising and the increased risks of individuals being identified.

How to anonymise datasets without wiping out their analytical value

The good news is that it is possible to create fully complaint anonymised datasets and still retain the analytical value of data for data science, and AI and ML applications. You just need the right software.

The first challenge is to understand the risk of re-identification of an individual or individuals from a dataset. This cannot be done manually or by scanning a dataset. A systematic and automated approach has to be applied to assess the risk of re-identification. This risk assessment forms a key part of demonstrating your Privacy Impact Assessment (PIA), especially in a data science and data lake environments. How many unique individuals or identifying attributes exist is a dataset that can identify an individual directly or indirectly?  For example, say there are three twenty-eight-year-old males living in a certain neighbourhood in Toronto. As there are only three individuals, if this information was combined with one other piece of information – such as employer, or car driven, or medical condition – then you have a high probability of being able to identify the individual. 

Once we’re armed with this risk assessment information, modern systems-based approaches to anonymisation can be applied. In the first example, using an anonymisation generalisation technique, we can generalise the indirect identifiers in such a manner that the analytical value of the data is still retained but we can also meet our privacy compliance objectives to fully anonymise the dataset.  So with the twenty-eight-year-old males living in a certain neighbourhood in Toronto, we can generalise gender to show that there are nine twenty-eight-year-old individuals living there, thereby reducing the risk of an individual being identified.  

Another example is age binning, where the analytical value of the data is preserved by generalising the age attribute. By binning the age “28” to a range such as “25 to 30,” we now show that there are 15 individuals aged 25 to 30 living in the Toronto neighbourhood, further reducing the risk of identification of an individual.

In the above examples, two key technologies enable us to fully anonymize datasets while retaining the analytical value: 

  1. An automated risk assessment feature which identifies the risk of re-identification in each and every dataset in a consistent and defensible manner across the enterprise is the first step. 
  2. The application of anonymisation protection using privacy protection actions such as generalisation, hierarchies, and differential privacy techniques.

Using these two techniques, enterprises can start to overcome the anonymisation dilemma.


Subscribe to our newsletter

How can I Maximally Monetize my Data without Exchanging & Exposing Sensitive Information?

How can I Maximally Monetize my Data without Exchanging & Exposing Sensitive Information?

In 2017, The Economist published a report titled “The world’s most valuable resource is no longer oil, but data.” The report explained how digitally-native organizations like Facebook, Google, and Amazon were leveraging data to achieve high-growth and market disruption.

In the two years since The Economist’s article was published, organizations of all sizes have jumped aboard the digitalization wagon, investing millions of dollars in technology and processes to gather and analyze data. The main goal of going digital is data monetization.

What is data monetization?

Data monetization is the process of extracting insights from data in order to create value for the customer and the business. In a survey by McKinsey & Company, organizations mentioned that they monetize data by adding new services, developing new business models, and joining with similar companies to create a data utility.

In a recent survey, Gartner went deeper in understanding how organizations are monetizing their data, exploring both their present and future approaches:

Data monetization is no longer a nice-to-have strategy. It is a requirement for any enterprise that wants to compete. In the same McKinsey & Company survey, respondents described how incumbents are being disrupted by new entrants who are better at monetizing their data, or by traditional competitors who use data to improve their businesses. 

Unfortunately, data misuse and data breaches have made governments and customers react negatively to the way organizations are using their data. Facebook – whose unethical practices were revealed during the Cambridge Analytica scandal – are the best example of a company with enormous data power that has abused customer trust. In response, governments from all around the world are pushing privacy regulations that control how organizations use customer data. Data is no longer flowing inside and among organizations. Instead, in order to prevent risk, restrictions and siloes are being created. As a result, large digitalization investments are lagging and ROI’s are getting squished.

Data Monetization 2.0

Recent advancements in cryptography and privacy techniques offer organizations new tools for monetizing data – tools that don’t risk exposing or exchanging sensitive information. Three of the most promising new technologies are Differential Privacy, Secure Multi-Party computation, and Fully Homomorphic Encryption.

Differential Privacy

Differential privacy, as The Conversation outline, “makes it possible for tech companies to collect and share aggregate information about user habits while maintaining the privacy of individual users.” 

Differential privacy is a technique that injects “noise” into a data set, with the objective of reducing the risk of re-identification of an individual. With Differential Privacy, the probability of someone being present in a data set is the same as the probability of them not being present in that same data set – thus protecting the privacy of the subjects in the data set. After applying Differential Privacy, organizations can monetize their data with a high degree of certainty that no sensitive information will be exposed or exchange.

Differential privacy has been adopted by large tech companies like Google and Apple as well as by the US Census Bureau. 

Secure Multi-Party Computation (MPC)

As described by Boston University researchers, “Secure Multi-Party Computation allows us to collaboratively analyze data for the public good without revealing private information.” 

MPC is a cryptographic technique that allows various parties to compute a function using everyone’s inputs without disclosing what those inputs are. Since the inputs are never disclosed, the sensitive information is always protected.

MPC gives organizations two major data monetization benefits:

  • Better Insights: Organizations can now expand their internal data with data from partners in a way in which everyone’s sensitive data is never exchanged, but everyone derives more and better insights from the augmented data sets. 
  • Renting Data: Data assets are extremely valuable. However, the value is lost once they are sold. With MPC, organizations can allow partners to benefit from the data without having to sell it, increasing the value of the data asset.

Fully Homomorphic Encryption (FHE)

Fully Homomorphic Encryption (FHE) is a cryptographic technique that allows for computations to be performed on encrypted data. Once data is encrypted, it becomes fully anonymized, thus preventing any sensitive information from being exposed. 

In theory, FHE is the best way to monetize data without exposing or sharing sensitive information. However, in reality, this technique is extremely slow and computationally expensive. This means FHE isn’t a viable option for real commercial applications.

Join our newsletter