Select Page
Location data and your privacy

Location data and your privacy

As technology grows to surround the entirety of our lives, it comes as no surprise that each and every move is tracked and stored by the very apps we trust with our information. With the current COVID-19 pandemic, the consequences of inviting these big techs into our every movement are being revealed. 

At this point, most of the technology-users understand the information they do give to companies, such as their birthdays, access to pictures, or other sensitive information. However, some may be unknowing of the amount of location data that companies collect and how that affects their data privacy. 

Location data volume expected to grow

We have created over 90% of the world’s data since 2017. As wearable technology continues to grow in trend, the amount of data a person creates each day is on a steady incline. 

One study reported that by 2025, the installation of worldwide IoT-enabled devices is expected to hit 75 billion. This astronomical number highlights how intertwined technology is into our lives, but also how welcoming we are to that technology; technology that people may be unaware of the ways their data is collected. 

Marketers, companies and advertisers will increasingly look to using location-based information as its volume grows. A recent study found that more than 84% of marketers use location data for their 

The last few years have seen a boost in big tech companies giving their users more control over how their data is used. One example is in 2019 when Apple introduced pop-ups to remind users when apps are using their location data.

Location data is saved and stored for the benefit of companies to easily direct personalized ads and products to your viewing. Understanding what your devices collect from you, and how to eliminate data sharing on your devices is crucial as we move forward in the technological age. 

Click here to read our past article on location data in the form of wearable devices. 

COVID-19 threatens location privacy

Risking the privacy of thousands of people or saving thousands of lives seems to be the question throughout this pandemic; a question that is running out of time for debate. Companies across the big 100 have stepped up to volunteer its anonymized data, including SAS, Google and Apple. 

One of the largest concerns is not how this data is being used in this pandemic, but how it could be abused in the future. 

One Forbes article brought up a comparison of the regret many are faced with after sharing DNA with sites like 23andMe, leading to health insurance issues or run-ins with criminal activity. 

As companies like Google, Apple and Facebook step-up to the COVID-19 technology race, many are expressing their concerns as these companies have not been deemed reliable for user data anonymization. 

In addition to the data-collecting concern, governments and big tech companies are looking into contact-tracking applications. Civilian location data being used for surveillance purposes, while alluded for the greater good of health and safety, raises multiple red flags into how our phones can be used to survey our every movement. To read more about this involvement in contact tracing apps, read our latest article

Each company has released that it anonymizes its collected data. However, in this pandemic age, anonymized information can still be exploited, especially at the hands of government intervention. 

With all this said, big tech holds power over our information and are playing a vital role in the COVID-19 response. Paying close attention to how user data is managed post-pandemic will be valuable in exposing how these companies handle user information.

 

4 techniques for data science

4 techniques for data science

With growing tension between privacy and analytics, the job of data scientists and data architects has become more complicated. The responsibility of data professionals is not just to maximize the value of the data, but to find ways in which data can be privacy protected while preserving its analytical value.

The reality today is that regulations like GDPR and CCPA have disrupted the way in which data flows through organizations. Now data is being siloed and protected using techniques that are not suited for the data-driven enterprise. Data professionals are left with long processes to access the information they need and, in many cases, the data they receive has no analytical value after it has been protected. 

This emphasizes the importance of using adequate privacy protection tactics to ensure that personally identifiable information (PII) is accessible in a privacy-protected manner and that it can be used for analytics.

To satisfy GDPR and CCPA, organizations can choose between three options, pseudonymization, anonymization, and consent: 

Pseudonymization is replacing direct identifiers, like names or emails, with pseudonyms to protect the privacy of the individual. However, this process is still in the scope of the privacy regulations, and the risk for re-identification remains very high.

Anonymization, on the other hand, looks at direct identifiers and quasi-identifiers and transforms the data in a way that’s now out-of-scope for privacy regulations and can be used for analytics. 

Consent requires organizations to ask customers for their consent on the usage of data, this opens up the opportunity for opt-outs. If the usage of the data changes, as it often does in an analytics environment, then consent may very well be required each time.

There are four main techniques that can help data professionals with privacy protection. All of them have different impacts on both privacy protection and data quality. These are: 

Masking: A de-identification technique that focuses on the redaction or transformation of information within a dataset to prevent exposure. 

K-anonymity: This privacy model ensures that each individual is indistinguishable from at least k-1 other individuals based on their attributes in a dataset.

Differential Privacy: Is a technique applied to an algorithm that mathematically guarantees that the output of the algorithm doesn’t change whether an individual is in the dataset or not. It is achieved through the addition of noise to the algorithm. 

Secure Multi-Party Computation: This is a cryptographic technique where a group of parties can compute a function over their inputs while keeping their inputs private.

Keep your eyes peeled in the next few weeks for our whitepaper, which will explore these four techniques in further detail.

Key terms to know to navigate data privacy

Key terms to know to navigate data privacy

As the data privacy discourse continues to grow, it’s crucial that the terms used to explain data science, data privacy and data protection are accessible to everyone. That’s why we at CryptoNumerics have compiled a continuously growing Privacy Glossary, to help people learn and better understand what’s happening to their data. 

Below are 25 terms surrounding privacy legislations, personal data, and other privacy or data science terminology to help you better understand what our company does, what other privacy companies do, and what is being done for your data.

Privacy regulations

    • General Data Protection Regulation (GDPR) is a privacy regulation implemented in May 2018 that has inspired more regulations worldwide. The law determined data controllers must establish a specific legal basis for each and every purpose where personal data is used. If a business intends to use customer data for an additional purpose, then it must first obtain explicit consent from the individual. As a result, all data in data lakes can only be made available for use after processes have been implemented to notify and request permission from every subject for every use case.
    • California Consumer Privacy Act (CCPA) is a sweeping piece of legislation that is aimed at protecting the personal information of California residents. It will give consumers the right to learn about the personal information that businesses collect, sell, or disclose about them, and prevent the sale or disclosure of their personal information. It includes the Right to Know, Right of Access, Right to Portability, Right to Deletion, Right to be Informed, Right to Opt-Out, and Non-Discrimination Based on Exercise of Rights. This means that if consumers do not like the way businesses are using their data, they request for it to be deleted -a risk for business insights 
    • Health Insurance Portability and Accountability Act (HIPAA) is a health protection regulation passed in 1998 by President Clinton. This act gives patients the right to privacy and covers 18 personal identifiers that are required to be de-identified. This Act is applicable not only in hospitals but in places of work, schooling, etc.

Legislative Definitions of Personal Information

  • Personal Data (GDPR): Any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person’ (source)
  • Personal Information (PI) (CCPA): “information that identifies, relates to, describes, is capable of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household.” (source)
  • Personal Health Information (PHI) (HIPAA): considered to be any identifiable health information that is used, maintained, stored, or transmitted by a HIPAA-covered entity – A healthcare provider, health plan or health insurer, or a healthcare clearinghouse – or a business associate of a HIPAA-covered entity, in relation to the provision of healthcare or payment for healthcare services. PHI is made up of 18 identifiers, including names, social security number, and medical record numbers (source)

Privacy terms

 

  • Anonymization is a process where personally identifiable information (whether direct or indirect) from data sets is removed or manipulated to prevent re-identification. This process must be made irreversible. 
  • Data controller is a person, an authority or a body that determines the purposes for which and the means by which personal data is collected.
  • Data lake is a collection point for the data a business collects. 
  • Data processor is a person, an authority or a body that processes personal data on behalf of the controller. 
  • De-identified data is the result of removing or manipulating direct and indirect identifiers to break any links so that re-identification is impossible. 
  • Differential privacy is a privacy framework that characterizes a data analysis or transformation algorithm rather than a dataset. It specifies a property that the algorithm must satisfy to protect the privacy of its inputs, whereby the outputs of the algorithm are statistically indistinguishable when any one particular record is removed in the input dataset.
  • Direct identifiers are pieces of data that identify an individual without the need for more data, ex. name, SSN, etc.
  • Homomorphic encryption is a method of performing a calculation on encrypted information (ciphertext) without decrypting it (to plaintext) first.
  • Identifier: Unique information that identifies a specific individual in a dataset. Examples of identifiers are names, social security numbers, and bank account numbers. Also, any field that is unique for each row. 
  • Indirect identifiers are pieces of data that can be used to identify an individual indirectly, or with the combination of other pieces of information, ex. date of birth, gender, etc.
  • Insensitive: Information that is not identifying or quasi-identifying and that you do not want to be transformed.
  • k-anonymity is where identifiable attributes of any record in a particular database are indistinguishable from at least one other record.
  • Perturbation: Data can be perturbed by using additive noise, multiplicative noise, data swapping (changing the order of the data to prevent linkage) or generating synthetic data.
  • Pseudonymization is the processing of personal data in a way that the personal data can no longer be attributed to a specific data subject without the use of additional information. This is provided that such additional information is kept separately and is subject to technical and organizational
  • Quasi-identifiers (also known as Indirect identifiers) are pieces of information that on its own are not sufficient to identify a specific individual but when combined with other quasi-identifiers is possible to re-identify an individual. Examples of quasi-identifiers are zip code, age, nationality, and gender.
  • Re-identification, or de-anonymization, is when anonymized data (de-identified data) is matched with publicly available information, or auxiliary data, in order to discover the individual to which the data belong to.
  • Secure multi-party computation (SMC), or Multi-Party Computation (MPC), is an approach to jointly compute a function over inputs held by multiple parties while keeping those inputs private. MPC is used across a network of computers while ensuring that no data leaks during computation. Each computer in the network only sees bits of secret shares — but never anything meaningful.
  • Sensitive: Information that is more general among the population, making it difficult to identify an individual with it. However, when combined with quasi-identifiers, sensitive information can be used for attribute disclosure. Examples of sensitive information are salary and medical data. Let’s say we have a set of quasi-identifiers that form a group of women aged 40-50, a sensitive attribute could be “diagnosed with breast cancer.” Without the quasi-identifiers, the probability of identifying who has breast cancer is low, but once combined with the quasi-identifiers, the probability is high.
  • Siloed data is data stored away in silos with limited access, to protect it against the risk of exposing private information. While these silos protect the data to a certain extent, they also lock the value of the data.
Differential Privacy in the Decennial U.S. Census

Differential Privacy in the Decennial U.S. Census

On April 1st, 2020, people across the United States will receive the decennial census to complete. There are minimal changes made to the census itself, but large scale changes in how each person’s privacy is protected and managed. 

Since the United States’ first census in 1790, the public attitude towards privacy has changed drastically. And as the world shifts further into a technological future, determining how to protect 327 million individuals data is the U.S Census Bureau’s most important decision.

What is the census?

The U.S. Census is a decennial survey sent out to every U.S resident. Its primary purpose is to determine the number of assigned congressional seats per state. The census helps in determining the proper distribution of federal funds, as well as disaster preparation, housing development, job markets, and community needs. Questions asked of residents include the number of people in one household, ages, or gender. 

Census data has many use cases outside congress. The information it provides helps determine the introduction of specific protocols within a city, town or state. This includes deciding how to prepare for disasters based on population density, the type of care needed for an area’s demographic (eg. An influx of new mothers may call for more daycares, while an ageing population would require the introduction of more senior living centers. This type of information can be detrimental to how these areas function.

How has privacy been dealt with previously? 

Privacy has remained an essential discourse in the census’s history since 1920. And in 1952, the U.S Census bureau instituted the agreement that is personally identifying information is to be kept privacy protected for 72 years. From 1970 to 1990, the Bureau implemented full data table suppression in order to protect access to data. 

Since 2000, the Bureau has applied a privacy technique called ‘data swapping.’ What this technique did, was swap quasi-identifiers, like a person’s race, with another person in a different dataset. It’s unknown how many profiles are masked using data swapping in these datasets. 

There has been no previous evidence of individuals being re-identified from the U.S Census, or any other privacy attacks, however, there is still the possibility. In 2010, the Bureau performed a reconstruction attack on its data that was able to re-identify 46% of the U.S. population.

Previously, the Bureau typically releases aggregate-level data and implements various disclosure avoidance techniques, including collapsing data or variable suppression.

Click here for an infographic released by the U.S. Census Bureau, highlighting its privacy history.

What is Differential Privacy and how will it be used?

In 2018, the Bureau released its plans to utilize differential privacy as its privacy-protecting tactic. 

Differential privacy is a privacy model that mathematically guarantees that an individual is not identifiable to the point that it is impossible to distinguish if they are in a dataset or not. This technique works through noise injection or synthetic creation. The Bureau will apply differential privacy in such a way to balance privacy loss.

The Bureau has said that differential privacy will not change the total population statistics per state. However, smaller towns or counties will have injected noise, which may alter its population on the released dataset. Other numbers that will not change including the number of those above voting ages and below, number of vacant houses, and number of householders.

What are the concerns?

There have been many expressed concerns coming from citizens and professionals alike. Many concerns stem from the data being altered such that information used in critical situations, like in disaster relief, is considerably changed and therefore impacting how citizens can be reached when most necessary. 

The U.S Census Bureau released a paper highlighting main concerns for deploying differential privacy onto the dataset. These concerns include: 

  • Obtaining qualified personal and a suitable computing environment
  • The difficulty for all uses of the confidential data
  • Lack of release mechanisms that align with data user needs 
  • Expectations on the part of data users that will have access to microdata
  • Difficulty in setting privacy loss parameter (epsilon) 
  • Lack of tools and trained individuals to verify the correctness of differential privacy implementations

The Bureau is continuing to work through any issues brought up in points above. Many people are showing concern for the data being altered. However, one website says, “there’s been inaccuracies in the data forever. Differential privacy just lets the Bureau be transparent about how much it’s fiddled with it.” 

Despite the many circulating concerns about differential privacy, the Bureau released that this census is the easiest for them to make differentially private.

To read more privacy articles, click here. 

Join our newsletter


Facial recognition, data marketplaces and AI changing the future of data privacy

Facial recognition, data marketplaces and AI changing the future of data privacy

With the emerging Artificial Intelligence (AI) market comes the everso popular privacy discourse. Data regulations that are being introduced left and right, while effective, are not yet representative of the growing technologies like facial recognition or data marketplaces. 

Companies like Clearview AI are once again making headlines after receiving cease-and-desist from big tech, despite there being no current facial recognition laws they are violating. As well, Nature released an article calling for an international code of conduct for genomic research aggregation. Between both AI and healthcare, Microsoft has announced a $40million AI for health initiative.  

Facial recognition company hit with cease-and-desist  

A few weeks ago, we released a blog introducing the facial recognition start-up, Clearview AI, as a threat to privacy.

Since then, Clearview AI has continued to make headlines, and most recently, has received cease-and-desist from Big Tech companies like Google, Facebook and Twitter. 

To recap, Clearview AI is a facial recognition company that has created a database of over 3 billion searchable faces, scrapped from different social media platforms. The company has introduced its software in more than 600 police departments across Canada and the US. 

The company’s CEO, Hoan Ton-That, has repeatedly defended its company, telling CBS

“Google can pull in information from all different websites, so if it’s public, you know, and it’s out there, it could be inside Google search engine it can be inside ours as well.”

Google then responded, saying this was ‘inaccurate.’ Google says they are a public search option and give sites choices in what they put out, as well as give opportunities to withdraw images. All options Clearview does not provide, as they go as far as holding images in their database after it’s been deleted from its source.

While Google and Facebook have both provided Clearview with a cease-and-desist, Clearview has maintained that they are within their first amendment rights to use the information. One privacy attorney told Cnet, “I don’t really buy it. It’s really frightening if we get into a world where someone can say, ‘The first amendment allows me to violate everyone’s privacy.’” 

While cities like San Francisco have started banning facial recognition, there are currently no federal laws addressing it as an issue, thus allowing more leeway for companies like Clearview AI to create potentially dangerous software.  

Opening up genomic data for researchers across the world

With these introductions to new health care initiatives, privacy becomes more relevant than ever. Healthcare data contains some of the most sensitive information for an individual. Thus the idea of big tech buying and selling such personal data is scary.

Last week, Nature, an international journal of science, released that over 800 terabytes of genomic data are available to investigators all over the world. The eight authors worked explicitly to protect the privacy of the thousands of patients/volunteers who consented to have their data used in this research.

The article reports the six-year collection of 2,658 cancer genomes between 468 institutions in 34 different countries is creating an open market of genome data. This project, called the Pan-Cancer Analysis of Whole Genomes (PCAWG), was the first attempt to aggregate a variety of subprojects and release a dataset globally.

A significant emphasis of this article was on the lack of clarity within the healthcare research community on how to protect data in compliance with the ongoing changes to privacy legislation.

Some issues in these genomic marketplaces are in the strategic attempts to not only comply with the variety of privacy legislation but also in ensuring that no individual can be re-identified using this information. Protecting patient data is not just a legislative issue but a moral one. 

The majority of the privacy unclarity came from questions of what vetting should occur before gaining access to information, or what checks should be made before the data is internationally shared.

As the article says, “Genomic researches urgently need clear data-sharing rules that are harmonized across jurisdictions.” The report calls for an international code of conduct to overcome the current hurdles that come with the different emerging privacy regulations. 

The article also said that the Biobanking and BioMolecular Resources Research Infrastructure (BBMRI-ERIC), had announced back in 2017 that it would develop an EU Code of Conduct on Health-Related Data. Once completed and approved, 

Microsoft to add another installment to AI for Good

The ability to collect patient data and share in an open market for researchers or doctors is helping cure and diagnose patients at a faster rate than ever before seen. In addition to this, AI is seen as another vital tool for the growing healthcare industry.

Last week, Microsoft announced its fifth installment to its ‘AI for Good’ project, ‘AI for Health.’ This project, similar to its cohorts, will support healthcare initiatives such as providing access to cash grants, AI tools, cloud computing, and Microsoft researchers. 

The project will focus on three different AI strategies, including: 

  • Accelerating medical research
  • Increase the understanding of mortality to guard various global health crises.
  • Reducing health injustices 

The program will be emphasizing supporting individual non-profits and under-served communities. As well, Microsoft released in a video their focus on addressing Sudden Infant Death Syndrome, eliminating Leprosy and diabetic retinopathy-driven blindness in partnership with different non-for-profits. 

AI is essential to healthcare, and it has lots of data that companies like Microsoft are utilizing. But with this, privacy has to remain at the forefront of the action. 

Similar to Nature’s data, protecting user information is extremely important and complicated when looking to utilize the data’s analytical value, all while complying with privacy regulations. Microsoft announced that it would be using Differential Privacy as its privacy solution. 

Like Microsoft, we at CryptoNumerics user differential privacy as a method of anonymization and data value preserving. Learn more about differential privacy and CryptoNumeric solutions.

 

Join our newsletter


Masking is killing data science

Masking is killing data science

When it comes to data science, the trade-off for protecting data while keeping its value appears near impossible. And with the introduction of privacy legislation like the California Consumer Privacy Act (CCPA), this trade-off makes the job even harder.

Methods such as data masking appear the standard option, with privacy risks landing at almost 0%. However, with information loss reaching a potential of over 50%, the opportunity for data analytics vanishes.

Data Masking is a lost battle

Data Masking is a de-identification technique that focuses on the redaction or transformation of information within a dataset to prevent exposure. The information in the resulting is of low quality. This technique is not enough to move a company forward in innovation.

Companies need to privacy protect their consumer data. However, they also need to preserve the value of the data for analytical uses.

Masking fails to address how data works today and how a business benefits for it. Consumer data is beneficial to all aspects of an organization and creates a better experience for the customer. Failing to utilize and protect the datasets leaves your company behind in innovation and consumer satisfaction.

Privacy-protection that preserves analytical value

Data scientists need to be able to control the trade-off, and the only way to do it is by using “smart” optimization solutions.

A “smart” optimization solution is one that can modify the data in different ways using privacy risk and analytical value as its optimization functions. With a solution like this, a data scientist would get a data set that is optimized for analytics, and that is privacy compliant, the best of both worlds.

Smart Optimization vs Masking

Let’s look at the impact that both privacy-protection solutions have on a machine learning algorithm.

For this example, we want to predict loan default risk using a random forest model. The model is going to be run on three datasets:

  • In the clear: The original dataset without any privacy transformations.
  • Masked dataset: Transformation of the original dataset using standard rule-based masking techniques.
  • Optimized dataset: Transformation of the original dataset using a smart optimization solution.

 

The dataset has 11 variables:

  • Age
  • Sex
  • Job
  • Housing
  • Saving Account Balance
  • Checking Account Balance
  • Credit Account Balance
  • Duration
  • Purpose
  • Zipcode
  • Risk

Let’s compare the results.

Running the model with the original dataset gave us an accuracy of 93%; however, the risk of re-identification is 100%. When we used the masked data, the model accuracy dropped to 28%, since there were 5 risk levels, the accuracy of this model is barely better than random. On the positive side, the risk of re-identification is 0%. Lastly, the accuracy with the optimized dataset was 87%, a drop of only 5 points vs the original data. Additionally, the risk of re-identification was only 3%.

While having a 0% privacy risk is appealing, the loss in accuracy makes masking worthless for analytic purposes.

This example highlights why masking is killing data science, and organizations need to implement smart optimization solutions, like CryptoNumeric’s CN-Protect, that reduce the risk of-reidentification while preserving the analytical value of the data.

Gaining a competitive edge in your industry means utilizing consumer data. And by adequately protecting the data without mass data loss, a high data value can take your company far.

 

 

Join our newsletter