Select Page
Location data and your privacy

Location data and your privacy

As technology grows to surround the entirety of our lives, it comes as no surprise that each and every move is tracked and stored by the very apps we trust with our information. With the current COVID-19 pandemic, the consequences of inviting these big techs into our every movement are being revealed. 

At this point, most of the technology-users understand the information they do give to companies, such as their birthdays, access to pictures, or other sensitive information. However, some may be unknowing of the amount of location data that companies collect and how that affects their data privacy. 

Location data volume expected to grow

We have created over 90% of the world’s data since 2017. As wearable technology continues to grow in trend, the amount of data a person creates each day is on a steady incline. 

One study reported that by 2025, the installation of worldwide IoT-enabled devices is expected to hit 75 billion. This astronomical number highlights how intertwined technology is into our lives, but also how welcoming we are to that technology; technology that people may be unaware of the ways their data is collected. 

Marketers, companies and advertisers will increasingly look to using location-based information as its volume grows. A recent study found that more than 84% of marketers use location data for their 

The last few years have seen a boost in big tech companies giving their users more control over how their data is used. One example is in 2019 when Apple introduced pop-ups to remind users when apps are using their location data.

Location data is saved and stored for the benefit of companies to easily direct personalized ads and products to your viewing. Understanding what your devices collect from you, and how to eliminate data sharing on your devices is crucial as we move forward in the technological age. 

Click here to read our past article on location data in the form of wearable devices. 

COVID-19 threatens location privacy

Risking the privacy of thousands of people or saving thousands of lives seems to be the question throughout this pandemic; a question that is running out of time for debate. Companies across the big 100 have stepped up to volunteer its anonymized data, including SAS, Google and Apple. 

One of the largest concerns is not how this data is being used in this pandemic, but how it could be abused in the future. 

One Forbes article brought up a comparison of the regret many are faced with after sharing DNA with sites like 23andMe, leading to health insurance issues or run-ins with criminal activity. 

As companies like Google, Apple and Facebook step-up to the COVID-19 technology race, many are expressing their concerns as these companies have not been deemed reliable for user data anonymization. 

In addition to the data-collecting concern, governments and big tech companies are looking into contact-tracking applications. Civilian location data being used for surveillance purposes, while alluded for the greater good of health and safety, raises multiple red flags into how our phones can be used to survey our every movement. To read more about this involvement in contact tracing apps, read our latest article

Each company has released that it anonymizes its collected data. However, in this pandemic age, anonymized information can still be exploited, especially at the hands of government intervention. 

With all this said, big tech holds power over our information and are playing a vital role in the COVID-19 response. Paying close attention to how user data is managed post-pandemic will be valuable in exposing how these companies handle user information.

 

4 techniques for data science

4 techniques for data science

With growing tension between privacy and analytics, the job of data scientists and data architects has become more complicated. The responsibility of data professionals is not just to maximize the value of the data, but to find ways in which data can be privacy protected while preserving its analytical value.

The reality today is that regulations like GDPR and CCPA have disrupted the way in which data flows through organizations. Now data is being siloed and protected using techniques that are not suited for the data-driven enterprise. Data professionals are left with long processes to access the information they need and, in many cases, the data they receive has no analytical value after it has been protected. 

This emphasizes the importance of using adequate privacy protection tactics to ensure that personally identifiable information (PII) is accessible in a privacy-protected manner and that it can be used for analytics.

To satisfy GDPR and CCPA, organizations can choose between three options, pseudonymization, anonymization, and consent: 

Pseudonymization is replacing direct identifiers, like names or emails, with pseudonyms to protect the privacy of the individual. However, this process is still in the scope of the privacy regulations, and the risk for re-identification remains very high.

Anonymization, on the other hand, looks at direct identifiers and quasi-identifiers and transforms the data in a way that’s now out-of-scope for privacy regulations and can be used for analytics. 

Consent requires organizations to ask customers for their consent on the usage of data, this opens up the opportunity for opt-outs. If the usage of the data changes, as it often does in an analytics environment, then consent may very well be required each time.

There are four main techniques that can help data professionals with privacy protection. All of them have different impacts on both privacy protection and data quality. These are: 

Masking: A de-identification technique that focuses on the redaction or transformation of information within a dataset to prevent exposure. 

K-anonymity: This privacy model ensures that each individual is indistinguishable from at least k-1 other individuals based on their attributes in a dataset.

Differential Privacy: Is a technique applied to an algorithm that mathematically guarantees that the output of the algorithm doesn’t change whether an individual is in the dataset or not. It is achieved through the addition of noise to the algorithm. 

Secure Multi-Party Computation: This is a cryptographic technique where a group of parties can compute a function over their inputs while keeping their inputs private.

Keep your eyes peeled in the next few weeks for our whitepaper, which will explore these four techniques in further detail.

Key terms to know to navigate data privacy

Key terms to know to navigate data privacy

As the data privacy discourse continues to grow, it’s crucial that the terms used to explain data science, data privacy and data protection are accessible to everyone. That’s why we at CryptoNumerics have compiled a continuously growing Privacy Glossary, to help people learn and better understand what’s happening to their data. 

Below are 25 terms surrounding privacy legislations, personal data, and other privacy or data science terminology to help you better understand what our company does, what other privacy companies do, and what is being done for your data.

Privacy regulations

    • General Data Protection Regulation (GDPR) is a privacy regulation implemented in May 2018 that has inspired more regulations worldwide. The law determined data controllers must establish a specific legal basis for each and every purpose where personal data is used. If a business intends to use customer data for an additional purpose, then it must first obtain explicit consent from the individual. As a result, all data in data lakes can only be made available for use after processes have been implemented to notify and request permission from every subject for every use case.
    • California Consumer Privacy Act (CCPA) is a sweeping piece of legislation that is aimed at protecting the personal information of California residents. It will give consumers the right to learn about the personal information that businesses collect, sell, or disclose about them, and prevent the sale or disclosure of their personal information. It includes the Right to Know, Right of Access, Right to Portability, Right to Deletion, Right to be Informed, Right to Opt-Out, and Non-Discrimination Based on Exercise of Rights. This means that if consumers do not like the way businesses are using their data, they request for it to be deleted -a risk for business insights 
    • Health Insurance Portability and Accountability Act (HIPAA) is a health protection regulation passed in 1998 by President Clinton. This act gives patients the right to privacy and covers 18 personal identifiers that are required to be de-identified. This Act is applicable not only in hospitals but in places of work, schooling, etc.

Legislative Definitions of Personal Information

  • Personal Data (GDPR): Any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person’ (source)
  • Personal Information (PI) (CCPA): “information that identifies, relates to, describes, is capable of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household.” (source)
  • Personal Health Information (PHI) (HIPAA): considered to be any identifiable health information that is used, maintained, stored, or transmitted by a HIPAA-covered entity – A healthcare provider, health plan or health insurer, or a healthcare clearinghouse – or a business associate of a HIPAA-covered entity, in relation to the provision of healthcare or payment for healthcare services. PHI is made up of 18 identifiers, including names, social security number, and medical record numbers (source)

Privacy terms

 

  • Anonymization is a process where personally identifiable information (whether direct or indirect) from data sets is removed or manipulated to prevent re-identification. This process must be made irreversible. 
  • Data controller is a person, an authority or a body that determines the purposes for which and the means by which personal data is collected.
  • Data lake is a collection point for the data a business collects. 
  • Data processor is a person, an authority or a body that processes personal data on behalf of the controller. 
  • De-identified data is the result of removing or manipulating direct and indirect identifiers to break any links so that re-identification is impossible. 
  • Differential privacy is a privacy framework that characterizes a data analysis or transformation algorithm rather than a dataset. It specifies a property that the algorithm must satisfy to protect the privacy of its inputs, whereby the outputs of the algorithm are statistically indistinguishable when any one particular record is removed in the input dataset.
  • Direct identifiers are pieces of data that identify an individual without the need for more data, ex. name, SSN, etc.
  • Homomorphic encryption is a method of performing a calculation on encrypted information (ciphertext) without decrypting it (to plaintext) first.
  • Identifier: Unique information that identifies a specific individual in a dataset. Examples of identifiers are names, social security numbers, and bank account numbers. Also, any field that is unique for each row. 
  • Indirect identifiers are pieces of data that can be used to identify an individual indirectly, or with the combination of other pieces of information, ex. date of birth, gender, etc.
  • Insensitive: Information that is not identifying or quasi-identifying and that you do not want to be transformed.
  • k-anonymity is where identifiable attributes of any record in a particular database are indistinguishable from at least one other record.
  • Perturbation: Data can be perturbed by using additive noise, multiplicative noise, data swapping (changing the order of the data to prevent linkage) or generating synthetic data.
  • Pseudonymization is the processing of personal data in a way that the personal data can no longer be attributed to a specific data subject without the use of additional information. This is provided that such additional information is kept separately and is subject to technical and organizational
  • Quasi-identifiers (also known as Indirect identifiers) are pieces of information that on its own are not sufficient to identify a specific individual but when combined with other quasi-identifiers is possible to re-identify an individual. Examples of quasi-identifiers are zip code, age, nationality, and gender.
  • Re-identification, or de-anonymization, is when anonymized data (de-identified data) is matched with publicly available information, or auxiliary data, in order to discover the individual to which the data belong to.
  • Secure multi-party computation (SMC), or Multi-Party Computation (MPC), is an approach to jointly compute a function over inputs held by multiple parties while keeping those inputs private. MPC is used across a network of computers while ensuring that no data leaks during computation. Each computer in the network only sees bits of secret shares — but never anything meaningful.
  • Sensitive: Information that is more general among the population, making it difficult to identify an individual with it. However, when combined with quasi-identifiers, sensitive information can be used for attribute disclosure. Examples of sensitive information are salary and medical data. Let’s say we have a set of quasi-identifiers that form a group of women aged 40-50, a sensitive attribute could be “diagnosed with breast cancer.” Without the quasi-identifiers, the probability of identifying who has breast cancer is low, but once combined with the quasi-identifiers, the probability is high.
  • Siloed data is data stored away in silos with limited access, to protect it against the risk of exposing private information. While these silos protect the data to a certain extent, they also lock the value of the data.
The data access bottleneck

The data access bottleneck

We create an influx of information each day, minute, and second. In the United States alone, 4, 416, 720 gigabytes of data were used every minute in 2019. This number is reported to have risen 41% since its 2018 report. 

As we continue entering the fast-paced era of technology, the world has been bombarded with hoards of user information without the resources ready to manage it. The role of Data Scientist, a career that didn’t exist ten years ago, has topped Glassdoor’s list of the best roles in America for the last five years. 

The responsibility of a data scientist includes collecting and cleaning data, performing analysis, applying data science techniques and measuring analytic results. This vital process helps businesses by providing customer insights to help manage innovation. However, the process of receiving and analyzing data loses precedence as cleaning and organizing the data takes time.

Data scientists search out the data needed, through other departments or data lakes, creating hours of waiting to receive the information they need. When finally provided with the information necessary, it may contain severe data quality issues. This takes a considerable amount of time away from being able to provide an actual analysis of the data. 

There is a typical time division for this very scenario, known as the 80/20 rule. 80% of data scientists’ work time is spent finding data and cleaning it, while 20% of their time is spent providing analysis on the data. 

This bottleneck of information leads to an increase in potential error and dries up analytical resources. 

One survey conducted by TMMData and the digital analytics association created insight into the difficulty a data scientist faces before getting the opportunity to implement analytic techniques. 56.9% of the 800 surveyed said it takes a few days to a few weeks before they are granted access to all the data they need.

The study also said that only ⅓ are able to immediately access all the data they need or receive the required data in less than one day. 

On top of this, 43 respondents to the survey mentioned that gaining data access to be one of their top two analytics challenges. 

On top of the difficulty of gaining access to the data, this influx of information stored in data lakes is of poor quality. 48% of data scientists questioned the accuracy of the data they received. This incomplete or bad data can lead a data scientist in the wrong direction of their analytic process.

In 2017, IBM released that the two previous years had created 90% of the world’s data. As technology grows, the ability to consume and organize data must expand as well. To reverse the 80/20 time statistic for data science, companies’ abilities to harness and manage data as its collected must improve. 

Flipping 80/20 for data science

Based on the statistics presented, the most significant issue for data scientists involved access wait time and cleaning the data once received. 

It’s understandable why data is so disorganized right now. No one could predict the pace the internet and technology took just ten years ago. Knowing how and when to prepare and store data is still a relatively new issue. 

To improve this issue, the efficiency of data prep must be increased, and the number of people involved with the data should expand. By expanding the data over the organization, and limiting the prep time using less manual methods, companies will see a faster turnover of data. 

Now is the time to play catch-up, and organize the incoming data so that analytics can be prepped and ready to move your company forward as fast as possible. 

What does COVID-19 mean for patient privacy?

What does COVID-19 mean for patient privacy?

The rapid spread of the Coronavirus (COVID-19) has sent the world into mass shock, halting the movement in the economy, companies, schools and regular life. 

In situations of mass panic such as this, maintaining privacy and legislation compliance is the last thing on the publics’ minds. However, for companies and hospitals, this should not be the case. In this weekly news, we will go through how proper data sharing is beneficial, how governments are reacting to privacy concerns, and how employers should be handling their employees’ information.

Data Sharing and COVID-19

According to one Wired article released last week, Genomic data and data marketplaces across countries are being utilized for better understanding the virus and its unique spreading. 

NextStrain, an open-source application tracking bacteria evolution, is helping researchers release and share bacteria strains as close to 48hours after the bacteria is located.  

The article explains that NextStrain is an open-source application, and therefore allows research facilities to create their versions or use the application as a starting ground for other models of open research. 

By participating in this cross-platform data sharing, researchers “creates new opportunities to bridge the gap between public health and academia, and to enable novice users to explore the data as well.”

While this data sharing is proving helpful in moving quickly to understand and stop the growth of this virus, there are issues presented with sharing data. 

An issue with open-source data sharing, as one researcher shared with Wired, is that non-professionals can misinterpret the information, as one Twitter user published false information last week. This twitter thread not only stresses the importance of incorrect information but also how data can spread across platforms—thus emphasizing the importance of anonymizing the influx of COVID-19 patient data.

Last month, we released a short article involving genomic data and marketplaces, as well as the process of de-identifying its information. Click here to read more about what that entails. 

Crisis Communication 

Last week, we released an article about the lack of privacy in South Korea, as every detail of patients’ lives are disclosed to the public, in fear that regular people made contact with the infected individual.

As the virus moves toward Western countries, this handling of privacy must be prevented. However in unprecedented situations such as this, the “every-man-for-himself” mindset takes over for much of the public, as the concern of connection with an infected person spreads. 

One senior risk manager told Modern HealthCare, “It’s a slippery slope—if you let people know where the cases are, they may be more cautious and stay away from certain events,” she said. “If you say nothing, they get a false sense of security.” 

When looking to release information to the public or between researchers, hospitals need to ensure their data is de-identified and compliant with legislation like the Health Insurance Portability and Accountability Act (HIPAA). Not doing so leaves organizations liable to penalties ranging from $100 to $50,000 per violation.

In a newly released Advis survey, only 39% of surveyed U.S hospitals reported that they were prepared for an outbreak like COVID-19. This level of unpreparedness is where cracks in patient privacy can open up, and sensitive data is put at risk of the general public.  

COVID-19 and personal privacy 

Last month, the U.S Department of Health and Human Services released a bulletin outlining HIPAA and privacy factors in response to the outbreak. 

Highlighted in this bulletin is the minimum required disclosures of employers and workplaces as well as the implications versus necessary action of sharing patient data. This bulletin serves as a reminder to the general public of understanding the importance of privacy protection, especially in scenarios as drastic as the current situation.

Because of the panic this virus causes, the mass fear that is created has to be dealt with by authority positions properly. Employers and companies must ensure they are approaching the handling of this pandemic with consideration of patient privacy and legislation compliance. 

One U.S law firm, Sidley, created and released an elaborate list of questions companies should be reflecting on while dealing with the COVID-19 virus. In terms of privacy, some items include; 

  • What information can companies collect from third parties and open sources about employees’ and others’ health and risk of exposure?
  • Are there statutory, regulatory or contractual restrictions on any data collection, processing or dissemination contemplated to address COVID-19 risks? What are the risks of these activities?
  • Are existing privacy disclosures and international data transfer mechanisms adequate to address any new data collection and analyses?
  • Is a privacy impact assessment, or a security risk assessment, required or advisable for any new data-related activities?

(Source)

The main struggle for companies right now is ensuring that their employee information is dealt with in compliance with privacy legislation, while still keeping in mind the safety of the other workers.

Join our newsletter


Differential Privacy in the Decennial U.S. Census

Differential Privacy in the Decennial U.S. Census

On April 1st, 2020, people across the United States will receive the decennial census to complete. There are minimal changes made to the census itself, but large scale changes in how each person’s privacy is protected and managed. 

Since the United States’ first census in 1790, the public attitude towards privacy has changed drastically. And as the world shifts further into a technological future, determining how to protect 327 million individuals data is the U.S Census Bureau’s most important decision.

What is the census?

The U.S. Census is a decennial survey sent out to every U.S resident. Its primary purpose is to determine the number of assigned congressional seats per state. The census helps in determining the proper distribution of federal funds, as well as disaster preparation, housing development, job markets, and community needs. Questions asked of residents include the number of people in one household, ages, or gender. 

Census data has many use cases outside congress. The information it provides helps determine the introduction of specific protocols within a city, town or state. This includes deciding how to prepare for disasters based on population density, the type of care needed for an area’s demographic (eg. An influx of new mothers may call for more daycares, while an ageing population would require the introduction of more senior living centers. This type of information can be detrimental to how these areas function.

How has privacy been dealt with previously? 

Privacy has remained an essential discourse in the census’s history since 1920. And in 1952, the U.S Census bureau instituted the agreement that is personally identifying information is to be kept privacy protected for 72 years. From 1970 to 1990, the Bureau implemented full data table suppression in order to protect access to data. 

Since 2000, the Bureau has applied a privacy technique called ‘data swapping.’ What this technique did, was swap quasi-identifiers, like a person’s race, with another person in a different dataset. It’s unknown how many profiles are masked using data swapping in these datasets. 

There has been no previous evidence of individuals being re-identified from the U.S Census, or any other privacy attacks, however, there is still the possibility. In 2010, the Bureau performed a reconstruction attack on its data that was able to re-identify 46% of the U.S. population.

Previously, the Bureau typically releases aggregate-level data and implements various disclosure avoidance techniques, including collapsing data or variable suppression.

Click here for an infographic released by the U.S. Census Bureau, highlighting its privacy history.

What is Differential Privacy and how will it be used?

In 2018, the Bureau released its plans to utilize differential privacy as its privacy-protecting tactic. 

Differential privacy is a privacy model that mathematically guarantees that an individual is not identifiable to the point that it is impossible to distinguish if they are in a dataset or not. This technique works through noise injection or synthetic creation. The Bureau will apply differential privacy in such a way to balance privacy loss.

The Bureau has said that differential privacy will not change the total population statistics per state. However, smaller towns or counties will have injected noise, which may alter its population on the released dataset. Other numbers that will not change including the number of those above voting ages and below, number of vacant houses, and number of householders.

What are the concerns?

There have been many expressed concerns coming from citizens and professionals alike. Many concerns stem from the data being altered such that information used in critical situations, like in disaster relief, is considerably changed and therefore impacting how citizens can be reached when most necessary. 

The U.S Census Bureau released a paper highlighting main concerns for deploying differential privacy onto the dataset. These concerns include: 

  • Obtaining qualified personal and a suitable computing environment
  • The difficulty for all uses of the confidential data
  • Lack of release mechanisms that align with data user needs 
  • Expectations on the part of data users that will have access to microdata
  • Difficulty in setting privacy loss parameter (epsilon) 
  • Lack of tools and trained individuals to verify the correctness of differential privacy implementations

The Bureau is continuing to work through any issues brought up in points above. Many people are showing concern for the data being altered. However, one website says, “there’s been inaccuracies in the data forever. Differential privacy just lets the Bureau be transparent about how much it’s fiddled with it.” 

Despite the many circulating concerns about differential privacy, the Bureau released that this census is the easiest for them to make differentially private.

To read more privacy articles, click here. 

Join our newsletter