Select Page
4 techniques for data science

4 techniques for data science

With growing tension between privacy and analytics, the job of data scientists and data architects has become more complicated. The responsibility of data professionals is not just to maximize the value of the data, but to find ways in which data can be privacy protected while preserving its analytical value.

The reality today is that regulations like GDPR and CCPA have disrupted the way in which data flows through organizations. Now data is being siloed and protected using techniques that are not suited for the data-driven enterprise. Data professionals are left with long processes to access the information they need and, in many cases, the data they receive has no analytical value after it has been protected. 

This emphasizes the importance of using adequate privacy protection tactics to ensure that personally identifiable information (PII) is accessible in a privacy-protected manner and that it can be used for analytics.

To satisfy GDPR and CCPA, organizations can choose between three options, pseudonymization, anonymization, and consent: 

Pseudonymization is replacing direct identifiers, like names or emails, with pseudonyms to protect the privacy of the individual. However, this process is still in the scope of the privacy regulations, and the risk for re-identification remains very high.

Anonymization, on the other hand, looks at direct identifiers and quasi-identifiers and transforms the data in a way that’s now out-of-scope for privacy regulations and can be used for analytics. 

Consent requires organizations to ask customers for their consent on the usage of data, this opens up the opportunity for opt-outs. If the usage of the data changes, as it often does in an analytics environment, then consent may very well be required each time.

There are four main techniques that can help data professionals with privacy protection. All of them have different impacts on both privacy protection and data quality. These are: 

Masking: A de-identification technique that focuses on the redaction or transformation of information within a dataset to prevent exposure. 

K-anonymity: This privacy model ensures that each individual is indistinguishable from at least k-1 other individuals based on their attributes in a dataset.

Differential Privacy: Is a technique applied to an algorithm that mathematically guarantees that the output of the algorithm doesn’t change whether an individual is in the dataset or not. It is achieved through the addition of noise to the algorithm. 

Secure Multi-Party Computation: This is a cryptographic technique where a group of parties can compute a function over their inputs while keeping their inputs private.

Keep your eyes peeled in the next few weeks for our whitepaper, which will explore these four techniques in further detail.

IoT and everyday life; how interconnected are we?

IoT and everyday life; how interconnected are we?

The Internet of Things (IoT) is a term spanning a variety of ‘smart’ applications. This ranges from things like smart fridges, to smart cities. This idea of ‘smart’ or IoT is the connectedness between everything and the internet. 

It’s hard to grasp the amount of data one person creates each day and understanding where IoT fits into that. And with this new era of ‘smart’ everything, the realm of knowledge is pushed even farther away. 

To understand just how much our smart technologies follow our everyday behaviours, let’s focus on only one person’s use of a smartwatch. 

But first, what are the implications of a smartwatch? This wearable technology gained its popularity starting in 2012, giving users the ability to track their health and set fitness goals at the tap of their wrist. Since then, smartwatches have infiltrated all sorts of markets, from the ability to pay using the watch, take phone calls, or update a Facebook status.

The technology in our lives has become so interconnected, de-identifying our data, while achievable, on a grand scale, is seemingly complicated. Take the smartwatch, our unique footprints, recreated each day are logged and monitored through the small screen on our wrist. While the data created is anonymized to an extent, it’s not sufficient

But why not? After all, technology has moved mountains in the last decade. To better understand this connectedness of our data, let’s follow one person’s day through the point of view of just their smartwatch. 

Imagine Tom is a 30-year-old man in excellent health who, like the rest of us, follows a pretty general routine during his workweek. Outside of the many technologies that collect Tom’s data, what might just his smartwatch collect? 

Let’s take a look. 

Every morning, Tom’s smartwatch alerts him at 7:30 am to wake up and start his day. After a few days of logging Tom’s breathing patterns and heart rate, and monitoring his previous alarm settings, Tom’s smartwatch has learned the average time Tom should be awake and alerts Tom to set a 7:30 alarm each night before bed. 

Before ever having to tell his watch which time he gets up in the morning, his watch already knows. 

Similar to his smartwatches alarm system, this watch knows and labels the locations of 6 specific places that Tom spends most time in the week. Tom didn’t have to tell his watch where he was and why; based on the hours of the day Tom spends at this location, with his sleeping patterns and other movements, his watch already knows. 

Not only are these places determined from his geographical location, but from the other information, his watch creates. 

When Tom is at the gym, his sped-up heart rate and lost calories are logged. When Tom goes to his local grocery store or coffee shop, Tom uses his smartwatch to pay. At his workplace, Tom’s watch records the amount of time spent at the location and is able to determine the two main places Tom spends his time is between his home location and his work. 

Based on a collection of spatial-temporal data, transactional data, health data and repeated behaviour, it is easy to create a very accurate picture of who Tom is.

Let’s keep in mind that this is all created without Tom having to explicitly tell his smartwatch where he is or what he is doing at each minute. Tom’s smartwatch operates on learned behaviours based on the unique pattern Tom creates each day.

This small peak into Tom’s life, according to his watch, isn’t even much of a “peak” at all. We could analyze the data retained by his smartwatch with each purchase, each movement of location or only by the data pertaining to his health. 

This technology is seen in our cars, fridges, phones and TVs. Thus, understanding how just one device collects and understands so much about your person is critical to how we interact with these technologies. What’s essential to understand next is how this data is dealt with, protected and shared. 

The more advanced our technology gets, the easier it is to connect a person based on the data the technology collects. It’s important more than ever to understand the impacts of our technology use, what of our data is being collected, and where it is going. 

At CryptoNumerics we have been developing a solution that can de-identify this data without destroying its analytical value. 

If your company has transactional and/or spatio-temporal data that needs to be privacy-protected, contact us to learn more about our solution.

Banking and fraud detection; what is the solution?

Banking and fraud detection; what is the solution?

As the year comes to a close, we must reflect on the most historic events in the world of privacy and data science, so that we can learn from the challenges, and improve moving forward.

In the past year, General Data Protection Regulation (GDPR) has had the most significant impact on data-driven businesses. The privacy law has transformed data analytics capacities and inspired a series of sweeping legislation worldwide: CCPA in the United States, LGPD in Brazil, and PDPB in India. Not only has this regulation moved the needle on privacy management and prioritization, but it has knocked major companies to the ground with harsh fines. 

Since its implementation in 2018, €405,871,210 in fines have been actioned against violators, signalling that the DPA supervisory authority has no mercy in its fervent search for the unethical and illegal actions of businesses. This is only the beginning, as the deeper we get into the data privacy law, the more strict regulatory authorities will become. With the next wave of laws hitting the world on January 1, 2020, businesses can expect to feel pressure from all locations, not just the European Union.

 

The two most breached GDPR requirements are Article 5 and Article 32.

These articles place importance on maintaining data for only as long as is necessary and seek to ensure that businesses implement advanced measures to secure data. They also signal the business value of anonymization and pseudonymization. After all, once data has been anonymized (de-identified), it is no longer considered personal, and GDPR no longer applies.

Article 5 affirms that data shall be “kept in a form which permits identification of data subjects for no longer than is necessary for the purposes for which the personal data are processed.”

Article 32 references the importance of “the pseudonymization and encryption of personal data.”

The frequency of a failure to comply with these articles signals the need for risk-aware anonymization to ensure compliance. Businesses urgently need to implement a data anonymization solution that optimizes privacy risk reduction and data value preservation. This will allow businesses to measure the risk of their datasets, apply advanced anonymization techniques, and minimize the analytical value lost throughout the process.

If this is implemented, data collection on EU citizens will remain possible in the GDPR era, and businesses can continue to obtain business insights without risking their reputation and revenue. However, these actions can now be done in a way that respects privacy.

Sadly, not everyone has gotten the message, as nearly 130 fines have been actioned so far.

The top five regulatory fines

GDPR carries a weighty fine:  4% of a business’s annual global turnover, or €20M, whichever is greater. A fine of this size could significantly derail a business, and if paired alongside brand and reputational damage, it is evident that GDPR penalties should encourage businesses to rethink the way they handle data

1. €204.6M: British Airways

Article 32: Insufficient technical and organizational measures to ensure information security

User traffic was directed to a fraudulent site because of improper security measures, compromising 500,000 customers’ personal data. 

 2. €110.3M: Marriott International

Article 32: Insufficient technical and organizational measures to ensure information security

The guest records of 339 million guests were exposed in a data breach due to insufficient due diligence and a lack of adequate security measures.

3. €50M: Google

Article 13, 14, 6, 5: Insufficient legal basis for data processing

Google was found to have breached articles 13, 14, 6, and 5 because it created user accounts during the configuration stage of Android phones without obtaining meaningful consent. They then processed this information without a legal basis while lacking transparency and providing insufficient information.

4. €18M: Austrian Post

Article 5, 6: Insufficient legal basis for data processing

Austrian Post created more than three million profiles on Austrians and resold their personal information to third-parties, like political parties. The data included home addresses, personal preferences, habits, and party-affinity.

5. €14.5M: Deutsche Wohnen SE

Article 5, 25: Non-compliance with general data processing principles

Deutsche Wohnen stored tenant data in an archive system that was not equipped to delete information that was no longer necessary. This made it possible to have unauthorized access to years-old sensitive information, like tax records and health insurance, for purposes beyond those described at the original point of collection.

Privacy laws like GDPR seek to restrict data controllers from gaining access to personally identifiable information without consent and prevent data from being handled in manners that a subject is unaware of. If these fines teach us anything, it is that investing in technical and organizational measures is a must today. Many of these fines could have been avoided had businesses implemented Privacy by Design. Privacy must be considered throughout the business cycle, from conception to consumer use. 

Businesses cannot risk violations for the sake of it. With a risk-aware privacy software, they can continue to analyze data while protecting privacy -with the guarantee of a privacy risk score.

Resolution idea for next year: Avoid ending up on this list in 2020 by adopting risk-aware anonymization.

Data sharing is an issue across industries

Data sharing is an issue across industries

Privacy, as many of our previous blogs have enforced, is essential not only on a business-customer relationship but also on a moral level. The recent Fitbit acquisition by Google has created big waves in the privacy sphere, as the customer’s health data is at risk, due to Google’s past dealings with personal information. On the topic of healthcare data, the recent Coronavirus panic has thrown patient privacy out the window, as the fear of the spreading virus rises. Finally, data sharing continues to raise eyes as a popular social media app, TikTok scrambles to protect its privacy reputation.  

Fitbit acquisition causing major privacy concerns

From its in-house command system to being the world’s most used search engine, Google has infiltrated most aspects of regular life. There are seemingly no corners left untouched by the search engine. 

In 2014, Google released its Wear OS, a watch technology for monitoring health, as well as for use compatible with phone technology. While wearable technology has soared to the top of technology chart, as a popular way to track and manage your health and lifestyle, Google’s Wear OS has not gained the popularity necessary to maintain itself as a strong tech competitor.  

In November of last year, Google announced its acquisition of Fitbit for $2.1 billion. Fitbit has sold over 100 million devices and is worn by over 28 million people, 24 hours a day, 7 days a week. Many are calling this Google’s attempt to recover from its failing project.

But there is more to this acquisition than staying on top of the market; personal data. 

Google’s terrible privacy reputation is falling onto Fitbit, as fears that the personal information FitBit holds, like sleep patterns or heart rate, will fall into the hands of third parties and advertisers.  

Healthcare is a large market, one of which Google has been silently buying into for years. Accessing personal health information gives Google an edge in the healthcare partnerships it’s been looking for. 

Fitbit has come under immense scrutiny after its announced partnership with Google, seeing sales drop 5% in 2019. Many are urging Fitbit consumers to ditch their products amidst the acquisition.

However, Fitbit still maintains that users will be in full control of their data and that the company will not see personal information to Google. 

The partnership will be followed with a close eye going forward, as government authorities such as the Australian Competition and Consumer Commission open inquiries into the companies intentions.

TikTok scrambling to fix privacy reputation

TikTok is a social media app that has taken over video streaming services. With over 37 million users in the U.S. last year, TikTok has been downloaded over 1 billion times. And that number is expected to rise 22% this year

While the app is reporting these drastically high numbers for downloading, the app has been continuously reprimanded for its terrible privacy policy and its inability to protect its user’s information. After already being banned from companies across the U.S, one Republican Senator, Josh Hawley, is introducing legislation to prohibit federal workers from using the app. This comes from several security flaws reported against the app in January, addressing user location and access to user information. 

The CEO of Reddit recently criticized TikTok, saying he tells people, “don’t install that spyware on your phone.”

These privacy concerns stem from the app’s connection with the Chinese government. In 2017, viral app Musical.ly was acquired and merged with TikTok by Beijing company, ByteDance, for $1 billion. Chinese law requires companies to comply with government intelligence operations if asked, meaning apps like TikTok would have no authority to decline government access to their data.

In response to their privacy backlash, the company made a statement last year saying all their data centers are located entirely outside of China. However, their privacy policy does state that they share a variety of user data with third parties. 

In new attempts to combat all privacy concerns, ex-APD, Roland Cloutier has been hired as Chief Information Security Officer to oversee privacy information issues within the popular app.

With Cloutier’s long history in cybersecurity, there is hope that the most popular app among will soon gain a better privacy reputation.

Coronavirus raising concerns over person information 

The Coronavirus is a deadly, fast-spreading respiratory illness that has moved quickly throughout China and now reported in 33 countries across the world. 

Because of this, China has been thrown into a rightful panic and has gone to all lengths to combat and protect its spreading. However, in working to protect the continuous spread of the virus, many are saying that patient privacy is being thrown out the window.

Last month China put out a ‘close contact’ app, testing people to see if they’ve been around people who have or contracted the virus. The app assigns a colour code to users; green for safe, yellow for required 7day quarantine, and red is a 14day quarantine. 

Not only is the app required to enter public places like subways or malls, but the data is also shared with police. 

The New York Times released that the app sends a person’s location, city name and an identifying code number to the authorities. China’s already high-tech surveillance has reached new limits, as the times reports that surveillance cameras placed around neighborhoods are being strictly monitored, watching residents who present yellow or red cards.

South Korea has also thrown patient privacy to the wind, as text messages are sent out, highlighting every movement of individuals who contracted the virus. One individual’s extra-marital affair was exposed through the string of messages, revealing his every move before contracting the virus, according to the Guardian.

The question on everyone’s mind now is, what happens to privacy when the greater good is at risk?

For more privacy blogs, click here

Join our newsletter


Facial recognition, data marketplaces and AI changing the future of data privacy

Facial recognition, data marketplaces and AI changing the future of data privacy

With the emerging Artificial Intelligence (AI) market comes the everso popular privacy discourse. Data regulations that are being introduced left and right, while effective, are not yet representative of the growing technologies like facial recognition or data marketplaces. 

Companies like Clearview AI are once again making headlines after receiving cease-and-desist from big tech, despite there being no current facial recognition laws they are violating. As well, Nature released an article calling for an international code of conduct for genomic research aggregation. Between both AI and healthcare, Microsoft has announced a $40million AI for health initiative.  

Facial recognition company hit with cease-and-desist  

A few weeks ago, we released a blog introducing the facial recognition start-up, Clearview AI, as a threat to privacy.

Since then, Clearview AI has continued to make headlines, and most recently, has received cease-and-desist from Big Tech companies like Google, Facebook and Twitter. 

To recap, Clearview AI is a facial recognition company that has created a database of over 3 billion searchable faces, scrapped from different social media platforms. The company has introduced its software in more than 600 police departments across Canada and the US. 

The company’s CEO, Hoan Ton-That, has repeatedly defended its company, telling CBS

“Google can pull in information from all different websites, so if it’s public, you know, and it’s out there, it could be inside Google search engine it can be inside ours as well.”

Google then responded, saying this was ‘inaccurate.’ Google says they are a public search option and give sites choices in what they put out, as well as give opportunities to withdraw images. All options Clearview does not provide, as they go as far as holding images in their database after it’s been deleted from its source.

While Google and Facebook have both provided Clearview with a cease-and-desist, Clearview has maintained that they are within their first amendment rights to use the information. One privacy attorney told Cnet, “I don’t really buy it. It’s really frightening if we get into a world where someone can say, ‘The first amendment allows me to violate everyone’s privacy.’” 

While cities like San Francisco have started banning facial recognition, there are currently no federal laws addressing it as an issue, thus allowing more leeway for companies like Clearview AI to create potentially dangerous software.  

Opening up genomic data for researchers across the world

With these introductions to new health care initiatives, privacy becomes more relevant than ever. Healthcare data contains some of the most sensitive information for an individual. Thus the idea of big tech buying and selling such personal data is scary.

Last week, Nature, an international journal of science, released that over 800 terabytes of genomic data are available to investigators all over the world. The eight authors worked explicitly to protect the privacy of the thousands of patients/volunteers who consented to have their data used in this research.

The article reports the six-year collection of 2,658 cancer genomes between 468 institutions in 34 different countries is creating an open market of genome data. This project, called the Pan-Cancer Analysis of Whole Genomes (PCAWG), was the first attempt to aggregate a variety of subprojects and release a dataset globally.

A significant emphasis of this article was on the lack of clarity within the healthcare research community on how to protect data in compliance with the ongoing changes to privacy legislation.

Some issues in these genomic marketplaces are in the strategic attempts to not only comply with the variety of privacy legislation but also in ensuring that no individual can be re-identified using this information. Protecting patient data is not just a legislative issue but a moral one. 

The majority of the privacy unclarity came from questions of what vetting should occur before gaining access to information, or what checks should be made before the data is internationally shared.

As the article says, “Genomic researches urgently need clear data-sharing rules that are harmonized across jurisdictions.” The report calls for an international code of conduct to overcome the current hurdles that come with the different emerging privacy regulations. 

The article also said that the Biobanking and BioMolecular Resources Research Infrastructure (BBMRI-ERIC), had announced back in 2017 that it would develop an EU Code of Conduct on Health-Related Data. Once completed and approved, 

Microsoft to add another installment to AI for Good

The ability to collect patient data and share in an open market for researchers or doctors is helping cure and diagnose patients at a faster rate than ever before seen. In addition to this, AI is seen as another vital tool for the growing healthcare industry.

Last week, Microsoft announced its fifth installment to its ‘AI for Good’ project, ‘AI for Health.’ This project, similar to its cohorts, will support healthcare initiatives such as providing access to cash grants, AI tools, cloud computing, and Microsoft researchers. 

The project will focus on three different AI strategies, including: 

  • Accelerating medical research
  • Increase the understanding of mortality to guard various global health crises.
  • Reducing health injustices 

The program will be emphasizing supporting individual non-profits and under-served communities. As well, Microsoft released in a video their focus on addressing Sudden Infant Death Syndrome, eliminating Leprosy and diabetic retinopathy-driven blindness in partnership with different non-for-profits. 

AI is essential to healthcare, and it has lots of data that companies like Microsoft are utilizing. But with this, privacy has to remain at the forefront of the action. 

Similar to Nature’s data, protecting user information is extremely important and complicated when looking to utilize the data’s analytical value, all while complying with privacy regulations. Microsoft announced that it would be using Differential Privacy as its privacy solution. 

Like Microsoft, we at CryptoNumerics user differential privacy as a method of anonymization and data value preserving. Learn more about differential privacy and CryptoNumeric solutions.

 

Join our newsletter


Processing personal data through anonymization methods

Processing personal data through anonymization methods

Companies are becoming increasingly reliant on user data to understand consumers better and improve performance. But with the rise of new privacy legislation and the growing concerns for personal data security, ensuring that your company is checking all the boxes in privacy protection is more critical than ever.

Utilizing different privacy-protecting techniques, organizations can then protect consumer information while extracting value at the same time. These techniques include masking, k-anonymity, and differential privacy.  

By understanding the potentials and challenges of these techniques, processing personal data so that user data is not re-identifiable is achievable.  

Let’s look at the three privacy-protection techniques mentioned before.

Masking is the process of replacing the values in a dataset with different values, that in many cases, resemble the structure of the original value. Unfortunately, masking tends to destroy the analytical value of data since the relationship between values gets affected by the replacing actions.

The ideal use case for masking is in DevOps environments where there is a need for data, but the analytical value is irrelevant. 

k-anonymity objective is to reduce privacy risk by grouping individual records into “cohorts.” Grouping is achieved by using generalization (substitution of a specific value with a more general value) and suppression (removal of values) to group the quasi-identifiers (QID’s) in ways that make them indistinguishable from one another. The k value defines the minimum number of elements in one group; the higher the value is, the higher the level of data protection. 

While k-anonymity reduces the analytical value of the data, it still preserves enough value for data scientists to perform analytics and Machine Learning using the dataset.

Differential privacy is a privacy technique that provides a privacy guarantee on how much information can be extracted on an individual.  

Differential privacy uses a technique called perturbation, which adds random noise to a point where it becomes incredibly difficult to know with certainty if a specific individual is present in a dataset. 

Differential privacy is one of the most promising privacy techniques; however, it can only be used with large data sets because applying perturbation to a small data set would destroy its analytical value.

With these privacy-techniques techniques, privacy and analytics no longer have to be at odds. Companies who dare to ignore them are exposing themselves to unnecessary risks. 

Contact us today to learn how you can use CN-Protect to apply any of these techniques to protect your data while preserving its analytical value.

To read more privacy blogs, click here

Join our newsletter