Protect Your Data Throughout the Pipeline
Maximizing data value is essential, but, privacy regulations must be satisfied when doing so. This is achievable by implementing privacy-protecting techniques throughout the data pipeline to avoid compliance risks.
Before introducing the privacy-protecting techniques, it is important to understand the four stages of the data pipeline:
- Data Acquisition: first off, the data must be acquired, which can be either generated internally or externally from third parties.
- Data Organization: the data is now stored for future use, and needs to be protected along the pipeline to avoid misuse and breaches. This can be achieved using access controls.
- Data Analysis: the data must now be opened up and mobilized in order to analyze it, which allows for a better understanding of an organization’s operations and customers, as well as improved forecasting.
- Data Publishing: analysis results are published, and/or internal data is shared with another party.
Now that we have talked about the 4 stages of the data pipeline, let’s go over the sixteen privacy-protecting techniques that can be implemented throughout the pipeline to make it privacy-protected.
These techniques can be categorized based on their function into four groups: randomizing, sanitizing, output, and distributed computing.
Within the randomizing group, there are two techniques: additive and multiplicative noise. In applying these techniques, random noise is added or multiplied on the individual’s record to transform the data. These techniques can be used in the Data Acquisition stage of the data pipeline.
The sanitizing group has five privacy techniques in it. The first technique is k-anonymity, where identifiable attributes of any record in a particular database are indistinguishable from at least one other record. Next comes l-diversity, which is an extension of k-anonymity. However, this technique solves the k-anonymity shortfall by making sure there is a diversity of sensitive information in each group. Another technique is t-closeness, which makes sure that the distribution of sensitive elements in each group remains the same as the distribution in the whole group. This technique is used to prevent attribute disclosure by maintaining a ‘t’ threshold. Additionally, there is the personalized privacy technique, in which privacy levels are defined and customized by owners. The last technique in this group is ε-differential privacy, which ensures any single record does not affect the overall outcome of the data’s analysis. These techniques can be used in the Data Acquisition stage, Data Organization stage, and the Data Publishing stage of the data pipeline.
The output group has three techniques, which are used to reduce the inference of sensitive information from the output of any algorithm. The first technique is known as association rule hiding, where information used to exploit privacy can be taken from the rules identified in the data set. Next, there is the downgrading classifier effectiveness technique, where data is sanitized to reduce the classifier’s effectiveness to prevent information from being leaked. Finally, the query auditing and inference control technique, where data queries can output data that can be used to detect sensitive information. These techniques can be applied to the Data Publishing stage of the data pipeline.
Last but not least, the distributed computing group, made up of seven privacy-protecting techniques. 1-out-of-2 oblivious transfer is where two messages are sent, but only one out of the two messages, are received and encrypted. Another technique in this group is homomorphic encryption, a method of performing a calculation on encrypted information (ciphertext) without decrypting it (to plaintext) first. Secure sum receives the sum of inputs without revealing these inputs to others. Secure set union shares and creates a union of sets without compromising the owners of each set. Secure size of intersection figures out the size of the data set’s intersection without revealing the data itself. The scalar product technique computes the scalar product between two vectors without revealing the input vector to each other’s party. Finally, the private set intersection technique computes the intersection of two sets from each party without revealing anything else. This technique can be used in the Data Acquisition stage, as well. All of the techniques from the distributed computing group prevent access to original, raw data while allowing analysis to be performed. All of these techniques can be applied to the Data Analysis stage and Data Publishing stage of the data pipeline. Homomorphic encryption can also be used in the Data Organization stage of the data pipeline.
These sixteen techniques help protect data’s privacy throughout the data pipeline. For a visual view on the privacy-exposed pipeline versus the privacy-protected pipeline, download our Data Pipeline infographic.