How Safely Opening Data Silos Facilitates Cutting-edge Data Science
The problem with data silos
Typically, the problem of data silos presents itself like this.
A data scientist notices that their model is not performing as well as they hoped. The data scientist has a hypothesis as to why this might be the case, and wants to test their hypothesis. The data currently available does not adequately measure the potentially explanatory variable. The data scientist begins a long expedition, searching for a dataset to test the hypothesis. The hunt is slow and arduous, including many red herrings and wild geese.
At long last, the data is found, and lo and behold, it’s been inside the organization this entire time!
The scientist sends off an email requesting access, and heads home, content the search is over. They arrive at work the next day to an email from the data owner rebuffing their request. Back to square one, foiled by siloed data.
While this problem may be solved with a simple email between managers, the cost is already apparent. Time was spent seeking out data that was internally available. More time passes waiting for clearance of the data. Even the time spent hypothesizing about model performance likely could have been reduced had the data been accessible from the outset.
Further, these problems don’t always get solved. Sometimes siloed data is never found. Sometimes it’s never cleared. In these cases, the data scientist is unable to test their hypothesis. At best, siloed data inhibits productivity. At worst, it limits fundamental understanding of the problem by obfuscating relationships between data.
Why do data silos appear?
Siloed data can crop up within an organization for a wide variety of reasons, ranging from the malicious (teams wanting to maintain a competitive advantage) to the innocuous (too many layers of hierarchy/bureaucracy to traverse). As data privacy concerns and a more nuanced understanding of identifying information emerge, limiting access to sensitive data is an increasingly pressing motivation for the creation of data silos.
Unfortunately, limiting data access also limits data utility. Luckily, there are a couple techniques available to gain data utility, while maintaining acceptable privacy standards.
How to break open data silos
One technique is to anonymize siloed data. The goal of anonymization is to limit the risk of any individuals in the dataset being identified. Simple anonymization, such as removal of direct identifiers like name and ID, have long been commonplace. However, these approaches are insufficient. Indirect identifiers remain, leaving the data susceptible to inference attacks.
Luckily, there are more effective ways to anonymize data. By utilizing concepts such as k-anonymity and t-closeness, data owners can possess a clear understanding of their data’s risk of reidentification. Applying advanced practical privacy-preserving protection to indirect identifiers to reach a desired reidentification risk is one way to open data silos.
Another solution is to implement Secure Multi-Party Computation (SMC). SMC enables a number of parties to jointly compute a function over a set of inputs that they wish to keep private (head here for a deeper explanation). This allows training a machine learning model across datasets held by multiple parties as if they were a single dataset, but without actually moving, centralizing, or disclosing the data between the parties. This approach increases data utility without actually opening the silo.
Data privacy concerns are likely to only increase moving forward. Because of this, data silos are likely to continue to be created. Being able to safely open or connect these silos will be key to unlocking the analytical value of the data within.
For more about CryptoNumerics’ privacy automation solutions, read our blog here.
Join our newletter