Data Privacy in Machine Learning

Machine Learning is a subset within the field of AI (Artificial
Intelligence) that permits a computer to internalize concepts
found in data to form predictions for new situations. We relish
most of the innovative products and services thanks to Machine
Learning.

To reach reliable levels of accuracy, these models require
enormous datasets to ‘learn’ from. But the data we feed them are
very sensitive and personal. So that it’s crucial that we must find
ways to unlock the power of Artificial Intelligence while protecting
data privacy. The objective of this article is to make aware the
reader on Data Privacy conducted in Machine Learning.

Introduction

Let’s initially get a brief idea on the term ‘Privacy’. The term
‘Privacy’ spreads over a vast range in the technology sector. The
current approaches to ‘Privacy’ and ‘Machine Learning’ can be
simply described basically in two sections which are ‘user control’
and ‘data protection’. So under the ‘user control’ section, user’s
rights can be prescribed by understanding what’s being collected
by whom, for what purpose and for how long etc. Then moving on
to ‘data protection’ section, encryption of data when it’s at rest
and removing identifiable information to make data anonymous is
fundamentally being carried out.

But currently with respective to Machine Learning, both of these
components have gaps that we need to address. Further with the
nature of Machine Learning we can distinguish that this data
needs to be operated, where it should be decrypted creating a new
vulnerability. So it’s substantial to have additional protections for
both of these above mentioned fronts.
This is to be collaborated with the idea of trust too which could be
simplify as follow.

Idea of Trust in Machine Learning

So the data we are dealing with and models are all digital assets.
When the data asset is shared with someone, it is entirely
comparable to, giving it to them and additionally trusting them in
order to keep the data security within the scope of work.

Moreover, we could say that Machine Learning is fundamentally a
multi-stakeholder computation. In a Machine Learning situation,
you will have multiple entities wanting to work together. There
would be one entity which owns training data and another set of
data may own inference data while providing a machine learning
server. In this situation the inference may be performed by a
model, where the model could be owned by someone else. So in
this kind of complex situations trust issues shall be play a
significant role.

So what if untrusted parties could do Machine Learning together?
This shall be deliberated using three cases.

  • Finance/Insurance- Rival banks could build joint anti-money
    laundering services.
  • Healthcare- Hospitals could use remote, third party analytics
    on patients’ data.
  • Retail- Retailers could monetize their purchase data while
    protecting user privacy.

    Now let’s move on to discuss our primary objective of this article
    on Privacy Preserving Machine Learning.

Introducing Privacy Preserving Machine Learning

In order to shield individual privacy in the context of big data,
different anonymization techniques have conventionally been
used. Privacy preserving learning techniques are usually done
using cryptography and statistics tricks which will be discussed
distinctly below.

Federated Learning/ Multi Party Computation

It is possible for a group of separate entities to collectively train a Machine Learning model by pooling their data but without explicitly having to share it with each other. Simply you can pool your data without sharing with each other. This also known as secure computation, which is a subfield of cryptography with the aim of creating ways for parties to jointly compute a function over their inputs while keeping those inputs private.

Homomorphic Encryption

In this technique it is possible to do Machine Learning on data that are encrypted and that stay encrypted throughout. This technique allows computation on cipher texts, generating an encrypted outcome which, when decrypted, equals the result of the operations as if they had been performed on the plaintext.

Differential Privacy

One of the most promising approaches within privacy-preserving Machine Learning is Differential Privacy (DP). The concept is not new within the debate around privacy protection.

This technique allows you to collect personal data with quantifiable privacy protections in such a way that the output cannot be tied back to the presence or absence of any individual in that dataset. In general terms, an algorithm is differentially private if an observer examining the output is not able to determine whether a specific individual’s information was used in the computation.

To protect individual privacy, random noise is generated according to a carefully chosen distribution, which will lead to perturbation of the true answer so that the true answer plus noise is returned to the user.

An astonishing property of differential privacy is that it is mostly compatible with, or even beneficial to, meaningful data analysis despite its protective strength. This also offers protection from over fitting where its benefits thus go even beyond data security.

Further all of these techniques can be further amplified by hardware and software based techniques known as trusted execution environments.

Getting access to Private Data is really tough since a lot of friction is involved in the process, so basically in order to get access to one particular data set to be able to work with it would be a time consuming process. As a consequence of this, users tend to solve tasks which are accessible for public such as;

  • ImageNet
  • MNIST
  • CIFAR-10
  • Libirispeech
  • WikiText-103
  • WMT

But, what if there is a much simpler access for these real human problems which incorporates the private data?

The issue is, if you want to address these kinds of real human issues there should be real data of humans going through those issues where getting access is the most critical process in this procedure. So as a result, infrastructures are built to give access to private data to make this process simpler.

To understand more regarding this issue, let us move on to understand a tool built by a community called ‘OpenMinded’, where over 5000 volunteers who care about this issue are on the development of making privacy preserving AI as convenient as possible.

The ‘Py Syft’ tool extends with ‘Py Torch’ with tools or privacy preserving Machine Learning. Let’s move on to understand some of the features that are available in this particular tool so as to work with easy privacy preserving data.

Remote Execution

It is the ability to leverage ‘PyTorch’ or the processing using ‘PyTorch’ on machines that you don’t have direct access to. Initially you import ‘syft’, then import ‘torch’, and we use the ‘TorchHook’ which augments ‘Torch’ with tools in privacy preserving Machine Learning. So that we can initialize a reference to a remote machine. This allows us to actually interact with ‘PyTorch Operations’ and ‘PyTorch Tensors’ that live in this particular machine. When the ‘Tensor’ is sent to a particular data center, what it returns to us would be a ‘Pointer’. This ‘Pointer’ has all the functionalities that the ‘PyTorch’ would normally have. The implication of these ‘PointerTensors’, is that now we can use normal ‘PyTorch’ API’s, the things we already know how to handle, in order to orchestrate complex operations across multiple different remote machines. Here we also get another most important command in the end which is ‘.get’, where it requests information from the remote machine to be sent back to the user.

Search and Example Data

This next feature will be useful to do good data science for the unseen data. Let’s say we have what is called the ‘GridClient’, which is a collection of work where it gives features such as ‘Search’. So if we want to do some sort of analysis, initially the user can search for a given dataset, where pointers to this remote data could be returned back along with metadata, relating to these pointers which explains sort of the schema how it was collected, about the distribution, etc.

This would help the user to put together the data that is relevant for the user’s problem that could basically be distributed in multiple locations. Also when the user needs to do say, feature engineering, by putting the model together, you can even make available sample data or the data that was sort of synthetically generated to be similar to the distribution.

Before moving on to the next function, let’s have a brief outline on that. As we know the ‘dot get function’ is a cryptic function in this process. What if we want to ensure that when we ask for a tensor back from a remote machine, we are accidentally getting back private information. So this scenario brings us to the third tool.

Differential Privacy

This is simply a field where a group of mathematical algorithms that try to ensure the statistical analysis does not compromise privacy. Let’s think that we have a database with a bunch of people compromising with a single column. If we are going to query this database it’s important to understand the maximum output of the function if one data is removed from the dataset.

If that is zero, then we can get an idea that the output of this function is not conditioned on that particular removed, which proves that it does not contribute to the total dataset at all. If this same procedure is done for the total database by removing them or by swapping them with another data and if still this output won’t change, it will be clear that the output of the function does not depend on any specific individual.

This reflects that the output of the query is the same between this database and identical database with one row removed or replaced which ultimately provides perfect privacy. It also lends clarity to a function that is not perfectly private.

Epsilon

This additional function on ‘dot get’ accepts a parameter called ‘Epsilon’, where we can choose how much our privacy budget we want to spend. Any given data science project will have a certain privacy budget which is dependent on the level of trust and kind of the relationship that you have with the data owner.

It could be zero, where you can only do algorithms that will not leak any information, or you could even have a higher degree of trust in individuals where more complex queries could be accomplished. Further this mechanism helps you to track private data all the way through in a train model or some output function. This will automatically add the appropriate amount of noise to make sure that you stay under your privacy budget.

Before moving on to our next tool, what if our model is really valuable and it is exposed where someone could use it without authority, this brings to our last interesting algorithm in Machine Learning which will be discussed below.

Secure Multi-Party Computation

Simply this indicates that multiple people can combine their private inputs to compute a function, without revealing inputs to each other where it implies to us the indication that multiple people can share ownership of a number. So in this procedure we will encrypt some value and distribute it among shareholders where the original encrypted value will not be known by any one and it shall be hidden because of the encryption.

Further, ‘Shared Governance’ which states that this number could only be used or decrypted if everyone agrees shall also be carried in this procedure. So we could say that this is just more than an encryption where it’s shared control over a digital asset which is really impressive. The most amazing part is that while it’s being encrypted we further could perform computation.

So finally we could say that models and datasets are just large collections of numbers which we can encrypt and have shared governance.

We shall conclude this article by moving on to understand how this feature impacts ‘PyTorch’, Let’s say you have several clients and a set of ‘point dot set’ and that you have a method called, ‘dot share’ having distributed in a list of shareholders. The return to you would be a ‘pointer’ with a normal ‘PyTorch’ API which you can use and will automatically implement the cryptography where you could do encrypted training, encrypted prediction with models, etc. So this brings us a lot of anticipated properties beneficial for privacy preserving Machine Learning.

So considering all the factors we have discussed above it can be said that no single technology solves completely ‘Privacy’ and advancing both ‘Artificial Intelligence (AI) and Privacy’. Moreover it is not a zero-sum game so that the progress in AI does not have to come at the expense of privacy.

CTO @ ZorroSign | Seasoned Software Architect | Expertise in AI/ML , Blockchain , Distributed Systems and IoT | Lecturer | Speaker | Blogger