Big Data, Compliance, Cyber Security, Data Privacy, Differential Privacy

Differential Privacy – My takeaway from Book “Ethical Algorithms” by Michael Kearns and Aaron Roth

Technology controls & drives almost every aspect of the business, data have become the backbone for organizations to get visibility, to be competitive, and to make more agile business decisions. This is why businesses are collecting the maximum possible data from all possible sources. The data does not come free, protecting the data from data breaches and complying with the data privacy regulatory requirements is very challenging. For details please refer to the Verizon Data breach report for 2020. Also according to DLA Piper’s latest GDPR Data Breach Survey, data protection regulators have imposed EUR 114 million (approximately USD 126 million / GBP 97 million) in fines under the GDPR regime for a wide range of GDPR infringements, not just for data breaches.

This makes data privacy is a priority for organizations. Protecting data privacy has a lot of challenges, how to resolve these challenges is big problem organizations need to resolve before it is too late. Differential Privacy is one of the best options to resolve privacy issues.

While researching this topic I came across the book “The Ethical Algorithm: The Science of Socially Aware Algorithm Design” by authors duo Michael Kearns and Aaron Roth.  The book is written in a way, a person with little or no whatsoever understanding of algorithms and differential privacy will be able to easily grasp the content.

The book covers the Differential Privacy and Fairness aspect of the algorithms, this article is all about my takeaway from this book about Differential Privacy.

This article is mainly my comprehension of the book and some concepts to understand the subject better.

What is Differential Privacy?

The definition of Differential Privacy as per Wikipedia is “Differential privacy is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. The idea behind differential privacy is that if the effect of making an arbitrary single substitution in the database is small enough, the query result cannot be used to infer much about any single individual, and therefore provides privacy. Another way to describe differential privacy is as a constraint on the algorithms used to publish aggregate information about a statistical database which limits the disclosure of private information of records whose information is in the database.”

As per Cynthia Dwork, one of the inventors of differential privacy “The outcome of any analysis is essentially likely, independent of whether an individual joins or refrains from joining the dataset” 

Differential privacy is a mathematical definition of what it means to have privacy. It is not a specific process like de-identification, but a property that a process can have. For example, it is possible to prove that a specific algorithm “satisfies” differential privacy.

Differential privacy is a standard, this can be used for analyzing sensitive personal information. It provides a mathematically provable guarantee of privacy protection against a wide range of privacy attacks. These attacks can be attempts to learn private information specific to individuals from a data release. Types of privacy attacks include re-identification, record linkage, and differencing attacks, or other attacks currently unknown or unforeseen.

Why Differential Privacy is Required?

Following are five important points why differential privacy is required :

  1. The anonymization is not protected from de-anonymization, also known as reidentification. Anonymization has two main flaws, one is, it tries to protect against only a very narrow view of what a privacy violation is, and second, it is brittle under multiple data releases.
  2. When a data curator and/or administrator is considering releasing an “anonymized” data source, he can try to make an informed guess about how difficult it would be for an attacker to re-identify individuals in the dataset. But it can be difficult for him to anticipate other data sources. When people use the internet and web-based applications either for continence or pleasure they create lots of digital footprints that can reveal the information which users are not aware of. Hackers / malicious intention people can gather and correlate this information to reidentify a person
  3. If your data set has a unique record, then there is a problem, unique records are akin to a fingerprint that can be reconnected with an identity by anyone who knows enough about someone to identify that person’s record — and then she/he can learn everything else contained in that record.
  4. In the domain of data privacy, there have been many cases in which sensitive information about specific people — including medical records, web search behavior, and financial data — has been inferred by “ de-anonymizing ” data that was allegedly made “ anonymous ” by algorithmic processing
  5. Recent research has shown that other aggregated statistics can leak private data. For example, given only input-output access to a trained machine learning model, it is often possible to identify the data points that were used in its training set.

As Cynthia Dwork, likes to say “anonymized data isn’t” — either it isn’t really anonymous or so much of it has been removed that it is no longer data.

Differential Privacy Models :

Following are the two models used for algorithmic differential privacy,

  1. Local Model of Differential Privacy: As depicted in the diagram below, the noise or randomness is added at the user level while collecting the data. The advantage of this model is it is difficult to separate the real data from the noise, first you do not need a trusted data curator, and second if the dataset falls in hands of malicious intention people or made available to legal authorities it will still maintain the privacy. Of course, the data security requirements like encryption, role-based access control, and multifactor authentication need to be implemented. The limitation of this model is it needs a very large number of data records to make the statistical analysis work accurately.   

Google and Apple are examples of organizations using this model to collect the user’s data. These organizations have been used for collecting data from users which is not been collected before. Google adds the randomness at the chrome browser level and apple adds it at the iPhone level and hence the data reached to these organizations is already randomized.

  • Central Model of Differential Privacy – In this model, the data which is collected is real data with sensitive private inputs. It is not added with any noise or randomness at the user level. Since the data has all private data attributes it needs to be collected, stored, and handled with utmost care. There should legally mandate for data protection, data leak prevention for the organization that is collecting these kinds of data records. The data curators, handlers must be trustworthy legally liable for data miss handling.  Data security requirements that need to be implemented are more stringent as sensitive information is collected and stored.

In September 2017, the US Census Bureau announced that all statistical analyses published as part of the 2020 Census will be protected with differential privacy. Census will operate using the centralized model of differential privacy, collecting the data (as it always has) exactly and adding privacy protections to its publicly released aggregate statistics.

Limitations of Differential Privacy

It is possible to conduct essentially any kind of statistical analysis subject to differential privacy. But there is a cost that needs to pay for differential privacy to work, it needs a large amount of data to get the same accuracy of statistical results when privacy is applied when privacy is not applied.

There is a big tread off between the accuracy of results and stringent data privacy requirements. A large volume of data sets is required to make errors added by randomness or noise to become zero.

The differential privacy cannot protect from, seemingly useless bits of data — the digital exhaust that everyone is constantly producing as they use the Internet, purchasing items, clicking on links, and liking posts. As machine learning becomes more powerful, we can discover increasingly complicated and non – obvious correlations.

A very simple example as given in the book is,

 In a scenario when person X, a smoker, decided to be included in a survey. Then, the analysis of the survey data reveals that smoking causes cancer. Is this survey analysis will harm person X, as a smoker? Maybe — person X is a smoker is known fact known to people who know him. Based on the survey analysis and known facts, the general publicly available information, correlated to survey analysis, the health insurance company may raise the premium amount for person X, this is a direct financial impact for person X for the long term. Is this a case of a personal data leak? As per the Differential privacy, it is not, the rationale behind this is, the impact on the smoker is the same independent of whether or not person X’s participation in the survey.   The inclusion or omission of person X’s data does not impact the analysis results. 

My Takeaway From the Book

The  Differential Privacy provides a guarantee that, for every individual in the dataset, and for any observer no matter what their initial beliefs about the world were, after observing the output of a differentially private computation, their posterior belief about anything is close to what it would have been having they observed the output of the same computation run without the individual’s data.

Differential privacy controls the difference between what can be learned about you when you make the individual decision to allow your data to be used and what can be learned when you don’t. So, it gives you little incentive to withhold your data.

Differential privacy doesn’t protect you from the inferences of the data which is already available on other sources on the internet, malicious intention people can make use of it for the correlations they rely on would exist whether or not your data was available. For clarity refer to the example of person X.

Print Friendly, PDF & Email
Tagged , , , , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.