Differential Privacy

Posted on Fri 18 December 2020 in Data Science • 4 min read

It's quite clear in today's age that the biggest companies in the world, make most of their profits from harvesting and productionalising their user's data. With privacy becoming more and more of a concern in everyday life as we become more connected, it's almost becoming a human right for our privacy to be protected, especially by those who profit from it.

Differential privacy in particular is a model that aims to protect the data's owners against bad actors reversing the aggregated data to find details of individual users.

For the example in this post, we will use a dataset that includes each persons name, age, email and annual income. In this example, say we wanted to determine how many people in our dataset make over $50,000 annually. Instead of sharing all of the specific people and their incomes, we would rather share the aggregate data. Someone viewing our results might know that 30% of our set make over that threshold, but not which people.

However, let's say someone viewing the results wants to know the specific income of one person. To do this, they've gone and collected the background information on every person except the person of interest to them. If they know which of the 4999 people make over the threshold, they can determine whether the person of question makes over or under the threshold.

This type of attack is known as a differentiated attack, and is very difficult to protect against, and is what differential privacy aims to defend against. The primary method of achieving privacy is by adding random noise to the aggregate data (the private key in a sense of cryptography). In our example from above our results might say 27% - 32% of people make over the threshold rather than the specific number. This still achieves an outcome that people can understand the results, but protects the privacy of the users within it.

Now let's use the following packages:

  1. mimesis to generate the user data
  2. pandas to calculate the exact values (statistical analysis)
  3. pydp to calculate the same values but maintaining privacy of the users
In [1]:
import pydp as dp
from pydp.algorithms.laplacian import BoundedSum, BoundedMean, Count, Max
from mimesis import Person
from mimesis import Address
from mimesis.enums import Gender
from mimesis import Datetime
person = Person('en')
import pandas as pd
import random
person = Person()
addess = Address()
datetime = Datetime()
def create_rows_mimesis(number_of_rows=1):
    output = [{"name":person.full_name(),
                "age": person.age(),
                "income": random.randint(10000,100000)} for x in range(number_of_rows)]
    return output

income_data = pd.DataFrame(create_rows_mimesis(5000))

name age email income
0 Fidel Burnett 36 aquaria1999@gmail.com 53435
1 Felipe Norman 56 labradoodle1933@outlook.com 72310
2 Jeremiah Valentine 49 arcanist2068@yahoo.com 20600
3 Stuart Fitzpatrick 39 uniembryonate1938@live.com 45444
4 Kim Hubbard 55 dutiful2019@protonmail.com 95098
... ... ... ... ...
4995 Heath Barrera 35 booby1892@live.com 71771
4996 Nadene Flynn 63 mitten1812@yahoo.com 59993
4997 Francisco Phillips 18 vaucheriaceous1908@gmail.com 86981
4998 Claud Puckett 64 prefavorably1893@gmail.com 97025
4999 Claris Foster 60 dhava1876@outlook.com 33015

5000 rows × 4 columns

In [2]:
# Calculate count with no differential privacy
def typical_count_above(column_name, limit):
    return income_data[income_data[column_name] > limit].count()[column_name]

number_over_threshold = typical_count_above('income', 50000)
print(f"Number of users with income over $50,000: {number_over_threshold} or {(number_over_threshold / 5000) * 100:.1f}%")
Number of users with income over $50,000: 2810 or 56.2%

As we can see from the calculations above, typical_count_above counts the number of users over a limit for a specified column with no preservation of privacy whatsoever.

In [3]:
# Calculate count with differential privacy
def private_count_above(column_name, privacy_budget, limit):
    x = Count(privacy_budget, dtype='int')
    return x.quick_result(list(income_data[income_data[column_name] > limit][column_name]))

private_number_over_threshold = private_count_above('income', 0.8, 50000)
print(f"PRIVATE: Number of users with income over $50,000: {private_number_over_threshold} or {(private_number_over_threshold / 5000) * 100:.1f}%")
PRIVATE: Number of users with income over $50,000: 2809 or 56.2%

The private_count_above function works very similarly to typical_count_above but using the Differential Privacy Library by Google to count the number of users above the limit and preserves privacy by using the Laplacian mechanism for adding noise to the dataset. Also note that we are able to tune the privacy budget for the acceptable loss of privacy, with 0 denoting no loss whatsoever is acceptable.

Mean Example

Now let's repeat the same example, but by determining the average income across all users in the dataset.

In [4]:
# Calculate mean with no differential privacy
def typical_mean(column_name):
    return income_data[column_name].mean()

# Calculate mean with differential privacy
def private_mean(column_name, privacy_budget):
    x = BoundedMean(privacy_budget, income_data[column_name].min(), income_data[column_name].max())
    return x.quick_result(list(income_data[column_name]))

true_mean_income = typical_mean('income')
private_mean_income = private_mean('income', 0.5)
print(f"True mean income: {true_mean_income}")
print(f"Private mean income: {private_mean_income}, 0.8 privacy budget")
print(f"Private mean income: {private_mean('income', 0.1)}, 0.1 privacy budget")
True mean income: 55467.1134
Private mean income: 55470.37580603853, 0.8 privacy budget
Private mean income: 55589.33064063336, 0.1 privacy budget

As we can see from the above while the outcome achieves the same result, we have protected the users privacy!