Differential Privacy

Posted on Fri 18 December 2020 in Data Science • 4 min read

It's quite clear in today's age that the biggest companies in the world, make most of their profits from harvesting and productionalising their user's data. With privacy becoming more and more of a concern in everyday life as we become more connected, it's almost becoming a human right for our privacy to be protected, especially by those who profit from it.

Differential privacy in particular is a model that aims to protect the data's owners against bad actors reversing the aggregated data to find details of individual users.

For the example in this post, we will use a dataset that includes each persons name, age, email and annual income. In this example, say we wanted to determine how many people in our dataset make over $50,000 annually. Instead of sharing all of the specific people and their incomes, we would rather share the aggregate data. Someone viewing our results might know that 30% of our set make over that threshold, but not which people.

However, let's say someone viewing the results wants to know the specific income of one person. To do this, they've gone and collected the background information on every person except the person of interest to them. If they know which of the 4999 people make over the threshold, they can determine whether the person of question makes over or under the threshold.

This type of attack is known as a differentiated attack, and is very difficult to protect against, and is what differential privacy aims to defend against. The primary method of achieving privacy is by adding random noise to the aggregate data (the private key in a sense of cryptography). In our example from above our results might say 27% - 32% of people make over the threshold rather than the specific number. This still achieves an outcome that people can understand the results, but protects the privacy of the users within it.

Now let's use the following packages:

mimesis to generate the user data
pandas to calculate the exact values (statistical analysis)
pydp to calculate the same values but maintaining privacy of the users

In [1]:

import pydp as dp
from pydp.algorithms.laplacian import BoundedSum, BoundedMean, Count, Max
from mimesis import Person
from mimesis import Address
from mimesis.enums import Gender
from mimesis import Datetime
person = Person('en')
import pandas as pd
import random
person = Person()
addess = Address()
datetime = Datetime()
def create_rows_mimesis(number_of_rows=1):
    output = [{"name":person.full_name(),
                "age": person.age(),
                "email":person.email(),
                "income": random.randint(10000,100000)} for x in range(number_of_rows)]
    return output


income_data = pd.DataFrame(create_rows_mimesis(5000))

income_data

Out[1]:

	name	age	email	income
0	Fidel Burnett	36	aquaria1999@gmail.com	53435
1	Felipe Norman	56	labradoodle1933@outlook.com	72310
2	Jeremiah Valentine	49	arcanist2068@yahoo.com	20600
3	Stuart Fitzpatrick	39	uniembryonate1938@live.com	45444
4	Kim Hubbard	55	dutiful2019@protonmail.com	95098
...	...	...	...	...
4995	Heath Barrera	35	booby1892@live.com	71771
4996	Nadene Flynn	63	mitten1812@yahoo.com	59993
4997	Francisco Phillips	18	vaucheriaceous1908@gmail.com	86981
4998	Claud Puckett	64	prefavorably1893@gmail.com	97025
4999	Claris Foster	60	dhava1876@outlook.com	33015

5000 rows × 4 columns

In [2]:

# Calculate count with no differential privacy
def typical_count_above(column_name, limit):
    return income_data[income_data[column_name] > limit].count()[column_name]

number_over_threshold = typical_count_above('income', 50000)
print(f"Number of users with income over $50,000: {number_over_threshold} or {(number_over_threshold / 5000) * 100:.1f}%")

Number of users with income over $50,000: 2810 or 56.2%

As we can see from the calculations above, typical_count_above counts the number of users over a limit for a specified column with no preservation of privacy whatsoever.

In [3]:

# Calculate count with differential privacy
def private_count_above(column_name, privacy_budget, limit):
    x = Count(privacy_budget, dtype='int')
    return x.quick_result(list(income_data[income_data[column_name] > limit][column_name]))

private_number_over_threshold = private_count_above('income', 0.8, 50000)
print(f"PRIVATE: Number of users with income over $50,000: {private_number_over_threshold} or {(private_number_over_threshold / 5000) * 100:.1f}%")

PRIVATE: Number of users with income over $50,000: 2809 or 56.2%

The private_count_above function works very similarly to typical_count_above but using the Differential Privacy Library by Google to count the number of users above the limit and preserves privacy by using the Laplacian mechanism for adding noise to the dataset. Also note that we are able to tune the privacy budget for the acceptable loss of privacy, with 0 denoting no loss whatsoever is acceptable.

Mean Example¶

Now let's repeat the same example, but by determining the average income across all users in the dataset.

In [4]:

# Calculate mean with no differential privacy
def typical_mean(column_name):
    return income_data[column_name].mean()

# Calculate mean with differential privacy
def private_mean(column_name, privacy_budget):
    x = BoundedMean(privacy_budget, income_data[column_name].min(), income_data[column_name].max())
    return x.quick_result(list(income_data[column_name]))

true_mean_income = typical_mean('income')
private_mean_income = private_mean('income', 0.5)
print(f"True mean income: {true_mean_income}")
print(f"Private mean income: {private_mean_income}, 0.8 privacy budget")
print(f"Private mean income: {private_mean('income', 0.1)}, 0.1 privacy budget")

True mean income: 55467.1134
Private mean income: 55470.37580603853, 0.8 privacy budget
Private mean income: 55589.33064063336, 0.1 privacy budget

As we can see from the above while the outcome achieves the same result, we have protected the users privacy!

Previous Post Next Post

Mean Example¶

You might enjoy