Differential Privacy
Posted on Fri 18 December 2020 in Data Science • 4 min read
It's quite clear in today's age that the biggest companies in the world, make most of their profits from harvesting and productionalising their user's data. With privacy becoming more and more of a concern in everyday life as we become more connected, it's almost becoming a human right for our privacy to be protected, especially by those who profit from it.
Differential privacy in particular is a model that aims to protect the data's owners against bad actors reversing the aggregated data to find details of individual users.
For the example in this post, we will use a dataset that includes each persons name, age, email and annual income. In this example, say we wanted to determine how many people in our dataset make over $50,000 annually. Instead of sharing all of the specific people and their incomes, we would rather share the aggregate data. Someone viewing our results might know that 30% of our set make over that threshold, but not which people.
However, let's say someone viewing the results wants to know the specific income of one person. To do this, they've gone and collected the background information on every person except the person of interest to them. If they know which of the 4999 people make over the threshold, they can determine whether the person of question makes over or under the threshold.
This type of attack is known as a differentiated attack, and is very difficult to protect against, and is what differential privacy aims to defend against. The primary method of achieving privacy is by adding random noise to the aggregate data (the private key in a sense of cryptography). In our example from above our results might say 27% - 32% of people make over the threshold rather than the specific number. This still achieves an outcome that people can understand the results, but protects the privacy of the users within it.
Now let's use the following packages:
mimesis
to generate the user datapandas
to calculate the exact values (statistical analysis)pydp
to calculate the same values but maintaining privacy of the users
import pydp as dp
from pydp.algorithms.laplacian import BoundedSum, BoundedMean, Count, Max
from mimesis import Person
from mimesis import Address
from mimesis.enums import Gender
from mimesis import Datetime
person = Person('en')
import pandas as pd
import random
person = Person()
addess = Address()
datetime = Datetime()
def create_rows_mimesis(number_of_rows=1):
output = [{"name":person.full_name(),
"age": person.age(),
"email":person.email(),
"income": random.randint(10000,100000)} for x in range(number_of_rows)]
return output
income_data = pd.DataFrame(create_rows_mimesis(5000))
income_data
# Calculate count with no differential privacy
def typical_count_above(column_name, limit):
return income_data[income_data[column_name] > limit].count()[column_name]
number_over_threshold = typical_count_above('income', 50000)
print(f"Number of users with income over $50,000: {number_over_threshold} or {(number_over_threshold / 5000) * 100:.1f}%")
As we can see from the calculations above, typical_count_above
counts the number of users over a limit for a specified column with no preservation of privacy whatsoever.
# Calculate count with differential privacy
def private_count_above(column_name, privacy_budget, limit):
x = Count(privacy_budget, dtype='int')
return x.quick_result(list(income_data[income_data[column_name] > limit][column_name]))
private_number_over_threshold = private_count_above('income', 0.8, 50000)
print(f"PRIVATE: Number of users with income over $50,000: {private_number_over_threshold} or {(private_number_over_threshold / 5000) * 100:.1f}%")
The private_count_above
function works very similarly to typical_count_above
but using the Differential Privacy Library by Google to count the number of users above the limit and preserves privacy by using the Laplacian mechanism for adding noise to the dataset. Also note that we are able to tune the privacy budget for the acceptable loss of privacy, with 0 denoting no loss whatsoever is acceptable.
Mean Example¶
Now let's repeat the same example, but by determining the average income across all users in the dataset.
# Calculate mean with no differential privacy
def typical_mean(column_name):
return income_data[column_name].mean()
# Calculate mean with differential privacy
def private_mean(column_name, privacy_budget):
x = BoundedMean(privacy_budget, income_data[column_name].min(), income_data[column_name].max())
return x.quick_result(list(income_data[column_name]))
true_mean_income = typical_mean('income')
private_mean_income = private_mean('income', 0.5)
print(f"True mean income: {true_mean_income}")
print(f"Private mean income: {private_mean_income}, 0.8 privacy budget")
print(f"Private mean income: {private_mean('income', 0.1)}, 0.1 privacy budget")
As we can see from the above while the outcome achieves the same result, we have protected the users privacy!