This repository contains an implementation of joinable private sketches and
experiments evaluating their utility.

Joinable private sketches allow two parties holding sensitive datasets to
perform joint computations while protecting the privacy of the individuals
present in the datasets. A joinable private sketch is a differentially private
representation of a dataset. The dataset must have an "id" column and a "value"
column (e.g. name and employment status). It can be combined with another
dataset having an "id" column and any number of other columns (e.g. name, age,
nationality, ...) to compute approximate answers to questions about the inner
join of both datasets on the "id" column.

Two kinds of query are supported:
- Summing a function over the join. For example, the number of unemployed people
  between 30 and 50 years old.
- Choosing the optimal function from a class --- for example, training a
  machine learning model to predict employment status from the other columns.
Since the sketch is differentially private, it can be published so that other
parties can later join it with their own sensitive datasets.


Running the experiments
-----------------------

1. Download the UCI "Adult" dataset.

Go to
	https://archive.ics.uci.edu/ml/machine-learning-databases/adult/
and download two files: adult.data and adult.test. Save them in the directory
datasets/adult.

2. Experiment: Measuring a joint distribution:

Run
	python3 -m dp_join.experiment.joint_counts
The script will print a path to an HTML report which can be opened in a web
browser.

3. Experiment: Logistic regression on the UCI Adult dataset:

NOTE: this is not the experiment that was used for the paper! For that, see the
README file in the parent directory, which will direct you to the Jupyter
notebooks in the "adult" directory. A significant difference is that this
version doesn't use a bounded loss function, but instead uses a "difference
sketch" (Appendix E). Also, this code includes some non-categorical features.

Run
	python3 -m dp_join.experiment.adult_logistic_regression
The script will print a path to an HTML report which can be opened in a web
browser.


Exploring the code
------------------

All of the code lives in the dp_join directory.

To see how sketches work, look at the Sketcher interface in
dp_join/sketch/__init__.py and the implementation, OneHotSketcher, in
dp_join/sketch/one_hot.py.

Another way to understand the sketching code is to see how it is used. You can
look at the experiments under dp_join/experiment, or look at the unit tests.
To run all unit tests and doctests, make sure pytest is installed and run:
	pytest
