Challenge Overview

Problem Statement

NIST: Differential Privacy Synthetic Data Challenge

Prizes

1-st Place: $10,000

2-nd Place: $7,000

3-rd Place: $5,000

4-th place: $2,000

5-th place: $1,000

Progressive prizes: 4 x $1,000 awarded to the top four placements midway through the challenge (23:55 EDT, November 15, 2018).

For the full breakdown of each match period see the information provided on Challenge.gov

By registering for and competing in this challenge you consent to sharing your contact information with NIST, and agree to be contacted by NIST, in the event you are entitled to a payment on this challenge as payments will be issued by NIST and not Topcoder.

This match is TCO19 eligible. Everybody who scores above 250k in the system test will get TCO19 points according to these rules.

Introduction

Detailed and accurate data builds the foundation for academic research, informed decision making, and a large diversity of smart software systems. Collection and sharing of relevant datasets are often complicated by various restrictions, among them the need to protect private and sensitive information contained in the dataset from 3rd parties, who still may have valid reasons to use the data. Consider a dataset listing all law enforcement incidents in a metropolitan area, where each row is an incident, and the columns contain related information such as incident location, date and time of 911 call, time of police arrival at the site, etc. There is significant benefit for independent researchers to analyze this type of public safety data for different city neighborhoods to compare police presence and public protection for different areas, etc. However, it is not acceptable to share sensitive information about specific incidents, that might allow the identification of victims, suspects, and other sensitive information about these events.

One way to secure this data is to censor all sensitive information from the dataset before making it publicly available. For example, removing exact locations of accidents, exact times, names of involved people, etc. However, this will significantly reduce the value of the remaining data for analysis. Another approach, called Differential Privacy, allows for a better balance between data privacy and the loss of useful information. The overall idea of synthetic data is the following: using the original dataset, one may generate a synthetic (also called mock, artificial) dataset that describes fictional criminal events, such that collective statistical properties of the synthetic dataset are very similar to those of the original one, and research conclusions from the synthetic dataset hold true for the original one. Differential Privacy is a formal guarantee of privacy that considers the output probability distributions of randomized, privacy-preserving algorithms. Synthetic data generators that satisfy the Differential Privacy guarantee produce data that can be safely released to the public in place of the original sensitive data. We encourage you to read specialized literature and original research on differential privacy.

NIST��s (National Institute of Standards and Technology) PSCR (Public Safety Communications Research) division and Topcoder are running a series of three similar marathon matches, each two-months long, in which you will work to create new, or improve existing, differentially private synthetic data generation algorithms. This is the first marathon match in the series, with a $29k prize pool. The total prize pool of the series is up to $150k which includes: $39k for the second match, $62k for the last match and incentive prizes throughout all three matches. In each marathon match, you will be given an original dataset, for which you will generate a synthetic dataset. We will run various statistical tests on the submitted synthetic datasets, and score them based on the proximity of their statistical properties to those of the original data.

Problem

The competitor pack provides you with an extract of 2017 data from The San Francisco Fire Department��s Call for Service dataset: 305k records and 32 columns of data in CSV format, plus additional JSON files with data dictionary. For convenience of handling, all categorical data values were converted to numerical form. Details are explained in the README.md document, included into the pack.

Your goal is to develop an (��, ��)-differential privacy algorithm. You will run that algorithm on the provided dataset with an arbitrary �� ≤ 1/n^², where n is the number of dataset records, and three values of ��: 10.0; 1.0; and 0.1, to produce three output CSV files. You will call these output files 10_0.csv, 1_0.csv, and 0_1.csv, pack them together into a ZIP archive, and submit the resulting ZIP to the Topcoder system for preliminary scoring, as explained below. For your convenience, the competitor pack includes Python scripts for submission generation, and a naive implementation of a simple differential privacy synthetic data algorithm, as an example.

To ensure that submitted solutions were generated by valid (��, ��)-differential privacy algorithms, and with the specified values of �� and ��, you will be requested to submit a clear and complete write up of your algorithm, along with a clear and correct proof that the algorithm satisfies differential privacy, for a manual pre-screening by a team of experts in differential privacy. Competitors approved during pre-screening will receive ��1000 boost to their further provisional scores, displayed on the challenge leaderboard; i.e. submissions without such approval will have scores within the [0; 1,000] range, and submissions from pre-screened competitors, having ��1000 factor, will score within the [0; 1,000,000] range.

The pre-screening approval should not be considered as an official certification of your algorithm��s correctness of using differential privacy, although the pre-screening team will do their best to ensure it. Algorithms with the potential of winning prizes will be reviewed with greater care at the conclusion of the match. To keep the leaderboard accurate, and ensure fair play during the competition, you may be requested to submit a revised write-up and pass the pre-screening procedure again, to spot check the current revision of your algorithm.

Pre-Screening Procedure

To apply for the pre-screening, fill out this Google Form. The form requests your Topcoder username, your contact email, an archive with the current revision of your algorithm, and a write-up containing a complete and clear algorithm specification and the theoretic proof that the algorithm satisfies differential privacy. Within approximately one week, you will be notified via email about the pre-screening outcome.

Submission of Generated Privatized Datasets into Topcoder System

As explained earlier, your submission into the Topcoder system will be a ZIP archive, containing three files, named 10_0.csv, 1_0.csv, and 0_1.csv, generated with the corresponding values of ��, and �� ≤ 1/n^². Each of these CSV files must be smaller than 100 Mb in size. An auxiliary Python script is included in the competitor pack, to help you with generation and check the submission archive. Once ready, you will upload this ZIP to a file sharing service that supports direct download URLs (for example, you may upload it to Google.Drive, get the sharing URL, and replace its /open segment by /uc to get the direct download URL). The resulting URL should be submitted into the Topcoder system wrapped in a simple Java code, such as this example:

class NistDp1 {
  public String getAnswerUrl() {
    return "https://drive.google.com/uc?id=1MUMH0IZ3-zgfLYloFXvOrhA4EpKgEBrl";
  }
}

You can submit your solutions a maximum of once every four hours.

Provisional Scoring

The intent behind our scoring approach is to quantify the ability of your differential privacy algorithm to conserve clustering characteristics of the dataset being privatized. For that purpose, our scoring method compares 3-columns marginal density distributions between the original and synthetic datasets. Thus, the maximal score would be achieved by submission of the original dataset; of course that would violate differential privacy, and won��t be approved.

Here are the details of the scoring method. A single test works on three dataset columns, picked up randomly. The domain of possible values in those columns is split into buckets. For example, say two columns are categorical with 50 and 200 possible values, and the third column is numerical, with integer values between a and b, and also permits empty values. Our scoring algorithm divides the domain of the numerical column into 100 equal ranges, of size (a - b)/100 each. As the column allows empty values, it adds a ��virtual range�� for its empty values. Then, it creates 50 �� 101 �� 200 buckets, plus one special bucket for records outside of the valid value range; and for both the original and submitted datasets, it counts the number of records falling into each bucket. Resulting counts are divided by the total number of records in each dataset to get the density distribution of records. Then, our scoring algorithm calculates the absolute difference of density distributions for the original and submitted datasets, taking the sum of absolute differences of density values in corresponding pairs of buckets. Due to normalization of density distributions to 1.0, the resulting difference will be a number s, belonging to the range between 0.0 (perfect match of density distributions) and 2.0 (density distributions for the original and synthetic dataset do not overlap at all). The resulting single test score is defined as S = 1 000 000 (1 - s/2).

To calculate the score shown in the provisional leaderboard, we created a set of 30 tests, described above, with randomly picked columns for each test. It is guaranteed, however, that each dataset column is included in at least one of the tests. The scores from separate tests are averaged; and if the submitter has not been approved in the pre-screening procedure, the score is divided by 1 000.

The Competitor pack includes the code for generation of such tests, and subsequent scoring of the synthetic datasets, along with instructions on how to use them for local scoring.

NOTE: It is allowed to process and submit a subset of the dataset, consisting of the first N < 32 consecutive columns. In this case, you will get a zero score for any test relying on the columns outside of the subset you have submitted.

Final Scoring

The first round of the final scoring will consist of testing your synthetic datasets on an independent ground truth, consisting of another set of randomly generated tests that work the same way as described above. The top performing solutions will undergo an additional thorough review of theory and algorithm implementation by the team of differential privacy experts. We will also run those algorithms on the original dataset to verify that their scores were achieved with the correct �� and �� values. In case of a tie between close-performing solutions, the solutions will be tested further with the same methodology, but with more values of (��, ��), and data from other years from the original dataset.

We reserve the right to adjust provisional and final scoring methodologies during the contest, in a way that��s fair for all competitors, in case any flaws are found in the original protocol.

Legal Rules

This marathon match is subject to the NIST official rules. In the event of any discrepancy or inconsistency between the terms and conditions of the contest rules, the terms and conditions of the NIST Official Rules on Challenge.gov as specified within the Challenge Specific Agreement shall control. Notice the following points: teaming is allowed, participation of organizations is allowed. Participants non-eligible to prizes for legal reasons, still may participate if they relinquish from the prizes.

Eligibility

To be eligible for a prize:

Get a prize placement, with a score above 250k.
Provide a write-up including a clear, complete and correct proof that your algorithm satisfies differential privacy.
Meet the eligibility requirements in the NIST PSCR Official Challenge Rules posted on Challenge.gov.
Per the NIST official challenge rule 3B, the official representative (individual or team lead) must be age 18 or older and a U.S. Citizen or permanent resident of the United States or its territories
Satisfy compliance review based on completion of the Certification Process (Match contest rule #8) at each match, and if you are a top scorer, submit a report for Final Review (Match contest rule #9).
Score higher than your competitors!

References

The Algorithmic Foundations of Differential Privacy by Cynthia Dwork and Aaron Roth.

Definition

Class:	NistDp1
Method:	getAnswerUrl
Parameters:
Returns:	String
Method signature:	String getAnswerUrl()
(be sure your method is public)

Examples

"E"

Returns: "Seed: E"

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2020, TopCoder, Inc. All rights reserved.

NIST: Differential Privacy - NIST Differential Privacy