Challenge Overview
Problem Statement  
NIST: Differential Privacy Synthetic Data ChallengePrizes
1st Place: $10,000 Progressive prizes: 4 x $1,000 awarded to the top four placements midway through the challenge (23:55 EDT, November 15, 2018). For the full breakdown of each match period see the information provided on Challenge.gov By registering for and competing in this challenge you consent to sharing your contact information with NIST, and agree to be contacted by NIST, in the event you are entitled to a payment on this challenge as payments will be issued by NIST and not Topcoder. This match is TCO19 eligible. Everybody who scores above 250k in the system test will get TCO19 points according to these rules. IntroductionDetailed and accurate data builds the foundation for academic research, informed decision making, and a large diversity of smart software systems. Collection and sharing of relevant datasets are often complicated by various restrictions, among them the need to protect private and sensitive information contained in the dataset from 3rd parties, who still may have valid reasons to use the data. Consider a dataset listing all law enforcement incidents in a metropolitan area, where each row is an incident, and the columns contain related information such as incident location, date and time of 911 call, time of police arrival at the site, etc. There is significant benefit for independent researchers to analyze this type of public safety data for different city neighborhoods to compare police presence and public protection for different areas, etc. However, it is not acceptable to share sensitive information about specific incidents, that might allow the identification of victims, suspects, and other sensitive information about these events. One way to secure this data is to censor all sensitive information from the dataset before making it publicly available. For example, removing exact locations of accidents, exact times, names of involved people, etc. However, this will significantly reduce the value of the remaining data for analysis. Another approach, called Differential Privacy, allows for a better balance between data privacy and the loss of useful information. The overall idea of synthetic data is the following: using the original dataset, one may generate a synthetic (also called mock, artificial) dataset that describes fictional criminal events, such that collective statistical properties of the synthetic dataset are very similar to those of the original one, and research conclusions from the synthetic dataset hold true for the original one. Differential Privacy is a formal guarantee of privacy that considers the output probability distributions of randomized, privacypreserving algorithms. Synthetic data generators that satisfy the Differential Privacy guarantee produce data that can be safely released to the public in place of the original sensitive data. We encourage you to read specialized literature and original research on differential privacy. NIST���s (National Institute of Standards and Technology) PSCR (Public Safety Communications Research) division and Topcoder are running a series of three similar marathon matches, each twomonths long, in which you will work to create new, or improve existing, differentially private synthetic data generation algorithms. This is the first marathon match in the series, with a $29k prize pool. The total prize pool of the series is up to $150k which includes: $39k for the second match, $62k for the last match and incentive prizes throughout all three matches. In each marathon match, you will be given an original dataset, for which you will generate a synthetic dataset. We will run various statistical tests on the submitted synthetic datasets, and score them based on the proximity of their statistical properties to those of the original data. ProblemThe competitor pack provides you with an extract of 2017 data from The San Francisco Fire Department���s Call for Service dataset: 305k records and 32 columns of data in CSV format, plus additional JSON files with data dictionary. For convenience of handling, all categorical data values were converted to numerical form. Details are explained in the README.md document, included into the pack. Your goal is to develop an (��, ��)differential privacy algorithm. You will run that algorithm on the provided dataset with an arbitrary �� ≤ 1/n^^{2}, where n is the number of dataset records, and three values of ��: 10.0; 1.0; and 0.1, to produce three output CSV files. You will call these output files 10_0.csv, 1_0.csv, and 0_1.csv, pack them together into a ZIP archive, and submit the resulting ZIP to the Topcoder system for preliminary scoring, as explained below. For your convenience, the competitor pack includes Python scripts for submission generation, and a naive implementation of a simple differential privacy synthetic data algorithm, as an example. To ensure that submitted solutions were generated by valid (��, ��)differential privacy algorithms, and with the specified values of �� and ��, you will be requested to submit a clear and complete write up of your algorithm, along with a clear and correct proof that the algorithm satisfies differential privacy, for a manual prescreening by a team of experts in differential privacy. Competitors approved during prescreening will receive ���1000 boost to their further provisional scores, displayed on the challenge leaderboard; i.e. submissions without such approval will have scores within the [0; 1,000] range, and submissions from prescreened competitors, having ���1000 factor, will score within the [0; 1,000,000] range. The prescreening approval should not be considered as an official certification of your algorithm���s correctness of using differential privacy, although the prescreening team will do their best to ensure it. Algorithms with the potential of winning prizes will be reviewed with greater care at the conclusion of the match. To keep the leaderboard accurate, and ensure fair play during the competition, you may be requested to submit a revised writeup and pass the prescreening procedure again, to spot check the current revision of your algorithm. PreScreening ProcedureTo apply for the prescreening, fill out this Google Form. The form requests your Topcoder username, your contact email, an archive with the current revision of your algorithm, and a writeup containing a complete and clear algorithm specification and the theoretic proof that the algorithm satisfies differential privacy. Within approximately one week, you will be notified via email about the prescreening outcome. Submission of Generated Privatized Datasets into Topcoder SystemAs explained earlier, your submission into the Topcoder system will be a ZIP archive, containing three files, named 10_0.csv, 1_0.csv, and 0_1.csv, generated with the corresponding values of ��, and �� ≤ 1/n^^{2}. Each of these CSV files must be smaller than 100 Mb in size. An auxiliary Python script is included in the competitor pack, to help you with generation and check the submission archive. Once ready, you will upload this ZIP to a file sharing service that supports direct download URLs (for example, you may upload it to Google.Drive, get the sharing URL, and replace its /open segment by /uc to get the direct download URL). The resulting URL should be submitted into the Topcoder system wrapped in a simple Java code, such as this example: class NistDp1 { public String getAnswerUrl() { return "https://drive.google.com/uc?id=1MUMH0IZ3zgfLYloFXvOrhA4EpKgEBrl"; } } You can submit your solutions a maximum of once every four hours. Provisional ScoringThe intent behind our scoring approach is to quantify the ability of your differential privacy algorithm to conserve clustering characteristics of the dataset being privatized. For that purpose, our scoring method compares 3columns marginal density distributions between the original and synthetic datasets. Thus, the maximal score would be achieved by submission of the original dataset; of course that would violate differential privacy, and won���t be approved. Here are the details of the scoring method. A single test works on three dataset columns, picked up randomly. The domain of possible values in those columns is split into buckets. For example, say two columns are categorical with 50 and 200 possible values, and the third column is numerical, with integer values between a and b, and also permits empty values. Our scoring algorithm divides the domain of the numerical column into 100 equal ranges, of size (a  b)/100 each. As the column allows empty values, it adds a ���virtual range��� for its empty values. Then, it creates 50 ��� 101 ��� 200 buckets, plus one special bucket for records outside of the valid value range; and for both the original and submitted datasets, it counts the number of records falling into each bucket. Resulting counts are divided by the total number of records in each dataset to get the density distribution of records. Then, our scoring algorithm calculates the absolute difference of density distributions for the original and submitted datasets, taking the sum of absolute differences of density values in corresponding pairs of buckets. Due to normalization of density distributions to 1.0, the resulting difference will be a number s, belonging to the range between 0.0 (perfect match of density distributions) and 2.0 (density distributions for the original and synthetic dataset do not overlap at all). The resulting single test score is defined as S = 1 000 000 (1  s/2). To calculate the score shown in the provisional leaderboard, we created a set of 30 tests, described above, with randomly picked columns for each test. It is guaranteed, however, that each dataset column is included in at least one of the tests. The scores from separate tests are averaged; and if the submitter has not been approved in the prescreening procedure, the score is divided by 1 000. The Competitor pack includes the code for generation of such tests, and subsequent scoring of the synthetic datasets, along with instructions on how to use them for local scoring. NOTE: It is allowed to process and submit a subset of the dataset, consisting of the first N < 32 consecutive columns. In this case, you will get a zero score for any test relying on the columns outside of the subset you have submitted. Final ScoringThe first round of the final scoring will consist of testing your synthetic datasets on an independent ground truth, consisting of another set of randomly generated tests that work the same way as described above. The top performing solutions will undergo an additional thorough review of theory and algorithm implementation by the team of differential privacy experts. We will also run those algorithms on the original dataset to verify that their scores were achieved with the correct �� and �� values. In case of a tie between closeperforming solutions, the solutions will be tested further with the same methodology, but with more values of (��, ��), and data from other years from the original dataset. We reserve the right to adjust provisional and final scoring methodologies during the contest, in a way that���s fair for all competitors, in case any flaws are found in the original protocol. Legal RulesThis marathon match is subject to the NIST official rules. In the event of any discrepancy or inconsistency between the terms and conditions of the contest rules, the terms and conditions of the NIST Official Rules on Challenge.gov as specified within the Challenge Specific Agreement shall control. Notice the following points: teaming is allowed, participation of organizations is allowed. Participants noneligible to prizes for legal reasons, still may participate if they relinquish from the prizes. EligibilityTo be eligible for a prize:
References
 
Definition  
 
Examples  
0)  

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2020, TopCoder, Inc. All rights reserved.