Differential Privacy Synthetic Data Challenge
Propose a mechanism to enable the protection of personally
identifiable information while maintaining a dataset's utility for analysis
If you’re not a differential privacy expert, and you’d like to learn, we’ll have tutorials to help you catch up and compete! Join, learn, and compete for $150,000 in prizes! Check back soon for more details about the challenge.
Why is this match interesting and important?
Are you a mathematician or data scientist interested in a new challenge? Then join this exciting data privacy competition with up to $150,000 in prizes, where participants will create new or improved differentially private synthetic data generation tools. When a data set has important public value, but contains sensitive personal information and can’t be directly shared with the public, privacy-preserving synthetic data tools solve the problem by producing new, artificial data that can serve as a practical replacement for the original sensitive data, with respect to common analytics tasks such as clustering, classification and regression. By mathematically proving that a synthetic data generator satisfies the rigorous Differential Privacy guarantee, we can be confident that the synthetic data it produces won’t contain any information that can be traced back to specific individuals in the original data. The “Differential Privacy Synthetic Data Challenge” will entail a sequence of three marathon matches run on the Topcoder platform, asking contestants to design and implement their own synthetic data generation algorithms, mathematically prove their algorithm satisfies differential privacy, and then enter it to compete against others’ algorithms on empirical accuracy over real data, with the prospect of advancing research in the field of Differential Privacy.
If you’re not a differential privacy expert, and you’d like to learn, we’ll have tutorials to help you catch up and compete!
Join, learn, and compete for $150,000 in prizes! Check back soon for more details about the challenge.
Why Does This Challenge Matter?
The digital revolution has radically changed the way we interact with data. In a pre-digital age, personal data was something that had to be deliberately asked for, stored, and analyzed. The inefficiency of pouring over printed or even hand-written data made it difficult and expensive to conduct research. It also acted as a natural barrier that protected personally identifiable information (PII) -- it was extremely difficult to use a multitude of sources to identify particular individuals included in shared data.
Our increasingly digital world turns almost all our daily activities into data collection opportunities, from the more obvious entry into a webform to connected cars, cell phones, and wearables. Dramatic increases in computing power and innovation over the last decade along with both public and private organizations increasingly automating data collection make it possible to combine and utilize the data from all of these sources to complete valuable research and data analysis.
At the same time, these same increases is computing power and innovations can also be used to the detriment of individuals through linkage attacks: auxiliary and possibly completely unrelated datasets in combination with records in the dataset that contain sensitive information can be used to determine uniquely identifiable individuals.
This valid privacy concern is unfortunately limiting the use of data for research, including datasets within the Public Safety sector that might otherwise be used to improve protection of people and communities. Due to the sensitive nature of information contained in these types of datasets and the risk of linkage attacks, these datasets can’t easily be made available to analysts and researchers. In order to make the best use of data that contains PII, it is important to disassociate the data from PII. There is a utility vs. privacy tradeoff however, the more that a dataset is altered, the more likely that there will be a reduced utility of the de-identified dataset for analysis and research purposes.
Currently popular de-identification techniques are not sufficient. Either PII is not sufficiently protected, or the resulting data no longer represents the original data. Additionally, it is difficult or even impossible to quantify the amount of privacy that is lost with current techniques.
The competition will use a data set of emergency response events occurring in San Francisco and a sub-sample of the IPUMS USA data for the 2016 American Community Survey."
What is this match about?
This competition is about creating new methods, or improving existing methods of data de-identification, in a way that makes de-identification of privacy-sensitive datasets practical. A first phase hosted on HeroX will ask for ideas and concepts, while later phases executed on Topcoder will focus on the performance of developed algorithms.
Learn the Math Behind Differential Privacy
Oct 8 - Oct 15
|This week's resource is a video tutorial that provides a basic introduction to epsilon-differential privacy. This is a great starting place if you've never worked with differential privacy before. A couple important caveats, though: This tutorial does not talk about synthetic data generation yet, and it does not cover epsilon-delta differential privacy. We'll get to those topics in future weeks!||Video Resource|
Oct 16 - Oct 23
|Ok! If you followed our newsletter entry last week, you should have some basic idea how epsilon-differential privacy works when you want to privatize query responses.But in a lot of practical applications, we don’t want to just privatize the result of a specific query--we want to privatize a whole data-set. Synthetic data generation involves taking a real data-set, computing a set of statistics or learning a model that describes the data-set, and then using those statistics or model to generate an entirely new data-set consisting of completely fake people that still preserves the important patterns in the original data-set. If we want to produce synthetic data that satisfies differential privacy, then we need to do our statistics computation or model learning using only differentially private queries. That will ensure our synthetic data set is safe to share with the public. Here is a paper describing one system for generating differentially private synthetic data. Make sure you read carefully through the privacy proof in Section 2.3 and fully understand it. ||Reference Paper|
Oct 24 - Oct 31
|Welcome back to the NIST Differentially Private Synthetic Data Corner of the Top Coder newsletter! If you’ve been following our previous entries, you’ve learned about writing basic epsilon differentially private queries , and how to build a synthetic data generator using those queries. In this segment, we’ll introduce a slightly more complex privacy standard called epsilon-delta differential privacy. This introduces a new parameter, delta, that relaxes the definition of differential privacy a bit and makes it easier to achieve with less noise and greater accuracy…. it lets us ‘cheat’ a bit. This resource describes a set of differentially private learning algorithms that make use of that cheat factor, and satisfy epsilon delta differential privacy instead of strict epsilon differential privacy. Be sure to read carefully through each of the theorems establishing privacy guarantees, and make sure you can understand them and convince yourself of their validity. You’ll need to be able to write your own correct differential privacy proof to be eligible for a prize in the upcoming contest.Also note that this resource doesn’t explicitly cover synthetic data generation with epsilon-delta differential privacy--we’ll do that next week!||Reference Paper|
Oct 31 - Nov 7
|This is the fourth and final NIST Differentially Private Synthetic Data newsletter entry!. In the previous segments, we showed you how to write basic epsilon differentially private queries, how to build a synthetic data generator using those queries, and how to work with epsilon-delta differential privacy. If you’ve followed all of those, then you’re ready for this last piece: general differentially private synthetic data generation, including epsilon-delta differential privacy. Rather than give you one specific solution to look at, below are links to a couple resources that will help you compare many different possible solutions. This contest will help us tell which synthetic data generation algorithms do the best job of preserving data patterns needed for analysis tasks like clustering, classification and regression while still provably satisfying either epsilon differential privacy, or epsilon delta differential privacy. As always, while you’re investigating the solutions that are referenced in these links, remember to pay careful attention to any privacy proofs and be sure you have a clear understanding of them. Welcome to the NIST Differentially Private Synthetic Data Challenge!||Claire Bowen’s Comparative Study|
The DPComp Website