The National Institute of Standards and Technology Public Safety Communications Research Division recently joined forces with Topcoder and HeroX to develop new methods and improve existing methods of data de-identification for the security, safety, and improved analysis of datasets. During this competition, Topcoder was able to leverage an incredible pool of talent to deliver profound results to NIST.
In particular, A Topcoder Copilot, birdofpreyru (aka Sergey) defined new techniques that furthered research and development into sophisticated data de-identification methods. And, it helped NIST measure the value of their Topcoder solution.
Today, we’re going to talk about how Topcoder was able to connect NIST with critical talent like birdofpreyru — who helped innovate for both for the customer and Topcoder. This incredible community dynamic helps us provide continued success for clients spanning across industry verticals and research groups.
Crowdsourcing is more than just an anonymous pool of talent. There are intelligent, driven, inspired individuals that make these projects work. And Topcoder brings them at-scale (+1 million members and counting!). This is a story of how a member of Topcoder helped redefine differential data privacy for both NIST and the entire differential privacy community.
Understanding Differential Privacy
The world runs on data. Data fuels critical business decisions, drives marketing campaigns, and is utilized in the development of protocols and products. And, data helps us declutter the world and create meaning out of the chaos. But, data is only useful when you can analyze it.
Here’s the problem — large data lakes filled with datasets from various sources can create security risks. Linkage attacks (or attacks that combine these anonymous data sets with existing records to identify individuals) are a growing problem in the security space. For example, Netflix published data pertaining to the ranking of movies from 500,000 customers in 2007. Researchers found that they could easily identify the individuals by leveraging existing data from IMDB.
This problem magnifies when we talk about public safety. In 1990, Stanford researchers de-anonymized the data of the U.S. census and accurately identified 87% of the U.S. population. This creates barriers for researchers trying to identify critical trends in public safety. If data can be de-anonymized, threat vectors are opened — which puts all individuals with data in the pool at risk.
Want to learn more about differential privacy? Check out this Wired story.
The NIST Differential Privacy Synthetic Data Challenge
For NIST, differential privacy is a big deal. The ability to work with large public safety datasets to garner actionable insights that guide policy and protection is a core function of NIST. Unfortunately, those capabilities are blockaded by this de-identification conundrum.
To help solve this issue, Topcoder, NIST, and the Laboratory for Innovation Science at Harvard joined forces to create algorithms that will help solve differential privacy issues. These synthetic data generation algorithms produce synthetic data that can replace the original data for the purposes of data analysis — thereby circumventing the differential privacy barrier.
To do this, Topcoder members were given a data set of emergency response events that occurred in San Francisco, as well as a sub-sample of past U.S. census data. The goal: develop an algorithm that could replace data within these sets with synthetic data that would still provide value for analysis without subjecting the datasets to possible linkage attacks.
How Topcoder Helped NIST Find the Talent They Needed
When governments or private/public entities need a set of advanced algorithms that can help solve a real-world problem at-scale, finding the appropriate talent to create, test, and produce these algorithms can be difficult. These are cutting-edge security algorithms that are attempting to deconstruct a complex privacy and security problem using large sets of data, so finding global support is a critical component of creating a compelling and functional algorithm.
To connect NIST to a global set of intelligent coders, Topcoder leveraged a crowdsourcing challenge — which provides financial rewards for achievements gained towards the ultimate goal of creating a functional synthetic data generation algorithm.
The Topcoder Copilots
At Topcoder, we leverage crowdsourcing dynamically. One way that we provide ongoing value is through our Copilots. These are the people that assist in setting up and running challenges, which often involves creating complex measurements, assisting in research and development, and defining the scope and problem definitions.
In our Data Science practice, before complex problems are solved by the rest of our community, these Copilots get busy developing unique solutions that ensure that each client that comes to Topcoder can accurately measure the value of the solution being delivered.
One of the Topcoder Copilots, birdofpreyru (or Sergey,) helped solve a fundamental problem in the construction of synthetic data generation algorithms — ensuring that the synthetic data generated conserved its clustering characteristics and that the challenge was able to be accurately and successfully measured to the parameters necessary to provide NIST with the value they needed.
At Topcoder, we have a wealth of hyper-talented coders, QA testers, and data scientists that can be leveraged through our advanced crowdsourcing system to provide value across a wealth of app development, design, and data science needs. But, there’s a face behind every one of our community members. Each person on our platform has been heavily vetted, and they each have a unique combination of skills and capabilities that make them valuable to both us and our clients.
While our community is responsible for generating success through challenges and cutting-edge initiatives, we have another group of individuals, Copilots, that help define tasks, goals, and create the systems under which these challenges can run successfully.
Birdofpreyru is one of those Topcoder’s Copilots — and he helps set up and run complex data challenges like this one. While birdofpreyru didn’t actually compete in the synthetic algorithm generation challenge, he is responsible for creating the circumstances under which the challenge was able to run successfully.
Before the NIST Differential Privacy Synthetic Challenge, birdofpreyru combined cutting-edge research with intelligent coding and analysis to provide something incredible — he ensured the conservation of synthetic data clustering characteristics. A principle problem with synthetic data algorithms is ensuring that clustering characteristics are the same as the original data set. In other words, how can you tell if the synthetic data is valuable for analysis?
Birdofpreyru was able to randomly pick out 3 columns of the synthetic data and compare that to the original data set to ensure clustering characteristics.
This is a huge breakthrough for two reasons.
- Emerging research has suggested that 3 columns are sufficient for ensuring the clustering characteristics of synthetic data, but case studies are virtually non-existent. This means that birdofpreyru not only delved into research that’s recently emerging in a niche field, but he was able to prove that 3 columns was sufficient for this purpose and apply that to real-world algorithm and scenario.
- He increased the speed at which all data set characteristics were verified by checking the minimum number of columns possible to ensure the quality of data. This helped Topcoder provide faster resolutions for NIST, and it provided NIST (and the entire community of differential privacy) with a credible and valuable case study for synthetic data clustering conservation.
Birdofpreyru highlights the value in crowdsourcing. He’s talented, smart, and willing to go the extra mile for each client. Crowdsourcing may not include on-site teams, but that doesn’t mean the crowdsourcing community isn’t delving into cutting-edge techniques and problem-solving methods to ensure that clients receive incredible value at-scale.
Each Topcoder experience provides layers of value. Whether you need to define the scope, create solutions, R&D assistance, or problem definition, Topcoder provides remarkably deep crowdsourcing to meet each and every client’s needs.
The value in crowdsourcing goes beyond the global reach and scale. There’s virtually no other method of development, testing, QA, etc. that gives you access to the breadth of talent that crowdsourcing does. Finding and hiring a person who understands emerging research in differential privacy like birdofpreyru is nearly impossible for the majority of businesses or research centers.
When you need a complex problem solved fast, Topcoder’s pool of talent will be here waiting. And, each of them has a particular set of talents that makes them invaluable. With +1.5 million members and counting, Topcoder can help your business find incredibly talented coders, testers, and data scientists. And, we combine them with hyper-intelligent, dedicated top-level support that can help, so that you can take your project to the next level.
Talent, scale, speed, and value — those are the core tenants of Topcoder. Is your business ready to scale its next project? Contact us! We can help.
Don’t believe us? Check out The Forrester Total Economic Impact™ of Topcoder — which shows how Topcoder can boost your ROI by 113% on large enterprise crowdsourcing programs.