October 30, 2018 Layers of Data Protection at Topcoder
Universal internet access and the rise of the gig economy are delivering on the promise of as-needed, when-needed expert workforces. The benefits of these workforces are increasingly compelling; customers can expect their production capacity to flex beyond core teams with their real-time demand for it, and can access hard-to-hire skills instantly and as needed, instead of grappling with a job market that is increasingly difficult to access. Talented people now enjoy the ability to cast their skills out anywhere in the world and earn from them, without feeling compelled to leave their family and community. Accordingly, Gartner, Forrester, and McKinsey have all recently tracked a crowdsourcing workforce on the rise.
But with the power of a thousand minds on tap comes the risk of sharing your data and work with countless strangers. Each of these workers may see a slice of your data or strategic intentions. Concerns over IP sharing, IP contamination, disclosure, and privacy naturally follow. Our crowdsourcing platform was founded in 2001 and has dealt with these concerns every day since. Both the tools that we use and the methods we employ to control these risks change year by year as new tools emerge and ways of doing business change. We answer questions about these methods in every Q&A, and every deal cycle.
As our Global Director of Crowd Analytics & AI, I thought it would be helpful to share the latest basics on how Topcoder mitigates these concerns today with seven distinct layers of security.
Layer #1: Agreements
It begins with agreements. When you’re a customer of Wipro and Topcoder, we sign an agreement with you that sets the rules for what we can and can’t disclose, as well as the process for disclosing it — exactly like any other prudent commercial transaction. These terms are typically handled in the MSA, and more stringent requirements can be layered on top when needed, SOW by SOW. For projects that require them, our contestants digitally sign NDAs as a condition of access to the challenge. There’s sometimes a misconception that crowdsourcing is unique in this regard. In reality, customers experience a commercial relationship with us, complete with standard NDAs and contract terms.
But this is only the beginning. From here, we have a number of practical steps we take to reduce risk and protect privacy — some of which may be considerably more stringent than protections you have in place today with traditional vendors.
Layer #2: Atomization
Topcoder handles projects according to the skill types required — through a process called atomization. We take the project you’d like to build and break it down into bite-sized segments, which become separate challenges (e.g., app design, coding, etc.) that we run through our community. This is the approach that makes our version of crowdsourcing lasting and efficient. We don’t require Oracle DBAs to become front-end design/build unicorns. It is iterative and serves our customers well in both the waterfall and dev/ops worlds.
While this process was designed to allow us to control time and delivery, atomization also adds obvious protection. Think of it like this: members of our global crowd don’t get to work on Voltron as a whole; they work on a single robot lion (or limb) at a time. Workers won’t know there are other lion robots that assemble into Voltron unless you want them to. Atomization drastically and naturally reduces the number of people who see your entire project, which is already more protection than a traditional contractor engagement typically provides.
Layer #3: Pseudonym
We don’t disclose the identity of our customer to the Topcoder Community. We assign a pseudonym instead. Generally, it’s the same pseudonym across all projects for any one customer, but we will also assign them project by project, or even component by component, as required for the project’s security goals. So our members may not realize that two projects they’re working on are even for the same customer.
Layer #4: Obfuscation of data
Obfuscation an important, very complex topic. Obfuscation is a best practice-driven scrubbing of personal identifiable information (PII) and other sensitive information in order to mask that data and reduce or eliminate the likelihood that a worker can correlate it with anything else in the field, or even who it’s for. Here are a couple of illustrative examples:
- In healthcare, this is all about patient privacy; don’t disclose a name, birthdate, patient identifier, etc. It also includes attention paid to preventing “triangulation,” or an ability for a worker to derive the identity or disease of a patient by comparing the scrubbed data to public data sources (e.g., questions on Quora).
- In finance, PII still plays a role, but the issue is more about disclosing information about trades, holdings, intentions, how to gauge risk, etc.
Obfuscation is always a partnership exercise, and either the data is treated before it’s handed to us, or Wipro and Topcoder work with the customer to prepare it. We have adopted and developed several approaches for obfuscation. They range from simple scrambling of PII or key identifiers (e.g., product codes, warehouse IDs, etc.), to statistically rigorous replication of reference data to create a fabricated though relevant data set. (We recently completed a project for the Department of Veterans Affairs specifically to produce a tool to generate such data.)
Layer #5: Metaphors
A metaphor transposes the domain. Metaphors have long played a role in gamification (see FoldIT and Play To Cure for examples), or abstracting the problem domain from the solution in order to find new approaches. They also help in protection. We’ll apply metaphors when even the basic project domain or purpose shouldn’t be exposed.
Say a mining company doesn’t want to disclose the locations of their mines, or that they’re specifically excavating for gold. To the extent that any position data is needed, Topcoder preserves relative but not exact spatial relationships while moving the scene to another continent or even planet, and might present the problem as a widget manufacturer instead. This way, we further distance the data and topic from its presentation to competitors on our platform. Together, Wipro and Topcoder then go on to unwind those metaphors when we return results to clients.
Layer #6: Direct reviews and direct testing
Our review process uses a two-pronged approach. One prong is direct, manual review performed by no fewer than two expert reviewers in our community — members who’ve proven to be not only technical masters, but also trustworthy on our platform. For critical code reviews, they inspect code line by line and complete lengthy scorecards, searching for best practices and security flaws. (Reviewers are unable to see the identity of the submitters.) A contestant must first get past those sentinels if they want a chance at victory.
The second prong is technology. We also run the code across SAST when necessary, as well as IP detection platforms. Mike Morris, our CEO, wrote on this subject in relation to crowdsourcing as being more secure than traditional means of development.
Layer #7: Ring-fenced crowds
There are times when the project or data simply cannot be shared in any form with the public crowd. Fortunately, this is quite rare. But in those cases, we are able to develop a sub-crowd to work on projects. We will first qualify workers who are interested, available, and capable of working on the given solution. These workers are then asked to complete additional paperwork; past examples include data use agreements, network use agreements, even background checks. The worker pool may even include consultants from Wipro or the client’s other trusted vendors.
Only after this paperwork (and if necessary, location-specific) conditions are met are project details shared. When necessary, we can also set up virtual private clouds with I/O restrictions for their use. But it’s worth noting that this is a last resort step; reducing the size of the addressable workforce always has an impact.
Up next at Topcoder: Differential Privacy
If concerns about the risks of data sharing are on the rise, so fortunately are methods for dealing with it. One promising technique for obfuscation on the ascent is called Differential Privacy (“DP”). DP seeks to replicate important data in a manner that both breaks the ability to triangulate data back to reality while also preserving key relationships.
To illustrate the point: imagine being able to replicate a data set of disease patients in an entire state in a manner where hundreds of data scientists can perform tests to seek precursor signals, without the risk that some bad actor can figure out patient or provider identities. Through our innovation contract with NASA and in partnership with NIST, Topcoder will be hosting a Differential Privacy challenge this November and are exploring methods to refine these techniques into our standard practice. If you’re a data scientist and would like a chance to contribute to the solution, click here to stay in the loop and join the contest!
Keep up with security advancements from Topcoder
As any security professional will tell you, privacy and security protections is a dynamic field that requires constant diligence. Rest assured this isn’t the last post on this topic. Our methods of protecting our clients, our members, and ourselves, are always evolving. (So be on the lookout for a post from our VP of Security on platform and community security.)
Click here to subscribe to the weekly Topcoder data science newsletter to learn more about differential privacy in advance of our upcoming challenge — launching in November.
Global Director, Data Science, Analytics & AI