@singhns – Narinder Singh
The [topcoder] community has a long track record of solving interesting problems – ranging from energy optimization of the solar collectors on the international space station, models to predict atrocities, algorithms to compress time to run genetic simulations and much more. Recently I sat down with Harvard Business School professor Karim R. Lakhani on how this work relates to the new world being described as data science.
Lakhani also serves as the Principal Investigator of the NASA Tournament Lab (NTL) at Harvard University’s Institute for Quantitative Social Science. In this context he has worked closely with business and technology leaders on how communities and data can be used in fundamentally new ways in the future. Professor Lakhani and his lab have worked with and researched the [topcoder] community for the last five+ years.
Narinder Singh: We’re seeing a lot of hype around data science. Can you explain what it is and if its even important?
Karim Lakhani: The amount of data produced in the world continues to increase at an exponential rate as more and more analog transactions are being digitized. This dramatically increases the need to be able to sort through this data efficiently and allows for new categories of questions to be asked and answered because of it. It has taken what used to be specialized and niche areas of problem solving in computer science, statistics, econometrics and visualization and increased their need by orders of magnitude. This has given rise to the term and concept of data science.
An analogy to help understand the reality of what this is vs. the overused marketing hype relates to the rise of materials science as a scientific field. Material science as a field really emerged around the Second World War, and it was a combination of chemistry, physics and metallurgy married with mechanical engineering. These were very different academic disciplines with separate distinctions about how to deal with materials. Materials science emerged as this new field with an entirely new set of professionals that weren’t chemists physicists, metallurgists or mechanical engineers, but a combination of those. Material science wasn’t simply a regurgitation of these fields but a very novel new combination of them with application to new categories of problems.
We can see this across history when innovative disciplines are defined in new ways, which then requires another categorization to acknowledge these professions together. That’s where data science is currently. We’re at the beginning stages of defining this field, and many people from around the world are trying to figure out what it means. From my point of view, the [topcoder] community plays an important role in this because many of the different disciplines have existed within [topcoder] for a long period of time. People within the community have been doing algorithmic and statistical work, as well as digitization, and computer science for many years.
NS: From your perspective, what are the constituent parts that make up this recombined field?
KL: The first component is being able to deal with large volumes of both structured and unstructured data. Next there is an entire new and evolving range of analytical tools needed to make sense of all of the data, which is often the more computationally-driven work. There is also a need for statistical analysis on top of this in order to get the causal inference from the data, not just the correlation and patterns, and the various ways to do statistical analysis are becoming important in this endeavor. It’s also important to think about how you visualize complex data to make it interactive and guide somebody through the process of being able to understand the data that exists for them.
NS: Aside from algorithms, how should we think of data science? How is it different?
KL: We are vastly underappreciating the strength and viability of the [topcoder] community if we say that all they do is algorithms. On the one hand- yes, algorithms are what single round matches (SRMs) gear towards as a means to provide repeatable rating of core skills. And this ability to categorize SRMS into their component parts and rate various aspects of skills will be important as the field of data science emerges.
But much of algorithm development in the community is around questions of data analysis and where and how this is put to use. For example, in our work with the medical school and [topcoder], we’ve taken really difficult data problems that exist across a range of medical disciplines and have begun creating the algorithms, but the context in which they come together is critical. That’s really the data science work being done. The same goes for NASA and some of the other US federal agencies that we have helped shepherd through the NASA Tournament Lab. There was a desire around taking complex data intensive problems and having the [topcoder] community create the algorithms and the accompanying software to do the analysis. We were basically prototyping data science work with the community. We never pitched it within NASA or with our NASA colleagues as doing just algorithms for them. Instead, it was solving difficult data intensive, computationally challenging problems.
Even in what we typically consider an algorithm problem – compress the time it take to find X – this problem is now critical because the explosion of data volume makes it more critical for every part to operate more efficiently. Just as there’s more to development than coding, more to design than wireframing, there is more to data science than just algorithms. It’s the broader context for that work.
NS: It feels like we’ve had a lot of fine-grain definition around what we do in development (in [topcoder] categories). We’ve historically used algorithms as a generic name, but how would you define better vocabulary around the challenges we run in this area? Should we should we have additional categorizations (beside algorithms) for data science?
KL: For the algorithms and SRMs, there is room for creating multiple categories within structures to aid in communication of skills and better understanding of what is important and needed. For example, since there is nothing in the external labor market that says “I need a marathon match specialist”. Even with the existing information related to already run SRMs and Marathon matches additional structure / categories can be applied. This could allow more fine grained development and advancement of skills that constitute various parts of this emerging field – e.g. machine learning, optimization. In addition, there are new categories that could help relate the context of the problems we are trying to solve to the techniques used to solve them, like data visualization.
[topcoder] is successful in helping developers provide information on their skill set because participation on the platform creates a signal within the external labor market about coders’ skills. This concept will be highly needed here because of the noise around data science. Finer grained categories will be helpful for [topcoder] members in explaining their skill set and it will be helpful to the market overall because it will create rigor around what data science is and skills related to it.
NS: We are seeing a ton of technologies for running large data problems and even the emergence of cognitive learning models like Watson from IBM. What role could [topcoder] play in the advancement of these areas.
KL: Large firms, like IBM with Watson, SAP with HANA and many others are making core investments in technology. But it is still very early in terms what the use cases will be or who has the skills to find and execute on them. For example, IBM Watson is a general purpose tool that can be deployed in many ways. What makes a difference is a labor market that knows how to deploy it and can create use cases that show value proposition. The kinds of skills [topcoder] has highlighted are the core of unleashing that potential.
NS: Could [topcoder] SRMs be the next Jeopardy challenge for Watson?
KL: Its certainly a natural progression but the competition for Watson will be tougher at [topcoder]! But seriously, the best outcome will be when the skills of the [topcoder] community can be used to make Watson and technologies like it live up to their full potential.
NS: Do you have any other thoughts as we wrap up?
KL: One final point to make is it’s clear that in the world, we have a secular shift towards more and more analog transactions being converted to digital transactions. The skill sets that [topcoder] has across the board are, and will continue to be, in high demand. This is not just in Silicon Valley- it’s in all industries. There is a need for people who can help organizations make sense of data being generated from their transactions, and it’s very important to realize that there is a talent base within [topcoder] that can help solve anyone’s problems.