Daniel Tunkelang is the Chief Data Scientist at LinkedIn and disagrees with my definition of a data scientist, presumably from my blog, “What Is a Data Scientist (and What Isn’t?”). I said that, “the term ‘Data Scientist’ is an over-reaching title.”
There are two sides to this discussion about data scientists. One the one hand, from a formal position, a scientist must engage in original, reproducible research and publish results in peer-reviewed journals. But most of those who are referred to as data scientists are not truly scientists. They may actually be, or have been, scientists in their chosen profession, as we find quite a few positions filled with PhD’s in all sorts of areas, (including, as LinkedIn’s Monica Rogati points out in “What Is A Data Scientist?”, even a neurosurgeon), but that doesn’t mean that the work they do can be called science. Ms. Rogati mentions that in addition to wrangling data and doing analytics, these professionals conceive things. But engineers and jewelry designers conceive things too. But they aren’t scientists.
In “LinkedIn’s Daniel Tunkelang On ‘What Is a Data Scientist?’“Tunkelang writes: “I’m a big fan of Hilary Mason, chief scientist at bit.ly, so I’ll cite her definition: a data scientist is someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning. Data scientists not only are adept at working with data, but appreciate data itself as a first-class product. At LinkedIn, products pioneered by data scientists, such as People You May Know, harness the power of data to create value for users.”
This is most likely the emerging definition that others will use, but I see no science here. However, and this is where I’m willing to concede the point, my published definition of data scientist is almost exactly the same as Mason’s, the one Tunkelang quotes. In a research report I recently released at Constellationrg.com, Trends: Analytic Types, Roles and Skills, I defined four types of analytics. What I would prefer to refer to as Data Scientist are Type I. What Daniel and Hilary describe are clearly in my category of Type II, but I deferred to the emerging industry conventions:
Type I Analytics: Quantitative Research
The creation of theory and development of algorithms for all forms of quantitative analysis deserves the title Type I. Quantitative Research analytics are performed by mathematicians, statisticians and other pure quantitative scientists. They discover new ideas and concepts in mathematical terms and develop new algorithms with names like Hidden Markov Support Vector Machines, Linear Dynamical Systems, Spectral Clustering, Machine Learning and a host of other exotic models. The discovery and enhancement of computer-based algorithms for these concepts is mostly the realm of academia and other research institutions (though not exclusively). Commercial, governmental and other organizations (Wall Street for example) employ staff with these very advanced skills; but in general, most organizations are able to conduct their necessary analytics without them, or employ the results of their research. An obvious example is the FICO score, developed by Quantitative Research experts but employed widely in credit-granting institutions and even human resource organizations.
Type II Analytics: “Data Scientists”
More practical than theoretical, Type II is the incorporation of advanced analytical approaches derived from Type I activities. This includes commercial software companies, vertical software implementations, enterprises that essentially all data (such as Google, Facebook, LinkedIn) and even the heavy “quants” in industry who apply these methods specifically to the work they do like fraud detection, failure analysis, propensity to consume models, among hundreds of other examples. They operate in much the same way as commercial software companies but for just one customer (though they often start their own software companies too). The popular term for this role is “data scientist.”
- “Heavy” Data Scientists. The Type II category could actually be broken down into two subtypes, Type II-A and Type II-B. While both perform roughly the same function – providing guidance and expertise in the application of quantitative analysis – they are differentiated by the sophistication of the techniques applied. II-A practitioners understand the mathematics behind the analytics and may apply very complex tools such as Kucene wrapper, loopy logic, path analysis, root cause analysis, synthetic time series or Naïve Bayes derivatives that are understood by a small number of practitioners. What differentiates the Type II-A from Type I is not necessarily the depth of knowledge they have about the formal methods of analytics (it is not uncommon for Type II’s to have a PhD for example), it is that they also possess the business domain knowledge they apply and their goal is to develop specific models for the enterprise, not for the general case as Type I’s usually do.
- “Light” Data Scientists. Type II-Bs on the other hand may work with more common and well-understood techniques such as logistic regression, ANOVA, CHAID and various forms of linear regression. They approach the problems they deal with using more conventional best practices and/or packaged analytical solutions from third parties.
In summary, to keep the terminology straight, I agree with the common definition of data scientist, but maintain the position that the title is a stretch as most of the work done by today’s data scientists is more heavily slanted to the “D” in R&D.