Understanding Analytical Types and Needs

Understanding Analytics Types and Needs

By Neil Raden, January, 2013

Purpose and Intent

“Analytics” is a critical component of enterprise architecture capabilities, though most organizations have only recently begun to develop experience using quantitative methods. As Information Technology emerges from a scarcity-based mentality of constrained and costly resources to a commodity consumption model of data, processors and tools, analytics is quickly becoming table stakes for competition.

This report is the first of a two-part series. (Part II will cover analytic functionality and matching the right technology to the proper analytic tools and best practices.) It discusses the importance of understanding the role of analytics, why it is a difficult topic for many, and what actions you should take. It will explore the various meanings of analytics, provide a framework for aligning various types of analytics with associated roles and skill sets needed.

Executive Summary

Using quantitative methods is rapidly becoming, not an option for competitive advantage, but rather, at the very least, barely enough to keep up. Everyone needs to understand what’s involved in analytics, what you particular organization needs and how to do it.

Few people are comfortable with the concepts of advanced analytic methods. In fact, most people cannot explain the difference between a mean, a median and a sample mean. The misapplication of statistics is widespread, but today’s explosion of data sources and intriguing technologies to deal with them have changed the calculus. Embedded quantitative methods may relieve analysts of the actual construction of predictive models, but applying those models correctly requires understanding the different analytical types, roles and skill.

Analytics in the Enterprise

The emphasis of analytics is changing from one of long-range planning based on historical data, to dynamic and adaptive response based on timely information from multiple contexts, augmented and interpreted through various degrees of quantitative analysis. Analytics now permeates every aspect leading organizations’ operations. Competitive, technological and economic factors combine to require more precision and less lag time in discovery and decision-making.

For example, operational processing, the orchestration of business processes and secure capture of transactional data is merging with analytical processing, the gathering and processing of data for reporting and analysis. Analytics in commercial organizations has historically been limited to special groups working more or less off-line. Platforms for transaction processing were separated for performance and security reasons, an effect of “managing from scarcity.” But scarcity is not the issue anymore as the relative cost of computing has plummeted. Driven equally by technology and competition, operational systems are either absorbing or at least cooperating with analytical processes. This convergence elevates the visibility of all forms of analytics.

Confusion and mistakes in deploying analytics are common due to imprecise understanding of the various forms and types. Uncertainty about the staff and skills needed for various “types” of analytics are common. Messaging from technology vendors, service providers and analysts is murky and misleading, sometimes deliberately so.

The urgency behind implementing an analytics program, however, can be driven not by getting a leg up, but rather not falling behind.

Analytics and the Red Queen Effect

Analytics are crucial because the barriers to getting started are lower than ever. Everyone can engage in analytics now, of one type or another. As analytic capabilities increase across competitors, everyone must step up – it’s a Red Queen[i] effect.  When everyone was shooting from the hip, efficiency was a matter of degree. If everyone used crude models and unreliable data, then everyone should, more or less, work within the same margin of error. What separated competitors was good strategy and good execution. But now that everyone can employ quantitative methods and techniques like Naive Bayes, C4.5 and support vector machines, it will still be the strategy and execution that count. Companies must improve just to stay in place.  Each new level of analytics becomes the “table stakes” for the next.

Can You Compete on Analytics? Analytics Are Necessary – but Not Sufficient

Statistical methods using software have been shown to be useful in many aspects of an organization, such as fraud detection, demand forecasting and inventory management, but just using analytics has not been shown to necessarily improve the fortunes or effectiveness of the overall organization. In 2007, Davenport and Harris released their influential book[ii], Competing on Analytics, which described how a dozen or so companies used “analytics” to not only advise decision-makers, but to play a major role in the development of strategy and implementation of business initiatives. The book found a huge following and was a bestseller on the business book lists. It certainly placed the word “analytics” in the top of the mind of many decision makers. However, when comparing the fortunes of the twelve companies highlighted in the book, their performance in the stock market is less than spectacular as illustrated in Figure 2:

This scenario is often repeated – good work is performed inside an organization, but the benefits of the discipline do not permeate other parts of the business and, hence, have little effect on the organization as a whole. In another example, statistical methods have been used in the U.S. in agriculture for decades, and yields have improved dramatically, but the quality of the food supply has clearly degraded along with the fortunes of individual farmers.

Too many organizations, despite good intentions, do not see dramatic improvement in their fortunes after adopting wider-based analytical methods because:

First, rarely does one thing change a company. Analytics are a powerful tool, but it takes execution to realize the benefits. Perhaps if good analytical technique had been applied across the board along with a clear strategy to drive decisions based on quantitative models, better results may have followed. Instead, as is often the case, a visible project shows great promise and early results, but the follow through is wanting.

Data mining tools can actually be predictive, showing what is likely to happen or not happen. But what is often misunderstood is that data mining tools are usually poor at specifying when things will happen. In this case, too much faith is placed in the models, imbuing them with fortune-telling capabilities they simply lack. The correct approach is to test, run proofs of concept, and once in production engage in continuous improvement through mechanisms like champion/challenger and A/B testing.

Most of the companies try to understand customer behavior – which you can do with data mining – but it rarely captures the randomness of people’s behavior leading to overconfidence in the models. Given this customer is likely to purchase a car, when is the correct time to reach out? Perhaps right away, perhaps not. Data mining tools are not very good at individual propensities derived from behavior due to the randomness of human behavior. It is pretty common for inexperienced modelers to put too much faith in model results. The solution is to engage experienced talent to get a program started in the right track.

Return on investment in analytics is difficult to measure because there isn’t often a straight line from the model to results. Other parts of the organization contribute. An analytical process can inform decisions, either human or machine-driven, but the execution of those decisions is beyond the reach of an analytical system. People and process have to perform too. In addition, a successful analytical program can be the result of a well-defined strategy. Positive results from analytics would not have been possible without the formation of that strategy.

Professionals skilled in statistics, data mining, predictive modeling and optimization have been a part of many organizations for some time, but their contribution, and even an awareness of what they do, is sometimes poorly understood – and filled with many impediments to success.  By categorizing analytics by the quantitative techniques used and the level of skill of the practitioners who use these techniques (the business applications that they support are detailed in Part II of the series), companies can begin to understand when and how to use analytics effectively and deploy their analytic resources to achieve better results.

The Four Types of Analytics

There are related and unrelated disciplines that are all combined under the term analytics. There is advanced analytics, descriptive analytics, predictive analytics and business analytics, all defined in a pretty murky way. It cries out for some precision. What follows is a way to characterize the many types of analytics by the quantitative techniques used and the level of skill of the practitioners who use these techniques.

Figure 4: The Four Types of Analytics

Descriptive Title Quantitative Sophistication/Numeracy Sample Roles
    Type I QuantitativeResearch PhD or equivalent Creation of theory, development of algorithms. Academic/research. Often employed in business or government for very specialized roles
Type II Data Scientist orQuantitative

Analyst

Advanced Math/Stat, not necessarily PhD Internal expert in statistical and mathematical modeling and development, with solid business domain knowledge
Type III Operational Analytics Good business domain, background in statistics optional Running and managing analytical models. Strong skills in and/or project management of analytical systems implementation
Type IV Business Intelligence/ Discovery Data and numbers oriented, but so special advanced statistical skills Reporting, dashboard, OLAP and visualization use, possibly design, Performing posterior analysis of results driven by quantitative methods

Type I Analytics: Quantitative Research

The creation of theory and development of algorithms for all forms of quantitative analysis deserves the title Type I. Quantitative Research analytics are performed by mathematicians, statisticians and other pure quantitative scientists. They discover new ideas and concepts in mathematical terms and develop new algorithms with names like Hidden Markov Support Vector Machines, Linear Dynamical Systems, Spectral Clustering, Machine Learning and a host of other exotic models. The discovery and enhancement of computer-based algorithms for these concepts is mostly the realm of academia and other research institutions (though not exclusively).  Commercial, governmental and other organizations (Google or Wall Street for example) employ staff with these very advanced skills; but in general, most organizations are able to conduct their necessary analytics without them, or employ the results of their research. An obvious example is the FICO score, developed by Quantitative Research experts at FICO (Formerly Fair Isaac) but employed widely in credit-granting institutions and even human resource organizations.

Type II Analytics: “Data Scientists”

More practical than theoretical, Type II is the incorporation of advanced analytical approaches derived from Type I activities. This includes commercial software companies, vertical software implementations, and even the heavy “quants” in industry who apply these methods specifically to the work they do like fraud detection, failure analysis, propensity to consume models, among hundreds of other examples. They operate in much the same way as commercial software companies but for just one customer (though they often start their own software companies too). The popular term for this role is “data scientist.”

“Heavy” Data Scientists. The Type II category could actually be broken down into two subtypes, Type II-A and Type II-B. While both perform roughly the same function – providing guidance and expertise in the application of quantitative analysis – they are differentiated by the sophistication of the techniques applied. II-A practitioners understand the mathematics behind the analytics and may apply very complex tools such as Kucene wrapper, loopy logic, path analysis, root cause analysis, synthetic time series or Naïve Bayes derivatives that are understood by a small number of practitioners. What differentiates the Type II-A from Type I is not necessarily the depth of knowledge they have about the formal methods of analytics (it is not uncommon for Type II’s to have a PhD for example), it is that they also possess the business domain knowledge they apply and their goal is to develop specific models for the enterprise, not for the general case as Type I’s usually do.

“Light” Data Scientists. Type II-Bs on the other hand may work with more common and well-understood techniques such as logistic regression, ANOVA, CHAID and various forms of linear regression. They approach the problems they deal with using more conventional best practices and/or packaged analytical solutions from third parties

Data Scientist Confusion. “Data Scientist” is a relatively new title for quantitatively adept people with accompanying business skills. The ability to formulate and apply tools to classification, prediction and even optimization, coupled with fairly deep understanding of the business itself, is clearly in the realm of Type II efforts. However, it seems pretty likely that most so-called data scientists will lean more towards the quantitative and data-oriented subjects than business planning and strategy. The reason for this is that the term data scientist emerged from those businesses like Google or Facebook where the data is the business; so understanding the data is equivalent to understanding the business. This is clearly not the case for most organizations. We see very few Type II data scientists with the in-depth knowledge of the whole business as, say, actuaries in the insurance business, whose extensive training should be a model for the newly designated data scientists (see our blog at: “What is a Data Scientist and What Isn’t”)

Though not universally accepted, data scientists must be able to effectively communicate their work to non-technical people. This is a major discriminator between a data scientist and a statistician. It is absolutely essential that someone in the analytics process have the role of chief communicator, someone who is comfortable working with quants, analysts and programmers, deconstructing their methodologies and processes, distilling them, and then rendering it in language that other stakeholders understand. Companies often fail to see that there is almost never anything to be gained by trying to put a PhD statistician into the role of managing a group of analysts and developers. It is safe to say that this role is represented more by a collaborative group of professionals than by a single individual.

Type III Analytics: Operational Analytics

Historically, this is the part of analytics we’re most familiar with. For example, a data scientist may develop a scoring model for his/her company. In Type III activity, parameters are chosen by the operational analytics expert analyst and are input into the model, generating the scores calculated by the Type II models and embedded into an operational system that, say, generates offers for credit cards. Models developed by data scientists can be applied and embedded in an almost infinite number of ways today. The application of Type II applications into real work is the realm of operational analysts. In very complex applications, real-time data can be streamed into applications based on Type II models with outcomes instantaneously derived through decision-making tools such as rules engines.

Packaged applications that embed quantitative methods such as predictive modeling or optimizations are also Type III in that the intricacies and the operation of the statistical or stochastic method are mostly hidden in a sort of “black box.” As analytics using advanced quantitative methods becomes more acceptable to management over time, these packages become more popular.

Decision making systems that are reliant on quantitative methods that are not well understood by the operators can lead to trouble. They must be carefully designed (and improved) to avoid overly burdening the recipients of useless or irrelevant information. This was a lesson learned in the early days of data mining, that generating “interesting” results without understanding what was relevant usually led to flagging interest in the technology. In today’s business environment, time is perhaps the scarcest commodity of all. Whether a decision-making system notifies people or machines, it must confine those messages to those that are the most relevant and useful.

False negatives are quite a bit more problematic as they can lead to transactions passing through that should not have. Large banks have gone under by not catching trades that cost billions of dollars. Think of false negatives as being asleep at the wheel.

Type IV Analytics: Business Intelligence & Discovery

Type III analytics aren’t of much value if their application in real business situations cannot be evaluated for their effectiveness. This is the analytical work we are most familiar with via reports, OLAP, dashboards and visualizations. This includes almost any activity that reviews information to understand what happened or how something performed, or to scan and free associate what patterns appear from analysis. The mathematics involved is simple. But pulling the right information – and understanding what information means – is still an art and requires both business sense and knowledge about sources and uses of the data.

Know Your Needs First

The scope of analytics is vast, ranging from the familiar features of business intelligence to the arcane and mysterious world of applied mathematics. Organizations need to be clear on their objectives and capabilities before funding and staffing an analytic program. Predictive modeling to dramatically improve your results makes for good reading, but the reality is quite different. The four types are meant to help you understand where you can begin or advance.

These categories are not hard and fast. Some activities are clearly a blend of various types. But the point is to add some clarity to the term “analytics” in order to understand its various use cases. Tom Davenport, for example, advocated creating a cadre of “PhDs with personality” in order to become an analytically competitive organization. That is one approach. Implementing analytics as part of other enterprise software you already have – or purchasing a specialized application that is already used and vetted in your industry – is a better place to start.

Recommendations

Use of some clear terminology can avoid confusion within your organization, not just internally, but in communication with vendors and service providers. To get the most out of analytics:

  • Be clear about what you need.  Having clarity on the meaning of analytics has clear benefits. Because the nature of analytics is a little mysterious to most people, a vendor statement that they provide “embedded predictive analytics” can no longer be taken at face value. You should look closely to see if those capabilities line up with your needs.
  • Don’t assume high value means high resource costs. In the same vein, you needn’t hesitate to begin analytical projects because you believe you need to source a dozen PhDs, when in fact, your needs are in the Type II category.
  • Formulate specific vendor questions based on what level of sophistication and resources you need. By more clearly specifying what type of analytics you need, it becomes very easy to ask: Is this tool designed to discover and create predictive models, or to deploy them from other sources? Do you offer training in quantitative methods or only in the use of your product? Is the tool designed for authoring scoring models or just using scored values?
  • Use analytic knowledge to start to prepare for Big Data.  Understanding what type of analytics – and results – you need will even help you in your soon-to-be-serious consideration of Big Data solutions, including Hadoop, its variants and its competitors, all of which use variants of the above techniques to process large quantities of information.

Analytics is a catchall phrase, but understanding the various uses and types should help in implementing the right approach for accomplishing the tasks at hand.  It should also help in discerning what is meant when the term is used, as almost anything can be called analytics.

Next Steps

Part II of this series will examine in depth the forms that analytics take in the organization and the business purposes it serves, and demonstrate through examples and case studies how analytics of all types are successfully employed. But analytics are a step in the process. Without effective decision-making practices the value in analytics is lost. Part III of this series will deal with decision making and decision management.

Author Bio: Neil Raden

Analyst, Consultant and Author in Analytics and Decision Science

Neil Raden, nraden@hiredbrains.com is the founder and Principal Analyst at Hired Brains Research, a provider of consulting and implementation services in business intelligence, analytics and decision managemen. Hired Brains focuses on the needs of organizations and capabilities of technology. He began his career as a property and casualty actuary with AIG in New York before moving into predictive analytics services, software engineering, and systems integration with experience in delivering environments for decision making in fields as diverse as health care to nuclear waste management to cosmetics marketing and many others in between.

 

[i] The Red Queen is a concept from evolutionary biology first used in Matt Ridley, The Red Queen: Sex and the Evolution of Human Nature, (New York: Macmillan Publishing Co, 1994).  The allusion is to the Red Queen in Lewis Carroll’s Through the Looking-Glass, who had to keep running just to stay in place.

[ii] Davenport, Harris, et al, “Competing on Analytics: The New Science of Winning,” New York, Harvard Business Press, 2007.

Advertisements
Posted in Uncategorized | 1 Comment

When Are Decisions Driven by Analytics, or Merely Informed by Them?

In http://www.dataintegrationblog.com/data-quality/from-tactical-to-strategic-action-operational-decision-management-2/

Boy did Julie Hunt ever hit the nail on the head: 

“But – real-time decision-making also has to be vetted with domain knowledge, human experience and common sense, to validate the viability of analytics results. Decisions make a positive difference for the enterprise only if they are based on accurate intelligence. While many things are possible with predictive analytics, there is always the danger of trying to force ‘reality’ to fit the model. This can be deadly to real-time operational decision-making.”

When it comes to decisions that can be made via models, you have to separate them into two categories: those that do not require 100% precision, and those that are too important to get wrong. 

For example, routing a call center call, approving a credit line increase, rating a car insurance premium – these are all “decisions” that are made in high volume, but getting some of them wrong, in the aggregate, causes little harm. Obviously, the closer you get to perfect performance the better, but you can allow these decisions to be made without human interference. Obviously you track the result and continuously improve the models.

On the other hand, many decisions in an enterprise are too important to turn over to some algorithms. In these cases, the quantitative analysis can be a part of the decision process, but ultimately the decision vests with the person or persons who take responsibility for it. In point of fact, very few managers are comfortable with answers based on probability. The difference between 80% probability and 95% probability simply doesn’t resonate. For important decisions, managers want one answer, and that requires discussion and consensus. 

We have to be very careful not to over-promise on analytics. 

Posted in Big Data, Business Intelligence, Decision Management | Tagged , , , , | 1 Comment

Personalized Medicine World Conference

This is the fourth or fifth year for this conference, and each year there are some surprises. The first couple of years it was a diverse collection of researchers, entrepreneurs and vendors (Oracle, Deloitte, etc.). The number of exhibitors seems about the same as last year, but there was a small booth for SAP HANA, which was sort of a surprise, but I learned they are aggressively going after the life sciences sector and Hasso Plattner was a keynote speaker. That’s a pretty good sign this conference is getting pretty commercial.

Like any conference, a few stars emerge and become familiar and repeat presenters. Atul Butte is one example. Here is his bio:

Atul Butte, MD, PhD is Chief of the Division of Systems Medicine and Associate Professor of Pediatrics, Medicine, and by courtesy, Computer Science, at Stanford University and Lucile Packard Children’s Hospital. Dr. Butte trained in Computer Science at Brown University, worked as a software engineer at Apple and Microsoft, received his MD at Brown University, trained in Pediatrics and Pediatric Endocrinology at Children’s Hospital Boston, then received his PhD in Health Sciences and Technology from Harvard Medical School and MIT. Dr. Butte has authored more than 100 publications and delivered more than 120 invited presentations in personalized and systems medicine, biomedical informatics, and molecular diabetes, including 20 at the National Institutes of Health or NIH-related meetings.

I did find Dr. Butte’s presentation about how bioinformatics tools applied to big public data have yielded new uses for drugs and new prototype drugs and diagnostics for type 2 diabetes. It was an interesting discussion of what we call big data analytics, but in the end, it just came back to making more drugs. 

When you attend medical conferences, speakers always have these extensive pedigrees, but what I wonder is, with all of the esteem, what sort of doctors are they? Are they too distanced from day-to-day clinical work to see the problems and possibilities? Are their decisions made within a bubble that excludes consideration of alternatives? That is the sense I get listening to them. 

A common term used by many of the speakers was “omics.” First we had genomics, then epigenomics followed by proteomics or metabolomics. All of these areas combine both bench science and informatics on a huge scale. The hope is that the digital examination of these minute measurements can lead to cures for diabetes, cancer, heat disease and Alzheimers.

Michael Snyder, Ph.D., Professor & Chair, Stanford Center of Genomics & Personalized Medicine gave a notable and introspective presentation about the use of a combination of omics methods to assess health states in a single individual over the course of almost three years (himself). Genome sequencing was used to determine disease risk. Longitudinal personal profiling of transcriptome, proteome and metabolome was used to monitor disease, including viral infections and the onset of diabetes. His premise is that these aproaches can transform personalized medicine. It was discovered that he carried genes for diabetes and in fact developed it during the period, but, in my opinion, failed to see the causal effect of poor sleep from repeated respiratory infections that corresponded with the spike in blood sugar. In some ways, it seems these brilliant scientists just don’t see the forest from the trees, and that hurts us.

 

Steven C Quay, M.D., Ph.D., FCAP, Founder, Atossa Genetics, Inc. pitched his own company devoted to obtaining routine, repeated, “painless” breast biopsy samples non-invasively for cytopathology, NGS, proteome, and transcriptome analysis of precursors to breast cancer; The use of breast specimens obtained non-invasively for biomarker discovery, clinical trial support, and patient selection, and to inform personalized medical therapy; Cancer prevention using intraductal treatment of reversible hyperplastic lesions.

Two problems with his presentation. NO ONE KNOWS HOW TO PREVENT BREAST CANCER. Also, the “painless” techniques are almost medieval. If you don’t believe me, look up “ductal lavage” and let me know if you’d want to submit to that repeatedly.

The problem is no one ever seems to use the word cure, or to speculate why these diseases exist at all. All of the research presented seems to end with the following refrain: “Hopefully leading to the development of new drugs…” Well, follow the money. 

I don’t know if I’ll go next year. 

 

 

Posted in Big Data, Decision Management, Genomics, Medicine, Research | Tagged , , , , , , , , , , | Leave a comment

Forked SQL: Informatica Gets It

By Neil Raden

About fifteen years ago, Microstrategy cofounder Sanju Bansal told me, “SQL is the best hope for leveraging the latent value from databases.” Fifteen years later, it’s extraordinary how correct Bansal was. Microstrategy is still a robust, free-standing Business Intelligence company while most of the proprietary multidimensional databases have disappeared. But what about the next fifteen years?

At least for business analytics, SQL is under attack. So much so, that there is an entire emergent market segment called NoSQL. If SQL itself is under siege, what about the myriad technologies that in one way or another are part of the SQL ecosystem like Informatica? Are they obsolete? Will we need to throw away the baby with the bathwater?

This whole dustup has been brewing for a decade or more. Since we started using computers in business 60+ years ago, the big machines were managed by a separate group of people with specialized skills, now generically referred to as IT. Though separate and decidedly non-businesslike, IT eventually became bureaucratic, fixed in its mission to control everything data. Even SQL was adopted very slowly, but it is solidly the tool of choice for most applications.

About ten years ago, though, a renegade group of people I called “the pony-tail-haired guys” (PTH for short) appeared on the scene with their externally-focused web sites and gradually, development tools, methods and monitoring software. At first, IT paid no attention to them because they didn’t interfere with the inwardly-focused enterprise computing environment, but as the perceived value of “e-business” developed, a great deal of friction and turf war fighting erupted. Computing bifurcated inside the firewall.

The PTH preferred web-oriented tools, open source software and search. That’s why Big Data/Hadoop and NoSQL are so divergent from enterprise computing. Different people, different applications, different brains.

But when it comes to Business Analytics, the lines are not so clear. The PTH found, just as enterprise people did (reluctantly), analytics is key to everything else. But while enterprise apps measure sales and revenue, the PTH guys are looking at really strange things like sentiment analysis. Today, I can download a Fortune 500 general ledger to my watch, but things like sentiment analysis look at 100’s of 1000’s times more data. Loading into a relational database and analyzing the data with a language designed for set operations and transactions just doesn’t work. So is there a justification for something other than SQL for this? Of course there is.

But the PTH guys, now that they’ve grown up a little, are starting to show the same sclerotic tendencies as their IT colleagues, assuming that their tools and methodologies are the ONLY tool for analytics and SQL needs to be put in the dustbin. That’s just silly. Existing data warehouse, data integration and data analysis and presentation tools are well-suited to lots of tasks that aren’t going away any time soon, though methodologies and implementations are in dire need for renovation, something I call a Surround Strategy as opposed to the ridiculous outdated idea of the Single Version of the Truth.

Here is where Informatica enters the picture. With all the breathless enthusiasm over Big Data, one thing is often lost in the rush: no fundamental laws of physics concerning data integration have been altered. This places Informatica squarely in the middle of every trend grabbing headlines today: big data, social analytics, virtualization and cloud.

Any type of reporting or analysis, whether through traditional ETL and data warehousing, Hadoop, Complex Event Processing or even Master Data Management deals with “used data,” meaning, data created through some original process. All used data has one attribute in common – it doesn’t like to play with other used data. Data extracted from a primary source typically has hidden semantics and rules in the application logic so that, on it’s own, it often doesn’t make sense.

This has been the most difficult part of assembling useful data for analysis historically, and with the entrance of mountains of non-enterprise data, the problem has only grown larger.

Luckily for the fortunes of Informatica, this is a big opportunity and they have stepped up.

For in depth descriptions of the following Big Data, cloud, Hadoop SaaS innovations coming from Informatica, refer to their materials at http://www.informatica.com. Briefly, they include:

Hparser, at facility to visually prepare very large datasets for processing in Hadoop, saving developer/analyst a substantial amount of time from what is mostly hand coding.

Complex Event Processing (CEP) which isn’t new for Informatica, but given the expanded complexity of data integration in the Big Data era, they have incorporated CEP into their own platform to detect and inform of events that can affect the performance, accuracy and management of the data from teratogenic processes.

Informatica pioneered the GUI diagram for building integration mappings, but 9.5 implements an integration optimizer which detects the optimal mapping rather than processing the diagram literally.

With each new release of a complex software product is the time-consuming and nerve-wracking routine of upgrades. 9,5 now provides automated regression testing to dramatically reduce the time and pain of upgrades.

In Information Lifecycle Management, 9.5 provides “Intelligent Partitions” to distribute data across devices hot, warm and cold storage.

Add to this, facilities for virtualization, replication and a slew of offerings for cloud-based applications – there is an inescapable logic for Informatica exploiting the new opportunities of Big Data.

Posted in Big Data, Research | Tagged , , , | 5 Comments

BI Is Dead! Long Live BI!

 

Executive Summary

We suggest a dozen best practices needed to move Business Intelligence (BI) software products into the next decade. While five “elephants” occupy the lion’s share of the market, the real innovation in BI appears to be coming from smaller companies. What is missing from BI today is the ability for business analysts to create their own models in an expressive way. Spreadsheet tools exposed this deficiency in BI a long time ago, but their inherent weakness in data quality, governance and collaboration make them a poor candidate to fill this need.  BI is well-positioned to add these features, but must first shed its reliance on fixed-schema data warehouses and read-only reporting modes. Instead, it must provide businesspeople with the tools to quickly and fully develop their models for decision-making.

 

Why BI Must Transition From The Past Century

In this avant-garde era of Big Data, cloud, mobile and social, the whole topic of BI is a little “derriere.” BI is a phenomenon of the previous two decades, but the worldwide market for BI tools (not including services or surrounding technologies such as data warehousing and data integration) is greater than $10 billion per year. It’s still a very significant market and will, for some time, dwarf spending on the Big Data top gun, Hadoop, which is an open source distribution.  The lion’s share of revenue in the Big Data market will continue to be hardware and services, not software, unless you consider the application of existing technologies, especially database and data integration software, as part of Big Data.

Because BI is still alive, it’s worth revisiting some concepts I wrote about a few years ago — abstraction and model-driven design in BI. In the proto-BI days, when it was still known as Decision Support Systems (DSS), high-level declarative languages were used to create business models that could handle user-entered (or background-loaded) data. These systems were interactive, allowing for what-if analysis, testing sensitivity of the models and even the application of advanced statistical techniques.

Data warehousing changed all of that. Performance of relational databases for interactive   reporting from static schema was a challenge. Adding data to the warehouse interactively through iterative analysis was out of the question. Adding entities to the schema to accommodate new models required data modeling and schema changes, often reloading data and associated testing before implementation.

As data warehouses became the preferred way to provide data for reporting and analysis, BI vendors that focused on the read-only nature of data warehousing prospered and dominated the field. Business modeling fell from favor, or, more precisely, fell to Excel. The only exception was found in budgeting and planning packages, mostly based on Multi-Dimensional Online Processing (MOLAP) databases, predominantly Essbase and Microsoft BI.

 

The Era Of Big Data Means Business Analytics

The major BI vendors of the late 1990s through today are BusinessObjects (SAP), Cognos (IBM), SAS, Hyperion (Oracle), Microsoft and Microstrategy. At these six vendors (who together comprise more than two-thirds of the BI market), only about 20 percent of combined revenue came from tools that provided modeling capabilities. The rest came from strictly read-only data warehouses and marts[i].

Now that the era of Big Data is here, modeling has to move up a notch. While newer technologies are in play for capturing and massaging Big Data, the need for business analytics is greater than ever. In the past, the BI calculus more or less ended with informing people. Today, BI must enable actions and decisions that are supported by deep insight. Excel may be a good container for interacting with models, but its internal capabilities aren’t sufficient. The need for BI tools will not disappear, but it’s time to break the read-only mold.

 

Moving From Data To Decisions Through Business Modeling

Business modeling can be an imprecise term. But in general, it means creating descriptive replicas of a part of a business — such as assets, processes or optimizations — in terms that are consonant with the people (and processes) that use them. Usually, the goal is to render these models into computer-based applications. In general, a business model is created by someone who has certain knowledge about a process or function in the business. This can range from a single fact, such as how operating cash flow is calculated, to something as broad as how the manufacturing plants operate.

To be effective, businesspeople need more than access to data: They require a seamless process that lets them interact with the data and drive their own models and processes. In today’s environment, these steps are typically disconnected and, therefore, expensive and slow to maintain without the data quality controls from the IT department. The solution: a better approach to business modeling coupled with an effective architecture that separates physical data models from semantic ones. In other words, businesspeople need tools to address physical data through an abstraction layer that allows them to address only the meaning of data — not its structure, location or format.

Abstraction is applied routinely to systems that are somewhat complex and especially to systems that frequently change. A 2012 model car contains more processing power than most computers only a decade ago. Driving the car, even under extreme conditions, is a perfect example of abstraction. Stepping on the gas doesn’t really pump gas to the engine, it alerts the engine management system to increase speed by sampling and alerting dozens of circuits, relays and microprocessors to achieve the desired effect. These actions are subject to many constraints, such as limiting engine speed and watching the fuel-air mixture for maximum economy or minimum emissions. If the driver needs to attend to all of these things directly, he would not get out of the driveway.

Data warehouses and BI tools still rely on at least some of the business users’ understanding of the data models and semantics, and sometimes the intricacies of the crafting of queries. This is a huge barrier to progress. Businesspeople need to define their work in their own terms. A business modeling environment is needed for designing and maintaining structures, such as data warehouses and all of the other structures associated with it. It is especially important to have business modeling for the inevitable changes in those structures. It is likewise important for leveraging the latent value of those structures through analytical work. This analysis is enhanced by understandable models that are relevant and useful to businesspeople.

BI isn’t standalone anymore. In order to close the loop, it has to be implemented in an architecture that is standards-based. And it has to be extensible and resilient, since the boundaries of BI are more fuzzy and porous than other software. While most BI applications are fairly static, the ones of most value to companies are the flexible and adaptable ones. In the past, it was acceptable for individual “power users” to build BI applications for themselves or their group without regard to any other technology or architecture in the organization. Over time, these tools became brittle and difficult to maintain because the initial design was not robust enough to adapt to the continuous refinements and changes needed by the organization. Because change is a constant, BI tools need to provide the same adaptability at the definitional level that they currently do at the informational level. The answer is to provide tools that allow businesspeople to model.

 

Business Modeling Emerges As a Critical Skill

All businesspeople use models, though most of them are tacit or implied, not described explicitly. Evidence of these models can be found in the way workers go about their jobs (tacit models) or how they have authored models in a spreadsheet. The problem with tacit models is that they can’t be communicated or shared easily. The problem with making them explicit is that it just isn’t convenient enough yet. Most businesspeople can conceptualize models. Any person with an incentive compensation plan can explain a very complicated model. But most people will not make the effort to learn how to build a model if the technology is not accessible.

There are certain models that almost every business employs in one form or another. Pricing is a good example and, in a much more sophisticated form, yield management, such as the way airlines price seats. Most organizations look at risk and contingencies, a hope-for-the-best-prepare-for-the-worst exercise. Allocation of capital spending or, in general, allocation of any scarce resource is a form of trade-off analysis that has to be modeled. Decisions about partnering and alliances as well as merger or acquisition analysis are also common modeling problems. Models are also characterized by their structure. Simple models are built from data inputs and arithmetic. More complicated models use formulas and even multi-pass calculations, such as allocations. The formulas themselves can be statistical functions and can perform projections or smoothing of data. Beyond this, probabilistic modeling is used to model uncertainty, such as calculating reserves for claims or bad loans.

When logic is introduced to a model, it becomes procedural. Mixing calculations and logic yields a very potent approach. The downside is that procedural models are difficult to develop with most tools today and are even more difficult to maintain and modify because they require the modeler to interact with the system at a coding or scripting level; most business people lack the temperament or training, or both, to do so. It is not reasonable to assume that businesspeople, even the power users, will employ good software engineering technique, nor should they be expected to. Instead, the onus is on the vendors of BI software to provide robust tools that facilitate good design technique through wizards, robots and agents.

 

Learn From A Dozen Best Practices In Modeling

For any kind of modeling tool to be useful to businesspeople, supportable by the IT organization and durable enough over time to be economically justifiable, it must provide or allow the following capabilities:  

  1. Level of expressiveness. This must be sufficient for the specification, assembly and modification of common and complex business models without code; it should accommodate all but the most esoteric kinds of modeling.
  2. Declarative method. Such a method means that each “statement” is incorporated into the model without regard to its order, sequence or dependencies. The software, freeing modelers to design whatever they can conceive, handles issues of calculation optimization.
  3. Model visibility.  This enables the inspection, operation and communication of models without extra effort or resources. Models are collaborative and unless they can be published and understood, no collaboration is possible.
  4. Abstraction from data sources. This allows models to be made and shared in language and terms unconnected to the physical characteristics of data; it gives managers of the physical data much greater freedom to pursue and implement optimization and to improve performance efforts.
  5. Extensibility.   Extensibility means that the native capabilities of the modeling tool are robust enough to extend to virtually any business vertical, industry or function. Most of the leading BI tools are owned by much larger corporate parents, potentially limiting or directing the development roadmap of the BI offering, in alignment with the wider vision of the parent. Smaller BI and analytics pure-plays tend to have a truer vision about BI (until they are acquired). Because analysts who are thinking out of the box gain valuable insight, the BI tool cannot imposes vendor-specific semantics and functions, or lock analysts in and limit distribution of insights with expensive licenses.
  6. Visualization. Early BI tools relied on scarce computing resources, but with today’s abundance of processing power, an effective BI platform should include visualization. It is a proven fact that single-click visualization of models and results aids in understanding and communicating complicated models.
  7. Closed-loop processing. This is essential because business modeling is not an end-game exercise, or at least it shouldn’t be. It is part of a continuous execute-track-measure-analyze-refine-execute loop. A modeling tool must be able to operate cooperatively in a distributed environment, consuming and providing information and services through a standards-based protocol. The closed-loop aspect may be punctuated by steps managed by people, or it may operate as an unattended agent, or both.
  8. Continuous enhancement. This requirement is borne of two factors. First, with the emerging standards of service-oriented architectures, web services and XML, the often talked-about phenomenon of organizations linked in a continuous value chain with suppliers and customers will become a reality soon and will put great pressure on organizations to be more nimble. Second, it has finally crept into the collective consciousness that development projects involving computers are consistently under-budgeted for maintenance and enhancement. The mindset of “phases” or “releases” is already beginning to fray and forward-looking organizations are beginning to differentiate tool vendors by their ability to enable enhancement with extensive development and test phases.
  9. Zero code. In addition to the fact that most businesspeople are not capable of and/or interested in writing code, there is sufficient computing power at reasonable costs to allow for more and more sophisticated layers of abstraction between modelers and computers. Code implies labor, error and maintenance. Abstraction and declarative modeling implies flexibility and sustainability. Most software “bugs” are iatrogenic; that is, they are introduced by the programming process itself. When code is generated by another program, the range of programmatic errors is limited to the latent errors in the code generator, not the errors introduced by programmers.

10.Core semantic information model (ontology). Abstraction between data and the people or programs that access the data isn’t very useful unless the meaning of the data and its relationships to everything else are available in a repository.

11.Collaboration and workflow.  These capabilities are essential to connecting analytics to every other process within and beyond the enterprise. A complete set of collaboration and workflow capabilities supplied natively within a BI tool is not necessary, though. Instead, the ability to integrate (this does not mean “be integrated,” which implies lots of time and money) with collaboration and workflow services across the network, without latency or conversion problems, is preferable.

12.Policy. This may be the most difficult requirement of them all. Developing software to model business policy is tricky. For example, “Do not allow contractor hours in budget to exceed 10 percent of non-exempt hours.” Simple calculations through statistical and probabilistic functions have been around for over three decades. Logic models that can make decisions and branch are more difficult to develop, but still not beyond the reach of today’s tools. But a software tool that allows business people to develop models in a declarative way to actually implement policies is on a different plane. Today’s rules engines are barely capable enough and they require expert programmers to set them up. Policy in modeling tools is in the future, but it will depend on all of the above requirements.

 

Today’s BI Will Not Be Tomorrow’s BI

It is an open question whether BI has been, in the long run, successful or not. The take-up of BI in large organizations has stalled at 10 to 20 percent, depending on which survey you believe. I believe that expectations of broad acceptance of BI were overly optimistic and that the degree to which it has been adopted is probably at the right level for the functionality it delivered.

Will BI survive? Yes, but we may not recognize it. The need to analyze and use data that are produced in other systems will never go away, but BI will be wrapped in new technologies that provide a more complete set of tools. Instead of managing from scarcity of computing resources, BI will be part of a “decision management” continuum — the amalgam of predictive modeling, machine learning, natural language processing, business rules, traditional BI and visualization and collaboration capabilities.

Pieces of this “new” BI are already here, but within two to three years, it will be in full deployment.


[i] These numbers are representative estimates only and are based on our discussions with vendors and other published information. It is difficult to be precise with BI because the market leaders offer a wide variety of software and services, some for licensing and some embedded in non-BI products. In addition, BI itself has a number of fuzzy definitions. For the purposes of this discussion, we use a definition of query and reporting tools, including online analytical processing (OLAP), but exclusive of data warehousing, extract/transform/load (ETL), analytic databases and advanced analytics/statistics packages.

 Copyright 2012 Neil Raden and Hired Brains Inc

Posted in Uncategorized | 1 Comment

New World Order: Hadoop and Relational Databases

By Neil Raden Hired Brains Research  nraden@hiredbrains.com

Hadoop “data warehouses” do not resemble the data warehouse/analytics that are common in organizations today. They exist in businesses like Google and Amazon for web log parsing, indexing, and other batch data processing, as well as for storing enormous amounts of unfiltered data. Petabyte-size data warehouses in Hadoop are not data warehouses as we know them; they are a collection of files on a distributed file system designed for parallel processing. To call these file systems “a data warehouse” is misleading because a data warehouse exists to serve a broad swath of uses and people, particularly in business intelligence, which is both interactive and iterative.

MapReduce is a programming paradigm with a single data flow type that takes the form of directed acyclic graph of operators. These platforms lack built-in support for iterative programs, quite different from the operations of a relational database. To put it in layman’s terms, there are things that Hadoop is exceptionally well designed for that relational databases would struggle to do. Conversely, a relational database data warehouse performs a multitude of useful functions that Hadoop does not yet possess. Hadoop is described as a solution to a myriad of applications in web log analysis, visitor behavior, image processing, search indexes, analyzing and indexing textual content, for research in natural language processing and machine learning, scientific applications in physics, biology and genomics and all forms of data mining. While it is demonstrable that Hadoop has been applied to all of the domains and more, it is important to distinguish between supporting these applications and actually performing them. Hadoop comes out of the box with no facilities at all to do most of this analysis. Instead, it requires the application of libraries available either through the open source community at forge.com or from the commercial distributions of Hadoop, or by custom development by scarce programmers. In no case can these be considered a seamless bundle of software that is easy to deploy in the enterprise. A more accurate description is that Hadoop facilitates these applications by grinding through data sources that were previously too expensive to mine. In many cases, the end result of a MapReduce job is the creation of a new data set that is either loaded into a data warehouse or used directly by programs such as SAS or Tableau.

The MapReduce architecture provides automatic parallelization and distribution, fault recovery, I/O scheduling, monitoring, and status updates. It is both a programming model and a framework for massively parallel processing of large datasets in batch across many low-end nodes. Its ability to spread very large jobs across a cluster of ordinary servers is perhaps its best feature, certainly its most unique feature. In addition, it has excellent retry/failure semantics. MapReduce at the programming level is simple and easy to use. Programmers code only Map() and Reduce() functions and are not involved with how the job is distributed. There is no data model, and there is no schema. The subject of a MapReduce job can be any irregular data. Because the assumption is that MapReduce clusters are composed of commodity hardware, and there are so many of them, it is normal for faults to occur during a job, and Hadoop handles a few faults automatically, shifting the work to other resources. But there are some drawbacks. Because MapReduce is a single fixed data flow, has a lack of schema, index and high-level language, one could consider it a hammer, not a precision machine tool. It requires data parsing and fullscan in its operation; it sacrifices disk I/O to avoid schemas, indexes, and optimizers; intermediate results are materialized on local disks. Runtime scheduling is based on speculative execution, considerably less sophisticated than today’s relational analytical platforms. Even though Hadoop is evolving, and the community is adding capabilities rapidly, it lacks most of the security, resource management, concurrency, reliability and interactive capabilities of a data warehouse. Hadoop’s most basic components – the Hadoop Distributed File System (HDFS) and MapReduce framework – are purpose built for understanding and processing multi-structured data. The file system is crude in comparison to a mature relational database system which when compared to the universal use of SQL is a limiting factor. However, its capabilities, which have just begun to be appreciated, override these limitations and tremendous energy is apparent in the community that continues to enhance and expand Hadoop.

Hadoop MapReduce with the HDFS is not an integrated data management system. In fact, though it processes data across multiple nodes in parallel, it is not a complete massively parallel processing (MPP) system. It lacks almost every characteristic of an MPP system, with the exception of scalability and reliability. Hadoop stores multiple copies of the data it is processing, and the failure of a node can rollover to another node with the same data, though there is also a single point of failure at the HDFS Name Node, which the Hadoop community is looking to address in the long term (Today, NetApp provides a hardware-centric fail-over solution for the Name Node). It lacks security, load balancing and an optimizer. Data warehouse operators today will find Hadoop to be primitive and brittle to set up and operate, and users will find its performance lacking. In fact, its interactive features are limited to a pseudo-relational database, Hive, whose performance would be unacceptable to those accustomed to today’s data warehouse standards. In fairness, MapReduce was never conceived as an interactive knowledge worker tool, and the Hadoop community is making progress, but HDFS, which is the core data management feature of Hadoop, is simply not architected to provide the services that relational databases do today. And those relational database platforms for analytics are innovating just as rapidly with:

• Hybrid row and columnar orientation.

• Temporal and spatial data types.

• Dynamic workload management.

• Large memory and solid-state drives.

• Hot/warm/cold storage.

• Almost limitless scalability.

The ability to provide almost endless scalabilty and parallelism for batch jobs is a unique distinction for Hadoop. The only platforms that were previously able to provide this sort of massive parallelism were relational databases, and they are not limited to batch operation. So what happens next? My guess is that Hadoop survives and flourishes as the first responder to incoming data, making sense of it and handing it off to other proceses, including data warehouses, in whatever form they take. However, unless petabytes of historical data are needed for interactive analysis, Hadoop will be the favored location for storing history. The Hadoop community, and its imitators and competitors will play an important role in analytics, but not the only role.

Posted in Big Data, Decision Management, Research | Tagged , , , , , , , , , , , | 3 Comments

Decision Management on Steroids: Will Big Data Tools Trump Rules?

Decision Management on Steroids:Will Big Data Tools Trump Rules?

Can the ability to extract meaning and sentiment from previously unconventional data sources reorient the role of business rules?

In a typical customer application, scoring models are created by finding patterns and relationships from attributes using various statistical techniques and the customer records are scored for propensity or eligibility. Rules then apply policy – what to do with the scored records.

But the promise of Big Data is to deliver insight not possible with tools of even five years ago. Does the newer technology that can, to some extent, detect sentiment and propensity, examine relationships of 100’s of millions of ID’s, construct path analysis in real-time, eliminate the need for rules? In other words, does the data speak for itself?

On the other hand, can quantitative methods really implement policy or are we just in the early stages of the hype cycle. Is attended or unattended quantitative analysis of Big Data a sufficient model for implementing policy?

New information usually comes from unexpected places. Big leaps in understanding arise from unanticipated discoveries—but “unanticipated” does not imply a sloppy or accidental process. On the contrary, usable discoveries have to be verifiable, but the desire for knowledge requires a drive for innovation and the exploration of new sources of information that can alter our perceptions and outlooks. Unraveling the content of “big data” that lacks obvious structure begs for some new approaches. Big data is positioned to provide that insight.

But data doesn’t speak for itself. At some point, there will be expert failure: Solutions require data, but may degrade with too much. The largest annoyance is the overblown concept of the data scientist. Data scientists, in the traditional sense, are academic researchers. In the Big Data industry they apply existing algorithms and techniques to data from traditional and new sources. Unfortunately, they usually report to people who have no idea what they are talking about.

In subsequent research I will describe the changes in “predictive” modeling brought about by Big Data and draw some conclusions about how it affects the construction, delivery and uses of decision management.

 

Posted in Big Data, Decision Management, Research | Tagged , , , , , , , , , , , , , , , , , , , , , , , | Leave a comment

NoSQL: What’s the Buzz About Graph Databases?

I attended the NoSQLNow conference in San Jose and had the opportunity to  speak one-on-one with a number of principals of NoSQL database concerns, including Emil Eifrém of Neo Technology. For those of you who aren’t familiar with the concept, graph databases are based on an arrangement of edges, properties and nodes with relationships between them, not rows and columns with primary and foreign key relationships. In practice this allows them to traverse graphs of information more efficiently than reading pages of data and finding the rows that match the query.

Interestingly, Graph Theory (Euler) predates Set Theory (Cantor/Dedekind) on which the relational model is based by over 150 years. Of historical interest, the development of the relational database at IBM was conceived as a method to get data out of databases, not get data in. This turned out, in the early 70’s to be a problem for IBM so they redirected Ted Codd’s efforts to making relational databases fast transaction processors. Enter the concept of “normal form,” a horribly misleading term that has side-railed a zillion projects by data modelers with a thin understanding of the concept insisting on “normal” purity no matter the cost. The rest is history. The whole DSS/BI/Analytics movement grew out of the fact that the relational databases were poor performers at non-transaction processing.

According to the NoSQL movement, and I’m not entirely convinced of this but I’m listening, the rigidity of a physical schema needed in relational databases is their undoing in an era of agility, speed and volume.  Here is a quote from Wikipedia:

Compared with relational databases, graph databases are often faster for associative data sets, and map more directly to the structure of object-oriented applications. They can scale more naturally to large data sets as they do not typically require expensive join operations. As they depend less on a rigid schema, they are more suitable to manage ad-hoc and changing data with evolving schemas. Conversely, relational databases are typically faster at performing the same operation on large numbers of data elements.

The key characteristic of graph databases is this notion if index-free adjacency, meaning, each node knows the location of its adjacent nodes so an index is unnecessary. Obviously, a semantic interpretation of this is that the graph is a representation of relationship. Paradoxically, there are no relationships in a “relational” database, they are applied at run time from the query.

Emil seems to think that graph databases are superior to RDB in every way and will eventually supplant them. The concept that RDB are based on sound and proven mathematical principles is interesting, but relational theory is only 50 years old. Graph theory goes back to the 17th century!

This all sounds good, but there are only a billion applications out there that rely on things NOT changing and for which the relational model is well-suited. As William McKnight said in his keynote, look to NoSQL as additive not replacement technology.

For those old enough to remember, lots of database systems in the 80’s tried to get on the database bandwagon by getting certified as relational, and we’re already seeing this with graph databases. Just to name a few, the FlockDB for Twitter is only a thin graph database on top of MySQL and therefore lacks index-free adjacency. Microsoft’s Trinity does not store graphs natively.

Most of the NoSQL vendors are pushing the notion that their products are so much less complex than the current RDB’s. This is undoubtedly true, but they probably lack so much functionality that’s been built on the RDB model over the decades. In fact, RDB’s were pretty simple in the beginning too.

To sum it up, most of the NoSQL products I’ve seen are clearly aimed at high-speed, low-complexity transaction or streaming processing, usually with unconventional data. They are not analytical tools. But they could provide a very useful, even indispensible role in analytics: getting meaning into the process.

There has been a schism between semantic technology and graph databases, probably because the former still can’t figure out how to market their technologies while simultaneously trying to prove how smart they are. Their message in muddled and their most visible promoters are not, shall we say, enterprise ready. Oddly, the notion of a triple is fundamental to graphs, but graph database vendors are steering clear of the whole ontology/RDF/OWL thing and finding their customers in other pursuits. Good move.

Posted in Big Data | Tagged , , , , , , , , , , , , , , | Leave a comment

Decision Services: IBM Tackles the Full Spectrum of Decision Management

Originally posted August 31, 2012)

When James Taylor and I wrote the book “Smart (Enough) Systems: How to Gain Competitive Advantage by Automating Hidden Decisions,” we focused mainly on those kinds of decisions that are managed by a “decision service.” A decision service is an embedded applet that fires decisions in stream, typically by employing a rulesset (a set of declarative statements) and a business rules engine. We dealt at length with both predictive modeling and optimization, but the goal was to automate “little decisions that add up,” like credit line increases or call center responses, not, “Should I buy Yammer?”

You develop a decision service like the ones we proposed with predictive modeling, as the models inform the rules engine with what to do, but that wasn’t the focus of the book. At the IBM Analyst conference this week on “Smart Analytics,” IBM launched a full suite of tools for their own brand of decision management that was much more comprehensive. Combining predictive modeling, business rules, entity analytics, optimization and cognitive computing, their goal is to provide decision services addressing both horizontal applications (customer, risk, pricing and performance management, for example) and platforms (big data and decision management analytics). Still under NDA, their advanced visualization approach was very compelling.

The various tools in this initiative derive from both internally developed software (Watson, for example) but also the integrated offerings of companies acquired such as Netezza, Cognos, SPSS, iLog , Applix, Algorithmics and many others. That’s a partial list. Where IBM’s decision management proposition expands on on James’ and my decision services proposition is the ability to provide advice and recommendations to the full spectrum of strategic, operational and tactical decisions, automated or not. As a technology company, IBM presented lots of material for us geeks to enjoy, but there was a clear message that this initiative is about outcomes, not features.

One question I asked a number times was, “How much does this cost and how long does it take?” I even tried to bucket the answer with, “like ERP or a data warehouse or just flip a switch?” No answer was forthcoming. It’s will take some time for me to develop some guidelines on this from the buyer/clients, but that will follow in due course.

This mega-approach may sound daunting, but in the demos, I saw very useful and intuitive amalgams of these components that are clearly ready for the enterprise. I especially liked Brenda Dietrich’s (IBM Fellow, VP & CTO, Business Analytics) presentation on her perspective on the future of analytics including analytics at scale, analytics with uncertain data, visualization, and the future of Watson and decision management. Her messaging was clear and compelling, which will go a long way in convincing their customers implement this.

There was, of course, a lot of talk about Watson. For my money, Watson was a cute name for a system to compete on Jeopardy! but I have some reservation about the name Watson given T.J. Watson’s association with the Third Reich, for which IBM has never apologized or even acknowledged (see “IBM and the Holocaust,” by Edwin Black). Watson, composed 41 interoperating sub-systems including Natural Language Processing, Voice Recognition, Text Analytics, knowledge representation including ontologies and a host of other features, needs a more serious name in y opinion. I don’t know if IBM will stick with the name, but they are offering the capability through the cloud under the title “Cognitive Computing.” The first commercial customer was WellPoint, a decision I’ve been unimpressed with and written about previously, but they have also announced what seems to be an oncology assistant at Memorial Sloan-Kettering. There are others, but IBM does not disclose that yet.

Watson’s ability to gather, represent and use information is pretty stunning. However, Watson seems to learn partially by mistakes (don’t we all), but some mistakes, like oncology, can have disastrous results that don’t appear for a long time, during which presumably the same mistakes are made repeatedly. IBM counters that Watson is an advisor, not a doctor, but I wonder what he’s doing at WellPoint, which is the 2nd largest for-profit health insurer in the US. A report in Reuters alleged that the Anthem Blue Cross subsidiary improperly singled out women with breast cancer for cancellation of their policies shortly after they were diagnosed with breast cancer. Lets hope Watson didn’t “learn” to do that.

My guess is that Watson is offered as a “service” to providers (at a fee? I don’t know) to advise on the best treatment approach for patients. Actually, providers are irritated enough about how their practices are constrained by insurance companies, Medicare/Medicaid and their malpractice insurers. Watson may make things worse (for them). Personally, if Watson’s efforts can make a dent in the poor quality of healthcare in this country (WHO ranks the US 1st in spending per capita and 26th in quality, just above Bosnia) I’ll be delighted.

Also, I found a glaring contradiction in some presentations. While encouraging companies to use social media data to gain insight, Stephen Gold (Director Worldwide Marketing, Watson Solutions) was openly disparaging about sites like WebMB or PatientsLikeMe as a source of information for Watson’s forays into medicine. If serious sites like these are not acceptable, how can frivolous sites like Twiiter or Facebook be useful? Afterwards I spoke with Gold, a really smart and engaging guy, and he has a pretty open mind after all. I enjoyed that conversation. That discussion will continue.

One thing we learned long ago in data warehousing is that you can increase the value of your existing data by integrating other sources, even small ones that add to your ability to gain insight. How in the world we’re going to curate that with the mass of data now accessible is a question that will keep us busy for a while.

I hope to dig in more to the various tools that were alluded to from SPSS, especially things like optimization, but in the latter case, I got the impression they meant predictive models that deliver “next best offer” sort of advice as opposed scheduling a fleet of trucks or airplanes. We’ll see.

Anyway, it was a good day and a half, though it could have been done in less time. The breakouts sort of rehashed the general sessions and there was never enough time for questions. When you have a roomful of analysts, you should make better provision for that.

A lingering feeling I have, though, is that as these systems are given more and more leeway to make and direct decisions, there has to be an agreement that the models underlying those decisions are adequate. James and I were pretty clear that the decisions we wrote about were negligible singularly, but very important in the aggregate. Much could be gained by being consistent and timely in these decisions, even if a few stakeholders got mistreated. But when the decision services are pointed towards more important decisions, like your health or your employment or your kids placement in programs, you really have to wonder how you can ever determine whether the modelers really considered everything, or at least the important things. As humans, we tend to excel at simplified models, not comprehensive one. It’s a little worrisome. George Box said it best: “All models are wrong. Some are useful.” Too bad Dr. Box didn’t give us guidance figure out which ones.

Posted in Big Data, Business Intelligence, Decision Management, Medicine, Research | Tagged , , , , , , , , , , , , , , , , , | Leave a comment

The Fallacy of the Data Scientist Shortage

There is no question that the USA (in fact, most of the world) would be well-served with more quantitatively capable people to work in business and government. However, the current hysteria over the shortage of data scientists is overblown. To illustrate why, I am going to use an example from air travel.

On a recent trip from Santa Fe, NM to Phoenix, AZ, I tracked the various times:

  Duration (min) Cumulative (min)
Drive from Santa Fe to ABQ Airport 65 65
Park 15 80
Security 25 105
Wait to board 20 125
Boarding process 30 155
Taxiing 15 170
In flight 60 230
Taxiing 12 242
Deplane 9 251
Wait for valet bag 7 258
Travel to rental car 21 279
Arrive at destination in Tempe 32 311

As you can see, the actual flying time of 60 minutes represents only 19% of the travel time.  Because everything but the actual flight time is more or less constant for any domestic trip (disregarding common delays, connections and cancellations which would skew this analysis even farther), this low percentage of time in the air is a reality. For example, if the flight took 2 hours and fifteen minutes, it would still work out to 135/386 = 35%. The most recent data I have, from 2005, shows the average non stop distance flown per departure was 607 miles, so we can add about 25 minutes to the first calculation and arrive at 85/336 =  25%.

Keep in mind, again, these calculations do not account for late departures/arrivals, cancelled and re-booked flights, connections, flight attendants and pilots having nervous breakdowns, etc. It’s safe to say that at most 25% of your travel time is spent in the air. Just for fun, let’s see how this would work out if we could take the (unfortunately retired) Concorde.  We would reduce our travel time by flying at Mach 2.5 by 40 minutes, trimming out journey from five hours and eleven minutes to four hours and 31 minutes, about a 13% improvement.

What’s the point of all of this and what does it have to do with the so-called data scientist shortage?

Based on our research at Hired Brains, we find that analysts that work with Hadoop or other big data technologies spend a significant amount of time NOT requiring any knowledge of advanced quantitative methods – configuring and maintaining clusters, writing programs to gather, move, cleanse and otherwise organize data for analysis and many other common tasks in data analysis. In fact, even those who employ advanced quantitative techniques spend from 50-80% of their time gathering, cleansing and preparing data. This percentage has not budged in decades. Keep in mind that advanced analytics is not a new phenomenon; what is new is the volume (to some extent) and variety of the source data with new techniques to deal with it, especially, but not limited to, Hadoop.

The interest in analytics has risen dramatically in the past two or three years,  that is not in dispute. But the adoption of enterprise-scale analytics with big data is not guaranteed in most organizations beyond some isolated areas of expertise. Most of the activity is in predictable (commercial) industries – net-based businesses, financial services, and telecommunications, for example, but these businesses have employed very large-scale analytics, at the bleeding edge of technology for decades.  For most organizations, analytics will be provided by embedded algorithms in applications not developed in-house and third-party vendors of tools and services and consultants.

The good news is that 80% of the expertise you need for big data is readily available. The balance can be sourced and developed.  “The crème-de-la-crème of data scientists will fill roles in academia, technology vendors, Wall Street, research and government.

There are related and unrelated disciplines that are all combined under the term analytics. There is advanced analytics, descriptive analytics, predictive analytics and business analytics, all defined in a pretty murky way. It cries out for some precision. Here is how I characterize the many types of analytics by the quantitative techniques used and the level of skill of the practitioners who use these techniques.

 

Descriptive Title

Quantitative Sophistication/Numeracy

Sample Roles

Type I

Quantitative Research (True Data Scientist)

PhD or equivalent

Creation of theory, development of algorithms. Academic/research. Often employed in business or government for very specialized roles

Type II

(Current definition of) Data Scientist or Quantitative Analyst

Advanced Math/Stat, not necessarily PhD

Internal expert in statistical and mathematical modeling and development, with solid business domain knowledge

Type III

Operational Analytics

Good business domain, background in statistics optional

Running and managing analytical models. Strong skills in and/or project management of analytical systems implementation

Type IV

Business Intelligence/ Discovery

Data and numbers oriented, but no special advanced statistical skills

Reporting, dashboard, OLAP and visualization use, possibly design, Performing posterior analysis of results driven by quantitative methods

“Data Scientist” is a relatively new title for quantitatively adept people with accompanying business skills. The ability to formulate and apply tools to classification, prediction and even optimization, coupled with fairly deep understanding of the business itself, is clearly in the realm of Type II efforts. However, it seems pretty likely that most so-called data scientists will lean more towards the quantitative and data-oriented subjects than business planning and strategy. The reason for this is that the term data scientist emerged from those businesses like Google or Facebook where the data is the business; so understanding the data is equivalent to understanding the business. This is clearly not the case for most organizations. We see very few Type II data scientists with the in-depth knowledge of the whole business as, say, actuaries in the insurance business, whose extensive training should be a model for the newly designated data scientists (see my other posts here).

Posted in Big Data | Tagged , , , , , , , , , , , , , , , , , , , , , | 1 Comment