The Informed Data Lake: Beyond Metadata

The Informed Data Lake Strategy
Executive Summary

Historically, the volume and extent of data that an enterprise could store, assemble, analyze and act upon exceeded the capacity of their computing resources and was too expensive. The solution was to model some extract of a portion of the available data into a data model or schema, presupposing what was “important,” and then fit the incoming data into that structure.

But, the economics of data management today allow for the gathering of practically any data, in any form, skipping the engineered schema with the presumption that understanding the data can happen on an as-needed basis.

This newer approach of putting data in one place for later use is now described as a Data Lake. But sooner or later, one has to pay the piper, as this Data Lake approach involves manual, time-consuming data preparation and filtering that is often one-off, consumes a large percentage of the data scientist’s time and provides no reference to the content or meaning of the data.

The alternative is the Informed Data Lake. The difference between an Informed Data Lake and the static “dumb” data neatly arranged in a Data Lake, is a comprehensive set of capabilities that provides a graph based linked and contextualized information fabric (semantic metadata and linked datasets) where NLP (Natural Language Processing), Sentiment Analysis, Rules Engines, Connectors, Canonical Models for common domains and cognitive tools that can be plugged in to turn “dumb” data into information assets with speed, agility, reuse and value.

Today, companies in Pharmaceutical, Life Sciences, Financial Services, Retail, Government Agencies, and many other industries, are seeking ways to make the full extent of their data more insightful, valuable and actionable. Informed Data Lakes are leading the way through graph based data discovery and investigative analytics tools and techniques that uncover hidden relationships in your data, while enabling iterative question and answering of that linked data.

The Problem with a Traditional Data Lake Approach

Anyone tasked with analyzing data for understanding past events and/or predicting the future knows that the data assembled is always “used.” It’s secondhand. Data is captured digitally for purposes that are almost exclusively meant for purposes other than analysis. Operational systems automate processes and capture data to record the transactions. Document data is stored in formats for presentation and written in flowing prose without obvious structure; it’s written to read (or just record), not mined for later analysis. Clickstream data in web applications is captured and stored in a verbose stream but has to be reassembled for sense making.

.Organizations implementing or just contemplating a data lake often operate under the misconception that having the data in one place is an enabler to broader and more useful analytics leading to better decision-making and better outcomes. There is a large hurdle facing these kinds of approaches – while the data may be in one place physically (typically a Hadoop cluster), in essence all that is created is a collection of data siloes, unlinked and not useful in a broader context, reducing the data lake to nothing more than a collection of disparate data sources.  Here are the issues:

·      Data quality issues –data is not curated. Users have to deal with repeat data, old data, contextually wrong data

·      Data can be in different formats or different languages

·      Governance issues

·      Security issues

·      Companies being misinformed as to who can process and use the data

·      Also the ever changing nature of diverse data requires that the processing and analysis of the data be dynamic and evolves as the data changes or as the need change

The effort of adding meaning and context to the data falls on the shoulders of the analysts, a true time-sink. Data Scientists and Business Analysts are spending 50-80% of their time preparing and organizing their data and only 20% of their time analyzing it.

·      But what if data could describe itself?

·      What if analysts could link and contextualize data from different domains without having to go through the effort of curating it for themselves?

·      What if you had a Informed Data Lake to address all those issues and more?

The Argument for a Informed Data Lake

Existing approaches to curating, managing and analyzing data (and metadata) are mostly based on relational technology, which, for performance reasons, usually simplifies data and strips it of meaningful relationships while locking it into rigid schema. The traditional approach is to predefine what you want to do with the data, define the model and subsequent version of physical optimization and subsets for special uses. Context is like poured concrete – fluid at first needing a jackhammer to change it once it sets. You use the data as you originally designed it. If changes are needed, going back to redesign and modify is complicated.

The rapid rise of interest in “big data” has spawned a variety of technology approaches to solve, or at least, ease this problem such as text analytics and bespoke applications of AI algorithms. They work. They perform functions that are too time-consuming to do manually but they are incomplete because each one is too narrow, aimed at only a single domain or document type, or too specific in its operation. They mostly defy the practice of agile reuse because each new source, or even each new special extraction for a new purpose, has to start from scratch.

Given these limitations, where does one turn for help?  An Informed Data Lake is the answer.

At the heart of the Informed Data Lake approach is the linking and contextualizing of all forms of data using semantic based technology. Though descriptions of semantic technology are often times complicated, the concept itself is actually very simple:

–       It supplies meaning to data that travels with the data

–       The model of the data is updated on the fly as new data enters

–       The model also captures and understands the relationship between things from which it can actually do a certain level of reasoning without programming

–       Information from many sources can be linked, not through views or indexes, but through explicit and implicit relationships that are native to model

Conceptually, the Informed Data Lake is a departure from the earliest principle of IT: Parsimonious Development derived from a mindset of managing from scarcity[1] and deploying only simplified models. Instead of limiting the amount of data available, or even the access to it, Informed Data Lakes are driven by the abundance of resources and data. Semantic metadata provides the ability to find, link and contextualize information in a vast pool of data.

The Informed Data Lake works because it is based on a dynamic semantic model-approach based on graph driven ontologies. In technical terms, an ontology represents the meaning and relationships of data in a graph, an extremely compact and efficient way to define and use disparate data sources via semantic definitions based on business usage including terminology and rules that can be managed by business users:

•       Source data, application interfaces, operational data, and model metadata are all described in a consistent ontology framework supporting detailed semantics of the underlying objects. This means constraints on types, relations, and description logic, for example, are handled uniformly for all underlying elements.

•       The ontology represents both schema and data in the same way. This means that the description of metadata about the sources also represents a machine-readable way of representing the data itself for translation, transmission, query, and storage.

•       Ontology can richly describe behavior of services and composite applications in a way that a relational model can only do by being tightly bound to the applications logic.

•       The ontology is a run-time model, not just a design-time model. The ontology is used to generate rules, mappings, transforms, queries, and UI because all of the elements are combined under a single structure.

•       There is no reliance on indexes, keys, or positional notation to describe the elements of the ontology. Implementations do not break when local changes are made.

•       An ontological representation encourages both top-down, conceptual description and bottom-up, source- or silo-based representation of existing data. In fact, these can be in separate ontologies and easily brought together.

•       The ontology is designed to scale across users, applications, and organizations. Ontologies can easily share elements in an open and standard way, and ontology tools (for design, query, data exchange, etc.) don’t have to change in any way to reference information across ontologies.

Assuming a data lake is built for a broad audience, it is likely that no one party will have the complete set of data they think is of interest. Instead, it will be a union of all of those ideas, plus many more that arise as things are discovered, situations evolve and new sources of data become available. Thinking in the existing mode of database schema design, inadequate metadata features of Hadoop and just managing from scarcity in general, will fail under the magnitude of this effort. What the Informed Data Lake does is take the guesswork out of what the data means, what it’s related to and how it can be dynamically linked together without endless data modeling and remodeling.

All of the features and capabilities below are needed to keep a data lake from turning into a data swamp, where no one quite knows what it contains or if it is reliable.

Informed Data Lake Features:

·      Connectors to practically any source

·      Graph based, linked and contextualized data

·      Dynamic Ontology Mapping

·      Auto-generated conceptual models

·      Advanced Text Analytics

·      Annotation, Harmonization and Canonicalization

·       “Canonical” models to simplify ingest and classifying of new sources

·      Semantics querying and data enrichment

·      Fully customizable dashboards

·      With full data provenance adhering to IT standard

Sample Informed Data Lake Capabilities:

·      Manage business vocabulary along with technical syntax

·      Actively resolve differences in vocabulary across different departments, business units and external data

·      Support consistent assignment of business policies and constraints across various applications, users and sources

·      Accurately reflect all logical consequences of changes and dynamically reflect change in affected areas

·      Unify access to content and data

·      Assure and manage reuse of linked and contextualized data

Any vendor providing metadata based on semantic technology is in a unique position to provide these capabilities required to build and deploy the Informed Data Lake. It is based on open standards and takes a semantic approach from the beginning. In addition, it incorporates a very rich tool set that includes dozens of 3rd party applications that operate seamlessly within the Informed Data Platform. This is central to the ability to move the task of data integration and data extraction to more advanced knowledge Integration and knowledge extraction, without which it is impossible to fuel solutions in the areas of competitive intelligence, Insider trading surveillance, investigatory analytics and Customer 360, risk and compliance, as well as feeding existing BI applications (a requirement that is not going away anytime soon).

A Informed Data Lake Solution

The specific design pattern of the Informed Data Lake enables data science because analytics does not end with a single hypothesis test. Simple examples of “Data Scientists” building models on the data lake and saving the organization vast sums of money make good copy, but they do not represent what happens in the real world.

Often, the first dozen hypotheses are either obvious or non-demonstrable. When the model characterization comes back it presents additional components to validate and cross correlate. It is this discovery process that the data lake somehow needs to facilitate, and it needs to facilitate it well, otherwise the cost of the analytics is too high and the process is too slow to realize business value.

To enable that continuous improvement process of deep analytics requires more than a data strategy, it needs a tool chain to solve model refinement, and the best-known method to date is the Informed Data Lake. The significant pain point for deep analytics is refinement. And the lower the refinement costs are, the more business value can be extracted.

At some point you may have heard the criticism of BI and OLAP tools that you were constrained to the questions that were implicit in their models. In fact, the same criticism has been leveled at data warehouses.  The fact remains that both data warehouses and BI tools limit your questions to those that can be answered, not just with the available data, but how it is arranged physically and how well the query optimizer can resolve the query.

Now imagine what is possible if you could ask any question of the data in a massive data lake? This is where the Informed Data Lake comes into play.

Catalog capabilities allow for massive amounts of metadata and instantaneous access to it. Thus any user (or process) can “go shopping” for a dataset that interests them. Because the metadata is constructed in the form of an in-memory graph, linking and joining data that is of far different structures and perhaps never linked before, can be done instantaneously.

On a browser like interface,, the graph can show you not only the typical ways different data sets can be linked and joined, it can even recommend other datasets that you haven’t considered.

Once data is selected, the in-memory graph processing analyzes and traverses it structure to provide the instantaneous joins that would be impossible in a relational database. The net result is that arbitrarily complex models and tools can ask any question with unlimited joins as a result of processing optimized for multi-core CPU’s, very large memory models and fast interconnect across processing nodes.

Informed Data Lake in Action

Pharma R&D Intelligence:

Clinical trials involve great quantities of data from many sources, a perfect problem for an Informed Data lake. The Informed Data Lake allows the loading, unification and ingestion of the data without knowing a priori what analytics would be needed.  In particular, evaluating drug response would link many sources of data following participants with severity and occurrence of adverse drug reaction, across multiple trials, as well as unknown other classes of data.

Clinical trial data investigators and analysts can see the value of the graph based approach with the linking and contextualization they could not do otherwise.  They see many benefits including:

·      Identifying patients for enrollment based on more substantive criteria

·      Monitoring in real-time, to identify safety or operational signals,

·      Blending data from physicians and CROs (contract research organizations)


Insider Trading and Compliance Surveillance:

In the financial services space, the combination of deep analysis of large datasets with targeted queries of specific events and people give analysts and regulators an opportunity to catch wrongdoing early.

·      Identify an employee who has unusually high level of suspicious trading activity.

·      Spot patterns in which certain employees have histories of making the exact same trades at the exact same times.

·      Compare employees’ behaviors to their past histories, and spot situations where employees’ trading patterns make sudden, drastic changes


Making sense of data lakes takes discipline because a one-off approach will drain your best resources of time and patience. The Informed Data Lake approach, complete with a suite of NLP, AI, graph-based models and semantic technology is the sensible approach. Your two most expensive assets are staff and time. The Informed Data Lake allows you to do your work quicker, cheaper, faster, with more flexibility and greater accuracy, which has a major impact on your business. Without the Informed Data Lake, the data is a bewildering collection of pieces that analysts and data scientists can only understood in small pieces, diluting the value of the data lake.

The whole extended fabric of an ontology solution and its ability to plug in third-party abilities collapses many layers of logical and physical models in traditional data warehousing/business intelligence architectures into a single model. With the Informed Data Lake approach, tangible benefits accrue:

·      Widespread understanding of the model across many domains in the organization

·      Rapid implementation of new studies and applications by expanding the model, not re-designing it (even small adjustments to relational databases involve development at the logical, physical and downstream models, with time-consuming testing).

·      Application of Solution Accelerators that provide bundled models by industry/application type that can be modified for your specific need

·      “Data Democratization” making data available to users across the organization for their own data discovery and analytic needs, extracting greater value from the data

·      Discovering hidden patterns in relationships, something not possible with the rotational and drill down capabilities of IB tools

·      The ability for iterative question and answering, continuous data discovery and run time analytics across huge amounts of data and, more importantly, linked data from sources not typically associated previously

In conclusion, the Informed Data Lake layers a disparate collection of data sources of unknown origin, quality and currency, into a facility for almost limitless exploration and analysis.
[1] Managing from scarcity has historically driven IT to develop and deploy using the least amount of computing resources under the assumption that these resources were precious and expensive. In the current computing economy, the emphasis has shifted away from scarcity of hardware to scarcity of time and attention of knowledge workers

Posted in Uncategorized | 2 Comments

Karl Popper versus Data Science

I’m sure you’ve heard of Big Data and IoT (Internet of Things) by now. There is a current in computing now that is based on the economics of nearly unlimited resources for computational complexity including Cognitive Computing (AI + Machine Learning). From this, many are seeing the “end of science,” meaning, the truth is in the data and the scientific method is dead.Previously, a scientist may observe certain phenomena, come up with a theory and test it.He is a counter example.

Using algorithms from Topology (yeah, I studied topology in the 70’s) investigators can apply TDA (Topological Data Analysis) to investigate the SHAPE of very complex, very high-volume, very hi-dimensional data (1000’s of variables), deform it in various ways to see what its true nature is and find out what’s really going on. Traditional quantitative methods can only sample or reduce the variables using techniques like Principal Component Analysis (these variables don’t seem very important).

In one case, an organization did a retrospective analysis of every single trial and study on spinal cord injuries. What they found with TDA was that one and only one variable had a measurable effect on outcomes with patients presenting with SCI – maintaining normal blood pressure as soon as they hit the ambulance. No one had either seem or even contemplated this before.

Karl Popper was one of the most important and controversial philosophers of science of the 20th century. In “All Life is Problem Solving,” Popper claimed that “Science begins with problems. It attempts to solve them through bold, inventive theories. The great majority of theories are false and/or untestable. Valuable, testable theories will search for errors. We try to find errors and to eliminate them. This is science. It consists of wild, often irresponsible ideas that it places under the strict control of error correction.”

In other words, hypothesis precedes data. We decide what we want to test, and assemble the data to test it. This is the polar opposite of the data science emerging from big data.

So here’s my premise. Is Karl Popper over? Has computing killed the scientific method?


Posted in Big Data, Decision Management, Genomics, Medicine, Research, Uncategorized | Tagged , , , | 9 Comments

Miscellaneous Ramblings Today on Data and Analytics

Here are some ideas off the top of my head:

1. The Big Data Analytics industry – vendors, journalists, industry analysts – have flooded the market with messages as if no one ever used quantitative methods before

2. Because most of the content you see is generated by people who don’t actually use quantitative methods, it is:
– focused on technology
– full of the same use cases such as up-sell/cross-sell, churn, fraud, etc.

3. The real opportunity with Big Data and its attendant technologies is to get a richer understanding of those phenomena that are important to you

4. The rise of Data Science and Scientists is the invention of practitioners from the digital giants and not terribly relevant to most companies

5. Ultimately the benefit of Big Data Analytics will be better decisions born of better decision-making processes, not  just informing people of findings. This was the weak point of BI, it was too passive. Operational Intelligence and Decision Automation are key

6. All of this is possible because of the radically different analytical architectures and open source tools that are available in a variety of cloud-based topologies

7. Many business analysts have the background to use advanced analytical tools, provided the tools get better at guiding and advising.

8. The industry can’t continue without better tools. Big Data is a giant time sink. We’re seeing lots of interesting products emerge, many are open-source, to lubricate the whole data management and analytic spectrum

9. As always, finding a way for business units and IT to cooperate and work productivly is still a problem.

10. Existing operational systems are either based on relational databases technology or even older systems written in COBOL and other 2nd-generation languages. Capturing information in these systems is like fitting a square peg in a round hole. New database systems, the so-called NoSQL tools offer abundant opportunities to capture and use rich information. One example, graph databases, are brilliant at finding hidden relationships to expose concentration risk or fraud for example.

11. I’ve built a few Bayesian Belief Networks recently. What I learned is that they can get computationally expensive, perform poorly on high dimensional data and models can be hard to interpret. On the other hand is the ability to get to causation, not just correlation. Better to build from data and/or simulation

Posted in Uncategorized | 2 Comments

Pervasive Analytics: Needs Organizational Change, Better Software and Training

By Neil Raden

Principal Analyst, Hired Brains Research, LLC

May, 2015

The hunt for data scientists has reached its logical conclusion: There are not enough qualified ones to go around. The pull for analytics as a result of a number of factors, including big data and the march of Moore’s Law, is irresistible. As a result, industry analysts, software providers and other influencers are turning to the idea of the “democratization of analytics” as a solution. At Hired Brains, we believe this is not only a good idea (and have been writing and speaking about it for four years), but that it is inevitable. Unfortunately, turning business analysts loose of quantitative methods is an unworkable solution. As the title says, three things that are not currently in place need to be: organizational change, better software and training/mentoring for sustained periods.

Some Background

From the middle of the twentieth century until nearly its end, computers in business were mostly consumed with the process of capturing operational transactions for audit and regulatory purposes. Reporting for decision-making was repetitive and inactive. Some interactivity with the computer began to emerge in the eighties, but it was applied mostly to data input forms. By the end of the century, mostly as a result of the push from personal computers, tools for interacting with data, such as Decision Support Systems, reporting tools and Business Intelligence allowed business analysts to finally use the computing power for their analytical, as opposed to operational purposes.

Nevertheless, these tools were under constant stress because of the cost and scarcity of computing power. The repository of the data, mostly data warehouses, dwarfed the size of the operational systems that fed them. As BI software providers pressed for “pervasive BI,” so that a much broader group of people in the organization would actively use the tools (and the vendors would sell more licenses of course), the movement met resistance from three areas: 1) physical resources (CPU, RAM, Disk), 2) IT concerns that a much broader user community would wreak havoc with the established security and control and 3) people themselves who, beyond the existing users, showed little interest in “self-service” so long as there were others willing to do it for them.

In 2007, Tom Davenport published his landmark book, “Competing on Analytics,” and suddenly, every CEO wanted to find out how to compete on analytics. Beyond the more or less thin advice about why this was a good idea, the book was actually anemic when it came to providing any kind of specific, prescriptive advice on transforming an organization to an “analytically-driven” one.

Analytics Mania

Fast forward to 2015 and analytics has morphed from a meme to a mania. Pervasive BI is a relic not even discussed, but pervasive analytics or, more recently, the “democratization of analytics” is widely held to be the salvation of every organization. Granted, two of the three reasons pervasive BI failed to ignite are no longer an issue in this era of big data and Hadoop, but the third, the people, looms even larger: 1) people still are not motivated to do work that was previously done by others and 2) an even greater problem, the academic prerequisites to do the work are absent in the vast majority of workers. Pulling a Naïve Bayes or C4.5 icon over same data and getting a really pretty diagram or chart is dangerous. Software providers are making it terrifyingly easy for people to DO advanced quantitative analysis without knowing what they doing.

Pervasive analytics? It can happen. It will happen, it’s inevitable and even a good idea, but most of the messaging about it has been perilously thin, from Gartner’s “Citizen Data Scientists” to Davenport’s “Light Quants” (who would ever want to be a “light” anything?) What is lacking is some formality about what kind of training organizations need to commit to, what analytical software vendors need to do to provide extraordinarily better software for neophytes to use productively and, how organizations need to restructure for all of this to be worthwhile and effective.

How to Move to Pervasive Analytics

For “pervasive analytics” or “the democratization of analytics” to be successful, it requires much more than just technology. Most prominent is a lack of training and skills on the part of the wide audience that is expected to be “pervaded” if you will. The shortage of “data scientists” is well documented, which is the motivation for pushing advanced analytics down in the organization to business analysts. The availability of new forms of data provides an opportunity to gain a better understanding of your customers and business environment (among a multitude of other opportunities), which implies a need to analyze data at a level of complexity beyond current skills, and beyond the capabilities of your current BI tools.

Much work is needed to develop realistic game plans for this. In particular, our research at Hired Brains shows that there are three critical areas that need to be addressed:

  • Skills and training: A three-day course is not sufficient and organizations need to make a long-term commitment to the guiding of analysts
  • Organizing for pervasive analytics: Existing IT relationships with business analysts need reconstruction and senior analysts and data scientists need to supervise the roles of governance, mentoring and vetting
  • Vastly upgraded software from the analytics vendors: In reaction to this rapidly unfolding situation, software vendors are beginning to provide packaged predictive capabilities. This raises a whole host of concerns about casual dragging of statistical and predictive icons onto a palate and almost randomly generating plausible output, that is completely wrong.

Skills and Training

Of course it’s unrealistic to think that existing analysts who can build reports and dashboards will learn to integrate moment generating functions and understand the underlying math behind probability distributions and quantitative algorithms. However, with a little help (a lot actually) from software providers, a good man-machine mix is possible where analysts can explore data and use quantitative techniques while being guided, warned and corrected.

A more long-term problem is training people to be able to build models and make decisions based on probability, not a “single version of the truth.” This process will take longer and require more assistance from those with the training and experience to recognize what makes sense and what doesn’t. Here is an example:

Screen Shot 2015-06-27 at 9.35.46 AM

The chart shows a correlation between a stock market index and the number of times Jennifer Lawrence was mentioned in the media. Not shown, but the correlation coefficient is a robust 0.80, which means the variables are tightly correlated. Be honest with yourself and think about what could explain this? After you’ve thought about a few confounding variables, did you consider that they are both slightly increasing time series, which is actually the basis of the correlation, not the phenomena themselves? Remove the time element and the correlation drops to almost zero.

The point here is one doesn’t need to understand the algorithms that create this spurious correlation, they just need enough experience to know that you have to filter out the effect of the time series. But how would they know that?

The fact is that making statistical errors is far more insidious than spreadsheet or BI errors when underlying concepts are hidden. Turning business analysts into analytical analysts is possible, but not automatic.

Consider how actuaries learn their craft. Organizations hire people with an aptitude for math, demonstrated by doing well in things like Calculus and Linear Algebra, but not requiring a PhD. As they join an insurance or reinsurance or consulting organization, they are given study time at work to prepare for the exams, a process that takes years, and have ample access to mentors to help them along because the firm has a vested interest in them succeeding. Being an analyst in a firm is a less extensive learning process, but the model still makes sense.

Organizational: How organizations should deal with DIY analytics

We’re just beginning our research in this area, but one thing is certain: the BI user pyramid has got to go. In many BI implementations, the work fell onto the shoulders of BI Competency Centers to create datasets, while a handful of “power users” worked with the most useful features of the toolsets. The remainder of the users, dependent on the two tiers above them, generated simple reports or dashboards for themselves or departments (an amusing anecdote from a client of ours was, “The most used feature of our BI tool was ‘Export to Excel.’”) Creating “Pervasive BI” would have entailed doing a dead lift of the “business users” into the “power user” class, but no feasible approach was ever put forward.

Pervasive analytics cannot depend on the efforts of a few “go-to guys,” it has to evolve into an analytically centered organization where a combination of training and better software can be effective. That involves a continuing commitment to longer-term training and learning, governance of models so that models developed by professional business analysts can be monitored and vetted before finding their way into production and just a wholesale effort to change the analytics workflow: where do these analyses go beyond the analyst?

Expectations from Software Providers

Packaged analytical tools are sorely lacking in advice and error catching. It is very easy to take an icon and drop it on some data, and the tools may offer some cryptic error message or, at worst, the “help” system displays 500 words from a statistics textbook to describe the workings of the tool. But this is 2015 and computers are a jillion times more powerful than they were a few years ago. It will take some hard work for the engineers, but there is no reason why a tool should not be able to respond to its use with:

  • Those parameters are not likely to work in this model; why don’t you try these
  • Hey, “Texas Sharpshooter”-you drew the boundaries around the data to fit the category model
  • I see you’re using a p-value but haven’t verified that the distribution is normal. Shall I check for you?

We will be continuing our research in the areas of skills/training, organization and software for Pervasive Analytics. Please feel free to comment at

Posted in Uncategorized | 2 Comments

Relational Technologies Under Siege: Will Handsome Newcomers Displace the Stalwart Incumbents?

Relational Technologies Under Siege:
Will Handsome Newcomers Displace the Stalwart Incumbents?

Published: October 16, 2014
Analyst: Neil Raden
After three decades of prominence, Relational Database Management Systems RDBMS) are being challenged by a raft of new technologies. While enjoying a position of incumbency, newer data management approaches are benefitting from a vibrancy powered by the effects of Moore’s Law and Big Data. Hadoop and NoSQL offerings were designed for the cloud, but are finding a place in enterprise architecture. In fact, Hadoop has already made a dent in the burgeoning field of analytics, previously the realm of data warehouses and analytical (relational) platforms.

• RDBMS are overwhelmed by new forms of data (so-called “big data”), including text, documents, machine-generated streams, graphs and other, but are counter-attacking with new development and features as well as acquisitions and partnerships
• Non-relational platform vendors assert that the relational model itself is too rigid and expensive for the explosion of information
• A fundamental drawback in RDBMS technology is the tight coupling of the storage, metadata and parser/optimizer layers that cannot take advantage of the separate storage and compute capabilities of Hadoop
• Advances in technology are not the key differentiators between RDMBS tools and Hadoop/Big Data NoSQL offerings. Requirements are. The continuing enterprise need for or quality, integrated information and a “single version of the truth” argues for existing and enhanced relational data warehouses versus the “good enough” mentality of cloud-based and Hadoop efforts that were developed for large internet companies are key identifying differences between analytical approaches
• The “new-new” is pretty exciting, but there is a rush to provide true SQL access to many of these platforms, an admission that the relational calculus will endure
• Desirable features of RDBMS will migrate to the distributed processing of Hadoop, but only once Hadoop solves its shortcomings in security, workload management and operability. Born-in-the ¬cloud SaaS applications built on NoSQL databases (even some to emerge) will operate seamlessly on this platform, but not for 3-5 years
• Surveys of “revenue intention” for new technology spending are misleading; only 15% of companies surveyed are using Hadoop, and many are experiments.

• Recognize that RDBMS, Hadoop and NoSQL databases have vastly different purposes, capabilities, features and maturity
• When contemplating a move from a Enterprise Data Warehouse and/or on-premise ETL, take the long view of the effort, cost and disruption
• Determine exactly what your RDBMS vendor is planning for supporting “hybrid” environments because, for the time being, it will have the effect on the downstream activities of analytics
• There are many use cases for NoSQL/Big Data that are compelling and you should carefully consider them. In general, they go beyond your existing Data Warehouse/BI but are not necessarily a suitable replacement. IN two years this will likely change.
• Go slow and do not throw away the baby with the bath water. The best approach is to experiment with a “skunk works” project or two to get a feel if the approach is right for your organization. Beyond that, design a careful Proof of Concept (PoC) that can actually “prove” your “concept.” Vendors tend to insert requirements and features that favor their product, which can derail the validity of the PoC.

Relational database technology was adopted by the enterprise for its ability to host transactional/operational applications. By the late 80’s vendors posted benchmarks of transactions/second that exceeded those of the purely proprietary databases with the added benefit of an abstracted language, SQL, that allowed for different flavors of databases to be designed, queried and maintain without the effort of learning a new proprietary language for each one.
Later, as the need grew for more careful data management for reporting and analytics, RDBMS were pressed into service as data warehouses, a role for which they were not well-suited in terms of scale and especially speed of complex queries and large table joins. This need was met in a number of ways, to some degree, but it took time.
This is precisely where we see Hadoop today, a tool that was built to support search and indexing of unruly data in the Internet, primarily. However, its advantages in term of cost and scale are so compelling that it is quickly being pressed into service as an enterprise analytics platform, but it is sorely lacking in some features that data warehouses and analytical platforms (like Vertica, Netezza, Teradata etc.) already possess.
The trend for distributors of Hadoop is to claim that relational data warehouses are obsolete, or at best artifacts that have some enduring value. Curiously, with all of the attendant deficiencies of RDBMS in their view, they are mostly mute about RDBMS for transactional purposes, but that is likely to change.
Relational vendors are at work to put in place reference architectures (and products to support them) that are hybrid in nature. A term emerging is “polyglot persistence,” the ability of the first mover in an analytical query to parse and distribute pieces of the query to the logical location of the data and, preferably, the compute engine for that data without having to bulk-load data and persist it to answer a question. The concept is similar to federating queries, but much more powerful as a federation scheme usually involves design of a reference schema and assembling and transforming the data into a single place to satisfy the query. In a hybrid architecture, there are actually multiple storage locations (even in-memory) and compute resources working in a cooperative fashion. This arrangement preserves the RDBMS as the origin of analytical queries and provider of the answer set and simplifies the maintenance and orchestration of downstream processes, especially analytical, visualization and data discovery.
RDBMS were mostly row-oriented, given their OLTP orientation, but some adopted a column-orientation, the most visible being SybaselQ. In the past few years, it became obvious that analytical applications would be better served by a columnar orientation and products like Vertica emerged combined with a highly scalable MPP architecture. But today, there is an explosion of new databases of many types such as (a sampling, not comprehensive):
• Column: Accumulo, Cassandra, HBase
• Wide Table: MapR-DB, Google BigTable
• Document: MongoDB, Apache CouchDB, Couchbase
• Key Value: Dynamo, FoundationDB, MapR-DB
• Graph: Neo4J, InfiniteGraph and Virtuoso
Keep in mind that none of these database system are “general purpose,” most require programming interfaces and lack the kind of management and administrative features that IT departments demand.

The explosion in database technology was inevitable as the effects of Moore’s Law caused a discontinuous jump in the flow and processing of information. Technology, however, is always a step ahead of business. The implementation of enterprise applications, information management and processing platforms is a carefully woven fabric that does not bear rapid disruption (unless, of course, that is the enterprise’s strategy). “Big data” can provide enormous benefits to organizations, but not all of them. Many will find it preferable to rely on third parties to prepare and even interpret big data for them. For those that see a clear requirement, it is wise to consider the whole playing field and how the insights gained will find purchase and value. As Peter Drucker said, “Information is data that has meaning and purpose.”


Posted in Big Data, Business Intelligence, Decision Management, White Paper | Tagged , , , , , , , , , , , , , , , | Leave a comment

Miscellaneous Ramblings about Decision Making

By Neil Raden

Decision-making is not, strictly speaking, a business process. Attacking the speed problem for decision-making, which is mostly a collaborative and iterative effort, requires looking at the problem as a team phenomenon. This is especially true where decision-making requires analysis of data. Numeracy, a facility for working with numbers and programs that manipulates numbers, exists at varying levels in an organization. Domain expertise similarly exists at multiple levels, and most interesting problems require contributions and input from more than one domain. Pricing, for example, is a joint exercise of marketing, sales, engineering, production, finance and overall strategy. If there are partners involved, their input is needed as well. The killers of speed are handoffs, uncertainty and lack of consensus. In today’s world, an assembly line process of incremental analysis and input cannot provide the throughput to be competitive. Team speed requires that organizations break down the barriers between functions and enable information to be re-purposed for multiple uses and users. Engineers want to make financially informed technical decisions and financial analysts want to make technically informed economic decisions.

That requires analytical software and an organizational approach that is designed for collaboration between people of different backgrounds and abilities.

All participants need to see the answer and the path to the answer in the context of their particular roles. Most analytical tools in the market cannot support this kind of problem-solving. The urgency, complexity and volume of data needed overwhelms them, but more importantly, they cannot provide the collaborative and iterative environment that is needed. Useful, interactive and shareable analytics can, with some management assistance, directly affect decision-making cycle times.

When analysis can be shared, especially through software agents called guides that allow others to view and interact with a stream of analysis, instead of a static report or spreadsheet,. time-eating meetings and conferences can be shortened or eliminated. Questions and doubts can be resolved without the latency of scheduling meetings. In fact, guides can even eliminate some of the presentation time in meetings as everyone can satisfy themselves beforehand by evaluating the analysis in context, not just pouring over results and summarizations.

Decision making is iterative. Problems or opportunities that require decisions often aren’t resolved completely, but return, often slightly reframed. Karl Popper taught that in all matters of knowledge, truth cannot be verified by testing, it can only be falsified. As a result, “science,” which we can broadly interpret to include the subject of organizational decision-making, is an evolutionary process without a distinct end point. He uses the simple model below:

PS(1) -> TT(1) -> EE(1) -> PS(2)

Popper’s premise was that ideas passed through a constant set of manipulations that yielded solutions with better fit but not necessarily final solutions. While the initial problem specification PS(1) yielded a number of Tentative Theories TT(1), Error Elimination EE(1) generates a solution, PS(2), and the process repeats. The TT and EE steps are clearly collaborative.

The overly-simplified model that is prevalent in the Business Intelligence industry is that getting better information to people will yield better decisions. Popper’s simple formulation highlights that this is inadequate – every step from problem formulation, to posing tentative theories to error elimination in assumptions and, finally, reformulated problem specifications requires sharing of information and ideas, revision and testing. One-way report writers and dashboards cannot provide this needed functionality. Alternatively, building a one-off solution to solve a single problem, typically with spreadsheets, is a recurring cost each time it comes around.

Posted in Big Data, Business Intelligence, Decision Management | 1 Comment

I’m Getting Convinced About Hadoop, sort of

As I sometimes do, I went to Boulder last week to soak up some of Claudia and Dave Imhoff’s hospitality and to sit in on a BBBT (Boulder BI Brain Trust) briefing in person instead of remotely like most of us do. The company this particular week was Cloudera and I wanted to not only listen to their presentations, but participate in the Q&A give and take as well as have more intimate conversations at dinner the night before. Despite the fact it took eight hours to drive there from Santa Fe (but only six back), it was clearly worth the effort. I certainly enjoyed meeting all the Cloudera people who came, but since this article is about Hadoop, not Cloudera, I’ll skip the introductions.

A common refrain from any Hadoop vendor (the term vendor is a little misleading because the open source Hadopp is actually free), is that Hadoop, almost without qualification, is a superior architecture for analytics over its predecessor, the relational database management systems (RDBMS) and its attendant tools, especially ETL (Extract/TransformLoad, more on that below).  Their reasoning for this is that it is undeniably cheaper to load Hadoop clusters with gobs of data than it is to expand the size of a licensed enterprise relational database. This is across the board – server costs, RAM and disk storage. The economics are there, but only compelling when you overlook a few variables. Hadoop stores three copies of everything, data can’t be overwritten, only appended to, and most of the data coming into Hadoop is extremely pared down before it is actually used in analysis. A good analogy would be that I could spend 30 nights in a flophouse in the Tenderloin for what it would cost for one night at the Four Seasons. 

But I did say I was getting convinced about Hadoop, so be patient.

A constant refrain from the Hadoop world is that is difficult and time-consuming to change a schema in a RDBMS, but Hadoop, with its “schema on read” concept allows for instantaneous change as needed. Maybe not intentionally, but this is very misleading. What is hard to change in a RDBMS is making changes to an application such as a DW with upstream and downstream dependencies. I can make a change to a database in two seconds. I can add non-key attributes to a Data Warehouse dimension table in an instant. But changing a shared, vetted, secure application is, reasonably, not an instantaneous thing, which illustrates something about the nature of Hadoop applications – they are not typically shared applications. Often, they are not even applications, so this comparison makes no sense. Instead, it illustrates two very important qualities of RDBMS’ and Hadoop. 

One more item about this “hard to change” charge. Hadoop is composed of the file system, HDFS and the programming framework MapReduce. When Hadoop vendors talk about the flexibility and scalability of Hadoop, they are talking about this core. But today, the Hadoop ecosystem (and this is just the Apache open source stuff, there is an expanding soup of add on’s appearing everyday) has more than 20 other modules in the Hadoop stack that make it useful. While I can do whatever I want with the core, once I build applications with these other modules there are just as many dependencies up and down the stack that need to be attended to when changing things as in a standard Data Warehouse environment. 

But wait. Now we have the Stinger Initiative for Hive, Hadoop’s SQL-ish database, to make Hive 100x faster. This is accomplished by jettisoning MapReduce and replacing it with Tez, the next-generation MapReduce. According to Hortonworks, Tez is “better suited to SQL” The Stinger initiative also includes ORCfile file for better compression, vectorizing Tez so that, unlike MapReduce, it can grab lots of records at once. And on top of it all, the crown jewel in any relational database, a Cost-Based Optimizer (CBO) which can only work with a, wait for this, schema! In fact, in the demo I saw today from Hortonworks, they were actually showing iterative SQL queries against, again, wait for it…a STAR SCHEMA! So what happened to schema on read? What happened to how awful RDBMS was compared to Hadoop? See where this is going? In order to sell Hadoop to the enterprise, they are making it work like a RDBMS. 

There are four kinds of RDBMS’s in the market today (and this is my market definition, no one else’s): 1) Enterprise Data Warehouse database systems designed from the ground up for data warehousing. As far as I’m concerned, there is only one that can handle massive volumes, huge mixed workloads broad functionality, tens of thousands of users a day – Teradata 6xxx series; 2) RDBMS designed for transactional processing, but positioned for data warehousing too, just not as good at it such as Oracle, DB2 and MSSQL; 3) Analytical databases, either sold as software-only or as appliances – IBM Netezza, H-P Vertica, Teradata 2xxx series; 4) In-memory databases such as SAP HANA, Oracle Times Ten and passel of others. Now we have a fifth – SQL-compliant (not completely) databases running on top of HDFS. There are more versions of these, too, such a SpliceMachine, now in public beta, as well as Drill, Impala, Presto, Stinger, Hadapt and Spark/Shark to name a few (although Daniel Abadi of Hadapt has argued that “Structured” query language misses the point of Hadoop entirely – flexibility). Now Hadoop is sort of five.

So where are we going with this? Like Clinton in the 90’s it’s clear Hadoop is moving to the center. Purist Hadoop will continue to exist, but market forces are driving it to a more palatable enterprise offering. Governance, security, managed workloads, interactive analysis. All of the things we have now except cheap platforms for greater volumes of data and massive concurrency.

I do wonder about one thing, though. The whole notion of just throwing more cheap resources at it has to have a point of diminishing returns. When will we get to the point that Hadoop is working 100X or 1000x more resources than would be needed in a careful architecture? Think about this. If we morph Hadoop into just a newer analytical database platform, sooner or later someone is going to wonder why we have 3 petabytes of drives and only 800 terabytes of data. In fact, how much duplication is in that data? How much wasted space? Drives may be cheap, but even a thousand cheap drives cost something, especially when they’re only 20% utilized.

Hadoop was invented for indexing search and other internet-related activities, not enterprise software. It’s promotion to all forms of analytics is curious. Where did anyone prove that its architecture was right for everything, or did the hype just get sold on being cheap? And what is the TCO over time vs a DW?

And when Hadoop venders say, “Most of our customers are building an Enterprise Data Hubs or (a terrible term) Data Lakes next to their EDW because they are complementary, it begs the question, for analytics in typical organizations, what exactly is complementary? That’s when we hear about sensors and machine-generated data and the social networks. How universal are those needs?

Then there is ETL. Why do it in expensive cycles on your RDBMS data warehouse when you can do it in Hadoop? They need to be reminded that writing code is not quite the same as an ETL tool with versioning, collaboration, reuse, metadata and lots of existing transforms built in. It’s also a little contradictory. If Hadoop is for completely flexibile and novel analysis, who is going to write ETL code for every project? Now there is a real latency: only five minutes to crunch the data and 30 30 days to write the ETL code.

They talk about using Hadoop as an archive to get old data out of a data warehouse, but they fail to mention that that data is unusable with the context that still remains in the DW; nor will it be usable in the DW later after the schema evolves. So what they really mean is use Hadoop as a dump for data you’ll never use but can’t stand to delete, because if you don’t need it in the DW, why do you need it at all?

Despite all this, it’s a tsunami. The horse had left the stable. The train has left the station. Hadoop will grow and expand and probably not even be recognizable as the original Hadoop in a few years and it will replace the RDBMS as the platform of choice for enterprise applications (even if the bulk of application of it will be SQL-based). I guarantee it. So get on top of it or get out of the way.

Posted in Uncategorized | Tagged , , , , , , | 2 Comments

Metrics Can Lead in the Wrong Direction

Is it really possible to use measurement — or “metrics,” in the current parlance — to drive an organization? There are two points of view, one widely accepted and current, the other opposing and more abstract.

The conventional wisdom on performance management is that our technology is perfectly capable of providing detailed, current and relevant performance information to stakeholders in an enterprise, including executives, managers, functional people, customers, vendors and regulators. Because we are blessed with abundant computing resources, connectivity, bandwidth and even standards, it is possible to present this information in cognitively effective ways (dashboards and visualization, for example). Recipients are able to receive the information in the manner in which they choose, and the whole process pays dividends by supporting the notion that “If you can’t measure it, you can’t manage it. “It is hard to imagine how anyone could manage a large undertaking without measurement, isn’t it? And most presentations I’ve heard quickly stress that measurement is only part of the solution.

The first step is knowing what to measure; then measuring it accurately; then finding a way to disseminate the information for maximum impact (figuring out how to keep it current and relevant); and then being able to actually do something about the results. A different way of saying this is that technology is never a solution to social problems, and interactions between human beings are inherently social. This is why performance management is a very complex discipline, not just the implementation of dashboard or scorecard technology. Luckily, the business community seems to be plugged into this concept in a way they never were in the old context of business intelligence. In this new context, organizations understand that measurement tools only imply remediation and that business intelligence is most often applied merely to inform people, not to catalyze change. In practice, such undertakings almost always lack a change-management methodology or portfolio.

But there is an argument against measurement, too. Unlike machines or chemical reactions in a beaker, human beings are aware that they are being measured. In the realm of physics, Heisenberg’s Uncertainty Principle demonstrates that the act of measurement itself can very often distort the phenomena one is attempting to measure. When it comes to sub-atomic particles, we can pretty much assume it is a physical law that underlies this behavior. With people, the unseen subtext is clearly conscious. People find the most ingenious ways to distort measurement systems to generate the numbers that are desired. Thus, the effort to measure can not only discourage desired behavior; it can promote dysfunctional behavior. There are excellent, documented examples of this phenomenon in Measuring and Managing Performance in Organizations by Robert D. Austin. The author’s contention is that measurement of people always introduces distortion and often brings dysfunction because measurement is never more than a proxy or an approximation of the real phenomena.

In a particularly colorful analogy, Austin writes:

“Kaplan and Norton’s cockpit analogy would be accurate if it included a multitude of tiny gremlins controlling wing flaps, fuel flow, and so on of a plane being buffeted by winds and generally struggling against nature, but with the gremlins always controlling information flow back to the cockpit instruments, for fear that the pilot might find gremlin replacements. It would not be surprising if airplanes guided this way occasionally flew into mountains when they seemed to be progressing smoothly toward their destinations.”

We all know that incomplete proxies are too easy to exploit in the same way that inadequate software with programming gaps beckons unscrupulous hackers. However, one doesn’t have to be malicious to subvert a measurement system. After all, voluntary compliance to the tax code encourages a national obsession with “loopholes.” And what salesperson hasn’t “sandbagged” a few deals for the next quarter after meeting the quota for the current one?

The solution is not to discard measurement but rather to be conscious of this tendency and to be vigilant and thorough in the design of measurement systems. We all have a tendency toward simplifying things; but in some cases, it appears better to not measure at all than to produce something inadequate. Performance management, to achieve its goals, has to be applied effectively, which is to say, with superior execution of technology, implementation and management. It has to be designed to be responsive to both incremental and unpredicted changes in the organization and the environment. There are no road maps for this. This is truly the first time that analytical and measurement technology can be embedded in day-to-day, instantaneous decision-making and tracking; and the industry is sorely lacking in skills and experience to pull it off. Those organizations that have been successful so far have relied on existing methodologies (activity-based costing or balanced scorecard, for example) to guide them through the more uncertain steps of metric formulation and change management to close the loop.

The question of whether you can ever adequately measure an organization is still open. To the extent that there are statutory and regulatory requirements, such as taxation, SEC or specific industry regulations, the answer is clearly yes. But those measurements are dictated. To measure performance after the fact, at aggregated levels, is only useful to a point. The closer and closer a measurement system gets to the actual events and actions that drive the higher-level numbers, the less reliable the cause-effect relationship becomes, just like Heisenberg found so long ago. There are many examples in the management literature of everyone “doing the right thing” while the wheels are coming off the organization.

Recommended Reading:

Measuring and Managing Performance in Organizations , by Robert D. Austin, (New York: Dorset House Publishing, 1996) 
In the Age of the Smart Machine: The Future of Work and Power by Shoshana Zuboff, (New York, Basic Books, 1988) 
“The New Productivity Challenge,” by Peter Drucker, Harvard Business Review, (Nov.-Dec. 1991): p. 70. 

Posted in Uncategorized | Tagged , , , , , | 1 Comment

A Bit About Storytelling

My take on storytelling

1. Must be a “story” with a beginning, middle and end that is relevant to the listeners.
2. Must be highly compressed
3. Must have a hero – the story must be about a person who accomplished something notable or noteworthy.
4. Must include a surprising element – the story should shock the listener out of their complacency. It should shake up their model of reality.
5. Must stimulate an “of course!” reaction – once the surprise is delivered, the listener should see the obvious path to the future.
6. Must embody the change process desired, be relatively recent and “pretty much” true.
7. Must have a happy ending.

In Stephen Denning’s words, “When a springboard story does its job, the listeners’ minds race ahead, to imagine the further implications of elaborating the same idea in different contexts, more intimately known to the listeners. In this way, through extrapolation from the narrative, the re-creation of the change idea can be successfully brought to birth, with the concept of it planted in listeners’ minds, not as a vague, abstract inert thing, but an idea that is pulsing, kicking, breathing, exciting – and alive.”

That may be a little too much excitement on a daily basis, something you save for the really important things, but it matters nonetheless that turning data into a story is a valid and necessary skill. But is it for everyone?

Not really. Actual storytelling is a craft. Not everyone knows how to do it or can even learn it. But everyone can tell a story. It just may not be of the caliber of storytelling. But to get a point across and have it stick (even if it’s just in your own mind, not to an audience), learn to apply metaphor.

More on metaphor lately

Posted in Big Data, Business Intelligence, Decision Management, Research, White Paper | Tagged , , | Leave a comment

Understanding Analytical Types and Needs

Understanding Analytics Types and Needs

By Neil Raden, January, 2013

Purpose and Intent

“Analytics” is a critical component of enterprise architecture capabilities, though most organizations have only recently begun to develop experience using quantitative methods. As Information Technology emerges from a scarcity-based mentality of constrained and costly resources to a commodity consumption model of data, processors and tools, analytics is quickly becoming table stakes for competition.

This report is the first of a two-part series. (Part II will cover analytic functionality and matching the right technology to the proper analytic tools and best practices.) It discusses the importance of understanding the role of analytics, why it is a difficult topic for many, and what actions you should take. It will explore the various meanings of analytics, provide a framework for aligning various types of analytics with associated roles and skill sets needed.

Executive Summary

Using quantitative methods is rapidly becoming, not an option for competitive advantage, but rather, at the very least, barely enough to keep up. Everyone needs to understand what’s involved in analytics, what you particular organization needs and how to do it.

Few people are comfortable with the concepts of advanced analytic methods. In fact, most people cannot explain the difference between a mean, a median and a sample mean. The misapplication of statistics is widespread, but today’s explosion of data sources and intriguing technologies to deal with them have changed the calculus. Embedded quantitative methods may relieve analysts of the actual construction of predictive models, but applying those models correctly requires understanding the different analytical types, roles and skill.

Analytics in the Enterprise

The emphasis of analytics is changing from one of long-range planning based on historical data, to dynamic and adaptive response based on timely information from multiple contexts, augmented and interpreted through various degrees of quantitative analysis. Analytics now permeates every aspect leading organizations’ operations. Competitive, technological and economic factors combine to require more precision and less lag time in discovery and decision-making.

For example, operational processing, the orchestration of business processes and secure capture of transactional data is merging with analytical processing, the gathering and processing of data for reporting and analysis. Analytics in commercial organizations has historically been limited to special groups working more or less off-line. Platforms for transaction processing were separated for performance and security reasons, an effect of “managing from scarcity.” But scarcity is not the issue anymore as the relative cost of computing has plummeted. Driven equally by technology and competition, operational systems are either absorbing or at least cooperating with analytical processes. This convergence elevates the visibility of all forms of analytics.

Confusion and mistakes in deploying analytics are common due to imprecise understanding of the various forms and types. Uncertainty about the staff and skills needed for various “types” of analytics are common. Messaging from technology vendors, service providers and analysts is murky and misleading, sometimes deliberately so.

The urgency behind implementing an analytics program, however, can be driven not by getting a leg up, but rather not falling behind.

Analytics and the Red Queen Effect

Analytics are crucial because the barriers to getting started are lower than ever. Everyone can engage in analytics now, of one type or another. As analytic capabilities increase across competitors, everyone must step up – it’s a Red Queen[i] effect.  When everyone was shooting from the hip, efficiency was a matter of degree. If everyone used crude models and unreliable data, then everyone should, more or less, work within the same margin of error. What separated competitors was good strategy and good execution. But now that everyone can employ quantitative methods and techniques like Naive Bayes, C4.5 and support vector machines, it will still be the strategy and execution that count. Companies must improve just to stay in place.  Each new level of analytics becomes the “table stakes” for the next.

Can You Compete on Analytics? Analytics Are Necessary – but Not Sufficient

Statistical methods using software have been shown to be useful in many aspects of an organization, such as fraud detection, demand forecasting and inventory management, but just using analytics has not been shown to necessarily improve the fortunes or effectiveness of the overall organization. In 2007, Davenport and Harris released their influential book[ii], Competing on Analytics, which described how a dozen or so companies used “analytics” to not only advise decision-makers, but to play a major role in the development of strategy and implementation of business initiatives. The book found a huge following and was a bestseller on the business book lists. It certainly placed the word “analytics” in the top of the mind of many decision makers. However, when comparing the fortunes of the twelve companies highlighted in the book, their performance in the stock market is less than spectacular as illustrated in Figure 2:

This scenario is often repeated – good work is performed inside an organization, but the benefits of the discipline do not permeate other parts of the business and, hence, have little effect on the organization as a whole. In another example, statistical methods have been used in the U.S. in agriculture for decades, and yields have improved dramatically, but the quality of the food supply has clearly degraded along with the fortunes of individual farmers.

Too many organizations, despite good intentions, do not see dramatic improvement in their fortunes after adopting wider-based analytical methods because:

First, rarely does one thing change a company. Analytics are a powerful tool, but it takes execution to realize the benefits. Perhaps if good analytical technique had been applied across the board along with a clear strategy to drive decisions based on quantitative models, better results may have followed. Instead, as is often the case, a visible project shows great promise and early results, but the follow through is wanting.

Data mining tools can actually be predictive, showing what is likely to happen or not happen. But what is often misunderstood is that data mining tools are usually poor at specifying when things will happen. In this case, too much faith is placed in the models, imbuing them with fortune-telling capabilities they simply lack. The correct approach is to test, run proofs of concept, and once in production engage in continuous improvement through mechanisms like champion/challenger and A/B testing.

Most of the companies try to understand customer behavior – which you can do with data mining – but it rarely captures the randomness of people’s behavior leading to overconfidence in the models. Given this customer is likely to purchase a car, when is the correct time to reach out? Perhaps right away, perhaps not. Data mining tools are not very good at individual propensities derived from behavior due to the randomness of human behavior. It is pretty common for inexperienced modelers to put too much faith in model results. The solution is to engage experienced talent to get a program started in the right track.

Return on investment in analytics is difficult to measure because there isn’t often a straight line from the model to results. Other parts of the organization contribute. An analytical process can inform decisions, either human or machine-driven, but the execution of those decisions is beyond the reach of an analytical system. People and process have to perform too. In addition, a successful analytical program can be the result of a well-defined strategy. Positive results from analytics would not have been possible without the formation of that strategy.

Professionals skilled in statistics, data mining, predictive modeling and optimization have been a part of many organizations for some time, but their contribution, and even an awareness of what they do, is sometimes poorly understood – and filled with many impediments to success.  By categorizing analytics by the quantitative techniques used and the level of skill of the practitioners who use these techniques (the business applications that they support are detailed in Part II of the series), companies can begin to understand when and how to use analytics effectively and deploy their analytic resources to achieve better results.

The Four Types of Analytics

There are related and unrelated disciplines that are all combined under the term analytics. There is advanced analytics, descriptive analytics, predictive analytics and business analytics, all defined in a pretty murky way. It cries out for some precision. What follows is a way to characterize the many types of analytics by the quantitative techniques used and the level of skill of the practitioners who use these techniques.

Figure 4: The Four Types of Analytics

Descriptive Title Quantitative Sophistication/Numeracy Sample Roles
    Type I QuantitativeResearch PhD or equivalent Creation of theory, development of algorithms. Academic/research. Often employed in business or government for very specialized roles
Type II Data Scientist orQuantitative


Advanced Math/Stat, not necessarily PhD Internal expert in statistical and mathematical modeling and development, with solid business domain knowledge
Type III Operational Analytics Good business domain, background in statistics optional Running and managing analytical models. Strong skills in and/or project management of analytical systems implementation
Type IV Business Intelligence/ Discovery Data and numbers oriented, but so special advanced statistical skills Reporting, dashboard, OLAP and visualization use, possibly design, Performing posterior analysis of results driven by quantitative methods

Type I Analytics: Quantitative Research

The creation of theory and development of algorithms for all forms of quantitative analysis deserves the title Type I. Quantitative Research analytics are performed by mathematicians, statisticians and other pure quantitative scientists. They discover new ideas and concepts in mathematical terms and develop new algorithms with names like Hidden Markov Support Vector Machines, Linear Dynamical Systems, Spectral Clustering, Machine Learning and a host of other exotic models. The discovery and enhancement of computer-based algorithms for these concepts is mostly the realm of academia and other research institutions (though not exclusively).  Commercial, governmental and other organizations (Google or Wall Street for example) employ staff with these very advanced skills; but in general, most organizations are able to conduct their necessary analytics without them, or employ the results of their research. An obvious example is the FICO score, developed by Quantitative Research experts at FICO (Formerly Fair Isaac) but employed widely in credit-granting institutions and even human resource organizations.

Type II Analytics: “Data Scientists”

More practical than theoretical, Type II is the incorporation of advanced analytical approaches derived from Type I activities. This includes commercial software companies, vertical software implementations, and even the heavy “quants” in industry who apply these methods specifically to the work they do like fraud detection, failure analysis, propensity to consume models, among hundreds of other examples. They operate in much the same way as commercial software companies but for just one customer (though they often start their own software companies too). The popular term for this role is “data scientist.”

“Heavy” Data Scientists. The Type II category could actually be broken down into two subtypes, Type II-A and Type II-B. While both perform roughly the same function – providing guidance and expertise in the application of quantitative analysis – they are differentiated by the sophistication of the techniques applied. II-A practitioners understand the mathematics behind the analytics and may apply very complex tools such as Kucene wrapper, loopy logic, path analysis, root cause analysis, synthetic time series or Naïve Bayes derivatives that are understood by a small number of practitioners. What differentiates the Type II-A from Type I is not necessarily the depth of knowledge they have about the formal methods of analytics (it is not uncommon for Type II’s to have a PhD for example), it is that they also possess the business domain knowledge they apply and their goal is to develop specific models for the enterprise, not for the general case as Type I’s usually do.

“Light” Data Scientists. Type II-Bs on the other hand may work with more common and well-understood techniques such as logistic regression, ANOVA, CHAID and various forms of linear regression. They approach the problems they deal with using more conventional best practices and/or packaged analytical solutions from third parties

Data Scientist Confusion. “Data Scientist” is a relatively new title for quantitatively adept people with accompanying business skills. The ability to formulate and apply tools to classification, prediction and even optimization, coupled with fairly deep understanding of the business itself, is clearly in the realm of Type II efforts. However, it seems pretty likely that most so-called data scientists will lean more towards the quantitative and data-oriented subjects than business planning and strategy. The reason for this is that the term data scientist emerged from those businesses like Google or Facebook where the data is the business; so understanding the data is equivalent to understanding the business. This is clearly not the case for most organizations. We see very few Type II data scientists with the in-depth knowledge of the whole business as, say, actuaries in the insurance business, whose extensive training should be a model for the newly designated data scientists (see our blog at: “What is a Data Scientist and What Isn’t”)

Though not universally accepted, data scientists must be able to effectively communicate their work to non-technical people. This is a major discriminator between a data scientist and a statistician. It is absolutely essential that someone in the analytics process have the role of chief communicator, someone who is comfortable working with quants, analysts and programmers, deconstructing their methodologies and processes, distilling them, and then rendering it in language that other stakeholders understand. Companies often fail to see that there is almost never anything to be gained by trying to put a PhD statistician into the role of managing a group of analysts and developers. It is safe to say that this role is represented more by a collaborative group of professionals than by a single individual.

Type III Analytics: Operational Analytics

Historically, this is the part of analytics we’re most familiar with. For example, a data scientist may develop a scoring model for his/her company. In Type III activity, parameters are chosen by the operational analytics expert analyst and are input into the model, generating the scores calculated by the Type II models and embedded into an operational system that, say, generates offers for credit cards. Models developed by data scientists can be applied and embedded in an almost infinite number of ways today. The application of Type II applications into real work is the realm of operational analysts. In very complex applications, real-time data can be streamed into applications based on Type II models with outcomes instantaneously derived through decision-making tools such as rules engines.

Packaged applications that embed quantitative methods such as predictive modeling or optimizations are also Type III in that the intricacies and the operation of the statistical or stochastic method are mostly hidden in a sort of “black box.” As analytics using advanced quantitative methods becomes more acceptable to management over time, these packages become more popular.

Decision making systems that are reliant on quantitative methods that are not well understood by the operators can lead to trouble. They must be carefully designed (and improved) to avoid overly burdening the recipients of useless or irrelevant information. This was a lesson learned in the early days of data mining, that generating “interesting” results without understanding what was relevant usually led to flagging interest in the technology. In today’s business environment, time is perhaps the scarcest commodity of all. Whether a decision-making system notifies people or machines, it must confine those messages to those that are the most relevant and useful.

False negatives are quite a bit more problematic as they can lead to transactions passing through that should not have. Large banks have gone under by not catching trades that cost billions of dollars. Think of false negatives as being asleep at the wheel.

Type IV Analytics: Business Intelligence & Discovery

Type III analytics aren’t of much value if their application in real business situations cannot be evaluated for their effectiveness. This is the analytical work we are most familiar with via reports, OLAP, dashboards and visualizations. This includes almost any activity that reviews information to understand what happened or how something performed, or to scan and free associate what patterns appear from analysis. The mathematics involved is simple. But pulling the right information – and understanding what information means – is still an art and requires both business sense and knowledge about sources and uses of the data.

Know Your Needs First

The scope of analytics is vast, ranging from the familiar features of business intelligence to the arcane and mysterious world of applied mathematics. Organizations need to be clear on their objectives and capabilities before funding and staffing an analytic program. Predictive modeling to dramatically improve your results makes for good reading, but the reality is quite different. The four types are meant to help you understand where you can begin or advance.

These categories are not hard and fast. Some activities are clearly a blend of various types. But the point is to add some clarity to the term “analytics” in order to understand its various use cases. Tom Davenport, for example, advocated creating a cadre of “PhDs with personality” in order to become an analytically competitive organization. That is one approach. Implementing analytics as part of other enterprise software you already have – or purchasing a specialized application that is already used and vetted in your industry – is a better place to start.


Use of some clear terminology can avoid confusion within your organization, not just internally, but in communication with vendors and service providers. To get the most out of analytics:

  • Be clear about what you need.  Having clarity on the meaning of analytics has clear benefits. Because the nature of analytics is a little mysterious to most people, a vendor statement that they provide “embedded predictive analytics” can no longer be taken at face value. You should look closely to see if those capabilities line up with your needs.
  • Don’t assume high value means high resource costs. In the same vein, you needn’t hesitate to begin analytical projects because you believe you need to source a dozen PhDs, when in fact, your needs are in the Type II category.
  • Formulate specific vendor questions based on what level of sophistication and resources you need. By more clearly specifying what type of analytics you need, it becomes very easy to ask: Is this tool designed to discover and create predictive models, or to deploy them from other sources? Do you offer training in quantitative methods or only in the use of your product? Is the tool designed for authoring scoring models or just using scored values?
  • Use analytic knowledge to start to prepare for Big Data.  Understanding what type of analytics – and results – you need will even help you in your soon-to-be-serious consideration of Big Data solutions, including Hadoop, its variants and its competitors, all of which use variants of the above techniques to process large quantities of information.

Analytics is a catchall phrase, but understanding the various uses and types should help in implementing the right approach for accomplishing the tasks at hand.  It should also help in discerning what is meant when the term is used, as almost anything can be called analytics.

Next Steps

Part II of this series will examine in depth the forms that analytics take in the organization and the business purposes it serves, and demonstrate through examples and case studies how analytics of all types are successfully employed. But analytics are a step in the process. Without effective decision-making practices the value in analytics is lost. Part III of this series will deal with decision making and decision management.

Author Bio: Neil Raden

Analyst, Consultant and Author in Analytics and Decision Science

Neil Raden, is the founder and Principal Analyst at Hired Brains Research, a provider of consulting and implementation services in business intelligence, analytics and decision managemen. Hired Brains focuses on the needs of organizations and capabilities of technology. He began his career as a property and casualty actuary with AIG in New York before moving into predictive analytics services, software engineering, and systems integration with experience in delivering environments for decision making in fields as diverse as health care to nuclear waste management to cosmetics marketing and many others in between.


[i] The Red Queen is a concept from evolutionary biology first used in Matt Ridley, The Red Queen: Sex and the Evolution of Human Nature, (New York: Macmillan Publishing Co, 1994).  The allusion is to the Red Queen in Lewis Carroll’s Through the Looking-Glass, who had to keep running just to stay in place.

[ii] Davenport, Harris, et al, “Competing on Analytics: The New Science of Winning,” New York, Harvard Business Press, 2007.

Posted in Uncategorized | 1 Comment