Karl Popper versus Data Science

I’m sure you’ve heard of Big Data and IoT (Internet of Things) by now. There is a current in computing now that is based on the economics of nearly unlimited resources for computational complexity including Cognitive Computing (AI + Machine Learning). From this, many are seeing the “end of science,” meaning, the truth is in the data and the scientific method is dead.Previously, a scientist may observe certain phenomena, come up with a theory and test it.He is a counter example.

Using algorithms from Topology (yeah, I studied topology in the 70’s) investigators can apply TDA (Topological Data Analysis) to investigate the SHAPE of very complex, very high-volume, very hi-dimensional data (1000’s of variables), deform it in various ways to see what its true nature is and find out what’s really going on. Traditional quantitative methods can only sample or reduce the variables using techniques like Principal Component Analysis (these variables don’t seem very important).

In one case, an organization did a retrospective analysis of every single trial and study on spinal cord injuries. What they found with TDA was that one and only one variable had a measurable effect on outcomes with patients presenting with SCI – maintaining normal blood pressure as soon as they hit the ambulance. No one had either seem or even contemplated this before.

Karl Popper was one of the most important and controversial philosophers of science of the 20th century. In “All Life is Problem Solving,” Popper claimed that “Science begins with problems. It attempts to solve them through bold, inventive theories. The great majority of theories are false and/or untestable. Valuable, testable theories will search for errors. We try to find errors and to eliminate them. This is science. It consists of wild, often irresponsible ideas that it places under the strict control of error correction.”

In other words, hypothesis precedes data. We decide what we want to test, and assemble the data to test it. This is the polar opposite of the data science emerging from big data.

So here’s my premise. Is Karl Popper over? Has computing killed the scientific method?


Posted in Big Data, Decision Management, Genomics, Medicine, Research, Uncategorized | Tagged , , , | 5 Comments

Miscellaneous Ramblings Today on Data and Analytics

Here are some ideas off the top of my head:

1. The Big Data Analytics industry – vendors, journalists, industry analysts – have flooded the market with messages as if no one ever used quantitative methods before

2. Because most of the content you see is generated by people who don’t actually use quantitative methods, it is:
– focused on technology
– full of the same use cases such as up-sell/cross-sell, churn, fraud, etc.

3. The real opportunity with Big Data and its attendant technologies is to get a richer understanding of those phenomena that are important to you

4. The rise of Data Science and Scientists is the invention of practitioners from the digital giants and not terribly relevant to most companies

5. Ultimately the benefit of Big Data Analytics will be better decisions born of better decision-making processes, not  just informing people of findings. This was the weak point of BI, it was too passive. Operational Intelligence and Decision Automation are key

6. All of this is possible because of the radically different analytical architectures and open source tools that are available in a variety of cloud-based topologies

7. Many business analysts have the background to use advanced analytical tools, provided the tools get better at guiding and advising.

8. The industry can’t continue without better tools. Big Data is a giant time sink. We’re seeing lots of interesting products emerge, many are open-source, to lubricate the whole data management and analytic spectrum

9. As always, finding a way for business units and IT to cooperate and work productivly is still a problem.

10. Existing operational systems are either based on relational databases technology or even older systems written in COBOL and other 2nd-generation languages. Capturing information in these systems is like fitting a square peg in a round hole. New database systems, the so-called NoSQL tools offer abundant opportunities to capture and use rich information. One example, graph databases, are brilliant at finding hidden relationships to expose concentration risk or fraud for example.

11. I’ve built a few Bayesian Belief Networks recently. What I learned is that they can get computationally expensive, perform poorly on high dimensional data and models can be hard to interpret. On the other hand is the ability to get to causation, not just correlation. Better to build from data and/or simulation

Posted in Uncategorized | 2 Comments

Pervasive Analytics: Needs Organizational Change, Better Software and Training

By Neil Raden nraden@hiredbrains.com

Principal Analyst, Hired Brains Research, LLC

May, 2015

The hunt for data scientists has reached its logical conclusion: There are not enough qualified ones to go around. The pull for analytics as a result of a number of factors, including big data and the march of Moore’s Law, is irresistible. As a result, industry analysts, software providers and other influencers are turning to the idea of the “democratization of analytics” as a solution. At Hired Brains, we believe this is not only a good idea (and have been writing and speaking about it for four years), but that it is inevitable. Unfortunately, turning business analysts loose of quantitative methods is an unworkable solution. As the title says, three things that are not currently in place need to be: organizational change, better software and training/mentoring for sustained periods.

Some Background

From the middle of the twentieth century until nearly its end, computers in business were mostly consumed with the process of capturing operational transactions for audit and regulatory purposes. Reporting for decision-making was repetitive and inactive. Some interactivity with the computer began to emerge in the eighties, but it was applied mostly to data input forms. By the end of the century, mostly as a result of the push from personal computers, tools for interacting with data, such as Decision Support Systems, reporting tools and Business Intelligence allowed business analysts to finally use the computing power for their analytical, as opposed to operational purposes.

Nevertheless, these tools were under constant stress because of the cost and scarcity of computing power. The repository of the data, mostly data warehouses, dwarfed the size of the operational systems that fed them. As BI software providers pressed for “pervasive BI,” so that a much broader group of people in the organization would actively use the tools (and the vendors would sell more licenses of course), the movement met resistance from three areas: 1) physical resources (CPU, RAM, Disk), 2) IT concerns that a much broader user community would wreak havoc with the established security and control and 3) people themselves who, beyond the existing users, showed little interest in “self-service” so long as there were others willing to do it for them.

In 2007, Tom Davenport published his landmark book, “Competing on Analytics,” and suddenly, every CEO wanted to find out how to compete on analytics. Beyond the more or less thin advice about why this was a good idea, the book was actually anemic when it came to providing any kind of specific, prescriptive advice on transforming an organization to an “analytically-driven” one.

Analytics Mania

Fast forward to 2015 and analytics has morphed from a meme to a mania. Pervasive BI is a relic not even discussed, but pervasive analytics or, more recently, the “democratization of analytics” is widely held to be the salvation of every organization. Granted, two of the three reasons pervasive BI failed to ignite are no longer an issue in this era of big data and Hadoop, but the third, the people, looms even larger: 1) people still are not motivated to do work that was previously done by others and 2) an even greater problem, the academic prerequisites to do the work are absent in the vast majority of workers. Pulling a Naïve Bayes or C4.5 icon over same data and getting a really pretty diagram or chart is dangerous. Software providers are making it terrifyingly easy for people to DO advanced quantitative analysis without knowing what they doing.

Pervasive analytics? It can happen. It will happen, it’s inevitable and even a good idea, but most of the messaging about it has been perilously thin, from Gartner’s “Citizen Data Scientists” to Davenport’s “Light Quants” (who would ever want to be a “light” anything?) What is lacking is some formality about what kind of training organizations need to commit to, what analytical software vendors need to do to provide extraordinarily better software for neophytes to use productively and, how organizations need to restructure for all of this to be worthwhile and effective.

How to Move to Pervasive Analytics

For “pervasive analytics” or “the democratization of analytics” to be successful, it requires much more than just technology. Most prominent is a lack of training and skills on the part of the wide audience that is expected to be “pervaded” if you will. The shortage of “data scientists” is well documented, which is the motivation for pushing advanced analytics down in the organization to business analysts. The availability of new forms of data provides an opportunity to gain a better understanding of your customers and business environment (among a multitude of other opportunities), which implies a need to analyze data at a level of complexity beyond current skills, and beyond the capabilities of your current BI tools.

Much work is needed to develop realistic game plans for this. In particular, our research at Hired Brains shows that there are three critical areas that need to be addressed:

  • Skills and training: A three-day course is not sufficient and organizations need to make a long-term commitment to the guiding of analysts
  • Organizing for pervasive analytics: Existing IT relationships with business analysts need reconstruction and senior analysts and data scientists need to supervise the roles of governance, mentoring and vetting
  • Vastly upgraded software from the analytics vendors: In reaction to this rapidly unfolding situation, software vendors are beginning to provide packaged predictive capabilities. This raises a whole host of concerns about casual dragging of statistical and predictive icons onto a palate and almost randomly generating plausible output, that is completely wrong.

Skills and Training

Of course it’s unrealistic to think that existing analysts who can build reports and dashboards will learn to integrate moment generating functions and understand the underlying math behind probability distributions and quantitative algorithms. However, with a little help (a lot actually) from software providers, a good man-machine mix is possible where analysts can explore data and use quantitative techniques while being guided, warned and corrected.

A more long-term problem is training people to be able to build models and make decisions based on probability, not a “single version of the truth.” This process will take longer and require more assistance from those with the training and experience to recognize what makes sense and what doesn’t. Here is an example:

Screen Shot 2015-06-27 at 9.35.46 AM

The chart shows a correlation between a stock market index and the number of times Jennifer Lawrence was mentioned in the media. Not shown, but the correlation coefficient is a robust 0.80, which means the variables are tightly correlated. Be honest with yourself and think about what could explain this? After you’ve thought about a few confounding variables, did you consider that they are both slightly increasing time series, which is actually the basis of the correlation, not the phenomena themselves? Remove the time element and the correlation drops to almost zero.

The point here is one doesn’t need to understand the algorithms that create this spurious correlation, they just need enough experience to know that you have to filter out the effect of the time series. But how would they know that?

The fact is that making statistical errors is far more insidious than spreadsheet or BI errors when underlying concepts are hidden. Turning business analysts into analytical analysts is possible, but not automatic.

Consider how actuaries learn their craft. Organizations hire people with an aptitude for math, demonstrated by doing well in things like Calculus and Linear Algebra, but not requiring a PhD. As they join an insurance or reinsurance or consulting organization, they are given study time at work to prepare for the exams, a process that takes years, and have ample access to mentors to help them along because the firm has a vested interest in them succeeding. Being an analyst in a firm is a less extensive learning process, but the model still makes sense.

Organizational: How organizations should deal with DIY analytics

We’re just beginning our research in this area, but one thing is certain: the BI user pyramid has got to go. In many BI implementations, the work fell onto the shoulders of BI Competency Centers to create datasets, while a handful of “power users” worked with the most useful features of the toolsets. The remainder of the users, dependent on the two tiers above them, generated simple reports or dashboards for themselves or departments (an amusing anecdote from a client of ours was, “The most used feature of our BI tool was ‘Export to Excel.’”) Creating “Pervasive BI” would have entailed doing a dead lift of the “business users” into the “power user” class, but no feasible approach was ever put forward.

Pervasive analytics cannot depend on the efforts of a few “go-to guys,” it has to evolve into an analytically centered organization where a combination of training and better software can be effective. That involves a continuing commitment to longer-term training and learning, governance of models so that models developed by professional business analysts can be monitored and vetted before finding their way into production and just a wholesale effort to change the analytics workflow: where do these analyses go beyond the analyst?

Expectations from Software Providers

Packaged analytical tools are sorely lacking in advice and error catching. It is very easy to take an icon and drop it on some data, and the tools may offer some cryptic error message or, at worst, the “help” system displays 500 words from a statistics textbook to describe the workings of the tool. But this is 2015 and computers are a jillion times more powerful than they were a few years ago. It will take some hard work for the engineers, but there is no reason why a tool should not be able to respond to its use with:

  • Those parameters are not likely to work in this model; why don’t you try these
  • Hey, “Texas Sharpshooter”-you drew the boundaries around the data to fit the category model
  • I see you’re using a p-value but haven’t verified that the distribution is normal. Shall I check for you?

We will be continuing our research in the areas of skills/training, organization and software for Pervasive Analytics. Please feel free to comment at nraden@hiredbrains.com

Posted in Uncategorized | 1 Comment

Relational Technologies Under Siege: Will Handsome Newcomers Displace the Stalwart Incumbents?

Relational Technologies Under Siege:
Will Handsome Newcomers Displace the Stalwart Incumbents?

Published: October 16, 2014
Analyst: Neil Raden
After three decades of prominence, Relational Database Management Systems RDBMS) are being challenged by a raft of new technologies. While enjoying a position of incumbency, newer data management approaches are benefitting from a vibrancy powered by the effects of Moore’s Law and Big Data. Hadoop and NoSQL offerings were designed for the cloud, but are finding a place in enterprise architecture. In fact, Hadoop has already made a dent in the burgeoning field of analytics, previously the realm of data warehouses and analytical (relational) platforms.

• RDBMS are overwhelmed by new forms of data (so-called “big data”), including text, documents, machine-generated streams, graphs and other, but are counter-attacking with new development and features as well as acquisitions and partnerships
• Non-relational platform vendors assert that the relational model itself is too rigid and expensive for the explosion of information
• A fundamental drawback in RDBMS technology is the tight coupling of the storage, metadata and parser/optimizer layers that cannot take advantage of the separate storage and compute capabilities of Hadoop
• Advances in technology are not the key differentiators between RDMBS tools and Hadoop/Big Data NoSQL offerings. Requirements are. The continuing enterprise need for or quality, integrated information and a “single version of the truth” argues for existing and enhanced relational data warehouses versus the “good enough” mentality of cloud-based and Hadoop efforts that were developed for large internet companies are key identifying differences between analytical approaches
• The “new-new” is pretty exciting, but there is a rush to provide true SQL access to many of these platforms, an admission that the relational calculus will endure
• Desirable features of RDBMS will migrate to the distributed processing of Hadoop, but only once Hadoop solves its shortcomings in security, workload management and operability. Born-in-the ¬cloud SaaS applications built on NoSQL databases (even some to emerge) will operate seamlessly on this platform, but not for 3-5 years
• Surveys of “revenue intention” for new technology spending are misleading; only 15% of companies surveyed are using Hadoop, and many are experiments.

• Recognize that RDBMS, Hadoop and NoSQL databases have vastly different purposes, capabilities, features and maturity
• When contemplating a move from a Enterprise Data Warehouse and/or on-premise ETL, take the long view of the effort, cost and disruption
• Determine exactly what your RDBMS vendor is planning for supporting “hybrid” environments because, for the time being, it will have the effect on the downstream activities of analytics
• There are many use cases for NoSQL/Big Data that are compelling and you should carefully consider them. In general, they go beyond your existing Data Warehouse/BI but are not necessarily a suitable replacement. IN two years this will likely change.
• Go slow and do not throw away the baby with the bath water. The best approach is to experiment with a “skunk works” project or two to get a feel if the approach is right for your organization. Beyond that, design a careful Proof of Concept (PoC) that can actually “prove” your “concept.” Vendors tend to insert requirements and features that favor their product, which can derail the validity of the PoC.

Relational database technology was adopted by the enterprise for its ability to host transactional/operational applications. By the late 80’s vendors posted benchmarks of transactions/second that exceeded those of the purely proprietary databases with the added benefit of an abstracted language, SQL, that allowed for different flavors of databases to be designed, queried and maintain without the effort of learning a new proprietary language for each one.
Later, as the need grew for more careful data management for reporting and analytics, RDBMS were pressed into service as data warehouses, a role for which they were not well-suited in terms of scale and especially speed of complex queries and large table joins. This need was met in a number of ways, to some degree, but it took time.
This is precisely where we see Hadoop today, a tool that was built to support search and indexing of unruly data in the Internet, primarily. However, its advantages in term of cost and scale are so compelling that it is quickly being pressed into service as an enterprise analytics platform, but it is sorely lacking in some features that data warehouses and analytical platforms (like Vertica, Netezza, Teradata etc.) already possess.
The trend for distributors of Hadoop is to claim that relational data warehouses are obsolete, or at best artifacts that have some enduring value. Curiously, with all of the attendant deficiencies of RDBMS in their view, they are mostly mute about RDBMS for transactional purposes, but that is likely to change.
Relational vendors are at work to put in place reference architectures (and products to support them) that are hybrid in nature. A term emerging is “polyglot persistence,” the ability of the first mover in an analytical query to parse and distribute pieces of the query to the logical location of the data and, preferably, the compute engine for that data without having to bulk-load data and persist it to answer a question. The concept is similar to federating queries, but much more powerful as a federation scheme usually involves design of a reference schema and assembling and transforming the data into a single place to satisfy the query. In a hybrid architecture, there are actually multiple storage locations (even in-memory) and compute resources working in a cooperative fashion. This arrangement preserves the RDBMS as the origin of analytical queries and provider of the answer set and simplifies the maintenance and orchestration of downstream processes, especially analytical, visualization and data discovery.
RDBMS were mostly row-oriented, given their OLTP orientation, but some adopted a column-orientation, the most visible being SybaselQ. In the past few years, it became obvious that analytical applications would be better served by a columnar orientation and products like Vertica emerged combined with a highly scalable MPP architecture. But today, there is an explosion of new databases of many types such as (a sampling, not comprehensive):
• Column: Accumulo, Cassandra, HBase
• Wide Table: MapR-DB, Google BigTable
• Document: MongoDB, Apache CouchDB, Couchbase
• Key Value: Dynamo, FoundationDB, MapR-DB
• Graph: Neo4J, InfiniteGraph and Virtuoso
Keep in mind that none of these database system are “general purpose,” most require programming interfaces and lack the kind of management and administrative features that IT departments demand.

The explosion in database technology was inevitable as the effects of Moore’s Law caused a discontinuous jump in the flow and processing of information. Technology, however, is always a step ahead of business. The implementation of enterprise applications, information management and processing platforms is a carefully woven fabric that does not bear rapid disruption (unless, of course, that is the enterprise’s strategy). “Big data” can provide enormous benefits to organizations, but not all of them. Many will find it preferable to rely on third parties to prepare and even interpret big data for them. For those that see a clear requirement, it is wise to consider the whole playing field and how the insights gained will find purchase and value. As Peter Drucker said, “Information is data that has meaning and purpose.”


Posted in Big Data, Business Intelligence, Decision Management, White Paper | Tagged , , , , , , , , , , , , , , , | Leave a comment

What Is Speed?

By Neil Raden

There was a time when computers were too slow to do more than bookkeeping and other back office chores. Without machines to interfere in their interactions, people performed office work like a ritual. There were set hours, dress codes, rigid hierarchies, predictable tasks and very little emphasis on change from year to year. Nothing moved very quickly. Good companies were stable and planned thoroughly. That was then. As computers slowly became more useful, necessity and market forces applied them to increasingly more critical tasks. By the 90’s, Total Quality Management, Process Reengineering and headcount reductions driven by the brutal pressure of corporate raiders forced organizations to look at the efficiencies that could be gained by streamlining business processes. After more than two decades of tireless cost-cutting and pursuit of efficiency, the relaxed and personal office life depicted on television of the Sixties was gone forever. Today, we operate Netlix-style, on-demand.

The critical mass for an on-demand world is composed of Information Technology elements such as, ubiquitous communications (the Internet), open standards and easy access. In an information business, speed is king. For an organization conducting business, managing a battlefield or monitoring the world’s financial markets, going faster means a shift from a reliance on prediction, foresight and planning to building in flexibility, courage and faster reflexes, catching the curls as they come and getting smarter with each thing you do (and making your partners smarter, too), ranking the contingencies instead of sticking to the plan no matter what. But what exactly is speed?

Speed implies more than just doing something quickly. For example, being able to load and index 10 billion records into a data warehouse in an hour is one measure of speed, but if the process has to wait until the middle of the night, or it takes another day to aggregate and spin out the data to data marts before it can be used or if the results have to be interpreted by multiple people in different domains, then the relevant, useful measure of speed is the full cycle time. In Six Sigma terms, cycle time is the total elapsed time to move a unit of work from the beginning to the end of a physical process. It does not include lead time. Measuring speed can be relative or absolute. Closing the books in three working days is absolute. Being first to market is relative.

There is a paradox of efficiency – investing in efforts to pare the time it takes to complete work steps can often lead to even longer cycle times. Consider scheduling aircraft. When a step is delayed or fails, there are people to consider – passengers and crew, for example. The efficiency of the solution vanishes when something doesn’t work or when the means loses sight of the desired result. Perhaps the process is very efficient, but brittle—when it breaks all prior gains are lost. The lesson is that speed can’t be measured by the speed of steps or by the speed of a sequence of steps. People are always involved, often, people tangential to the process. The Concorde cut Paris-to-New-York flying time in half, a savings of three and a half hours, but in today’s congested surface traffic and extreme security, a 3.5 hour flight could still take 8 or 9 hours door-to-door, a savings of only 25% or less, possibly not worth the 300% fare increase, except for the most extremely time-conscious.

In the workings of decision-making in organizations, gaining speed can not be limited to the automating of single tasks or optimizing individual productivity. Speed has to be an organizational concept. The actors need to understand the priorities and not waste time trying to optimize things that aren’t important. But one area that is in desperate need of work is organizational decision-making.

Knowing the Enemies of Speed
What are the enemies of speed? Today, much of analytics is a solitary effort with highly skilled and trained workers expending a significant amount of their time re-configuring data, or waiting for others to do it as aspects of the problem space change. While one group expends a considerable amount of time developing reports, another group pauses to re-format the reports or, more typically, either re-key or import some of the information into their own spreadsheets. Spreadsheets, though an effective tool for individual problem-solving, cause delays when others have to interpret or proofread spreadsheets authored by others. The problem isn’t limited to spreadsheets – it applies to varying in degrees to all types of analytical software, but the spreadsheets account for the overwhelming proportion of problem.

And Big Data trickle-down will only make the problem worse.The gap between the first analysts (data scientists) and decision-makers is getting wider.

When information from analytic work is communicated to others, the results are often difficult to explain because they are conveyed in summary form, and usually in aggregated levels, statically. There is no explanation or explicit model to describe the rationale behind the results. These additional presentations about methodology, narratives about the steps involved, alternatives that were considered and rejected (or perhaps just not recommended) and a host of other background material, usually presented in a sequence of time-consuming, serial meetings that have to be scheduled days or weeks in advance are the greatest enemies of speed today. They turn cycle time into cycle epochs. The reason for all of this posterior explanation is the cognitive dissonance between the various actors. The result is that well-researched and reasonable conclusions are often not actionable because management is not willing to buy in due to their lack of insight into the process by which the conclusions were arrived.

The solution to this problem is an environment where complex decisions that have to be made with confidence and consensus can gather recommendations to be presented unambiguously and compellingly across multiple actors in the decision making process.

Gaining Confidence
The late Peter Drucker said that information was “data endowed with relevance and purpose,” but it takes a human being to do that. Unfortunately, one person’s relevance is not necessarily another’s. The process of demonstrating to others what you’ve discovered and/or convinced yourself of can add latency and frustration to the process. The generation of mountains of fixed reports and even beautiful presentations of static displays such as dashboards cannot solve the problem. Henry Mintzberg wrote repeatedly that strategy was never predictable; it was “emergent,” and based on all sorts of imperfect perceptions and conflicting points of view. The lack of confidence that each actor has in every other actors’ methods and conclusions is a serious enemy of speed. It is the cause of endless rounds of meetings, delays and subterfuge.

Cost-cutting will always be a useful effort in organizations because inefficiency will always find a way to creep back in, but the dramatic improvements are largely over.The real battlefield today is differentiation, distancing your enterprise from your competitors. And in an era when every company potentially has access to the same level of best practices and efficiency, the key to leaping ahead is speed. Finding a new insight, a disruption or a discontinuity before anyone else, and being able to act on it, is the ticket to the show. Organizations need to be constantly on the lookout for new ways to streamline, to enhance revenue opportunities, to improve in a multitude of ways, to go faster. Faster decisions, faster to market, faster to understand the environment, faster to go faster. Making things go faster or better is rewarding, but giving time back to people is crazy fast, it is supercharged. The one resource that is in shortest supply is the time and attention of your best people. Give some time back to them and you can change their world.

Posted in Uncategorized | 1 Comment

Miscellaneous Ramblings about Decision Making

By Neil Raden

Decision-making is not, strictly speaking, a business process. Attacking the speed problem for decision-making, which is mostly a collaborative and iterative effort, requires looking at the problem as a team phenomenon. This is especially true where decision-making requires analysis of data. Numeracy, a facility for working with numbers and programs that manipulates numbers, exists at varying levels in an organization. Domain expertise similarly exists at multiple levels, and most interesting problems require contributions and input from more than one domain. Pricing, for example, is a joint exercise of marketing, sales, engineering, production, finance and overall strategy. If there are partners involved, their input is needed as well. The killers of speed are handoffs, uncertainty and lack of consensus. In today’s world, an assembly line process of incremental analysis and input cannot provide the throughput to be competitive. Team speed requires that organizations break down the barriers between functions and enable information to be re-purposed for multiple uses and users. Engineers want to make financially informed technical decisions and financial analysts want to make technically informed economic decisions.

That requires analytical software and an organizational approach that is designed for collaboration between people of different backgrounds and abilities.

All participants need to see the answer and the path to the answer in the context of their particular roles. Most analytical tools in the market cannot support this kind of problem-solving. The urgency, complexity and volume of data needed overwhelms them, but more importantly, they cannot provide the collaborative and iterative environment that is needed. Useful, interactive and shareable analytics can, with some management assistance, directly affect decision-making cycle times.

When analysis can be shared, especially through software agents called guides that allow others to view and interact with a stream of analysis, instead of a static report or spreadsheet,. time-eating meetings and conferences can be shortened or eliminated. Questions and doubts can be resolved without the latency of scheduling meetings. In fact, guides can even eliminate some of the presentation time in meetings as everyone can satisfy themselves beforehand by evaluating the analysis in context, not just pouring over results and summarizations.

Decision making is iterative. Problems or opportunities that require decisions often aren’t resolved completely, but return, often slightly reframed. Karl Popper taught that in all matters of knowledge, truth cannot be verified by testing, it can only be falsified. As a result, “science,” which we can broadly interpret to include the subject of organizational decision-making, is an evolutionary process without a distinct end point. He uses the simple model below:

PS(1) -> TT(1) -> EE(1) -> PS(2)

Popper’s premise was that ideas passed through a constant set of manipulations that yielded solutions with better fit but not necessarily final solutions. While the initial problem specification PS(1) yielded a number of Tentative Theories TT(1), Error Elimination EE(1) generates a solution, PS(2), and the process repeats. The TT and EE steps are clearly collaborative.

The overly-simplified model that is prevalent in the Business Intelligence industry is that getting better information to people will yield better decisions. Popper’s simple formulation highlights that this is inadequate – every step from problem formulation, to posing tentative theories to error elimination in assumptions and, finally, reformulated problem specifications requires sharing of information and ideas, revision and testing. One-way report writers and dashboards cannot provide this needed functionality. Alternatively, building a one-off solution to solve a single problem, typically with spreadsheets, is a recurring cost each time it comes around.

Posted in Big Data, Business Intelligence, Decision Management | Leave a comment

I’m Getting Convinced About Hadoop, sort of

As I sometimes do, I went to Boulder last week to soak up some of Claudia and Dave Imhoff’s hospitality and to sit in on a BBBT (Boulder BI Brain Trust) briefing in person instead of remotely like most of us do. The company this particular week was Cloudera and I wanted to not only listen to their presentations, but participate in the Q&A give and take as well as have more intimate conversations at dinner the night before. Despite the fact it took eight hours to drive there from Santa Fe (but only six back), it was clearly worth the effort. I certainly enjoyed meeting all the Cloudera people who came, but since this article is about Hadoop, not Cloudera, I’ll skip the introductions.

A common refrain from any Hadoop vendor (the term vendor is a little misleading because the open source Hadopp is actually free), is that Hadoop, almost without qualification, is a superior architecture for analytics over its predecessor, the relational database management systems (RDBMS) and its attendant tools, especially ETL (Extract/TransformLoad, more on that below).  Their reasoning for this is that it is undeniably cheaper to load Hadoop clusters with gobs of data than it is to expand the size of a licensed enterprise relational database. This is across the board – server costs, RAM and disk storage. The economics are there, but only compelling when you overlook a few variables. Hadoop stores three copies of everything, data can’t be overwritten, only appended to, and most of the data coming into Hadoop is extremely pared down before it is actually used in analysis. A good analogy would be that I could spend 30 nights in a flophouse in the Tenderloin for what it would cost for one night at the Four Seasons. 

But I did say I was getting convinced about Hadoop, so be patient.

A constant refrain from the Hadoop world is that is difficult and time-consuming to change a schema in a RDBMS, but Hadoop, with its “schema on read” concept allows for instantaneous change as needed. Maybe not intentionally, but this is very misleading. What is hard to change in a RDBMS is making changes to an application such as a DW with upstream and downstream dependencies. I can make a change to a database in two seconds. I can add non-key attributes to a Data Warehouse dimension table in an instant. But changing a shared, vetted, secure application is, reasonably, not an instantaneous thing, which illustrates something about the nature of Hadoop applications – they are not typically shared applications. Often, they are not even applications, so this comparison makes no sense. Instead, it illustrates two very important qualities of RDBMS’ and Hadoop. 

One more item about this “hard to change” charge. Hadoop is composed of the file system, HDFS and the programming framework MapReduce. When Hadoop vendors talk about the flexibility and scalability of Hadoop, they are talking about this core. But today, the Hadoop ecosystem (and this is just the Apache open source stuff, there is an expanding soup of add on’s appearing everyday) has more than 20 other modules in the Hadoop stack that make it useful. While I can do whatever I want with the core, once I build applications with these other modules there are just as many dependencies up and down the stack that need to be attended to when changing things as in a standard Data Warehouse environment. 

But wait. Now we have the Stinger Initiative for Hive, Hadoop’s SQL-ish database, to make Hive 100x faster. This is accomplished by jettisoning MapReduce and replacing it with Tez, the next-generation MapReduce. According to Hortonworks, Tez is “better suited to SQL” The Stinger initiative also includes ORCfile file for better compression, vectorizing Tez so that, unlike MapReduce, it can grab lots of records at once. And on top of it all, the crown jewel in any relational database, a Cost-Based Optimizer (CBO) which can only work with a, wait for this, schema! In fact, in the demo I saw today from Hortonworks, they were actually showing iterative SQL queries against, again, wait for it…a STAR SCHEMA! So what happened to schema on read? What happened to how awful RDBMS was compared to Hadoop? See where this is going? In order to sell Hadoop to the enterprise, they are making it work like a RDBMS. 

There are four kinds of RDBMS’s in the market today (and this is my market definition, no one else’s): 1) Enterprise Data Warehouse database systems designed from the ground up for data warehousing. As far as I’m concerned, there is only one that can handle massive volumes, huge mixed workloads broad functionality, tens of thousands of users a day – Teradata 6xxx series; 2) RDBMS designed for transactional processing, but positioned for data warehousing too, just not as good at it such as Oracle, DB2 and MSSQL; 3) Analytical databases, either sold as software-only or as appliances – IBM Netezza, H-P Vertica, Teradata 2xxx series; 4) In-memory databases such as SAP HANA, Oracle Times Ten and passel of others. Now we have a fifth – SQL-compliant (not completely) databases running on top of HDFS. There are more versions of these, too, such a SpliceMachine, now in public beta, as well as Drill, Impala, Presto, Stinger, Hadapt and Spark/Shark to name a few (although Daniel Abadi of Hadapt has argued that “Structured” query language misses the point of Hadoop entirely – flexibility). Now Hadoop is sort of five.

So where are we going with this? Like Clinton in the 90’s it’s clear Hadoop is moving to the center. Purist Hadoop will continue to exist, but market forces are driving it to a more palatable enterprise offering. Governance, security, managed workloads, interactive analysis. All of the things we have now except cheap platforms for greater volumes of data and massive concurrency.

I do wonder about one thing, though. The whole notion of just throwing more cheap resources at it has to have a point of diminishing returns. When will we get to the point that Hadoop is working 100X or 1000x more resources than would be needed in a careful architecture? Think about this. If we morph Hadoop into just a newer analytical database platform, sooner or later someone is going to wonder why we have 3 petabytes of drives and only 800 terabytes of data. In fact, how much duplication is in that data? How much wasted space? Drives may be cheap, but even a thousand cheap drives cost something, especially when they’re only 20% utilized.

Hadoop was invented for indexing search and other internet-related activities, not enterprise software. It’s promotion to all forms of analytics is curious. Where did anyone prove that its architecture was right for everything, or did the hype just get sold on being cheap? And what is the TCO over time vs a DW?

And when Hadoop venders say, “Most of our customers are building an Enterprise Data Hubs or (a terrible term) Data Lakes next to their EDW because they are complementary, it begs the question, for analytics in typical organizations, what exactly is complementary? That’s when we hear about sensors and machine-generated data and the social networks. How universal are those needs?

Then there is ETL. Why do it in expensive cycles on your RDBMS data warehouse when you can do it in Hadoop? They need to be reminded that writing code is not quite the same as an ETL tool with versioning, collaboration, reuse, metadata and lots of existing transforms built in. It’s also a little contradictory. If Hadoop is for completely flexibile and novel analysis, who is going to write ETL code for every project? Now there is a real latency: only five minutes to crunch the data and 30 30 days to write the ETL code.

They talk about using Hadoop as an archive to get old data out of a data warehouse, but they fail to mention that that data is unusable with the context that still remains in the DW; nor will it be usable in the DW later after the schema evolves. So what they really mean is use Hadoop as a dump for data you’ll never use but can’t stand to delete, because if you don’t need it in the DW, why do you need it at all?

Despite all this, it’s a tsunami. The horse had left the stable. The train has left the station. Hadoop will grow and expand and probably not even be recognizable as the original Hadoop in a few years and it will replace the RDBMS as the platform of choice for enterprise applications (even if the bulk of application of it will be SQL-based). I guarantee it. So get on top of it or get out of the way.

Posted in Uncategorized | Tagged , , , , , , | 1 Comment

Metrics Can Lead in the Wrong Direction

Is it really possible to use measurement — or “metrics,” in the current parlance — to drive an organization? There are two points of view, one widely accepted and current, the other opposing and more abstract.

The conventional wisdom on performance management is that our technology is perfectly capable of providing detailed, current and relevant performance information to stakeholders in an enterprise, including executives, managers, functional people, customers, vendors and regulators. Because we are blessed with abundant computing resources, connectivity, bandwidth and even standards, it is possible to present this information in cognitively effective ways (dashboards and visualization, for example). Recipients are able to receive the information in the manner in which they choose, and the whole process pays dividends by supporting the notion that “If you can’t measure it, you can’t manage it. “It is hard to imagine how anyone could manage a large undertaking without measurement, isn’t it? And most presentations I’ve heard quickly stress that measurement is only part of the solution.

The first step is knowing what to measure; then measuring it accurately; then finding a way to disseminate the information for maximum impact (figuring out how to keep it current and relevant); and then being able to actually do something about the results. A different way of saying this is that technology is never a solution to social problems, and interactions between human beings are inherently social. This is why performance management is a very complex discipline, not just the implementation of dashboard or scorecard technology. Luckily, the business community seems to be plugged into this concept in a way they never were in the old context of business intelligence. In this new context, organizations understand that measurement tools only imply remediation and that business intelligence is most often applied merely to inform people, not to catalyze change. In practice, such undertakings almost always lack a change-management methodology or portfolio.

But there is an argument against measurement, too. Unlike machines or chemical reactions in a beaker, human beings are aware that they are being measured. In the realm of physics, Heisenberg’s Uncertainty Principle demonstrates that the act of measurement itself can very often distort the phenomena one is attempting to measure. When it comes to sub-atomic particles, we can pretty much assume it is a physical law that underlies this behavior. With people, the unseen subtext is clearly conscious. People find the most ingenious ways to distort measurement systems to generate the numbers that are desired. Thus, the effort to measure can not only discourage desired behavior; it can promote dysfunctional behavior. There are excellent, documented examples of this phenomenon in Measuring and Managing Performance in Organizations by Robert D. Austin. The author’s contention is that measurement of people always introduces distortion and often brings dysfunction because measurement is never more than a proxy or an approximation of the real phenomena.

In a particularly colorful analogy, Austin writes:

“Kaplan and Norton’s cockpit analogy would be accurate if it included a multitude of tiny gremlins controlling wing flaps, fuel flow, and so on of a plane being buffeted by winds and generally struggling against nature, but with the gremlins always controlling information flow back to the cockpit instruments, for fear that the pilot might find gremlin replacements. It would not be surprising if airplanes guided this way occasionally flew into mountains when they seemed to be progressing smoothly toward their destinations.”

We all know that incomplete proxies are too easy to exploit in the same way that inadequate software with programming gaps beckons unscrupulous hackers. However, one doesn’t have to be malicious to subvert a measurement system. After all, voluntary compliance to the tax code encourages a national obsession with “loopholes.” And what salesperson hasn’t “sandbagged” a few deals for the next quarter after meeting the quota for the current one?

The solution is not to discard measurement but rather to be conscious of this tendency and to be vigilant and thorough in the design of measurement systems. We all have a tendency toward simplifying things; but in some cases, it appears better to not measure at all than to produce something inadequate. Performance management, to achieve its goals, has to be applied effectively, which is to say, with superior execution of technology, implementation and management. It has to be designed to be responsive to both incremental and unpredicted changes in the organization and the environment. There are no road maps for this. This is truly the first time that analytical and measurement technology can be embedded in day-to-day, instantaneous decision-making and tracking; and the industry is sorely lacking in skills and experience to pull it off. Those organizations that have been successful so far have relied on existing methodologies (activity-based costing or balanced scorecard, for example) to guide them through the more uncertain steps of metric formulation and change management to close the loop.

The question of whether you can ever adequately measure an organization is still open. To the extent that there are statutory and regulatory requirements, such as taxation, SEC or specific industry regulations, the answer is clearly yes. But those measurements are dictated. To measure performance after the fact, at aggregated levels, is only useful to a point. The closer and closer a measurement system gets to the actual events and actions that drive the higher-level numbers, the less reliable the cause-effect relationship becomes, just like Heisenberg found so long ago. There are many examples in the management literature of everyone “doing the right thing” while the wheels are coming off the organization.

Recommended Reading:

Measuring and Managing Performance in Organizations , by Robert D. Austin, (New York: Dorset House Publishing, 1996) 
In the Age of the Smart Machine: The Future of Work and Power by Shoshana Zuboff, (New York, Basic Books, 1988) 
“The New Productivity Challenge,” by Peter Drucker, Harvard Business Review, (Nov.-Dec. 1991): p. 70. 

Posted in Uncategorized | Tagged , , , , , | 1 Comment

A Bit About Storytelling

My take on storytelling

1. Must be a “story” with a beginning, middle and end that is relevant to the listeners.
2. Must be highly compressed
3. Must have a hero – the story must be about a person who accomplished something notable or noteworthy.
4. Must include a surprising element – the story should shock the listener out of their complacency. It should shake up their model of reality.
5. Must stimulate an “of course!” reaction – once the surprise is delivered, the listener should see the obvious path to the future.
6. Must embody the change process desired, be relatively recent and “pretty much” true.
7. Must have a happy ending.

In Stephen Denning’s words, “When a springboard story does its job, the listeners’ minds race ahead, to imagine the further implications of elaborating the same idea in different contexts, more intimately known to the listeners. In this way, through extrapolation from the narrative, the re-creation of the change idea can be successfully brought to birth, with the concept of it planted in listeners’ minds, not as a vague, abstract inert thing, but an idea that is pulsing, kicking, breathing, exciting – and alive.”

That may be a little too much excitement on a daily basis, something you save for the really important things, but it matters nonetheless that turning data into a story is a valid and necessary skill. But is it for everyone?

Not really. Actual storytelling is a craft. Not everyone knows how to do it or can even learn it. But everyone can tell a story. It just may not be of the caliber of storytelling. But to get a point across and have it stick (even if it’s just in your own mind, not to an audience), learn to apply metaphor.

More on metaphor lately

Posted in Big Data, Business Intelligence, Decision Management, Research, White Paper | Tagged , , | Leave a comment

Understanding Analytical Types and Needs

Understanding Analytics Types and Needs

By Neil Raden, January, 2013

Purpose and Intent

“Analytics” is a critical component of enterprise architecture capabilities, though most organizations have only recently begun to develop experience using quantitative methods. As Information Technology emerges from a scarcity-based mentality of constrained and costly resources to a commodity consumption model of data, processors and tools, analytics is quickly becoming table stakes for competition.

This report is the first of a two-part series. (Part II will cover analytic functionality and matching the right technology to the proper analytic tools and best practices.) It discusses the importance of understanding the role of analytics, why it is a difficult topic for many, and what actions you should take. It will explore the various meanings of analytics, provide a framework for aligning various types of analytics with associated roles and skill sets needed.

Executive Summary

Using quantitative methods is rapidly becoming, not an option for competitive advantage, but rather, at the very least, barely enough to keep up. Everyone needs to understand what’s involved in analytics, what you particular organization needs and how to do it.

Few people are comfortable with the concepts of advanced analytic methods. In fact, most people cannot explain the difference between a mean, a median and a sample mean. The misapplication of statistics is widespread, but today’s explosion of data sources and intriguing technologies to deal with them have changed the calculus. Embedded quantitative methods may relieve analysts of the actual construction of predictive models, but applying those models correctly requires understanding the different analytical types, roles and skill.

Analytics in the Enterprise

The emphasis of analytics is changing from one of long-range planning based on historical data, to dynamic and adaptive response based on timely information from multiple contexts, augmented and interpreted through various degrees of quantitative analysis. Analytics now permeates every aspect leading organizations’ operations. Competitive, technological and economic factors combine to require more precision and less lag time in discovery and decision-making.

For example, operational processing, the orchestration of business processes and secure capture of transactional data is merging with analytical processing, the gathering and processing of data for reporting and analysis. Analytics in commercial organizations has historically been limited to special groups working more or less off-line. Platforms for transaction processing were separated for performance and security reasons, an effect of “managing from scarcity.” But scarcity is not the issue anymore as the relative cost of computing has plummeted. Driven equally by technology and competition, operational systems are either absorbing or at least cooperating with analytical processes. This convergence elevates the visibility of all forms of analytics.

Confusion and mistakes in deploying analytics are common due to imprecise understanding of the various forms and types. Uncertainty about the staff and skills needed for various “types” of analytics are common. Messaging from technology vendors, service providers and analysts is murky and misleading, sometimes deliberately so.

The urgency behind implementing an analytics program, however, can be driven not by getting a leg up, but rather not falling behind.

Analytics and the Red Queen Effect

Analytics are crucial because the barriers to getting started are lower than ever. Everyone can engage in analytics now, of one type or another. As analytic capabilities increase across competitors, everyone must step up – it’s a Red Queen[i] effect.  When everyone was shooting from the hip, efficiency was a matter of degree. If everyone used crude models and unreliable data, then everyone should, more or less, work within the same margin of error. What separated competitors was good strategy and good execution. But now that everyone can employ quantitative methods and techniques like Naive Bayes, C4.5 and support vector machines, it will still be the strategy and execution that count. Companies must improve just to stay in place.  Each new level of analytics becomes the “table stakes” for the next.

Can You Compete on Analytics? Analytics Are Necessary – but Not Sufficient

Statistical methods using software have been shown to be useful in many aspects of an organization, such as fraud detection, demand forecasting and inventory management, but just using analytics has not been shown to necessarily improve the fortunes or effectiveness of the overall organization. In 2007, Davenport and Harris released their influential book[ii], Competing on Analytics, which described how a dozen or so companies used “analytics” to not only advise decision-makers, but to play a major role in the development of strategy and implementation of business initiatives. The book found a huge following and was a bestseller on the business book lists. It certainly placed the word “analytics” in the top of the mind of many decision makers. However, when comparing the fortunes of the twelve companies highlighted in the book, their performance in the stock market is less than spectacular as illustrated in Figure 2:

This scenario is often repeated – good work is performed inside an organization, but the benefits of the discipline do not permeate other parts of the business and, hence, have little effect on the organization as a whole. In another example, statistical methods have been used in the U.S. in agriculture for decades, and yields have improved dramatically, but the quality of the food supply has clearly degraded along with the fortunes of individual farmers.

Too many organizations, despite good intentions, do not see dramatic improvement in their fortunes after adopting wider-based analytical methods because:

First, rarely does one thing change a company. Analytics are a powerful tool, but it takes execution to realize the benefits. Perhaps if good analytical technique had been applied across the board along with a clear strategy to drive decisions based on quantitative models, better results may have followed. Instead, as is often the case, a visible project shows great promise and early results, but the follow through is wanting.

Data mining tools can actually be predictive, showing what is likely to happen or not happen. But what is often misunderstood is that data mining tools are usually poor at specifying when things will happen. In this case, too much faith is placed in the models, imbuing them with fortune-telling capabilities they simply lack. The correct approach is to test, run proofs of concept, and once in production engage in continuous improvement through mechanisms like champion/challenger and A/B testing.

Most of the companies try to understand customer behavior – which you can do with data mining – but it rarely captures the randomness of people’s behavior leading to overconfidence in the models. Given this customer is likely to purchase a car, when is the correct time to reach out? Perhaps right away, perhaps not. Data mining tools are not very good at individual propensities derived from behavior due to the randomness of human behavior. It is pretty common for inexperienced modelers to put too much faith in model results. The solution is to engage experienced talent to get a program started in the right track.

Return on investment in analytics is difficult to measure because there isn’t often a straight line from the model to results. Other parts of the organization contribute. An analytical process can inform decisions, either human or machine-driven, but the execution of those decisions is beyond the reach of an analytical system. People and process have to perform too. In addition, a successful analytical program can be the result of a well-defined strategy. Positive results from analytics would not have been possible without the formation of that strategy.

Professionals skilled in statistics, data mining, predictive modeling and optimization have been a part of many organizations for some time, but their contribution, and even an awareness of what they do, is sometimes poorly understood – and filled with many impediments to success.  By categorizing analytics by the quantitative techniques used and the level of skill of the practitioners who use these techniques (the business applications that they support are detailed in Part II of the series), companies can begin to understand when and how to use analytics effectively and deploy their analytic resources to achieve better results.

The Four Types of Analytics

There are related and unrelated disciplines that are all combined under the term analytics. There is advanced analytics, descriptive analytics, predictive analytics and business analytics, all defined in a pretty murky way. It cries out for some precision. What follows is a way to characterize the many types of analytics by the quantitative techniques used and the level of skill of the practitioners who use these techniques.

Figure 4: The Four Types of Analytics

Descriptive Title Quantitative Sophistication/Numeracy Sample Roles
    Type I QuantitativeResearch PhD or equivalent Creation of theory, development of algorithms. Academic/research. Often employed in business or government for very specialized roles
Type II Data Scientist orQuantitative


Advanced Math/Stat, not necessarily PhD Internal expert in statistical and mathematical modeling and development, with solid business domain knowledge
Type III Operational Analytics Good business domain, background in statistics optional Running and managing analytical models. Strong skills in and/or project management of analytical systems implementation
Type IV Business Intelligence/ Discovery Data and numbers oriented, but so special advanced statistical skills Reporting, dashboard, OLAP and visualization use, possibly design, Performing posterior analysis of results driven by quantitative methods

Type I Analytics: Quantitative Research

The creation of theory and development of algorithms for all forms of quantitative analysis deserves the title Type I. Quantitative Research analytics are performed by mathematicians, statisticians and other pure quantitative scientists. They discover new ideas and concepts in mathematical terms and develop new algorithms with names like Hidden Markov Support Vector Machines, Linear Dynamical Systems, Spectral Clustering, Machine Learning and a host of other exotic models. The discovery and enhancement of computer-based algorithms for these concepts is mostly the realm of academia and other research institutions (though not exclusively).  Commercial, governmental and other organizations (Google or Wall Street for example) employ staff with these very advanced skills; but in general, most organizations are able to conduct their necessary analytics without them, or employ the results of their research. An obvious example is the FICO score, developed by Quantitative Research experts at FICO (Formerly Fair Isaac) but employed widely in credit-granting institutions and even human resource organizations.

Type II Analytics: “Data Scientists”

More practical than theoretical, Type II is the incorporation of advanced analytical approaches derived from Type I activities. This includes commercial software companies, vertical software implementations, and even the heavy “quants” in industry who apply these methods specifically to the work they do like fraud detection, failure analysis, propensity to consume models, among hundreds of other examples. They operate in much the same way as commercial software companies but for just one customer (though they often start their own software companies too). The popular term for this role is “data scientist.”

“Heavy” Data Scientists. The Type II category could actually be broken down into two subtypes, Type II-A and Type II-B. While both perform roughly the same function – providing guidance and expertise in the application of quantitative analysis – they are differentiated by the sophistication of the techniques applied. II-A practitioners understand the mathematics behind the analytics and may apply very complex tools such as Kucene wrapper, loopy logic, path analysis, root cause analysis, synthetic time series or Naïve Bayes derivatives that are understood by a small number of practitioners. What differentiates the Type II-A from Type I is not necessarily the depth of knowledge they have about the formal methods of analytics (it is not uncommon for Type II’s to have a PhD for example), it is that they also possess the business domain knowledge they apply and their goal is to develop specific models for the enterprise, not for the general case as Type I’s usually do.

“Light” Data Scientists. Type II-Bs on the other hand may work with more common and well-understood techniques such as logistic regression, ANOVA, CHAID and various forms of linear regression. They approach the problems they deal with using more conventional best practices and/or packaged analytical solutions from third parties

Data Scientist Confusion. “Data Scientist” is a relatively new title for quantitatively adept people with accompanying business skills. The ability to formulate and apply tools to classification, prediction and even optimization, coupled with fairly deep understanding of the business itself, is clearly in the realm of Type II efforts. However, it seems pretty likely that most so-called data scientists will lean more towards the quantitative and data-oriented subjects than business planning and strategy. The reason for this is that the term data scientist emerged from those businesses like Google or Facebook where the data is the business; so understanding the data is equivalent to understanding the business. This is clearly not the case for most organizations. We see very few Type II data scientists with the in-depth knowledge of the whole business as, say, actuaries in the insurance business, whose extensive training should be a model for the newly designated data scientists (see our blog at: “What is a Data Scientist and What Isn’t”)

Though not universally accepted, data scientists must be able to effectively communicate their work to non-technical people. This is a major discriminator between a data scientist and a statistician. It is absolutely essential that someone in the analytics process have the role of chief communicator, someone who is comfortable working with quants, analysts and programmers, deconstructing their methodologies and processes, distilling them, and then rendering it in language that other stakeholders understand. Companies often fail to see that there is almost never anything to be gained by trying to put a PhD statistician into the role of managing a group of analysts and developers. It is safe to say that this role is represented more by a collaborative group of professionals than by a single individual.

Type III Analytics: Operational Analytics

Historically, this is the part of analytics we’re most familiar with. For example, a data scientist may develop a scoring model for his/her company. In Type III activity, parameters are chosen by the operational analytics expert analyst and are input into the model, generating the scores calculated by the Type II models and embedded into an operational system that, say, generates offers for credit cards. Models developed by data scientists can be applied and embedded in an almost infinite number of ways today. The application of Type II applications into real work is the realm of operational analysts. In very complex applications, real-time data can be streamed into applications based on Type II models with outcomes instantaneously derived through decision-making tools such as rules engines.

Packaged applications that embed quantitative methods such as predictive modeling or optimizations are also Type III in that the intricacies and the operation of the statistical or stochastic method are mostly hidden in a sort of “black box.” As analytics using advanced quantitative methods becomes more acceptable to management over time, these packages become more popular.

Decision making systems that are reliant on quantitative methods that are not well understood by the operators can lead to trouble. They must be carefully designed (and improved) to avoid overly burdening the recipients of useless or irrelevant information. This was a lesson learned in the early days of data mining, that generating “interesting” results without understanding what was relevant usually led to flagging interest in the technology. In today’s business environment, time is perhaps the scarcest commodity of all. Whether a decision-making system notifies people or machines, it must confine those messages to those that are the most relevant and useful.

False negatives are quite a bit more problematic as they can lead to transactions passing through that should not have. Large banks have gone under by not catching trades that cost billions of dollars. Think of false negatives as being asleep at the wheel.

Type IV Analytics: Business Intelligence & Discovery

Type III analytics aren’t of much value if their application in real business situations cannot be evaluated for their effectiveness. This is the analytical work we are most familiar with via reports, OLAP, dashboards and visualizations. This includes almost any activity that reviews information to understand what happened or how something performed, or to scan and free associate what patterns appear from analysis. The mathematics involved is simple. But pulling the right information – and understanding what information means – is still an art and requires both business sense and knowledge about sources and uses of the data.

Know Your Needs First

The scope of analytics is vast, ranging from the familiar features of business intelligence to the arcane and mysterious world of applied mathematics. Organizations need to be clear on their objectives and capabilities before funding and staffing an analytic program. Predictive modeling to dramatically improve your results makes for good reading, but the reality is quite different. The four types are meant to help you understand where you can begin or advance.

These categories are not hard and fast. Some activities are clearly a blend of various types. But the point is to add some clarity to the term “analytics” in order to understand its various use cases. Tom Davenport, for example, advocated creating a cadre of “PhDs with personality” in order to become an analytically competitive organization. That is one approach. Implementing analytics as part of other enterprise software you already have – or purchasing a specialized application that is already used and vetted in your industry – is a better place to start.


Use of some clear terminology can avoid confusion within your organization, not just internally, but in communication with vendors and service providers. To get the most out of analytics:

  • Be clear about what you need.  Having clarity on the meaning of analytics has clear benefits. Because the nature of analytics is a little mysterious to most people, a vendor statement that they provide “embedded predictive analytics” can no longer be taken at face value. You should look closely to see if those capabilities line up with your needs.
  • Don’t assume high value means high resource costs. In the same vein, you needn’t hesitate to begin analytical projects because you believe you need to source a dozen PhDs, when in fact, your needs are in the Type II category.
  • Formulate specific vendor questions based on what level of sophistication and resources you need. By more clearly specifying what type of analytics you need, it becomes very easy to ask: Is this tool designed to discover and create predictive models, or to deploy them from other sources? Do you offer training in quantitative methods or only in the use of your product? Is the tool designed for authoring scoring models or just using scored values?
  • Use analytic knowledge to start to prepare for Big Data.  Understanding what type of analytics – and results – you need will even help you in your soon-to-be-serious consideration of Big Data solutions, including Hadoop, its variants and its competitors, all of which use variants of the above techniques to process large quantities of information.

Analytics is a catchall phrase, but understanding the various uses and types should help in implementing the right approach for accomplishing the tasks at hand.  It should also help in discerning what is meant when the term is used, as almost anything can be called analytics.

Next Steps

Part II of this series will examine in depth the forms that analytics take in the organization and the business purposes it serves, and demonstrate through examples and case studies how analytics of all types are successfully employed. But analytics are a step in the process. Without effective decision-making practices the value in analytics is lost. Part III of this series will deal with decision making and decision management.

Author Bio: Neil Raden

Analyst, Consultant and Author in Analytics and Decision Science

Neil Raden, nraden@hiredbrains.com is the founder and Principal Analyst at Hired Brains Research, a provider of consulting and implementation services in business intelligence, analytics and decision managemen. Hired Brains focuses on the needs of organizations and capabilities of technology. He began his career as a property and casualty actuary with AIG in New York before moving into predictive analytics services, software engineering, and systems integration with experience in delivering environments for decision making in fields as diverse as health care to nuclear waste management to cosmetics marketing and many others in between.


[i] The Red Queen is a concept from evolutionary biology first used in Matt Ridley, The Red Queen: Sex and the Evolution of Human Nature, (New York: Macmillan Publishing Co, 1994).  The allusion is to the Red Queen in Lewis Carroll’s Through the Looking-Glass, who had to keep running just to stay in place.

[ii] Davenport, Harris, et al, “Competing on Analytics: The New Science of Winning,” New York, Harvard Business Press, 2007.

Posted in Uncategorized | 1 Comment