When Are Decisions Driven by Analytics, or Merely Informed by Them?

In http://www.dataintegrationblog.com/data-quality/from-tactical-to-strategic-action-operational-decision-management-2/

Boy did Julie Hunt ever hit the nail on the head: 

“But – real-time decision-making also has to be vetted with domain knowledge, human experience and common sense, to validate the viability of analytics results. Decisions make a positive difference for the enterprise only if they are based on accurate intelligence. While many things are possible with predictive analytics, there is always the danger of trying to force ‘reality’ to fit the model. This can be deadly to real-time operational decision-making.”

When it comes to decisions that can be made via models, you have to separate them into two categories: those that do not require 100% precision, and those that are too important to get wrong. 

For example, routing a call center call, approving a credit line increase, rating a car insurance premium – these are all “decisions” that are made in high volume, but getting some of them wrong, in the aggregate, causes little harm. Obviously, the closer you get to perfect performance the better, but you can allow these decisions to be made without human interference. Obviously you track the result and continuously improve the models.

On the other hand, many decisions in an enterprise are too important to turn over to some algorithms. In these cases, the quantitative analysis can be a part of the decision process, but ultimately the decision vests with the person or persons who take responsibility for it. In point of fact, very few managers are comfortable with answers based on probability. The difference between 80% probability and 95% probability simply doesn’t resonate. For important decisions, managers want one answer, and that requires discussion and consensus. 

We have to be very careful not to over-promise on analytics. 

Posted in Big Data, Business Intelligence, Decision Management | Tagged , , , , | 1 Comment

Personalized Medicine World Conference

This is the fourth or fifth year for this conference, and each year there are some surprises. The first couple of years it was a diverse collection of researchers, entrepreneurs and vendors (Oracle, Deloitte, etc.). The number of exhibitors seems about the same as last year, but there was a small booth for SAP HANA, which was sort of a surprise, but I learned they are aggressively going after the life sciences sector and Hasso Plattner was a keynote speaker. That’s a pretty good sign this conference is getting pretty commercial.

Like any conference, a few stars emerge and become familiar and repeat presenters. Atul Butte is one example. Here is his bio:

Atul Butte, MD, PhD is Chief of the Division of Systems Medicine and Associate Professor of Pediatrics, Medicine, and by courtesy, Computer Science, at Stanford University and Lucile Packard Children’s Hospital. Dr. Butte trained in Computer Science at Brown University, worked as a software engineer at Apple and Microsoft, received his MD at Brown University, trained in Pediatrics and Pediatric Endocrinology at Children’s Hospital Boston, then received his PhD in Health Sciences and Technology from Harvard Medical School and MIT. Dr. Butte has authored more than 100 publications and delivered more than 120 invited presentations in personalized and systems medicine, biomedical informatics, and molecular diabetes, including 20 at the National Institutes of Health or NIH-related meetings.

I did find Dr. Butte’s presentation about how bioinformatics tools applied to big public data have yielded new uses for drugs and new prototype drugs and diagnostics for type 2 diabetes. It was an interesting discussion of what we call big data analytics, but in the end, it just came back to making more drugs. 

When you attend medical conferences, speakers always have these extensive pedigrees, but what I wonder is, with all of the esteem, what sort of doctors are they? Are they too distanced from day-to-day clinical work to see the problems and possibilities? Are their decisions made within a bubble that excludes consideration of alternatives? That is the sense I get listening to them. 

A common term used by many of the speakers was “omics.” First we had genomics, then epigenomics followed by proteomics or metabolomics. All of these areas combine both bench science and informatics on a huge scale. The hope is that the digital examination of these minute measurements can lead to cures for diabetes, cancer, heat disease and Alzheimers.

Michael Snyder, Ph.D., Professor & Chair, Stanford Center of Genomics & Personalized Medicine gave a notable and introspective presentation about the use of a combination of omics methods to assess health states in a single individual over the course of almost three years (himself). Genome sequencing was used to determine disease risk. Longitudinal personal profiling of transcriptome, proteome and metabolome was used to monitor disease, including viral infections and the onset of diabetes. His premise is that these aproaches can transform personalized medicine. It was discovered that he carried genes for diabetes and in fact developed it during the period, but, in my opinion, failed to see the causal effect of poor sleep from repeated respiratory infections that corresponded with the spike in blood sugar. In some ways, it seems these brilliant scientists just don’t see the forest from the trees, and that hurts us.


Steven C Quay, M.D., Ph.D., FCAP, Founder, Atossa Genetics, Inc. pitched his own company devoted to obtaining routine, repeated, “painless” breast biopsy samples non-invasively for cytopathology, NGS, proteome, and transcriptome analysis of precursors to breast cancer; The use of breast specimens obtained non-invasively for biomarker discovery, clinical trial support, and patient selection, and to inform personalized medical therapy; Cancer prevention using intraductal treatment of reversible hyperplastic lesions.

Two problems with his presentation. NO ONE KNOWS HOW TO PREVENT BREAST CANCER. Also, the “painless” techniques are almost medieval. If you don’t believe me, look up “ductal lavage” and let me know if you’d want to submit to that repeatedly.

The problem is no one ever seems to use the word cure, or to speculate why these diseases exist at all. All of the research presented seems to end with the following refrain: “Hopefully leading to the development of new drugs…” Well, follow the money. 

I don’t know if I’ll go next year. 



Posted in Big Data, Decision Management, Genomics, Medicine, Research | Tagged , , , , , , , , , , | Leave a comment

Forked SQL: Informatica Gets It

By Neil Raden

About fifteen years ago, Microstrategy cofounder Sanju Bansal told me, “SQL is the best hope for leveraging the latent value from databases.” Fifteen years later, it’s extraordinary how correct Bansal was. Microstrategy is still a robust, free-standing Business Intelligence company while most of the proprietary multidimensional databases have disappeared. But what about the next fifteen years?

At least for business analytics, SQL is under attack. So much so, that there is an entire emergent market segment called NoSQL. If SQL itself is under siege, what about the myriad technologies that in one way or another are part of the SQL ecosystem like Informatica? Are they obsolete? Will we need to throw away the baby with the bathwater?

This whole dustup has been brewing for a decade or more. Since we started using computers in business 60+ years ago, the big machines were managed by a separate group of people with specialized skills, now generically referred to as IT. Though separate and decidedly non-businesslike, IT eventually became bureaucratic, fixed in its mission to control everything data. Even SQL was adopted very slowly, but it is solidly the tool of choice for most applications.

About ten years ago, though, a renegade group of people I called “the pony-tail-haired guys” (PTH for short) appeared on the scene with their externally-focused web sites and gradually, development tools, methods and monitoring software. At first, IT paid no attention to them because they didn’t interfere with the inwardly-focused enterprise computing environment, but as the perceived value of “e-business” developed, a great deal of friction and turf war fighting erupted. Computing bifurcated inside the firewall.

The PTH preferred web-oriented tools, open source software and search. That’s why Big Data/Hadoop and NoSQL are so divergent from enterprise computing. Different people, different applications, different brains.

But when it comes to Business Analytics, the lines are not so clear. The PTH found, just as enterprise people did (reluctantly), analytics is key to everything else. But while enterprise apps measure sales and revenue, the PTH guys are looking at really strange things like sentiment analysis. Today, I can download a Fortune 500 general ledger to my watch, but things like sentiment analysis look at 100’s of 1000’s times more data. Loading into a relational database and analyzing the data with a language designed for set operations and transactions just doesn’t work. So is there a justification for something other than SQL for this? Of course there is.

But the PTH guys, now that they’ve grown up a little, are starting to show the same sclerotic tendencies as their IT colleagues, assuming that their tools and methodologies are the ONLY tool for analytics and SQL needs to be put in the dustbin. That’s just silly. Existing data warehouse, data integration and data analysis and presentation tools are well-suited to lots of tasks that aren’t going away any time soon, though methodologies and implementations are in dire need for renovation, something I call a Surround Strategy as opposed to the ridiculous outdated idea of the Single Version of the Truth.

Here is where Informatica enters the picture. With all the breathless enthusiasm over Big Data, one thing is often lost in the rush: no fundamental laws of physics concerning data integration have been altered. This places Informatica squarely in the middle of every trend grabbing headlines today: big data, social analytics, virtualization and cloud.

Any type of reporting or analysis, whether through traditional ETL and data warehousing, Hadoop, Complex Event Processing or even Master Data Management deals with “used data,” meaning, data created through some original process. All used data has one attribute in common – it doesn’t like to play with other used data. Data extracted from a primary source typically has hidden semantics and rules in the application logic so that, on it’s own, it often doesn’t make sense.

This has been the most difficult part of assembling useful data for analysis historically, and with the entrance of mountains of non-enterprise data, the problem has only grown larger.

Luckily for the fortunes of Informatica, this is a big opportunity and they have stepped up.

For in depth descriptions of the following Big Data, cloud, Hadoop SaaS innovations coming from Informatica, refer to their materials at http://www.informatica.com. Briefly, they include:

Hparser, at facility to visually prepare very large datasets for processing in Hadoop, saving developer/analyst a substantial amount of time from what is mostly hand coding.

Complex Event Processing (CEP) which isn’t new for Informatica, but given the expanded complexity of data integration in the Big Data era, they have incorporated CEP into their own platform to detect and inform of events that can affect the performance, accuracy and management of the data from teratogenic processes.

Informatica pioneered the GUI diagram for building integration mappings, but 9.5 implements an integration optimizer which detects the optimal mapping rather than processing the diagram literally.

With each new release of a complex software product is the time-consuming and nerve-wracking routine of upgrades. 9,5 now provides automated regression testing to dramatically reduce the time and pain of upgrades.

In Information Lifecycle Management, 9.5 provides “Intelligent Partitions” to distribute data across devices hot, warm and cold storage.

Add to this, facilities for virtualization, replication and a slew of offerings for cloud-based applications – there is an inescapable logic for Informatica exploiting the new opportunities of Big Data.

Posted in Big Data, Research | Tagged , , , | 5 Comments

BI Is Dead! Long Live BI!


Executive Summary

We suggest a dozen best practices needed to move Business Intelligence (BI) software products into the next decade. While five “elephants” occupy the lion’s share of the market, the real innovation in BI appears to be coming from smaller companies. What is missing from BI today is the ability for business analysts to create their own models in an expressive way. Spreadsheet tools exposed this deficiency in BI a long time ago, but their inherent weakness in data quality, governance and collaboration make them a poor candidate to fill this need.  BI is well-positioned to add these features, but must first shed its reliance on fixed-schema data warehouses and read-only reporting modes. Instead, it must provide businesspeople with the tools to quickly and fully develop their models for decision-making.


Why BI Must Transition From The Past Century

In this avant-garde era of Big Data, cloud, mobile and social, the whole topic of BI is a little “derriere.” BI is a phenomenon of the previous two decades, but the worldwide market for BI tools (not including services or surrounding technologies such as data warehousing and data integration) is greater than $10 billion per year. It’s still a very significant market and will, for some time, dwarf spending on the Big Data top gun, Hadoop, which is an open source distribution.  The lion’s share of revenue in the Big Data market will continue to be hardware and services, not software, unless you consider the application of existing technologies, especially database and data integration software, as part of Big Data.

Because BI is still alive, it’s worth revisiting some concepts I wrote about a few years ago — abstraction and model-driven design in BI. In the proto-BI days, when it was still known as Decision Support Systems (DSS), high-level declarative languages were used to create business models that could handle user-entered (or background-loaded) data. These systems were interactive, allowing for what-if analysis, testing sensitivity of the models and even the application of advanced statistical techniques.

Data warehousing changed all of that. Performance of relational databases for interactive   reporting from static schema was a challenge. Adding data to the warehouse interactively through iterative analysis was out of the question. Adding entities to the schema to accommodate new models required data modeling and schema changes, often reloading data and associated testing before implementation.

As data warehouses became the preferred way to provide data for reporting and analysis, BI vendors that focused on the read-only nature of data warehousing prospered and dominated the field. Business modeling fell from favor, or, more precisely, fell to Excel. The only exception was found in budgeting and planning packages, mostly based on Multi-Dimensional Online Processing (MOLAP) databases, predominantly Essbase and Microsoft BI.


The Era Of Big Data Means Business Analytics

The major BI vendors of the late 1990s through today are BusinessObjects (SAP), Cognos (IBM), SAS, Hyperion (Oracle), Microsoft and Microstrategy. At these six vendors (who together comprise more than two-thirds of the BI market), only about 20 percent of combined revenue came from tools that provided modeling capabilities. The rest came from strictly read-only data warehouses and marts[i].

Now that the era of Big Data is here, modeling has to move up a notch. While newer technologies are in play for capturing and massaging Big Data, the need for business analytics is greater than ever. In the past, the BI calculus more or less ended with informing people. Today, BI must enable actions and decisions that are supported by deep insight. Excel may be a good container for interacting with models, but its internal capabilities aren’t sufficient. The need for BI tools will not disappear, but it’s time to break the read-only mold.


Moving From Data To Decisions Through Business Modeling

Business modeling can be an imprecise term. But in general, it means creating descriptive replicas of a part of a business — such as assets, processes or optimizations — in terms that are consonant with the people (and processes) that use them. Usually, the goal is to render these models into computer-based applications. In general, a business model is created by someone who has certain knowledge about a process or function in the business. This can range from a single fact, such as how operating cash flow is calculated, to something as broad as how the manufacturing plants operate.

To be effective, businesspeople need more than access to data: They require a seamless process that lets them interact with the data and drive their own models and processes. In today’s environment, these steps are typically disconnected and, therefore, expensive and slow to maintain without the data quality controls from the IT department. The solution: a better approach to business modeling coupled with an effective architecture that separates physical data models from semantic ones. In other words, businesspeople need tools to address physical data through an abstraction layer that allows them to address only the meaning of data — not its structure, location or format.

Abstraction is applied routinely to systems that are somewhat complex and especially to systems that frequently change. A 2012 model car contains more processing power than most computers only a decade ago. Driving the car, even under extreme conditions, is a perfect example of abstraction. Stepping on the gas doesn’t really pump gas to the engine, it alerts the engine management system to increase speed by sampling and alerting dozens of circuits, relays and microprocessors to achieve the desired effect. These actions are subject to many constraints, such as limiting engine speed and watching the fuel-air mixture for maximum economy or minimum emissions. If the driver needs to attend to all of these things directly, he would not get out of the driveway.

Data warehouses and BI tools still rely on at least some of the business users’ understanding of the data models and semantics, and sometimes the intricacies of the crafting of queries. This is a huge barrier to progress. Businesspeople need to define their work in their own terms. A business modeling environment is needed for designing and maintaining structures, such as data warehouses and all of the other structures associated with it. It is especially important to have business modeling for the inevitable changes in those structures. It is likewise important for leveraging the latent value of those structures through analytical work. This analysis is enhanced by understandable models that are relevant and useful to businesspeople.

BI isn’t standalone anymore. In order to close the loop, it has to be implemented in an architecture that is standards-based. And it has to be extensible and resilient, since the boundaries of BI are more fuzzy and porous than other software. While most BI applications are fairly static, the ones of most value to companies are the flexible and adaptable ones. In the past, it was acceptable for individual “power users” to build BI applications for themselves or their group without regard to any other technology or architecture in the organization. Over time, these tools became brittle and difficult to maintain because the initial design was not robust enough to adapt to the continuous refinements and changes needed by the organization. Because change is a constant, BI tools need to provide the same adaptability at the definitional level that they currently do at the informational level. The answer is to provide tools that allow businesspeople to model.


Business Modeling Emerges As a Critical Skill

All businesspeople use models, though most of them are tacit or implied, not described explicitly. Evidence of these models can be found in the way workers go about their jobs (tacit models) or how they have authored models in a spreadsheet. The problem with tacit models is that they can’t be communicated or shared easily. The problem with making them explicit is that it just isn’t convenient enough yet. Most businesspeople can conceptualize models. Any person with an incentive compensation plan can explain a very complicated model. But most people will not make the effort to learn how to build a model if the technology is not accessible.

There are certain models that almost every business employs in one form or another. Pricing is a good example and, in a much more sophisticated form, yield management, such as the way airlines price seats. Most organizations look at risk and contingencies, a hope-for-the-best-prepare-for-the-worst exercise. Allocation of capital spending or, in general, allocation of any scarce resource is a form of trade-off analysis that has to be modeled. Decisions about partnering and alliances as well as merger or acquisition analysis are also common modeling problems. Models are also characterized by their structure. Simple models are built from data inputs and arithmetic. More complicated models use formulas and even multi-pass calculations, such as allocations. The formulas themselves can be statistical functions and can perform projections or smoothing of data. Beyond this, probabilistic modeling is used to model uncertainty, such as calculating reserves for claims or bad loans.

When logic is introduced to a model, it becomes procedural. Mixing calculations and logic yields a very potent approach. The downside is that procedural models are difficult to develop with most tools today and are even more difficult to maintain and modify because they require the modeler to interact with the system at a coding or scripting level; most business people lack the temperament or training, or both, to do so. It is not reasonable to assume that businesspeople, even the power users, will employ good software engineering technique, nor should they be expected to. Instead, the onus is on the vendors of BI software to provide robust tools that facilitate good design technique through wizards, robots and agents.


Learn From A Dozen Best Practices In Modeling

For any kind of modeling tool to be useful to businesspeople, supportable by the IT organization and durable enough over time to be economically justifiable, it must provide or allow the following capabilities:  

  1. Level of expressiveness. This must be sufficient for the specification, assembly and modification of common and complex business models without code; it should accommodate all but the most esoteric kinds of modeling.
  2. Declarative method. Such a method means that each “statement” is incorporated into the model without regard to its order, sequence or dependencies. The software, freeing modelers to design whatever they can conceive, handles issues of calculation optimization.
  3. Model visibility.  This enables the inspection, operation and communication of models without extra effort or resources. Models are collaborative and unless they can be published and understood, no collaboration is possible.
  4. Abstraction from data sources. This allows models to be made and shared in language and terms unconnected to the physical characteristics of data; it gives managers of the physical data much greater freedom to pursue and implement optimization and to improve performance efforts.
  5. Extensibility.   Extensibility means that the native capabilities of the modeling tool are robust enough to extend to virtually any business vertical, industry or function. Most of the leading BI tools are owned by much larger corporate parents, potentially limiting or directing the development roadmap of the BI offering, in alignment with the wider vision of the parent. Smaller BI and analytics pure-plays tend to have a truer vision about BI (until they are acquired). Because analysts who are thinking out of the box gain valuable insight, the BI tool cannot imposes vendor-specific semantics and functions, or lock analysts in and limit distribution of insights with expensive licenses.
  6. Visualization. Early BI tools relied on scarce computing resources, but with today’s abundance of processing power, an effective BI platform should include visualization. It is a proven fact that single-click visualization of models and results aids in understanding and communicating complicated models.
  7. Closed-loop processing. This is essential because business modeling is not an end-game exercise, or at least it shouldn’t be. It is part of a continuous execute-track-measure-analyze-refine-execute loop. A modeling tool must be able to operate cooperatively in a distributed environment, consuming and providing information and services through a standards-based protocol. The closed-loop aspect may be punctuated by steps managed by people, or it may operate as an unattended agent, or both.
  8. Continuous enhancement. This requirement is borne of two factors. First, with the emerging standards of service-oriented architectures, web services and XML, the often talked-about phenomenon of organizations linked in a continuous value chain with suppliers and customers will become a reality soon and will put great pressure on organizations to be more nimble. Second, it has finally crept into the collective consciousness that development projects involving computers are consistently under-budgeted for maintenance and enhancement. The mindset of “phases” or “releases” is already beginning to fray and forward-looking organizations are beginning to differentiate tool vendors by their ability to enable enhancement with extensive development and test phases.
  9. Zero code. In addition to the fact that most businesspeople are not capable of and/or interested in writing code, there is sufficient computing power at reasonable costs to allow for more and more sophisticated layers of abstraction between modelers and computers. Code implies labor, error and maintenance. Abstraction and declarative modeling implies flexibility and sustainability. Most software “bugs” are iatrogenic; that is, they are introduced by the programming process itself. When code is generated by another program, the range of programmatic errors is limited to the latent errors in the code generator, not the errors introduced by programmers.

10.Core semantic information model (ontology). Abstraction between data and the people or programs that access the data isn’t very useful unless the meaning of the data and its relationships to everything else are available in a repository.

11.Collaboration and workflow.  These capabilities are essential to connecting analytics to every other process within and beyond the enterprise. A complete set of collaboration and workflow capabilities supplied natively within a BI tool is not necessary, though. Instead, the ability to integrate (this does not mean “be integrated,” which implies lots of time and money) with collaboration and workflow services across the network, without latency or conversion problems, is preferable.

12.Policy. This may be the most difficult requirement of them all. Developing software to model business policy is tricky. For example, “Do not allow contractor hours in budget to exceed 10 percent of non-exempt hours.” Simple calculations through statistical and probabilistic functions have been around for over three decades. Logic models that can make decisions and branch are more difficult to develop, but still not beyond the reach of today’s tools. But a software tool that allows business people to develop models in a declarative way to actually implement policies is on a different plane. Today’s rules engines are barely capable enough and they require expert programmers to set them up. Policy in modeling tools is in the future, but it will depend on all of the above requirements.


Today’s BI Will Not Be Tomorrow’s BI

It is an open question whether BI has been, in the long run, successful or not. The take-up of BI in large organizations has stalled at 10 to 20 percent, depending on which survey you believe. I believe that expectations of broad acceptance of BI were overly optimistic and that the degree to which it has been adopted is probably at the right level for the functionality it delivered.

Will BI survive? Yes, but we may not recognize it. The need to analyze and use data that are produced in other systems will never go away, but BI will be wrapped in new technologies that provide a more complete set of tools. Instead of managing from scarcity of computing resources, BI will be part of a “decision management” continuum — the amalgam of predictive modeling, machine learning, natural language processing, business rules, traditional BI and visualization and collaboration capabilities.

Pieces of this “new” BI are already here, but within two to three years, it will be in full deployment.

[i] These numbers are representative estimates only and are based on our discussions with vendors and other published information. It is difficult to be precise with BI because the market leaders offer a wide variety of software and services, some for licensing and some embedded in non-BI products. In addition, BI itself has a number of fuzzy definitions. For the purposes of this discussion, we use a definition of query and reporting tools, including online analytical processing (OLAP), but exclusive of data warehousing, extract/transform/load (ETL), analytic databases and advanced analytics/statistics packages.

 Copyright 2012 Neil Raden and Hired Brains Inc

Posted in Uncategorized | 1 Comment

New World Order: Hadoop and Relational Databases

By Neil Raden Hired Brains Research  nraden@hiredbrains.com

Hadoop “data warehouses” do not resemble the data warehouse/analytics that are common in organizations today. They exist in businesses like Google and Amazon for web log parsing, indexing, and other batch data processing, as well as for storing enormous amounts of unfiltered data. Petabyte-size data warehouses in Hadoop are not data warehouses as we know them; they are a collection of files on a distributed file system designed for parallel processing. To call these file systems “a data warehouse” is misleading because a data warehouse exists to serve a broad swath of uses and people, particularly in business intelligence, which is both interactive and iterative.

MapReduce is a programming paradigm with a single data flow type that takes the form of directed acyclic graph of operators. These platforms lack built-in support for iterative programs, quite different from the operations of a relational database. To put it in layman’s terms, there are things that Hadoop is exceptionally well designed for that relational databases would struggle to do. Conversely, a relational database data warehouse performs a multitude of useful functions that Hadoop does not yet possess. Hadoop is described as a solution to a myriad of applications in web log analysis, visitor behavior, image processing, search indexes, analyzing and indexing textual content, for research in natural language processing and machine learning, scientific applications in physics, biology and genomics and all forms of data mining. While it is demonstrable that Hadoop has been applied to all of the domains and more, it is important to distinguish between supporting these applications and actually performing them. Hadoop comes out of the box with no facilities at all to do most of this analysis. Instead, it requires the application of libraries available either through the open source community at forge.com or from the commercial distributions of Hadoop, or by custom development by scarce programmers. In no case can these be considered a seamless bundle of software that is easy to deploy in the enterprise. A more accurate description is that Hadoop facilitates these applications by grinding through data sources that were previously too expensive to mine. In many cases, the end result of a MapReduce job is the creation of a new data set that is either loaded into a data warehouse or used directly by programs such as SAS or Tableau.

The MapReduce architecture provides automatic parallelization and distribution, fault recovery, I/O scheduling, monitoring, and status updates. It is both a programming model and a framework for massively parallel processing of large datasets in batch across many low-end nodes. Its ability to spread very large jobs across a cluster of ordinary servers is perhaps its best feature, certainly its most unique feature. In addition, it has excellent retry/failure semantics. MapReduce at the programming level is simple and easy to use. Programmers code only Map() and Reduce() functions and are not involved with how the job is distributed. There is no data model, and there is no schema. The subject of a MapReduce job can be any irregular data. Because the assumption is that MapReduce clusters are composed of commodity hardware, and there are so many of them, it is normal for faults to occur during a job, and Hadoop handles a few faults automatically, shifting the work to other resources. But there are some drawbacks. Because MapReduce is a single fixed data flow, has a lack of schema, index and high-level language, one could consider it a hammer, not a precision machine tool. It requires data parsing and fullscan in its operation; it sacrifices disk I/O to avoid schemas, indexes, and optimizers; intermediate results are materialized on local disks. Runtime scheduling is based on speculative execution, considerably less sophisticated than today’s relational analytical platforms. Even though Hadoop is evolving, and the community is adding capabilities rapidly, it lacks most of the security, resource management, concurrency, reliability and interactive capabilities of a data warehouse. Hadoop’s most basic components – the Hadoop Distributed File System (HDFS) and MapReduce framework – are purpose built for understanding and processing multi-structured data. The file system is crude in comparison to a mature relational database system which when compared to the universal use of SQL is a limiting factor. However, its capabilities, which have just begun to be appreciated, override these limitations and tremendous energy is apparent in the community that continues to enhance and expand Hadoop.

Hadoop MapReduce with the HDFS is not an integrated data management system. In fact, though it processes data across multiple nodes in parallel, it is not a complete massively parallel processing (MPP) system. It lacks almost every characteristic of an MPP system, with the exception of scalability and reliability. Hadoop stores multiple copies of the data it is processing, and the failure of a node can rollover to another node with the same data, though there is also a single point of failure at the HDFS Name Node, which the Hadoop community is looking to address in the long term (Today, NetApp provides a hardware-centric fail-over solution for the Name Node). It lacks security, load balancing and an optimizer. Data warehouse operators today will find Hadoop to be primitive and brittle to set up and operate, and users will find its performance lacking. In fact, its interactive features are limited to a pseudo-relational database, Hive, whose performance would be unacceptable to those accustomed to today’s data warehouse standards. In fairness, MapReduce was never conceived as an interactive knowledge worker tool, and the Hadoop community is making progress, but HDFS, which is the core data management feature of Hadoop, is simply not architected to provide the services that relational databases do today. And those relational database platforms for analytics are innovating just as rapidly with:

• Hybrid row and columnar orientation.

• Temporal and spatial data types.

• Dynamic workload management.

• Large memory and solid-state drives.

• Hot/warm/cold storage.

• Almost limitless scalability.

The ability to provide almost endless scalabilty and parallelism for batch jobs is a unique distinction for Hadoop. The only platforms that were previously able to provide this sort of massive parallelism were relational databases, and they are not limited to batch operation. So what happens next? My guess is that Hadoop survives and flourishes as the first responder to incoming data, making sense of it and handing it off to other proceses, including data warehouses, in whatever form they take. However, unless petabytes of historical data are needed for interactive analysis, Hadoop will be the favored location for storing history. The Hadoop community, and its imitators and competitors will play an important role in analytics, but not the only role.

Posted in Big Data, Decision Management, Research | Tagged , , , , , , , , , , , | 3 Comments

Decision Management on Steroids: Will Big Data Tools Trump Rules?

Decision Management on Steroids:Will Big Data Tools Trump Rules?

Can the ability to extract meaning and sentiment from previously unconventional data sources reorient the role of business rules?

In a typical customer application, scoring models are created by finding patterns and relationships from attributes using various statistical techniques and the customer records are scored for propensity or eligibility. Rules then apply policy – what to do with the scored records.

But the promise of Big Data is to deliver insight not possible with tools of even five years ago. Does the newer technology that can, to some extent, detect sentiment and propensity, examine relationships of 100’s of millions of ID’s, construct path analysis in real-time, eliminate the need for rules? In other words, does the data speak for itself?

On the other hand, can quantitative methods really implement policy or are we just in the early stages of the hype cycle. Is attended or unattended quantitative analysis of Big Data a sufficient model for implementing policy?

New information usually comes from unexpected places. Big leaps in understanding arise from unanticipated discoveries—but “unanticipated” does not imply a sloppy or accidental process. On the contrary, usable discoveries have to be verifiable, but the desire for knowledge requires a drive for innovation and the exploration of new sources of information that can alter our perceptions and outlooks. Unraveling the content of “big data” that lacks obvious structure begs for some new approaches. Big data is positioned to provide that insight.

But data doesn’t speak for itself. At some point, there will be expert failure: Solutions require data, but may degrade with too much. The largest annoyance is the overblown concept of the data scientist. Data scientists, in the traditional sense, are academic researchers. In the Big Data industry they apply existing algorithms and techniques to data from traditional and new sources. Unfortunately, they usually report to people who have no idea what they are talking about.

In subsequent research I will describe the changes in “predictive” modeling brought about by Big Data and draw some conclusions about how it affects the construction, delivery and uses of decision management.


Posted in Big Data, Decision Management, Research | Tagged , , , , , , , , , , , , , , , , , , , , , , , | Leave a comment

NoSQL: What’s the Buzz About Graph Databases?

I attended the NoSQLNow conference in San Jose and had the opportunity to  speak one-on-one with a number of principals of NoSQL database concerns, including Emil Eifrém of Neo Technology. For those of you who aren’t familiar with the concept, graph databases are based on an arrangement of edges, properties and nodes with relationships between them, not rows and columns with primary and foreign key relationships. In practice this allows them to traverse graphs of information more efficiently than reading pages of data and finding the rows that match the query.

Interestingly, Graph Theory (Euler) predates Set Theory (Cantor/Dedekind) on which the relational model is based by over 150 years. Of historical interest, the development of the relational database at IBM was conceived as a method to get data out of databases, not get data in. This turned out, in the early 70’s to be a problem for IBM so they redirected Ted Codd’s efforts to making relational databases fast transaction processors. Enter the concept of “normal form,” a horribly misleading term that has side-railed a zillion projects by data modelers with a thin understanding of the concept insisting on “normal” purity no matter the cost. The rest is history. The whole DSS/BI/Analytics movement grew out of the fact that the relational databases were poor performers at non-transaction processing.

According to the NoSQL movement, and I’m not entirely convinced of this but I’m listening, the rigidity of a physical schema needed in relational databases is their undoing in an era of agility, speed and volume.  Here is a quote from Wikipedia:

Compared with relational databases, graph databases are often faster for associative data sets, and map more directly to the structure of object-oriented applications. They can scale more naturally to large data sets as they do not typically require expensive join operations. As they depend less on a rigid schema, they are more suitable to manage ad-hoc and changing data with evolving schemas. Conversely, relational databases are typically faster at performing the same operation on large numbers of data elements.

The key characteristic of graph databases is this notion if index-free adjacency, meaning, each node knows the location of its adjacent nodes so an index is unnecessary. Obviously, a semantic interpretation of this is that the graph is a representation of relationship. Paradoxically, there are no relationships in a “relational” database, they are applied at run time from the query.

Emil seems to think that graph databases are superior to RDB in every way and will eventually supplant them. The concept that RDB are based on sound and proven mathematical principles is interesting, but relational theory is only 50 years old. Graph theory goes back to the 17th century!

This all sounds good, but there are only a billion applications out there that rely on things NOT changing and for which the relational model is well-suited. As William McKnight said in his keynote, look to NoSQL as additive not replacement technology.

For those old enough to remember, lots of database systems in the 80’s tried to get on the database bandwagon by getting certified as relational, and we’re already seeing this with graph databases. Just to name a few, the FlockDB for Twitter is only a thin graph database on top of MySQL and therefore lacks index-free adjacency. Microsoft’s Trinity does not store graphs natively.

Most of the NoSQL vendors are pushing the notion that their products are so much less complex than the current RDB’s. This is undoubtedly true, but they probably lack so much functionality that’s been built on the RDB model over the decades. In fact, RDB’s were pretty simple in the beginning too.

To sum it up, most of the NoSQL products I’ve seen are clearly aimed at high-speed, low-complexity transaction or streaming processing, usually with unconventional data. They are not analytical tools. But they could provide a very useful, even indispensible role in analytics: getting meaning into the process.

There has been a schism between semantic technology and graph databases, probably because the former still can’t figure out how to market their technologies while simultaneously trying to prove how smart they are. Their message in muddled and their most visible promoters are not, shall we say, enterprise ready. Oddly, the notion of a triple is fundamental to graphs, but graph database vendors are steering clear of the whole ontology/RDF/OWL thing and finding their customers in other pursuits. Good move.

Posted in Big Data | Tagged , , , , , , , , , , , , , , | Leave a comment

Decision Services: IBM Tackles the Full Spectrum of Decision Management

Originally posted August 31, 2012)

When James Taylor and I wrote the book “Smart (Enough) Systems: How to Gain Competitive Advantage by Automating Hidden Decisions,” we focused mainly on those kinds of decisions that are managed by a “decision service.” A decision service is an embedded applet that fires decisions in stream, typically by employing a rulesset (a set of declarative statements) and a business rules engine. We dealt at length with both predictive modeling and optimization, but the goal was to automate “little decisions that add up,” like credit line increases or call center responses, not, “Should I buy Yammer?”

You develop a decision service like the ones we proposed with predictive modeling, as the models inform the rules engine with what to do, but that wasn’t the focus of the book. At the IBM Analyst conference this week on “Smart Analytics,” IBM launched a full suite of tools for their own brand of decision management that was much more comprehensive. Combining predictive modeling, business rules, entity analytics, optimization and cognitive computing, their goal is to provide decision services addressing both horizontal applications (customer, risk, pricing and performance management, for example) and platforms (big data and decision management analytics). Still under NDA, their advanced visualization approach was very compelling.

The various tools in this initiative derive from both internally developed software (Watson, for example) but also the integrated offerings of companies acquired such as Netezza, Cognos, SPSS, iLog , Applix, Algorithmics and many others. That’s a partial list. Where IBM’s decision management proposition expands on on James’ and my decision services proposition is the ability to provide advice and recommendations to the full spectrum of strategic, operational and tactical decisions, automated or not. As a technology company, IBM presented lots of material for us geeks to enjoy, but there was a clear message that this initiative is about outcomes, not features.

One question I asked a number times was, “How much does this cost and how long does it take?” I even tried to bucket the answer with, “like ERP or a data warehouse or just flip a switch?” No answer was forthcoming. It’s will take some time for me to develop some guidelines on this from the buyer/clients, but that will follow in due course.

This mega-approach may sound daunting, but in the demos, I saw very useful and intuitive amalgams of these components that are clearly ready for the enterprise. I especially liked Brenda Dietrich’s (IBM Fellow, VP & CTO, Business Analytics) presentation on her perspective on the future of analytics including analytics at scale, analytics with uncertain data, visualization, and the future of Watson and decision management. Her messaging was clear and compelling, which will go a long way in convincing their customers implement this.

There was, of course, a lot of talk about Watson. For my money, Watson was a cute name for a system to compete on Jeopardy! but I have some reservation about the name Watson given T.J. Watson’s association with the Third Reich, for which IBM has never apologized or even acknowledged (see “IBM and the Holocaust,” by Edwin Black). Watson, composed 41 interoperating sub-systems including Natural Language Processing, Voice Recognition, Text Analytics, knowledge representation including ontologies and a host of other features, needs a more serious name in y opinion. I don’t know if IBM will stick with the name, but they are offering the capability through the cloud under the title “Cognitive Computing.” The first commercial customer was WellPoint, a decision I’ve been unimpressed with and written about previously, but they have also announced what seems to be an oncology assistant at Memorial Sloan-Kettering. There are others, but IBM does not disclose that yet.

Watson’s ability to gather, represent and use information is pretty stunning. However, Watson seems to learn partially by mistakes (don’t we all), but some mistakes, like oncology, can have disastrous results that don’t appear for a long time, during which presumably the same mistakes are made repeatedly. IBM counters that Watson is an advisor, not a doctor, but I wonder what he’s doing at WellPoint, which is the 2nd largest for-profit health insurer in the US. A report in Reuters alleged that the Anthem Blue Cross subsidiary improperly singled out women with breast cancer for cancellation of their policies shortly after they were diagnosed with breast cancer. Lets hope Watson didn’t “learn” to do that.

My guess is that Watson is offered as a “service” to providers (at a fee? I don’t know) to advise on the best treatment approach for patients. Actually, providers are irritated enough about how their practices are constrained by insurance companies, Medicare/Medicaid and their malpractice insurers. Watson may make things worse (for them). Personally, if Watson’s efforts can make a dent in the poor quality of healthcare in this country (WHO ranks the US 1st in spending per capita and 26th in quality, just above Bosnia) I’ll be delighted.

Also, I found a glaring contradiction in some presentations. While encouraging companies to use social media data to gain insight, Stephen Gold (Director Worldwide Marketing, Watson Solutions) was openly disparaging about sites like WebMB or PatientsLikeMe as a source of information for Watson’s forays into medicine. If serious sites like these are not acceptable, how can frivolous sites like Twiiter or Facebook be useful? Afterwards I spoke with Gold, a really smart and engaging guy, and he has a pretty open mind after all. I enjoyed that conversation. That discussion will continue.

One thing we learned long ago in data warehousing is that you can increase the value of your existing data by integrating other sources, even small ones that add to your ability to gain insight. How in the world we’re going to curate that with the mass of data now accessible is a question that will keep us busy for a while.

I hope to dig in more to the various tools that were alluded to from SPSS, especially things like optimization, but in the latter case, I got the impression they meant predictive models that deliver “next best offer” sort of advice as opposed scheduling a fleet of trucks or airplanes. We’ll see.

Anyway, it was a good day and a half, though it could have been done in less time. The breakouts sort of rehashed the general sessions and there was never enough time for questions. When you have a roomful of analysts, you should make better provision for that.

A lingering feeling I have, though, is that as these systems are given more and more leeway to make and direct decisions, there has to be an agreement that the models underlying those decisions are adequate. James and I were pretty clear that the decisions we wrote about were negligible singularly, but very important in the aggregate. Much could be gained by being consistent and timely in these decisions, even if a few stakeholders got mistreated. But when the decision services are pointed towards more important decisions, like your health or your employment or your kids placement in programs, you really have to wonder how you can ever determine whether the modelers really considered everything, or at least the important things. As humans, we tend to excel at simplified models, not comprehensive one. It’s a little worrisome. George Box said it best: “All models are wrong. Some are useful.” Too bad Dr. Box didn’t give us guidance figure out which ones.

Posted in Big Data, Business Intelligence, Decision Management, Medicine, Research | Tagged , , , , , , , , , , , , , , , , , | Leave a comment

The Fallacy of the Data Scientist Shortage

There is no question that the USA (in fact, most of the world) would be well-served with more quantitatively capable people to work in business and government. However, the current hysteria over the shortage of data scientists is overblown. To illustrate why, I am going to use an example from air travel.

On a recent trip from Santa Fe, NM to Phoenix, AZ, I tracked the various times:

  Duration (min) Cumulative (min)
Drive from Santa Fe to ABQ Airport 65 65
Park 15 80
Security 25 105
Wait to board 20 125
Boarding process 30 155
Taxiing 15 170
In flight 60 230
Taxiing 12 242
Deplane 9 251
Wait for valet bag 7 258
Travel to rental car 21 279
Arrive at destination in Tempe 32 311

As you can see, the actual flying time of 60 minutes represents only 19% of the travel time.  Because everything but the actual flight time is more or less constant for any domestic trip (disregarding common delays, connections and cancellations which would skew this analysis even farther), this low percentage of time in the air is a reality. For example, if the flight took 2 hours and fifteen minutes, it would still work out to 135/386 = 35%. The most recent data I have, from 2005, shows the average non stop distance flown per departure was 607 miles, so we can add about 25 minutes to the first calculation and arrive at 85/336 =  25%.

Keep in mind, again, these calculations do not account for late departures/arrivals, cancelled and re-booked flights, connections, flight attendants and pilots having nervous breakdowns, etc. It’s safe to say that at most 25% of your travel time is spent in the air. Just for fun, let’s see how this would work out if we could take the (unfortunately retired) Concorde.  We would reduce our travel time by flying at Mach 2.5 by 40 minutes, trimming out journey from five hours and eleven minutes to four hours and 31 minutes, about a 13% improvement.

What’s the point of all of this and what does it have to do with the so-called data scientist shortage?

Based on our research at Hired Brains, we find that analysts that work with Hadoop or other big data technologies spend a significant amount of time NOT requiring any knowledge of advanced quantitative methods – configuring and maintaining clusters, writing programs to gather, move, cleanse and otherwise organize data for analysis and many other common tasks in data analysis. In fact, even those who employ advanced quantitative techniques spend from 50-80% of their time gathering, cleansing and preparing data. This percentage has not budged in decades. Keep in mind that advanced analytics is not a new phenomenon; what is new is the volume (to some extent) and variety of the source data with new techniques to deal with it, especially, but not limited to, Hadoop.

The interest in analytics has risen dramatically in the past two or three years,  that is not in dispute. But the adoption of enterprise-scale analytics with big data is not guaranteed in most organizations beyond some isolated areas of expertise. Most of the activity is in predictable (commercial) industries – net-based businesses, financial services, and telecommunications, for example, but these businesses have employed very large-scale analytics, at the bleeding edge of technology for decades.  For most organizations, analytics will be provided by embedded algorithms in applications not developed in-house and third-party vendors of tools and services and consultants.

The good news is that 80% of the expertise you need for big data is readily available. The balance can be sourced and developed.  “The crème-de-la-crème of data scientists will fill roles in academia, technology vendors, Wall Street, research and government.

There are related and unrelated disciplines that are all combined under the term analytics. There is advanced analytics, descriptive analytics, predictive analytics and business analytics, all defined in a pretty murky way. It cries out for some precision. Here is how I characterize the many types of analytics by the quantitative techniques used and the level of skill of the practitioners who use these techniques.


Descriptive Title

Quantitative Sophistication/Numeracy

Sample Roles

Type I

Quantitative Research (True Data Scientist)

PhD or equivalent

Creation of theory, development of algorithms. Academic/research. Often employed in business or government for very specialized roles

Type II

(Current definition of) Data Scientist or Quantitative Analyst

Advanced Math/Stat, not necessarily PhD

Internal expert in statistical and mathematical modeling and development, with solid business domain knowledge

Type III

Operational Analytics

Good business domain, background in statistics optional

Running and managing analytical models. Strong skills in and/or project management of analytical systems implementation

Type IV

Business Intelligence/ Discovery

Data and numbers oriented, but no special advanced statistical skills

Reporting, dashboard, OLAP and visualization use, possibly design, Performing posterior analysis of results driven by quantitative methods

“Data Scientist” is a relatively new title for quantitatively adept people with accompanying business skills. The ability to formulate and apply tools to classification, prediction and even optimization, coupled with fairly deep understanding of the business itself, is clearly in the realm of Type II efforts. However, it seems pretty likely that most so-called data scientists will lean more towards the quantitative and data-oriented subjects than business planning and strategy. The reason for this is that the term data scientist emerged from those businesses like Google or Facebook where the data is the business; so understanding the data is equivalent to understanding the business. This is clearly not the case for most organizations. We see very few Type II data scientists with the in-depth knowledge of the whole business as, say, actuaries in the insurance business, whose extensive training should be a model for the newly designated data scientists (see my other posts here).

Posted in Big Data | Tagged , , , , , , , , , , , , , , , , , , , , , | 1 Comment

YABADaS: Yet Another Blog About Data Science

In a comment to Dan Woods’ article “Amazon’s John Rauser on What Is a Data Scientist?,” (http://www.forbes.com/sites/danwoods/2011/10/07/amazons-john-rauser-on-w… ) “nyhacker” commented:

“So far there is a standard obstacle to such work: Business is still organized as it was in Henry Ford’s day where the supervisor knows more and the subordinate is to apply labor to the ideas of the supervisor. So, unless the CEO has such a background, any such expert will have to report to someone who doesn’t understand the work, and that mostly can’t work. Net, for a person to get a career from such material, they have to start their own company to do such work and sell the results. For me, I need to get back to it!”

This commenter hit the nail on the head. I’d say the problem is less severe in net-savvy companies, but even in those companies where “data science” has been a core competency all along, (financial services, telco, oil and gas E&P, for example) this is still an issue. Consider Wall Street, for example. Here the crème de la crème are recruited from the best schools, with exquisite skills and credentials to become “quants” or, more accurately, “financial engineers,” inventing new types of securities out of thin air like credit default swaps that nearly destroyed the world financial system. Was this a failing of data science? I think not. Instead, these models were passed to portfolio managers, in my analytics types taxonomy, (http://www.constellationrg.com/research/2012/02/trends-analytic-types-ro… Type II-A’s to Type II-B’s) who were free to run the models with their own assumptions, such as, “home values will continue to rise forever.” As “nyhacker” said, the recipient of the data scientist’s work is often passed on to people who don’t really understand it. When that happens, one of two things occurs. Either the receiving party doesn’t act, which was pretty typical in the days when we called this data mining, or they act with either fantastic or disasterous consequences.

Tom Davenport advises that adopting an analytic culture requires a goodly number of these “data scientists” gathered centrally reporting to a C-person, preferably the CEO, who has an analytical background and does understand this stuff, but unless you’re on Wall Street or Silicon Valley, you can swing a dead cat around by the tail and not hit one of these characters.

So what’s the answer? I like to say that data doesn’t speak for itself. Don’t put too much stock in the ability of data scientists to change the world. It takes the efforts of lots of people working together to make really good (or really evil) things happen. They play a role, but they aren’t prime movers.

I believe the danger here is that the current fascination with the terms “big data” and “data scientist” is counterproductive as we waste so much time trying to define them. Even worse, the data scientist thing has started a rush of programs and a sense of urgency to find, recruit, train, hire, seduce and cherish these guys. In point of fact, they are right in front of you, and have been all along.

Here is an example. One of the three largest credit card companies in 1999 has over one thousand analysts developing SQL queries against one of the largest (at the time) data warehouses in world, over 50 terabytes. The data was extracted and separated into test datasets and production runs to develop and continuously refine quantitative models for fraud detection, churn action and other programs. The terms “big data” and “data scientist” didn’t exist yet, but that’s exactly what this was. And remember the number, one thousand. They didn’t grow on trees, they were recruited and learned on the job over time.

The good news, for an old quant like me, is that quantitative methods are suddenly fashionable and I get to have a little fun before it gets benched in favor of the handsome newcomer, whatever that will be.

Posted in Uncategorized | Tagged , , , , , , | 2 Comments