Digital Transformation in the Insurance Industry

Digital Transformation in the Insurance Industry
Neil Raden
Hired Brains Research LLC

November 2018

<Premise> Few businesses stand to gain as much from digital transformation as insurance companies, but few businesses encounter as many challenges when they try to implement it.

<Background> The US insurance industry recorded in 2016  $1.1 trillion  of “written premium,” roughly equivalent to sales in other industries. The premiums are split 53% / 47% between Life and Annuity companies and Property and Casualty companies. Health insurers are not included in these totals, which were about $700B in the US, but actual numbers are difficult to pinpoint ans muh of the totoal is not exactly insurance, such as PBM’s, etc.  At $1.1T, insurance contributes about $500 billion to GDP (about 2.7%). To put this in perspective, this places insurance just below federal government contribution to GDP and just above the information technology sector. The industry as a whole has cash and invested assets over $5.5 trillion and employs 2.5 million people.

Worldwide, US insurance business represents about 1/3 of the total, split more or less evenly with Europe and Asia.

Insurance is a complex industry, with many layers, disciplines, markets and regulators. It is, however, more reliant on data and analytics than most industries, with a few exceptions, but has struggled to keep up. Of all of the industries highlighted in case studies and sales pitches for big data analytics, insurance is often conspicuous by its absence, but it is not only desperately in need of digital transformation, it can benefit from a wide assortment of technologies and applications.

<Issues> Insurance companies need to implement digital transformation, including big data analytics, IoT, machine learning and AI. Some of the technologies that are relevant are:

  • Back Office Automation
  • Personalized User Experience
  • Voice Biometrics
  • Actuarial
  • IoT
  • Telematics
  • Blockchain

Back Office Automation – Insurance companies seem to have more office workers than other industries to keep the outdated processes flowing. Like any other company, application of automated workflow and decision-making can reduce costs, improve performance and lay the foundation for more sophisticated innovations, especially hastening the introduction of new products, essential in what is a very competitive market.

Personalized User Experience – With the exception of commercial lines, most insurance products are sold to individuals. The process of researching and securing insurance coverage online is still quite tedious with many steps, plus terms and conditions that are not often understood. In the past decade, aggregator websites provided a search and compare service, but typically, the insured chooses an option and is funneled back to the existing underwriting and policy issuance process.

The process of reporting a claim is a complex multi-step ritual that could benefit by new technology as well. Some auto insurers are experimenting with “touchless” claims where the driver takes a picture of the damage and sends it to the company with nearly instant authorization for the damage to be repaired. In some cases, the whole process is conducted without human intervention, but the technology is in the early stages.

Voice biometrics – can reduce the time-consuming process in a customer call of repeating policy information, etc. by identifying the client in the first seconds, and also sensing emotion, directing the caller to the proper responses.

Actuarial – Are actuaries trained to handle 21st century data and tools? There are at least 25,000 actuaries employed in the US, each with academic, professional and work experience that is mathematical, quantitative and ready to leverage the benefits of big data analytics including machine learning and AI. They will need training and to acquire new skills, but they represent a pool of professionals with the temperament and background to add data science to their work. Actuarial departments are, however, still reliant on manual data preparation and coding in low-scale languages and tools, especially desktop tools.

IoT – Some providers of home-owner’s insurance have already participated with third-party providers of “smart homes” using the sensor streams to spot events, and improve security. Insurers of other physical assets such as structures, vehicles and even ships in port and at sea are using drones and satellites to absorb telemetry for risk analysis and aversion.  Some personal lines insurers are using drones to inspect properties during the underwriting process.

Telematics – Telematics is one area where big data analytics are showcased with insurers, particularly with personal auto insurance and fleet management. The insurer (or a third party data aggregator) supplies the driver with a device that plugs directly into the vehicles OBD II port (All cars and light trucks built and sold in the United States after January 1, 1996 were required to be OBD II equipped). In principal, the device records the driving habits of the vehicle, rewarding good drivers, though it is unlikely that is all it does. Surely, penalizing drivers for poor habits is also part of the program as is aggregating the data and selling it to third parties. Some State Insurance Commissioners (in the US, insurance is regulated by the states) are evaluating whether these practices are acceptable. Holding and using big data about people and lives is a privilege. Access to this data for legitimate underwriting and claims adjudication purposes is reasonable, but abusing the privilege needs to be avoided with careful governance.

Blockchain – Blockchain can support innovative business processes and be the foundation for new products. One example is peer-to-peer insurance by hosting quoting, claims and other tasks. Blockchain also provides transparency, accuracy and currency of contracts to all parties in a contract. Faster and secure payment models and enhanced security reduce fraud and risk of duplication.

Figure 1: Courtesy of Data Centric LLC

<Conclusion> Our estimate is that most current back-office work processes will disappear in the next 5-7 years, of necessity, as the burden of manual processes on costs (and product pricing) and product innovation latency will be too much to bear. On the other hand, staffing of actuaries and actuaries performing as data scientists will likely increase 25% in the same period.

With the exception of some large insurers, the industry as a whole has lagged behind others in exploiting big data analytics, machine learning and AI. There are a multitude of benefits for insurers to embrace digital transformation in every aspect of their business, and there is great risk if they do not because as a business, insurance is almost completely driven by the processing and analysis of data.

<Action Item> One can only generalize about insurance companies, because there are so many types and so many models. However, most need to revise their processes to reduce cost and add agility in product development and cadence. Insurance companies often get poor grades for customer engagement, an area that is ready for great leaps in improvement with the application of big data analytics, machine learning and AI.

Insurance is a complex industry, with many layers, disciplines, markets, and regulators. It is, however, more reliant on data and analytics than most industries, with a few exceptions, but has struggled to keep up. Of all of the industries highlighted in case studies and sales pitches for big data analytics, insurance is often conspicuous by its absence, but itis not only desperately in need of digital transformation, but it can also benefit from a wide assortment of technologies and applications.

C Hired Brains ResearchLLC and Neil Raden 2018

Advertisements
Posted in Uncategorized | Leave a comment

Key Issues for Generating Value from AI

Key Issues for Generating Business Value from AI

By Neil Raden, Hired Brains Research, November, 2018

 

Premise: AI technology is far beyond the hype phase, but adoption of AI in organizations in the short term primarily will be through third-party software. There are, however, excellent opportunities for organizations to retrofit AI functions into their own applications to boost speed, accuracy and productivity. Caution: AI cuts to the core of human contribution and will need constant leadership to prevent disorganization, distortion and dysfunction. It is just as likely that human experts in select fields will be co-opted by AI as those performing manual processes.

 

Artificial intelligence (AI) is an old technology, with new implementations. However, the advent of increasingly parallel programming models and unprecedentedly scalable hardware, coupled with the opportunity to pursue significant new business value by using these technologies to powerfully mine volumes of customer and other data, have made AI vogue. As executives consider adding AI to their business system portfolio over the next 24 months, they must understand the following:

 

  • Not everything called AI is real. Psychologists and neuroscientists are still trying to understand what human intelligence is, so “intelligence” in the context of “artificial” and “human” is the same word to describe two different things, like Paris, France and Paris, Texas. Distinguishing between core AI disciplines and technologies and AI applications that are built from those technologies is important to keep track of AI investments and expected business outcomes (see Figure 1).
  • In 2017, AI can stand for “additive intelligence.” Organizations will find that their existing applications can be enhanced with the application of AI “wrappers,” particularly replacing manual data ingestion, human expert forecasting and data discovery. It is becoming easier for in-house developers to use AI technology, especially since Amazon AWS, IBM Watson and Microsoft Azure, among others, provide useful API’s to AI algorithms. However, enterprise software providers have far more resources to implement AI capabilities and most AI will be added to business systems through software packages.
  • AI can lead to organizational distortion and dysfunction. AI implementation has a direct effect on the nature of work in organizations. Adjusting to this is never simple. Employees see AI coming and they will push back, either purposely or not.

Picture1

 

Figure 1. AI Disciplines and AI Applications Are Not the Same

 

Key AI Adoption Issues During 2018-2019

 

Key Issue 1: Building, Buying Greenfield AI apps

No one will “implement” AGI, but the organizations that successfully deploy AI will treat it as the software/data combination that it is, which will make it less mysterious – and threatening. Software tools like Theano and TensorFlow, cloud data centers for model training, inexpensive GPUs for deployment, and API-based AI services in the cloud, are allowing small teams of engineers to build state-of-the-art AI systems.

In the short term, in-house AI development will focus on enacting simpler customer engagement activities in services, marketing, and sales. By 2019, most enterprise software vendors will have added AI capabilities to their flagship products, including chatbots to CRM, IoT Edge Analytics to ERP, and other more complex domains.

Over the next few years, AI will show up in a greater range of consumer applications, which will significantly increase expectations on corporate investments in AI. Why? Just as the ease-of-use of consumer web applications drove change in internally and externally facing enterprise applications, so will autonomous vehicles, robot appliances, and other AI-infused devices drive expectations in the enterprise.

Key Issue 2: Retrofitting AI into Existing Apps

Information technology can be used to both cut costs and increase revenue. Because inefficiencies always emerge as missions, environments, and people evolve, cost-cutting – such as optimizing workforce needs – will always be a useful effort in organizations. However, the real AI battlefield is differentiation: distancing your enterprise from your competitors. The key to leaping ahead isn’t a wholesale replacement, but applying new technology to existing systems whenever possible. Speed is essential. Organizations need to be constantly on the lookout for new ways to rapidly retrofit existing systems to streamline and enhance revenue opportunities.

To illustrate the idea of retrofitting, consider a workforce planning application to forecast and track labor cost for a utility company with outdoor transmission facilities. Utilities must balance the need for maintenance and repair with labor, especially overtime and contractor headcount. To begin, experts forecast the incidents based on the loads (miles of wire, for example), the types of incidents and the time to clear the problem. This results in a forecast of hours. Hours are then broken down into required workgroups and each workgroup forecast is transposed into a series of models that factor rates, overtime, headcount, union contracts, and more.

 

This planning cycle happens annually, and actual results are fed monthly (two weeks late), a sort of ritual, annual rain dance, out of date even before finalized.

 

Streaming sensors with machine learning algorithms could easily be attached to the top of the application to replace human “forecast” and guess work by analyzing actual incidents and productivities and even possibly accessing weather information which plays are a large roll in incidents and response times.

 

Since the workforce planning module is essentially an optimization problem worked out by analysts through many iterations, AI can dramatically accelerate the data flow process, improve accuracy, and operate in real-time.

mcbats

Figure 2. Schematic of a Workload Planning Application

 

Key Issue 3: Implementation of AI Will Cause Distortion and Dysfunction

People in organizations are not singular actors. They have relationships, are part of groups and have mentors and followers. When one is disrupted by a new technology they aren’t easily removed like knight takes bishop. It can affect all those relationships. This isn’t a new phenomenon. All previous major shifts in technology that affect the nature of work have been impeded by resistance and dysfunction, and AI is likely to be the most stunning shift of all.

When information from analytic work is communicated to others, the results are often difficult to explain. Context, explanation, or explicit models must be included to describe the rationale behind the results. This includes detail about methodology, narratives about the steps involved, alternatives that were considered and rejected (or perhaps just not recommended) and a host of other background material. However, communicating meta-information usually requires its own complex process. Typically, meta-information is conveyed in a sequence of time-consuming, serial meetings which must be scheduled days or weeks in advance. The combination of information communication activities turns cycle time into cycle epochs. Knowledge mismatch between various actors is the reason for posterior explanation. Well-researched and reasonable conclusions do not translate into action because management is unwilling to buy in due to their lack of insight into the process by which the conclusions arrived.

 

AI-driven decisions cannot come from a black box. In previous versions of AI, this was factor in its decline. Unless the AI tools themselves contain the mechanisms to explain how conclusions are drawn, progress will be slow. The solution to this problem is an environment where complex decisions that must be made with confidence and consensus can gather recommendations to be presented unambiguously and compellingly across multiple actors in the decision-making process, with the assistance of the AI models.

 

Action Item

 

AI cannot be inserted into an organization in a vacuum. Before starting, you will need to assess your existing capabilities. The late architect Eero Saarinen advised that when designing something, always consider it in its next larger context: a chair in a room, a room in a building. AI algorithms need lots of data to produce valid and useful conclusions. AI also implies real-time. Make sure you have the infrastructure to provide good data in sufficient volume and speed for AI to operate. In the search for opportunities for AI, consider augmenting the capabilities of people, allowing them to be more productive. This is especially true in wrangling data for analytics, a subject ripe for AI because business analysts and data scientists alike spend too much time finding and prepping data, limiting the time they have for creative analysis.

 

About the Author

Neil’s career spans roles as a P&C actuary with AIG, founder of a consulting/systems integration firm, software developer, healthcare CTO and industry analyst, always focused on data, analytics, quantitative methods and decision processes with a realistic approach to technology infusion in organizations. He is a prolific writer publishing over 70 white papers, hundreds of articles, books, blogs, keynote addresses and research reports. He lives in Santa Fe, NM

 

 

 

 

 

Posted in Uncategorized | Leave a comment

Integrated, conformed clinical data is a crucial factor in better patient outcomes, but genomics still belongs with researchers, not clinicians

Clearly, integrating and harmonizing data in a hospital setting is crucial to providing better care for patients. But there is a limit to the scope to this. Keeping track of clinicians’ actions and the timing of them, medications administered, vital signs and a wealth of other measures that can be brought together to give an accurate and real-time picture of the patient is a giant leap forward, and imminently doable with today’s technology. 

But there are limits. Digital Twin technology for modeling individual people, “personalized medicine,” is a concept for simulating the whole human through genomics, physiology and environments/lifestyle over time. However, it is infinitely more complicated to develop a working Digital Twin of a human being than that of an engineered object, such as a jet engine or an oil well. Also, sensors can provide streaming information to a Digital Twin of an engineered object, but to provide the same for an individual, such as blood tests and scans, is prohibitively expensive, intrusive and time-consuming. In that way the analogy of an engineering Digital Twin breaks down.

While Digital Twins paint a scenario of data-driven healthcare, it can be a double-edged sword. For example, there is the potential of providing a social equalizer through medical enhancements. On the other hand, it can just as well drive inequality leading to discrimination by identifying traits that are not “normal” based on patterns found in a collection of digital twins. This calls for measures that ensure transparency of data usage and derived benefits, and data privacy. Privacy is already identified as a concern with the rapid evolution of inexpensive genomics, but patient digital twins will add in more aspects of biological and behavioral data that will be much more identifiable as an individual than a mapped genome alone.

Most importantly of all, it is my opinion that our understanding of genomics, even at the most rudimentary level, is not only insufficient as a diagnostic or treatment mechanism, it is dangerous. Consider how little we knew about the workings of genomics while still leaping ahead with solutions:

1953 Watson and Crick (and Rosalind Franklin) discover the structure of DNA

1980s several studies demonstrated that DNA methylation was involved in gene regulation

Human genome mapped 2003

Epigenetic effects 2008

CRISPR nightmares

And just this month, scientists have confirmed new DNA structures Inside human cells that are not the familiar double helix:  i-motif, A-DNA, Z-DNA, triplex DNA, and cruciform DNA. We don’t yet know what exactly they do. 

Our recommendation is to allow geneticist to learn more before using, for example, the sequenced genome of a patient as a diagnosis or treatment modality. Patient outcomes can be improved considerably with the careful capture and recording of the ordinary workings in the hospital.

Posted in Uncategorized | Leave a comment

The Informed Data Lake: Beyond Metadata

The Informed Data Lake Strategy
Executive Summary

Historically, the volume and extent of data that an enterprise could store, assemble, analyze and act upon exceeded the capacity of their computing resources and was too expensive. The solution was to model some extract of a portion of the available data into a data model or schema, presupposing what was “important,” and then fit the incoming data into that structure.

But, the economics of data management today allow for the gathering of practically any data, in any form, skipping the engineered schema with the presumption that understanding the data can happen on an as-needed basis.

This newer approach of putting data in one place for later use is now described as a Data Lake. But sooner or later, one has to pay the piper, as this Data Lake approach involves manual, time-consuming data preparation and filtering that is often one-off, consumes a large percentage of the data scientist’s time and provides no reference to the content or meaning of the data.

The alternative is the Informed Data Lake. The difference between an Informed Data Lake and the static “dumb” data neatly arranged in a Data Lake, is a comprehensive set of capabilities that provides a graph based linked and contextualized information fabric (semantic metadata and linked datasets) where NLP (Natural Language Processing), Sentiment Analysis, Rules Engines, Connectors, Canonical Models for common domains and cognitive tools that can be plugged in to turn “dumb” data into information assets with speed, agility, reuse and value.

Today, companies in Pharmaceutical, Life Sciences, Financial Services, Retail, Government Agencies, and many other industries, are seeking ways to make the full extent of their data more insightful, valuable and actionable. Informed Data Lakes are leading the way through graph based data discovery and investigative analytics tools and techniques that uncover hidden relationships in your data, while enabling iterative question and answering of that linked data.

The Problem with a Traditional Data Lake Approach

Anyone tasked with analyzing data for understanding past events and/or predicting the future knows that the data assembled is always “used.” It’s secondhand. Data is captured digitally for purposes that are almost exclusively meant for purposes other than analysis. Operational systems automate processes and capture data to record the transactions. Document data is stored in formats for presentation and written in flowing prose without obvious structure; it’s written to read (or just record), not mined for later analysis. Clickstream data in web applications is captured and stored in a verbose stream but has to be reassembled for sense making.

.Organizations implementing or just contemplating a data lake often operate under the misconception that having the data in one place is an enabler to broader and more useful analytics leading to better decision-making and better outcomes. There is a large hurdle facing these kinds of approaches – while the data may be in one place physically (typically a Hadoop cluster), in essence all that is created is a collection of data siloes, unlinked and not useful in a broader context, reducing the data lake to nothing more than a collection of disparate data sources.  Here are the issues:

·      Data quality issues –data is not curated. Users have to deal with repeat data, old data, contextually wrong data

·      Data can be in different formats or different languages

·      Governance issues

·      Security issues

·      Companies being misinformed as to who can process and use the data

·      Also the ever changing nature of diverse data requires that the processing and analysis of the data be dynamic and evolves as the data changes or as the need change

The effort of adding meaning and context to the data falls on the shoulders of the analysts, a true time-sink. Data Scientists and Business Analysts are spending 50-80% of their time preparing and organizing their data and only 20% of their time analyzing it.

·      But what if data could describe itself?

·      What if analysts could link and contextualize data from different domains without having to go through the effort of curating it for themselves?

·      What if you had a Informed Data Lake to address all those issues and more?

The Argument for a Informed Data Lake

Existing approaches to curating, managing and analyzing data (and metadata) are mostly based on relational technology, which, for performance reasons, usually simplifies data and strips it of meaningful relationships while locking it into rigid schema. The traditional approach is to predefine what you want to do with the data, define the model and subsequent version of physical optimization and subsets for special uses. Context is like poured concrete – fluid at first needing a jackhammer to change it once it sets. You use the data as you originally designed it. If changes are needed, going back to redesign and modify is complicated.

The rapid rise of interest in “big data” has spawned a variety of technology approaches to solve, or at least, ease this problem such as text analytics and bespoke applications of AI algorithms. They work. They perform functions that are too time-consuming to do manually but they are incomplete because each one is too narrow, aimed at only a single domain or document type, or too specific in its operation. They mostly defy the practice of agile reuse because each new source, or even each new special extraction for a new purpose, has to start from scratch.

Given these limitations, where does one turn for help?  An Informed Data Lake is the answer.

At the heart of the Informed Data Lake approach is the linking and contextualizing of all forms of data using semantic based technology. Though descriptions of semantic technology are often times complicated, the concept itself is actually very simple:

–       It supplies meaning to data that travels with the data

–       The model of the data is updated on the fly as new data enters

–       The model also captures and understands the relationship between things from which it can actually do a certain level of reasoning without programming

–       Information from many sources can be linked, not through views or indexes, but through explicit and implicit relationships that are native to model

Conceptually, the Informed Data Lake is a departure from the earliest principle of IT: Parsimonious Development derived from a mindset of managing from scarcity[1] and deploying only simplified models. Instead of limiting the amount of data available, or even the access to it, Informed Data Lakes are driven by the abundance of resources and data. Semantic metadata provides the ability to find, link and contextualize information in a vast pool of data.

The Informed Data Lake works because it is based on a dynamic semantic model-approach based on graph driven ontologies. In technical terms, an ontology represents the meaning and relationships of data in a graph, an extremely compact and efficient way to define and use disparate data sources via semantic definitions based on business usage including terminology and rules that can be managed by business users:

•       Source data, application interfaces, operational data, and model metadata are all described in a consistent ontology framework supporting detailed semantics of the underlying objects. This means constraints on types, relations, and description logic, for example, are handled uniformly for all underlying elements.

•       The ontology represents both schema and data in the same way. This means that the description of metadata about the sources also represents a machine-readable way of representing the data itself for translation, transmission, query, and storage.

•       Ontology can richly describe behavior of services and composite applications in a way that a relational model can only do by being tightly bound to the applications logic.

•       The ontology is a run-time model, not just a design-time model. The ontology is used to generate rules, mappings, transforms, queries, and UI because all of the elements are combined under a single structure.

•       There is no reliance on indexes, keys, or positional notation to describe the elements of the ontology. Implementations do not break when local changes are made.

•       An ontological representation encourages both top-down, conceptual description and bottom-up, source- or silo-based representation of existing data. In fact, these can be in separate ontologies and easily brought together.

•       The ontology is designed to scale across users, applications, and organizations. Ontologies can easily share elements in an open and standard way, and ontology tools (for design, query, data exchange, etc.) don’t have to change in any way to reference information across ontologies.

Assuming a data lake is built for a broad audience, it is likely that no one party will have the complete set of data they think is of interest. Instead, it will be a union of all of those ideas, plus many more that arise as things are discovered, situations evolve and new sources of data become available. Thinking in the existing mode of database schema design, inadequate metadata features of Hadoop and just managing from scarcity in general, will fail under the magnitude of this effort. What the Informed Data Lake does is take the guesswork out of what the data means, what it’s related to and how it can be dynamically linked together without endless data modeling and remodeling.

All of the features and capabilities below are needed to keep a data lake from turning into a data swamp, where no one quite knows what it contains or if it is reliable.

Informed Data Lake Features:

·      Connectors to practically any source

·      Graph based, linked and contextualized data

·      Dynamic Ontology Mapping

·      Auto-generated conceptual models

·      Advanced Text Analytics

·      Annotation, Harmonization and Canonicalization

·       “Canonical” models to simplify ingest and classifying of new sources

·      Semantics querying and data enrichment

·      Fully customizable dashboards

·      With full data provenance adhering to IT standard

Sample Informed Data Lake Capabilities:

·      Manage business vocabulary along with technical syntax

·      Actively resolve differences in vocabulary across different departments, business units and external data

·      Support consistent assignment of business policies and constraints across various applications, users and sources

·      Accurately reflect all logical consequences of changes and dynamically reflect change in affected areas

·      Unify access to content and data

·      Assure and manage reuse of linked and contextualized data

Any vendor providing metadata based on semantic technology is in a unique position to provide these capabilities required to build and deploy the Informed Data Lake. It is based on open standards and takes a semantic approach from the beginning. In addition, it incorporates a very rich tool set that includes dozens of 3rd party applications that operate seamlessly within the Informed Data Platform. This is central to the ability to move the task of data integration and data extraction to more advanced knowledge Integration and knowledge extraction, without which it is impossible to fuel solutions in the areas of competitive intelligence, Insider trading surveillance, investigatory analytics and Customer 360, risk and compliance, as well as feeding existing BI applications (a requirement that is not going away anytime soon).

A Informed Data Lake Solution

The specific design pattern of the Informed Data Lake enables data science because analytics does not end with a single hypothesis test. Simple examples of “Data Scientists” building models on the data lake and saving the organization vast sums of money make good copy, but they do not represent what happens in the real world.

Often, the first dozen hypotheses are either obvious or non-demonstrable. When the model characterization comes back it presents additional components to validate and cross correlate. It is this discovery process that the data lake somehow needs to facilitate, and it needs to facilitate it well, otherwise the cost of the analytics is too high and the process is too slow to realize business value.

To enable that continuous improvement process of deep analytics requires more than a data strategy, it needs a tool chain to solve model refinement, and the best-known method to date is the Informed Data Lake. The significant pain point for deep analytics is refinement. And the lower the refinement costs are, the more business value can be extracted.

At some point you may have heard the criticism of BI and OLAP tools that you were constrained to the questions that were implicit in their models. In fact, the same criticism has been leveled at data warehouses.  The fact remains that both data warehouses and BI tools limit your questions to those that can be answered, not just with the available data, but how it is arranged physically and how well the query optimizer can resolve the query.

Now imagine what is possible if you could ask any question of the data in a massive data lake? This is where the Informed Data Lake comes into play.

Catalog capabilities allow for massive amounts of metadata and instantaneous access to it. Thus any user (or process) can “go shopping” for a dataset that interests them. Because the metadata is constructed in the form of an in-memory graph, linking and joining data that is of far different structures and perhaps never linked before, can be done instantaneously.

On a browser like interface,, the graph can show you not only the typical ways different data sets can be linked and joined, it can even recommend other datasets that you haven’t considered.

Once data is selected, the in-memory graph processing analyzes and traverses it structure to provide the instantaneous joins that would be impossible in a relational database. The net result is that arbitrarily complex models and tools can ask any question with unlimited joins as a result of processing optimized for multi-core CPU’s, very large memory models and fast interconnect across processing nodes.

Informed Data Lake in Action

Pharma R&D Intelligence:

Clinical trials involve great quantities of data from many sources, a perfect problem for an Informed Data lake. The Informed Data Lake allows the loading, unification and ingestion of the data without knowing a priori what analytics would be needed.  In particular, evaluating drug response would link many sources of data following participants with severity and occurrence of adverse drug reaction, across multiple trials, as well as unknown other classes of data.

Clinical trial data investigators and analysts can see the value of the graph based approach with the linking and contextualization they could not do otherwise.  They see many benefits including:

·      Identifying patients for enrollment based on more substantive criteria

·      Monitoring in real-time, to identify safety or operational signals,

·      Blending data from physicians and CROs (contract research organizations)

 

Insider Trading and Compliance Surveillance:

In the financial services space, the combination of deep analysis of large datasets with targeted queries of specific events and people give analysts and regulators an opportunity to catch wrongdoing early.

·      Identify an employee who has unusually high level of suspicious trading activity.

·      Spot patterns in which certain employees have histories of making the exact same trades at the exact same times.

·      Compare employees’ behaviors to their past histories, and spot situations where employees’ trading patterns make sudden, drastic changes

Conclusion

Making sense of data lakes takes discipline because a one-off approach will drain your best resources of time and patience. The Informed Data Lake approach, complete with a suite of NLP, AI, graph-based models and semantic technology is the sensible approach. Your two most expensive assets are staff and time. The Informed Data Lake allows you to do your work quicker, cheaper, faster, with more flexibility and greater accuracy, which has a major impact on your business. Without the Informed Data Lake, the data is a bewildering collection of pieces that analysts and data scientists can only understood in small pieces, diluting the value of the data lake.

The whole extended fabric of an ontology solution and its ability to plug in third-party abilities collapses many layers of logical and physical models in traditional data warehousing/business intelligence architectures into a single model. With the Informed Data Lake approach, tangible benefits accrue:

·      Widespread understanding of the model across many domains in the organization

·      Rapid implementation of new studies and applications by expanding the model, not re-designing it (even small adjustments to relational databases involve development at the logical, physical and downstream models, with time-consuming testing).

·      Application of Solution Accelerators that provide bundled models by industry/application type that can be modified for your specific need

·      “Data Democratization” making data available to users across the organization for their own data discovery and analytic needs, extracting greater value from the data

·      Discovering hidden patterns in relationships, something not possible with the rotational and drill down capabilities of IB tools

·      The ability for iterative question and answering, continuous data discovery and run time analytics across huge amounts of data and, more importantly, linked data from sources not typically associated previously

In conclusion, the Informed Data Lake layers a disparate collection of data sources of unknown origin, quality and currency, into a facility for almost limitless exploration and analysis.
[1] Managing from scarcity has historically driven IT to develop and deploy using the least amount of computing resources under the assumption that these resources were precious and expensive. In the current computing economy, the emphasis has shifted away from scarcity of hardware to scarcity of time and attention of knowledge workers

Posted in Uncategorized | 2 Comments

Karl Popper versus Data Science

I’m sure you’ve heard of Big Data and IoT (Internet of Things) by now. There is a current in computing now that is based on the economics of nearly unlimited resources for computational complexity including Cognitive Computing (AI + Machine Learning). From this, many are seeing the “end of science,” meaning, the truth is in the data and the scientific method is dead.Previously, a scientist may observe certain phenomena, come up with a theory and test it.He is a counter example.

Using algorithms from Topology (yeah, I studied topology in the 70’s) investigators can apply TDA (Topological Data Analysis) to investigate the SHAPE of very complex, very high-volume, very hi-dimensional data (1000’s of variables), deform it in various ways to see what its true nature is and find out what’s really going on. Traditional quantitative methods can only sample or reduce the variables using techniques like Principal Component Analysis (these variables don’t seem very important).

In one case, an organization did a retrospective analysis of every single trial and study on spinal cord injuries. What they found with TDA was that one and only one variable had a measurable effect on outcomes with patients presenting with SCI – maintaining normal blood pressure as soon as they hit the ambulance. No one had either seem or even contemplated this before.

Karl Popper was one of the most important and controversial philosophers of science of the 20th century. In “All Life is Problem Solving,” Popper claimed that “Science begins with problems. It attempts to solve them through bold, inventive theories. The great majority of theories are false and/or untestable. Valuable, testable theories will search for errors. We try to find errors and to eliminate them. This is science. It consists of wild, often irresponsible ideas that it places under the strict control of error correction.”

In other words, hypothesis precedes data. We decide what we want to test, and assemble the data to test it. This is the polar opposite of the data science emerging from big data.

So here’s my premise. Is Karl Popper over? Has computing killed the scientific method?

Thoughts?

Posted in Big Data, Decision Management, Genomics, Medicine, Research, Uncategorized | Tagged , , , | 9 Comments

Miscellaneous Ramblings Today on Data and Analytics

Here are some ideas off the top of my head:

1. The Big Data Analytics industry – vendors, journalists, industry analysts – have flooded the market with messages as if no one ever used quantitative methods before

2. Because most of the content you see is generated by people who don’t actually use quantitative methods, it is:
– focused on technology
– full of the same use cases such as up-sell/cross-sell, churn, fraud, etc.

3. The real opportunity with Big Data and its attendant technologies is to get a richer understanding of those phenomena that are important to you

4. The rise of Data Science and Scientists is the invention of practitioners from the digital giants and not terribly relevant to most companies

5. Ultimately the benefit of Big Data Analytics will be better decisions born of better decision-making processes, not  just informing people of findings. This was the weak point of BI, it was too passive. Operational Intelligence and Decision Automation are key

6. All of this is possible because of the radically different analytical architectures and open source tools that are available in a variety of cloud-based topologies

7. Many business analysts have the background to use advanced analytical tools, provided the tools get better at guiding and advising.

8. The industry can’t continue without better tools. Big Data is a giant time sink. We’re seeing lots of interesting products emerge, many are open-source, to lubricate the whole data management and analytic spectrum

9. As always, finding a way for business units and IT to cooperate and work productivly is still a problem.

10. Existing operational systems are either based on relational databases technology or even older systems written in COBOL and other 2nd-generation languages. Capturing information in these systems is like fitting a square peg in a round hole. New database systems, the so-called NoSQL tools offer abundant opportunities to capture and use rich information. One example, graph databases, are brilliant at finding hidden relationships to expose concentration risk or fraud for example.

11. I’ve built a few Bayesian Belief Networks recently. What I learned is that they can get computationally expensive, perform poorly on high dimensional data and models can be hard to interpret. On the other hand is the ability to get to causation, not just correlation. Better to build from data and/or simulation

Posted in Uncategorized | 2 Comments

Pervasive Analytics: Needs Organizational Change, Better Software and Training

By Neil Raden nraden@hiredbrains.com

Principal Analyst, Hired Brains Research, LLC

May, 2015

The hunt for data scientists has reached its logical conclusion: There are not enough qualified ones to go around. The pull for analytics as a result of a number of factors, including big data and the march of Moore’s Law, is irresistible. As a result, industry analysts, software providers and other influencers are turning to the idea of the “democratization of analytics” as a solution. At Hired Brains, we believe this is not only a good idea (and have been writing and speaking about it for four years), but that it is inevitable. Unfortunately, turning business analysts loose of quantitative methods is an unworkable solution. As the title says, three things that are not currently in place need to be: organizational change, better software and training/mentoring for sustained periods.

Some Background

From the middle of the twentieth century until nearly its end, computers in business were mostly consumed with the process of capturing operational transactions for audit and regulatory purposes. Reporting for decision-making was repetitive and inactive. Some interactivity with the computer began to emerge in the eighties, but it was applied mostly to data input forms. By the end of the century, mostly as a result of the push from personal computers, tools for interacting with data, such as Decision Support Systems, reporting tools and Business Intelligence allowed business analysts to finally use the computing power for their analytical, as opposed to operational purposes.

Nevertheless, these tools were under constant stress because of the cost and scarcity of computing power. The repository of the data, mostly data warehouses, dwarfed the size of the operational systems that fed them. As BI software providers pressed for “pervasive BI,” so that a much broader group of people in the organization would actively use the tools (and the vendors would sell more licenses of course), the movement met resistance from three areas: 1) physical resources (CPU, RAM, Disk), 2) IT concerns that a much broader user community would wreak havoc with the established security and control and 3) people themselves who, beyond the existing users, showed little interest in “self-service” so long as there were others willing to do it for them.

In 2007, Tom Davenport published his landmark book, “Competing on Analytics,” and suddenly, every CEO wanted to find out how to compete on analytics. Beyond the more or less thin advice about why this was a good idea, the book was actually anemic when it came to providing any kind of specific, prescriptive advice on transforming an organization to an “analytically-driven” one.

Analytics Mania

Fast forward to 2015 and analytics has morphed from a meme to a mania. Pervasive BI is a relic not even discussed, but pervasive analytics or, more recently, the “democratization of analytics” is widely held to be the salvation of every organization. Granted, two of the three reasons pervasive BI failed to ignite are no longer an issue in this era of big data and Hadoop, but the third, the people, looms even larger: 1) people still are not motivated to do work that was previously done by others and 2) an even greater problem, the academic prerequisites to do the work are absent in the vast majority of workers. Pulling a Naïve Bayes or C4.5 icon over same data and getting a really pretty diagram or chart is dangerous. Software providers are making it terrifyingly easy for people to DO advanced quantitative analysis without knowing what they doing.

Pervasive analytics? It can happen. It will happen, it’s inevitable and even a good idea, but most of the messaging about it has been perilously thin, from Gartner’s “Citizen Data Scientists” to Davenport’s “Light Quants” (who would ever want to be a “light” anything?) What is lacking is some formality about what kind of training organizations need to commit to, what analytical software vendors need to do to provide extraordinarily better software for neophytes to use productively and, how organizations need to restructure for all of this to be worthwhile and effective.

How to Move to Pervasive Analytics

For “pervasive analytics” or “the democratization of analytics” to be successful, it requires much more than just technology. Most prominent is a lack of training and skills on the part of the wide audience that is expected to be “pervaded” if you will. The shortage of “data scientists” is well documented, which is the motivation for pushing advanced analytics down in the organization to business analysts. The availability of new forms of data provides an opportunity to gain a better understanding of your customers and business environment (among a multitude of other opportunities), which implies a need to analyze data at a level of complexity beyond current skills, and beyond the capabilities of your current BI tools.

Much work is needed to develop realistic game plans for this. In particular, our research at Hired Brains shows that there are three critical areas that need to be addressed:

  • Skills and training: A three-day course is not sufficient and organizations need to make a long-term commitment to the guiding of analysts
  • Organizing for pervasive analytics: Existing IT relationships with business analysts need reconstruction and senior analysts and data scientists need to supervise the roles of governance, mentoring and vetting
  • Vastly upgraded software from the analytics vendors: In reaction to this rapidly unfolding situation, software vendors are beginning to provide packaged predictive capabilities. This raises a whole host of concerns about casual dragging of statistical and predictive icons onto a palate and almost randomly generating plausible output, that is completely wrong.

Skills and Training

Of course it’s unrealistic to think that existing analysts who can build reports and dashboards will learn to integrate moment generating functions and understand the underlying math behind probability distributions and quantitative algorithms. However, with a little help (a lot actually) from software providers, a good man-machine mix is possible where analysts can explore data and use quantitative techniques while being guided, warned and corrected.

A more long-term problem is training people to be able to build models and make decisions based on probability, not a “single version of the truth.” This process will take longer and require more assistance from those with the training and experience to recognize what makes sense and what doesn’t. Here is an example:

Screen Shot 2015-06-27 at 9.35.46 AM

The chart shows a correlation between a stock market index and the number of times Jennifer Lawrence was mentioned in the media. Not shown, but the correlation coefficient is a robust 0.80, which means the variables are tightly correlated. Be honest with yourself and think about what could explain this? After you’ve thought about a few confounding variables, did you consider that they are both slightly increasing time series, which is actually the basis of the correlation, not the phenomena themselves? Remove the time element and the correlation drops to almost zero.

The point here is one doesn’t need to understand the algorithms that create this spurious correlation, they just need enough experience to know that you have to filter out the effect of the time series. But how would they know that?

The fact is that making statistical errors is far more insidious than spreadsheet or BI errors when underlying concepts are hidden. Turning business analysts into analytical analysts is possible, but not automatic.

Consider how actuaries learn their craft. Organizations hire people with an aptitude for math, demonstrated by doing well in things like Calculus and Linear Algebra, but not requiring a PhD. As they join an insurance or reinsurance or consulting organization, they are given study time at work to prepare for the exams, a process that takes years, and have ample access to mentors to help them along because the firm has a vested interest in them succeeding. Being an analyst in a firm is a less extensive learning process, but the model still makes sense.

Organizational: How organizations should deal with DIY analytics

We’re just beginning our research in this area, but one thing is certain: the BI user pyramid has got to go. In many BI implementations, the work fell onto the shoulders of BI Competency Centers to create datasets, while a handful of “power users” worked with the most useful features of the toolsets. The remainder of the users, dependent on the two tiers above them, generated simple reports or dashboards for themselves or departments (an amusing anecdote from a client of ours was, “The most used feature of our BI tool was ‘Export to Excel.’”) Creating “Pervasive BI” would have entailed doing a dead lift of the “business users” into the “power user” class, but no feasible approach was ever put forward.

Pervasive analytics cannot depend on the efforts of a few “go-to guys,” it has to evolve into an analytically centered organization where a combination of training and better software can be effective. That involves a continuing commitment to longer-term training and learning, governance of models so that models developed by professional business analysts can be monitored and vetted before finding their way into production and just a wholesale effort to change the analytics workflow: where do these analyses go beyond the analyst?

Expectations from Software Providers

Packaged analytical tools are sorely lacking in advice and error catching. It is very easy to take an icon and drop it on some data, and the tools may offer some cryptic error message or, at worst, the “help” system displays 500 words from a statistics textbook to describe the workings of the tool. But this is 2015 and computers are a jillion times more powerful than they were a few years ago. It will take some hard work for the engineers, but there is no reason why a tool should not be able to respond to its use with:

  • Those parameters are not likely to work in this model; why don’t you try these
  • Hey, “Texas Sharpshooter”-you drew the boundaries around the data to fit the category model
  • I see you’re using a p-value but haven’t verified that the distribution is normal. Shall I check for you?

We will be continuing our research in the areas of skills/training, organization and software for Pervasive Analytics. Please feel free to comment at nraden@hiredbrains.com

Posted in Uncategorized | 2 Comments

Relational Technologies Under Siege: Will Handsome Newcomers Displace the Stalwart Incumbents?

Relational Technologies Under Siege:
Will Handsome Newcomers Displace the Stalwart Incumbents?

Published: October 16, 2014
Analyst: Neil Raden
After three decades of prominence, Relational Database Management Systems RDBMS) are being challenged by a raft of new technologies. While enjoying a position of incumbency, newer data management approaches are benefitting from a vibrancy powered by the effects of Moore’s Law and Big Data. Hadoop and NoSQL offerings were designed for the cloud, but are finding a place in enterprise architecture. In fact, Hadoop has already made a dent in the burgeoning field of analytics, previously the realm of data warehouses and analytical (relational) platforms.

KEY FINDINGS
• RDBMS are overwhelmed by new forms of data (so-called “big data”), including text, documents, machine-generated streams, graphs and other, but are counter-attacking with new development and features as well as acquisitions and partnerships
• Non-relational platform vendors assert that the relational model itself is too rigid and expensive for the explosion of information
• A fundamental drawback in RDBMS technology is the tight coupling of the storage, metadata and parser/optimizer layers that cannot take advantage of the separate storage and compute capabilities of Hadoop
• Advances in technology are not the key differentiators between RDMBS tools and Hadoop/Big Data NoSQL offerings. Requirements are. The continuing enterprise need for or quality, integrated information and a “single version of the truth” argues for existing and enhanced relational data warehouses versus the “good enough” mentality of cloud-based and Hadoop efforts that were developed for large internet companies are key identifying differences between analytical approaches
• The “new-new” is pretty exciting, but there is a rush to provide true SQL access to many of these platforms, an admission that the relational calculus will endure
• Desirable features of RDBMS will migrate to the distributed processing of Hadoop, but only once Hadoop solves its shortcomings in security, workload management and operability. Born-in-the ¬cloud SaaS applications built on NoSQL databases (even some to emerge) will operate seamlessly on this platform, but not for 3-5 years
• Surveys of “revenue intention” for new technology spending are misleading; only 15% of companies surveyed are using Hadoop, and many are experiments.

RECOMMENDATIONS
• Recognize that RDBMS, Hadoop and NoSQL databases have vastly different purposes, capabilities, features and maturity
• When contemplating a move from a Enterprise Data Warehouse and/or on-premise ETL, take the long view of the effort, cost and disruption
• Determine exactly what your RDBMS vendor is planning for supporting “hybrid” environments because, for the time being, it will have the effect on the downstream activities of analytics
• There are many use cases for NoSQL/Big Data that are compelling and you should carefully consider them. In general, they go beyond your existing Data Warehouse/BI but are not necessarily a suitable replacement. IN two years this will likely change.
• Go slow and do not throw away the baby with the bath water. The best approach is to experiment with a “skunk works” project or two to get a feel if the approach is right for your organization. Beyond that, design a careful Proof of Concept (PoC) that can actually “prove” your “concept.” Vendors tend to insert requirements and features that favor their product, which can derail the validity of the PoC.

ANALYSIS
Relational database technology was adopted by the enterprise for its ability to host transactional/operational applications. By the late 80’s vendors posted benchmarks of transactions/second that exceeded those of the purely proprietary databases with the added benefit of an abstracted language, SQL, that allowed for different flavors of databases to be designed, queried and maintain without the effort of learning a new proprietary language for each one.
Later, as the need grew for more careful data management for reporting and analytics, RDBMS were pressed into service as data warehouses, a role for which they were not well-suited in terms of scale and especially speed of complex queries and large table joins. This need was met in a number of ways, to some degree, but it took time.
This is precisely where we see Hadoop today, a tool that was built to support search and indexing of unruly data in the Internet, primarily. However, its advantages in term of cost and scale are so compelling that it is quickly being pressed into service as an enterprise analytics platform, but it is sorely lacking in some features that data warehouses and analytical platforms (like Vertica, Netezza, Teradata etc.) already possess.
The trend for distributors of Hadoop is to claim that relational data warehouses are obsolete, or at best artifacts that have some enduring value. Curiously, with all of the attendant deficiencies of RDBMS in their view, they are mostly mute about RDBMS for transactional purposes, but that is likely to change.
Relational vendors are at work to put in place reference architectures (and products to support them) that are hybrid in nature. A term emerging is “polyglot persistence,” the ability of the first mover in an analytical query to parse and distribute pieces of the query to the logical location of the data and, preferably, the compute engine for that data without having to bulk-load data and persist it to answer a question. The concept is similar to federating queries, but much more powerful as a federation scheme usually involves design of a reference schema and assembling and transforming the data into a single place to satisfy the query. In a hybrid architecture, there are actually multiple storage locations (even in-memory) and compute resources working in a cooperative fashion. This arrangement preserves the RDBMS as the origin of analytical queries and provider of the answer set and simplifies the maintenance and orchestration of downstream processes, especially analytical, visualization and data discovery.
RDBMS were mostly row-oriented, given their OLTP orientation, but some adopted a column-orientation, the most visible being SybaselQ. In the past few years, it became obvious that analytical applications would be better served by a columnar orientation and products like Vertica emerged combined with a highly scalable MPP architecture. But today, there is an explosion of new databases of many types such as (a sampling, not comprehensive):
• Column: Accumulo, Cassandra, HBase
• Wide Table: MapR-DB, Google BigTable
• Document: MongoDB, Apache CouchDB, Couchbase
• Key Value: Dynamo, FoundationDB, MapR-DB
• Graph: Neo4J, InfiniteGraph and Virtuoso
Keep in mind that none of these database system are “general purpose,” most require programming interfaces and lack the kind of management and administrative features that IT departments demand.

IN CONCLUSION
The explosion in database technology was inevitable as the effects of Moore’s Law caused a discontinuous jump in the flow and processing of information. Technology, however, is always a step ahead of business. The implementation of enterprise applications, information management and processing platforms is a carefully woven fabric that does not bear rapid disruption (unless, of course, that is the enterprise’s strategy). “Big data” can provide enormous benefits to organizations, but not all of them. Many will find it preferable to rely on third parties to prepare and even interpret big data for them. For those that see a clear requirement, it is wise to consider the whole playing field and how the insights gained will find purchase and value. As Peter Drucker said, “Information is data that has meaning and purpose.”

 

Posted in Big Data, Business Intelligence, Decision Management, White Paper | Tagged , , , , , , , , , , , , , , , | Leave a comment

Miscellaneous Ramblings about Decision Making

By Neil Raden

Decision-making is not, strictly speaking, a business process. Attacking the speed problem for decision-making, which is mostly a collaborative and iterative effort, requires looking at the problem as a team phenomenon. This is especially true where decision-making requires analysis of data. Numeracy, a facility for working with numbers and programs that manipulates numbers, exists at varying levels in an organization. Domain expertise similarly exists at multiple levels, and most interesting problems require contributions and input from more than one domain. Pricing, for example, is a joint exercise of marketing, sales, engineering, production, finance and overall strategy. If there are partners involved, their input is needed as well. The killers of speed are handoffs, uncertainty and lack of consensus. In today’s world, an assembly line process of incremental analysis and input cannot provide the throughput to be competitive. Team speed requires that organizations break down the barriers between functions and enable information to be re-purposed for multiple uses and users. Engineers want to make financially informed technical decisions and financial analysts want to make technically informed economic decisions.

That requires analytical software and an organizational approach that is designed for collaboration between people of different backgrounds and abilities.

All participants need to see the answer and the path to the answer in the context of their particular roles. Most analytical tools in the market cannot support this kind of problem-solving. The urgency, complexity and volume of data needed overwhelms them, but more importantly, they cannot provide the collaborative and iterative environment that is needed. Useful, interactive and shareable analytics can, with some management assistance, directly affect decision-making cycle times.

When analysis can be shared, especially through software agents called guides that allow others to view and interact with a stream of analysis, instead of a static report or spreadsheet,. time-eating meetings and conferences can be shortened or eliminated. Questions and doubts can be resolved without the latency of scheduling meetings. In fact, guides can even eliminate some of the presentation time in meetings as everyone can satisfy themselves beforehand by evaluating the analysis in context, not just pouring over results and summarizations.

Decision making is iterative. Problems or opportunities that require decisions often aren’t resolved completely, but return, often slightly reframed. Karl Popper taught that in all matters of knowledge, truth cannot be verified by testing, it can only be falsified. As a result, “science,” which we can broadly interpret to include the subject of organizational decision-making, is an evolutionary process without a distinct end point. He uses the simple model below:

PS(1) -> TT(1) -> EE(1) -> PS(2)

Popper’s premise was that ideas passed through a constant set of manipulations that yielded solutions with better fit but not necessarily final solutions. While the initial problem specification PS(1) yielded a number of Tentative Theories TT(1), Error Elimination EE(1) generates a solution, PS(2), and the process repeats. The TT and EE steps are clearly collaborative.

The overly-simplified model that is prevalent in the Business Intelligence industry is that getting better information to people will yield better decisions. Popper’s simple formulation highlights that this is inadequate – every step from problem formulation, to posing tentative theories to error elimination in assumptions and, finally, reformulated problem specifications requires sharing of information and ideas, revision and testing. One-way report writers and dashboards cannot provide this needed functionality. Alternatively, building a one-off solution to solve a single problem, typically with spreadsheets, is a recurring cost each time it comes around.

Posted in Big Data, Business Intelligence, Decision Management | 1 Comment

I’m Getting Convinced About Hadoop, sort of

As I sometimes do, I went to Boulder last week to soak up some of Claudia and Dave Imhoff’s hospitality and to sit in on a BBBT (Boulder BI Brain Trust) briefing in person instead of remotely like most of us do. The company this particular week was Cloudera and I wanted to not only listen to their presentations, but participate in the Q&A give and take as well as have more intimate conversations at dinner the night before. Despite the fact it took eight hours to drive there from Santa Fe (but only six back), it was clearly worth the effort. I certainly enjoyed meeting all the Cloudera people who came, but since this article is about Hadoop, not Cloudera, I’ll skip the introductions.

A common refrain from any Hadoop vendor (the term vendor is a little misleading because the open source Hadopp is actually free), is that Hadoop, almost without qualification, is a superior architecture for analytics over its predecessor, the relational database management systems (RDBMS) and its attendant tools, especially ETL (Extract/TransformLoad, more on that below).  Their reasoning for this is that it is undeniably cheaper to load Hadoop clusters with gobs of data than it is to expand the size of a licensed enterprise relational database. This is across the board – server costs, RAM and disk storage. The economics are there, but only compelling when you overlook a few variables. Hadoop stores three copies of everything, data can’t be overwritten, only appended to, and most of the data coming into Hadoop is extremely pared down before it is actually used in analysis. A good analogy would be that I could spend 30 nights in a flophouse in the Tenderloin for what it would cost for one night at the Four Seasons. 

But I did say I was getting convinced about Hadoop, so be patient.

A constant refrain from the Hadoop world is that is difficult and time-consuming to change a schema in a RDBMS, but Hadoop, with its “schema on read” concept allows for instantaneous change as needed. Maybe not intentionally, but this is very misleading. What is hard to change in a RDBMS is making changes to an application such as a DW with upstream and downstream dependencies. I can make a change to a database in two seconds. I can add non-key attributes to a Data Warehouse dimension table in an instant. But changing a shared, vetted, secure application is, reasonably, not an instantaneous thing, which illustrates something about the nature of Hadoop applications – they are not typically shared applications. Often, they are not even applications, so this comparison makes no sense. Instead, it illustrates two very important qualities of RDBMS’ and Hadoop. 

One more item about this “hard to change” charge. Hadoop is composed of the file system, HDFS and the programming framework MapReduce. When Hadoop vendors talk about the flexibility and scalability of Hadoop, they are talking about this core. But today, the Hadoop ecosystem (and this is just the Apache open source stuff, there is an expanding soup of add on’s appearing everyday) has more than 20 other modules in the Hadoop stack that make it useful. While I can do whatever I want with the core, once I build applications with these other modules there are just as many dependencies up and down the stack that need to be attended to when changing things as in a standard Data Warehouse environment. 

But wait. Now we have the Stinger Initiative for Hive, Hadoop’s SQL-ish database, to make Hive 100x faster. This is accomplished by jettisoning MapReduce and replacing it with Tez, the next-generation MapReduce. According to Hortonworks, Tez is “better suited to SQL” The Stinger initiative also includes ORCfile file for better compression, vectorizing Tez so that, unlike MapReduce, it can grab lots of records at once. And on top of it all, the crown jewel in any relational database, a Cost-Based Optimizer (CBO) which can only work with a, wait for this, schema! In fact, in the demo I saw today from Hortonworks, they were actually showing iterative SQL queries against, again, wait for it…a STAR SCHEMA! So what happened to schema on read? What happened to how awful RDBMS was compared to Hadoop? See where this is going? In order to sell Hadoop to the enterprise, they are making it work like a RDBMS. 

There are four kinds of RDBMS’s in the market today (and this is my market definition, no one else’s): 1) Enterprise Data Warehouse database systems designed from the ground up for data warehousing. As far as I’m concerned, there is only one that can handle massive volumes, huge mixed workloads broad functionality, tens of thousands of users a day – Teradata 6xxx series; 2) RDBMS designed for transactional processing, but positioned for data warehousing too, just not as good at it such as Oracle, DB2 and MSSQL; 3) Analytical databases, either sold as software-only or as appliances – IBM Netezza, H-P Vertica, Teradata 2xxx series; 4) In-memory databases such as SAP HANA, Oracle Times Ten and passel of others. Now we have a fifth – SQL-compliant (not completely) databases running on top of HDFS. There are more versions of these, too, such a SpliceMachine, now in public beta, as well as Drill, Impala, Presto, Stinger, Hadapt and Spark/Shark to name a few (although Daniel Abadi of Hadapt has argued that “Structured” query language misses the point of Hadoop entirely – flexibility). Now Hadoop is sort of five.

So where are we going with this? Like Clinton in the 90’s it’s clear Hadoop is moving to the center. Purist Hadoop will continue to exist, but market forces are driving it to a more palatable enterprise offering. Governance, security, managed workloads, interactive analysis. All of the things we have now except cheap platforms for greater volumes of data and massive concurrency.

I do wonder about one thing, though. The whole notion of just throwing more cheap resources at it has to have a point of diminishing returns. When will we get to the point that Hadoop is working 100X or 1000x more resources than would be needed in a careful architecture? Think about this. If we morph Hadoop into just a newer analytical database platform, sooner or later someone is going to wonder why we have 3 petabytes of drives and only 800 terabytes of data. In fact, how much duplication is in that data? How much wasted space? Drives may be cheap, but even a thousand cheap drives cost something, especially when they’re only 20% utilized.

Hadoop was invented for indexing search and other internet-related activities, not enterprise software. It’s promotion to all forms of analytics is curious. Where did anyone prove that its architecture was right for everything, or did the hype just get sold on being cheap? And what is the TCO over time vs a DW?

And when Hadoop venders say, “Most of our customers are building an Enterprise Data Hubs or (a terrible term) Data Lakes next to their EDW because they are complementary, it begs the question, for analytics in typical organizations, what exactly is complementary? That’s when we hear about sensors and machine-generated data and the social networks. How universal are those needs?

Then there is ETL. Why do it in expensive cycles on your RDBMS data warehouse when you can do it in Hadoop? They need to be reminded that writing code is not quite the same as an ETL tool with versioning, collaboration, reuse, metadata and lots of existing transforms built in. It’s also a little contradictory. If Hadoop is for completely flexibile and novel analysis, who is going to write ETL code for every project? Now there is a real latency: only five minutes to crunch the data and 30 30 days to write the ETL code.

They talk about using Hadoop as an archive to get old data out of a data warehouse, but they fail to mention that that data is unusable with the context that still remains in the DW; nor will it be usable in the DW later after the schema evolves. So what they really mean is use Hadoop as a dump for data you’ll never use but can’t stand to delete, because if you don’t need it in the DW, why do you need it at all?

Despite all this, it’s a tsunami. The horse had left the stable. The train has left the station. Hadoop will grow and expand and probably not even be recognizable as the original Hadoop in a few years and it will replace the RDBMS as the platform of choice for enterprise applications (even if the bulk of application of it will be SQL-based). I guarantee it. So get on top of it or get out of the way.

Posted in Uncategorized | Tagged , , , , , , | 2 Comments