Relational Technologies Under Siege: Will Handsome Newcomers Displace the Stalwart Incumbents?

Relational Technologies Under Siege:
Will Handsome Newcomers Displace the Stalwart Incumbents?

Published: October 16, 2014
Analyst: Neil Raden
After three decades of prominence, Relational Database Management Systems RDBMS) are being challenged by a raft of new technologies. While enjoying a position of incumbency, newer data management approaches are benefitting from a vibrancy powered by the effects of Moore’s Law and Big Data. Hadoop and NoSQL offerings were designed for the cloud, but are finding a place in enterprise architecture. In fact, Hadoop has already made a dent in the burgeoning field of analytics, previously the realm of data warehouses and analytical (relational) platforms.

• RDBMS are overwhelmed by new forms of data (so-called “big data”), including text, documents, machine-generated streams, graphs and other, but are counter-attacking with new development and features as well as acquisitions and partnerships
• Non-relational platform vendors assert that the relational model itself is too rigid and expensive for the explosion of information
• A fundamental drawback in RDBMS technology is the tight coupling of the storage, metadata and parser/optimizer layers that cannot take advantage of the separate storage and compute capabilities of Hadoop
• Advances in technology are not the key differentiators between RDMBS tools and Hadoop/Big Data NoSQL offerings. Requirements are. The continuing enterprise need for or quality, integrated information and a “single version of the truth” argues for existing and enhanced relational data warehouses versus the “good enough” mentality of cloud-based and Hadoop efforts that were developed for large internet companies are key identifying differences between analytical approaches
• The “new-new” is pretty exciting, but there is a rush to provide true SQL access to many of these platforms, an admission that the relational calculus will endure
• Desirable features of RDBMS will migrate to the distributed processing of Hadoop, but only once Hadoop solves its shortcomings in security, workload management and operability. Born-in-the ¬cloud SaaS applications built on NoSQL databases (even some to emerge) will operate seamlessly on this platform, but not for 3-5 years
• Surveys of “revenue intention” for new technology spending are misleading; only 15% of companies surveyed are using Hadoop, and many are experiments.

• Recognize that RDBMS, Hadoop and NoSQL databases have vastly different purposes, capabilities, features and maturity
• When contemplating a move from a Enterprise Data Warehouse and/or on-premise ETL, take the long view of the effort, cost and disruption
• Determine exactly what your RDBMS vendor is planning for supporting “hybrid” environments because, for the time being, it will have the effect on the downstream activities of analytics
• There are many use cases for NoSQL/Big Data that are compelling and you should carefully consider them. In general, they go beyond your existing Data Warehouse/BI but are not necessarily a suitable replacement. IN two years this will likely change.
• Go slow and do not throw away the baby with the bath water. The best approach is to experiment with a “skunk works” project or two to get a feel if the approach is right for your organization. Beyond that, design a careful Proof of Concept (PoC) that can actually “prove” your “concept.” Vendors tend to insert requirements and features that favor their product, which can derail the validity of the PoC.

Relational database technology was adopted by the enterprise for its ability to host transactional/operational applications. By the late 80’s vendors posted benchmarks of transactions/second that exceeded those of the purely proprietary databases with the added benefit of an abstracted language, SQL, that allowed for different flavors of databases to be designed, queried and maintain without the effort of learning a new proprietary language for each one.
Later, as the need grew for more careful data management for reporting and analytics, RDBMS were pressed into service as data warehouses, a role for which they were not well-suited in terms of scale and especially speed of complex queries and large table joins. This need was met in a number of ways, to some degree, but it took time.
This is precisely where we see Hadoop today, a tool that was built to support search and indexing of unruly data in the Internet, primarily. However, its advantages in term of cost and scale are so compelling that it is quickly being pressed into service as an enterprise analytics platform, but it is sorely lacking in some features that data warehouses and analytical platforms (like Vertica, Netezza, Teradata etc.) already possess.
The trend for distributors of Hadoop is to claim that relational data warehouses are obsolete, or at best artifacts that have some enduring value. Curiously, with all of the attendant deficiencies of RDBMS in their view, they are mostly mute about RDBMS for transactional purposes, but that is likely to change.
Relational vendors are at work to put in place reference architectures (and products to support them) that are hybrid in nature. A term emerging is “polyglot persistence,” the ability of the first mover in an analytical query to parse and distribute pieces of the query to the logical location of the data and, preferably, the compute engine for that data without having to bulk-load data and persist it to answer a question. The concept is similar to federating queries, but much more powerful as a federation scheme usually involves design of a reference schema and assembling and transforming the data into a single place to satisfy the query. In a hybrid architecture, there are actually multiple storage locations (even in-memory) and compute resources working in a cooperative fashion. This arrangement preserves the RDBMS as the origin of analytical queries and provider of the answer set and simplifies the maintenance and orchestration of downstream processes, especially analytical, visualization and data discovery.
RDBMS were mostly row-oriented, given their OLTP orientation, but some adopted a column-orientation, the most visible being SybaselQ. In the past few years, it became obvious that analytical applications would be better served by a columnar orientation and products like Vertica emerged combined with a highly scalable MPP architecture. But today, there is an explosion of new databases of many types such as (a sampling, not comprehensive):
• Column: Accumulo, Cassandra, HBase
• Wide Table: MapR-DB, Google BigTable
• Document: MongoDB, Apache CouchDB, Couchbase
• Key Value: Dynamo, FoundationDB, MapR-DB
• Graph: Neo4J, InfiniteGraph and Virtuoso
Keep in mind that none of these database system are “general purpose,” most require programming interfaces and lack the kind of management and administrative features that IT departments demand.

The explosion in database technology was inevitable as the effects of Moore’s Law caused a discontinuous jump in the flow and processing of information. Technology, however, is always a step ahead of business. The implementation of enterprise applications, information management and processing platforms is a carefully woven fabric that does not bear rapid disruption (unless, of course, that is the enterprise’s strategy). “Big data” can provide enormous benefits to organizations, but not all of them. Many will find it preferable to rely on third parties to prepare and even interpret big data for them. For those that see a clear requirement, it is wise to consider the whole playing field and how the insights gained will find purchase and value. As Peter Drucker said, “Information is data that has meaning and purpose.”


Posted in Big Data, Decision Management, White Paper, Business Intelligence | Tagged , , , , , , , , , , , , , , , | Leave a comment

What Is Speed?

By Neil Raden

There was a time when computers were too slow to do more than bookkeeping and other back office chores. Without machines to interfere in their interactions, people performed office work like a ritual. There were set hours, dress codes, rigid hierarchies, predictable tasks and very little emphasis on change from year to year. Nothing moved very quickly. Good companies were stable and planned thoroughly. That was then. As computers slowly became more useful, necessity and market forces applied them to increasingly more critical tasks. By the 90’s, Total Quality Management, Process Reengineering and headcount reductions driven by the brutal pressure of corporate raiders forced organizations to look at the efficiencies that could be gained by streamlining business processes. After more than two decades of tireless cost-cutting and pursuit of efficiency, the relaxed and personal office life depicted on television of the Sixties was gone forever. Today, we operate Netlix-style, on-demand.

The critical mass for an on-demand world is composed of Information Technology elements such as, ubiquitous communications (the Internet), open standards and easy access. In an information business, speed is king. For an organization conducting business, managing a battlefield or monitoring the world’s financial markets, going faster means a shift from a reliance on prediction, foresight and planning to building in flexibility, courage and faster reflexes, catching the curls as they come and getting smarter with each thing you do (and making your partners smarter, too), ranking the contingencies instead of sticking to the plan no matter what. But what exactly is speed?

Speed implies more than just doing something quickly. For example, being able to load and index 10 billion records into a data warehouse in an hour is one measure of speed, but if the process has to wait until the middle of the night, or it takes another day to aggregate and spin out the data to data marts before it can be used or if the results have to be interpreted by multiple people in different domains, then the relevant, useful measure of speed is the full cycle time. In Six Sigma terms, cycle time is the total elapsed time to move a unit of work from the beginning to the end of a physical process. It does not include lead time. Measuring speed can be relative or absolute. Closing the books in three working days is absolute. Being first to market is relative.

There is a paradox of efficiency – investing in efforts to pare the time it takes to complete work steps can often lead to even longer cycle times. Consider scheduling aircraft. When a step is delayed or fails, there are people to consider – passengers and crew, for example. The efficiency of the solution vanishes when something doesn’t work or when the means loses sight of the desired result. Perhaps the process is very efficient, but brittle—when it breaks all prior gains are lost. The lesson is that speed can’t be measured by the speed of steps or by the speed of a sequence of steps. People are always involved, often, people tangential to the process. The Concorde cut Paris-to-New-York flying time in half, a savings of three and a half hours, but in today’s congested surface traffic and extreme security, a 3.5 hour flight could still take 8 or 9 hours door-to-door, a savings of only 25% or less, possibly not worth the 300% fare increase, except for the most extremely time-conscious.

In the workings of decision-making in organizations, gaining speed can not be limited to the automating of single tasks or optimizing individual productivity. Speed has to be an organizational concept. The actors need to understand the priorities and not waste time trying to optimize things that aren’t important. But one area that is in desperate need of work is organizational decision-making.

Knowing the Enemies of Speed
What are the enemies of speed? Today, much of analytics is a solitary effort with highly skilled and trained workers expending a significant amount of their time re-configuring data, or waiting for others to do it as aspects of the problem space change. While one group expends a considerable amount of time developing reports, another group pauses to re-format the reports or, more typically, either re-key or import some of the information into their own spreadsheets. Spreadsheets, though an effective tool for individual problem-solving, cause delays when others have to interpret or proofread spreadsheets authored by others. The problem isn’t limited to spreadsheets – it applies to varying in degrees to all types of analytical software, but the spreadsheets account for the overwhelming proportion of problem.

And Big Data trickle-down will only make the problem worse.The gap between the first analysts (data scientists) and decision-makers is getting wider.

When information from analytic work is communicated to others, the results are often difficult to explain because they are conveyed in summary form, and usually in aggregated levels, statically. There is no explanation or explicit model to describe the rationale behind the results. These additional presentations about methodology, narratives about the steps involved, alternatives that were considered and rejected (or perhaps just not recommended) and a host of other background material, usually presented in a sequence of time-consuming, serial meetings that have to be scheduled days or weeks in advance are the greatest enemies of speed today. They turn cycle time into cycle epochs. The reason for all of this posterior explanation is the cognitive dissonance between the various actors. The result is that well-researched and reasonable conclusions are often not actionable because management is not willing to buy in due to their lack of insight into the process by which the conclusions were arrived.

The solution to this problem is an environment where complex decisions that have to be made with confidence and consensus can gather recommendations to be presented unambiguously and compellingly across multiple actors in the decision making process.

Gaining Confidence
The late Peter Drucker said that information was “data endowed with relevance and purpose,” but it takes a human being to do that. Unfortunately, one person’s relevance is not necessarily another’s. The process of demonstrating to others what you’ve discovered and/or convinced yourself of can add latency and frustration to the process. The generation of mountains of fixed reports and even beautiful presentations of static displays such as dashboards cannot solve the problem. Henry Mintzberg wrote repeatedly that strategy was never predictable; it was “emergent,” and based on all sorts of imperfect perceptions and conflicting points of view. The lack of confidence that each actor has in every other actors’ methods and conclusions is a serious enemy of speed. It is the cause of endless rounds of meetings, delays and subterfuge.

Cost-cutting will always be a useful effort in organizations because inefficiency will always find a way to creep back in, but the dramatic improvements are largely over.The real battlefield today is differentiation, distancing your enterprise from your competitors. And in an era when every company potentially has access to the same level of best practices and efficiency, the key to leaping ahead is speed. Finding a new insight, a disruption or a discontinuity before anyone else, and being able to act on it, is the ticket to the show. Organizations need to be constantly on the lookout for new ways to streamline, to enhance revenue opportunities, to improve in a multitude of ways, to go faster. Faster decisions, faster to market, faster to understand the environment, faster to go faster. Making things go faster or better is rewarding, but giving time back to people is crazy fast, it is supercharged. The one resource that is in shortest supply is the time and attention of your best people. Give some time back to them and you can change their world.

Posted in Uncategorized | 1 Comment

Miscellaneous Ramblings about Decision Making

By Neil Raden

Decision-making is not, strictly speaking, a business process. Attacking the speed problem for decision-making, which is mostly a collaborative and iterative effort, requires looking at the problem as a team phenomenon. This is especially true where decision-making requires analysis of data. Numeracy, a facility for working with numbers and programs that manipulates numbers, exists at varying levels in an organization. Domain expertise similarly exists at multiple levels, and most interesting problems require contributions and input from more than one domain. Pricing, for example, is a joint exercise of marketing, sales, engineering, production, finance and overall strategy. If there are partners involved, their input is needed as well. The killers of speed are handoffs, uncertainty and lack of consensus. In today’s world, an assembly line process of incremental analysis and input cannot provide the throughput to be competitive. Team speed requires that organizations break down the barriers between functions and enable information to be re-purposed for multiple uses and users. Engineers want to make financially informed technical decisions and financial analysts want to make technically informed economic decisions.

That requires analytical software and an organizational approach that is designed for collaboration between people of different backgrounds and abilities.

All participants need to see the answer and the path to the answer in the context of their particular roles. Most analytical tools in the market cannot support this kind of problem-solving. The urgency, complexity and volume of data needed overwhelms them, but more importantly, they cannot provide the collaborative and iterative environment that is needed. Useful, interactive and shareable analytics can, with some management assistance, directly affect decision-making cycle times.

When analysis can be shared, especially through software agents called guides that allow others to view and interact with a stream of analysis, instead of a static report or spreadsheet,. time-eating meetings and conferences can be shortened or eliminated. Questions and doubts can be resolved without the latency of scheduling meetings. In fact, guides can even eliminate some of the presentation time in meetings as everyone can satisfy themselves beforehand by evaluating the analysis in context, not just pouring over results and summarizations.

Decision making is iterative. Problems or opportunities that require decisions often aren’t resolved completely, but return, often slightly reframed. Karl Popper taught that in all matters of knowledge, truth cannot be verified by testing, it can only be falsified. As a result, “science,” which we can broadly interpret to include the subject of organizational decision-making, is an evolutionary process without a distinct end point. He uses the simple model below:

PS(1) -> TT(1) -> EE(1) -> PS(2)

Popper’s premise was that ideas passed through a constant set of manipulations that yielded solutions with better fit but not necessarily final solutions. While the initial problem specification PS(1) yielded a number of Tentative Theories TT(1), Error Elimination EE(1) generates a solution, PS(2), and the process repeats. The TT and EE steps are clearly collaborative.

The overly-simplified model that is prevalent in the Business Intelligence industry is that getting better information to people will yield better decisions. Popper’s simple formulation highlights that this is inadequate – every step from problem formulation, to posing tentative theories to error elimination in assumptions and, finally, reformulated problem specifications requires sharing of information and ideas, revision and testing. One-way report writers and dashboards cannot provide this needed functionality. Alternatively, building a one-off solution to solve a single problem, typically with spreadsheets, is a recurring cost each time it comes around.

Posted in Big Data, Business Intelligence, Decision Management | Leave a comment

I’m Getting Convinced About Hadoop, sort of

As I sometimes do, I went to Boulder last week to soak up some of Claudia and Dave Imhoff’s hospitality and to sit in on a BBBT (Boulder BI Brain Trust) briefing in person instead of remotely like most of us do. The company this particular week was Cloudera and I wanted to not only listen to their presentations, but participate in the Q&A give and take as well as have more intimate conversations at dinner the night before. Despite the fact it took eight hours to drive there from Santa Fe (but only six back), it was clearly worth the effort. I certainly enjoyed meeting all the Cloudera people who came, but since this article is about Hadoop, not Cloudera, I’ll skip the introductions.

A common refrain from any Hadoop vendor (the term vendor is a little misleading because the open source Hadopp is actually free), is that Hadoop, almost without qualification, is a superior architecture for analytics over its predecessor, the relational database management systems (RDBMS) and its attendant tools, especially ETL (Extract/TransformLoad, more on that below).  Their reasoning for this is that it is undeniably cheaper to load Hadoop clusters with gobs of data than it is to expand the size of a licensed enterprise relational database. This is across the board – server costs, RAM and disk storage. The economics are there, but only compelling when you overlook a few variables. Hadoop stores three copies of everything, data can’t be overwritten, only appended to, and most of the data coming into Hadoop is extremely pared down before it is actually used in analysis. A good analogy would be that I could spend 30 nights in a flophouse in the Tenderloin for what it would cost for one night at the Four Seasons. 

But I did say I was getting convinced about Hadoop, so be patient.

A constant refrain from the Hadoop world is that is difficult and time-consuming to change a schema in a RDBMS, but Hadoop, with its “schema on read” concept allows for instantaneous change as needed. Maybe not intentionally, but this is very misleading. What is hard to change in a RDBMS is making changes to an application such as a DW with upstream and downstream dependencies. I can make a change to a database in two seconds. I can add non-key attributes to a Data Warehouse dimension table in an instant. But changing a shared, vetted, secure application is, reasonably, not an instantaneous thing, which illustrates something about the nature of Hadoop applications – they are not typically shared applications. Often, they are not even applications, so this comparison makes no sense. Instead, it illustrates two very important qualities of RDBMS’ and Hadoop. 

One more item about this “hard to change” charge. Hadoop is composed of the file system, HDFS and the programming framework MapReduce. When Hadoop vendors talk about the flexibility and scalability of Hadoop, they are talking about this core. But today, the Hadoop ecosystem (and this is just the Apache open source stuff, there is an expanding soup of add on’s appearing everyday) has more than 20 other modules in the Hadoop stack that make it useful. While I can do whatever I want with the core, once I build applications with these other modules there are just as many dependencies up and down the stack that need to be attended to when changing things as in a standard Data Warehouse environment. 

But wait. Now we have the Stinger Initiative for Hive, Hadoop’s SQL-ish database, to make Hive 100x faster. This is accomplished by jettisoning MapReduce and replacing it with Tez, the next-generation MapReduce. According to Hortonworks, Tez is “better suited to SQL” The Stinger initiative also includes ORCfile file for better compression, vectorizing Tez so that, unlike MapReduce, it can grab lots of records at once. And on top of it all, the crown jewel in any relational database, a Cost-Based Optimizer (CBO) which can only work with a, wait for this, schema! In fact, in the demo I saw today from Hortonworks, they were actually showing iterative SQL queries against, again, wait for it…a STAR SCHEMA! So what happened to schema on read? What happened to how awful RDBMS was compared to Hadoop? See where this is going? In order to sell Hadoop to the enterprise, they are making it work like a RDBMS. 

There are four kinds of RDBMS’s in the market today (and this is my market definition, no one else’s): 1) Enterprise Data Warehouse database systems designed from the ground up for data warehousing. As far as I’m concerned, there is only one that can handle massive volumes, huge mixed workloads broad functionality, tens of thousands of users a day – Teradata 6xxx series; 2) RDBMS designed for transactional processing, but positioned for data warehousing too, just not as good at it such as Oracle, DB2 and MSSQL; 3) Analytical databases, either sold as software-only or as appliances – IBM Netezza, H-P Vertica, Teradata 2xxx series; 4) In-memory databases such as SAP HANA, Oracle Times Ten and passel of others. Now we have a fifth – SQL-compliant (not completely) databases running on top of HDFS. There are more versions of these, too, such a SpliceMachine, now in public beta, as well as Drill, Impala, Presto, Stinger, Hadapt and Spark/Shark to name a few (although Daniel Abadi of Hadapt has argued that “Structured” query language misses the point of Hadoop entirely – flexibility). Now Hadoop is sort of five.

So where are we going with this? Like Clinton in the 90’s it’s clear Hadoop is moving to the center. Purist Hadoop will continue to exist, but market forces are driving it to a more palatable enterprise offering. Governance, security, managed workloads, interactive analysis. All of the things we have now except cheap platforms for greater volumes of data and massive concurrency.

I do wonder about one thing, though. The whole notion of just throwing more cheap resources at it has to have a point of diminishing returns. When will we get to the point that Hadoop is working 100X or 1000x more resources than would be needed in a careful architecture? Think about this. If we morph Hadoop into just a newer analytical database platform, sooner or later someone is going to wonder why we have 3 petabytes of drives and only 800 terabytes of data. In fact, how much duplication is in that data? How much wasted space? Drives may be cheap, but even a thousand cheap drives cost something, especially when they’re only 20% utilized.

Hadoop was invented for indexing search and other internet-related activities, not enterprise software. It’s promotion to all forms of analytics is curious. Where did anyone prove that its architecture was right for everything, or did the hype just get sold on being cheap? And what is the TCO over time vs a DW?

And when Hadoop venders say, “Most of our customers are building an Enterprise Data Hubs or (a terrible term) Data Lakes next to their EDW because they are complementary, it begs the question, for analytics in typical organizations, what exactly is complementary? That’s when we hear about sensors and machine-generated data and the social networks. How universal are those needs?

Then there is ETL. Why do it in expensive cycles on your RDBMS data warehouse when you can do it in Hadoop? They need to be reminded that writing code is not quite the same as an ETL tool with versioning, collaboration, reuse, metadata and lots of existing transforms built in. It’s also a little contradictory. If Hadoop is for completely flexibile and novel analysis, who is going to write ETL code for every project? Now there is a real latency: only five minutes to crunch the data and 30 30 days to write the ETL code.

They talk about using Hadoop as an archive to get old data out of a data warehouse, but they fail to mention that that data is unusable with the context that still remains in the DW; nor will it be usable in the DW later after the schema evolves. So what they really mean is use Hadoop as a dump for data you’ll never use but can’t stand to delete, because if you don’t need it in the DW, why do you need it at all?

Despite all this, it’s a tsunami. The horse had left the stable. The train has left the station. Hadoop will grow and expand and probably not even be recognizable as the original Hadoop in a few years and it will replace the RDBMS as the platform of choice for enterprise applications (even if the bulk of application of it will be SQL-based). I guarantee it. So get on top of it or get out of the way.

Posted in Uncategorized | Tagged , , , , , , | 1 Comment

Metrics Can Lead in the Wrong Direction

Is it really possible to use measurement — or “metrics,” in the current parlance — to drive an organization? There are two points of view, one widely accepted and current, the other opposing and more abstract.

The conventional wisdom on performance management is that our technology is perfectly capable of providing detailed, current and relevant performance information to stakeholders in an enterprise, including executives, managers, functional people, customers, vendors and regulators. Because we are blessed with abundant computing resources, connectivity, bandwidth and even standards, it is possible to present this information in cognitively effective ways (dashboards and visualization, for example). Recipients are able to receive the information in the manner in which they choose, and the whole process pays dividends by supporting the notion that “If you can’t measure it, you can’t manage it. “It is hard to imagine how anyone could manage a large undertaking without measurement, isn’t it? And most presentations I’ve heard quickly stress that measurement is only part of the solution.

The first step is knowing what to measure; then measuring it accurately; then finding a way to disseminate the information for maximum impact (figuring out how to keep it current and relevant); and then being able to actually do something about the results. A different way of saying this is that technology is never a solution to social problems, and interactions between human beings are inherently social. This is why performance management is a very complex discipline, not just the implementation of dashboard or scorecard technology. Luckily, the business community seems to be plugged into this concept in a way they never were in the old context of business intelligence. In this new context, organizations understand that measurement tools only imply remediation and that business intelligence is most often applied merely to inform people, not to catalyze change. In practice, such undertakings almost always lack a change-management methodology or portfolio.

But there is an argument against measurement, too. Unlike machines or chemical reactions in a beaker, human beings are aware that they are being measured. In the realm of physics, Heisenberg’s Uncertainty Principle demonstrates that the act of measurement itself can very often distort the phenomena one is attempting to measure. When it comes to sub-atomic particles, we can pretty much assume it is a physical law that underlies this behavior. With people, the unseen subtext is clearly conscious. People find the most ingenious ways to distort measurement systems to generate the numbers that are desired. Thus, the effort to measure can not only discourage desired behavior; it can promote dysfunctional behavior. There are excellent, documented examples of this phenomenon in Measuring and Managing Performance in Organizations by Robert D. Austin. The author’s contention is that measurement of people always introduces distortion and often brings dysfunction because measurement is never more than a proxy or an approximation of the real phenomena.

In a particularly colorful analogy, Austin writes:

“Kaplan and Norton’s cockpit analogy would be accurate if it included a multitude of tiny gremlins controlling wing flaps, fuel flow, and so on of a plane being buffeted by winds and generally struggling against nature, but with the gremlins always controlling information flow back to the cockpit instruments, for fear that the pilot might find gremlin replacements. It would not be surprising if airplanes guided this way occasionally flew into mountains when they seemed to be progressing smoothly toward their destinations.”

We all know that incomplete proxies are too easy to exploit in the same way that inadequate software with programming gaps beckons unscrupulous hackers. However, one doesn’t have to be malicious to subvert a measurement system. After all, voluntary compliance to the tax code encourages a national obsession with “loopholes.” And what salesperson hasn’t “sandbagged” a few deals for the next quarter after meeting the quota for the current one?

The solution is not to discard measurement but rather to be conscious of this tendency and to be vigilant and thorough in the design of measurement systems. We all have a tendency toward simplifying things; but in some cases, it appears better to not measure at all than to produce something inadequate. Performance management, to achieve its goals, has to be applied effectively, which is to say, with superior execution of technology, implementation and management. It has to be designed to be responsive to both incremental and unpredicted changes in the organization and the environment. There are no road maps for this. This is truly the first time that analytical and measurement technology can be embedded in day-to-day, instantaneous decision-making and tracking; and the industry is sorely lacking in skills and experience to pull it off. Those organizations that have been successful so far have relied on existing methodologies (activity-based costing or balanced scorecard, for example) to guide them through the more uncertain steps of metric formulation and change management to close the loop.

The question of whether you can ever adequately measure an organization is still open. To the extent that there are statutory and regulatory requirements, such as taxation, SEC or specific industry regulations, the answer is clearly yes. But those measurements are dictated. To measure performance after the fact, at aggregated levels, is only useful to a point. The closer and closer a measurement system gets to the actual events and actions that drive the higher-level numbers, the less reliable the cause-effect relationship becomes, just like Heisenberg found so long ago. There are many examples in the management literature of everyone “doing the right thing” while the wheels are coming off the organization.

Recommended Reading:

Measuring and Managing Performance in Organizations , by Robert D. Austin, (New York: Dorset House Publishing, 1996) 
In the Age of the Smart Machine: The Future of Work and Power by Shoshana Zuboff, (New York, Basic Books, 1988) 
“The New Productivity Challenge,” by Peter Drucker, Harvard Business Review, (Nov.-Dec. 1991): p. 70. 

Posted in Uncategorized | Tagged , , , , , | 1 Comment

A Bit About Storytelling

My take on storytelling

1. Must be a “story” with a beginning, middle and end that is relevant to the listeners.
2. Must be highly compressed
3. Must have a hero – the story must be about a person who accomplished something notable or noteworthy.
4. Must include a surprising element – the story should shock the listener out of their complacency. It should shake up their model of reality.
5. Must stimulate an “of course!” reaction – once the surprise is delivered, the listener should see the obvious path to the future.
6. Must embody the change process desired, be relatively recent and “pretty much” true.
7. Must have a happy ending.

In Stephen Denning’s words, “When a springboard story does its job, the listeners’ minds race ahead, to imagine the further implications of elaborating the same idea in different contexts, more intimately known to the listeners. In this way, through extrapolation from the narrative, the re-creation of the change idea can be successfully brought to birth, with the concept of it planted in listeners’ minds, not as a vague, abstract inert thing, but an idea that is pulsing, kicking, breathing, exciting – and alive.”

That may be a little too much excitement on a daily basis, something you save for the really important things, but it matters nonetheless that turning data into a story is a valid and necessary skill. But is it for everyone?

Not really. Actual storytelling is a craft. Not everyone knows how to do it or can even learn it. But everyone can tell a story. It just may not be of the caliber of storytelling. But to get a point across and have it stick (even if it’s just in your own mind, not to an audience), learn to apply metaphor.

More on metaphor lately

Posted in Big Data, Business Intelligence, Decision Management, Research, White Paper | Tagged , , | Leave a comment

Understanding Analytical Types and Needs

Understanding Analytics Types and Needs

By Neil Raden, January, 2013

Purpose and Intent

“Analytics” is a critical component of enterprise architecture capabilities, though most organizations have only recently begun to develop experience using quantitative methods. As Information Technology emerges from a scarcity-based mentality of constrained and costly resources to a commodity consumption model of data, processors and tools, analytics is quickly becoming table stakes for competition.

This report is the first of a two-part series. (Part II will cover analytic functionality and matching the right technology to the proper analytic tools and best practices.) It discusses the importance of understanding the role of analytics, why it is a difficult topic for many, and what actions you should take. It will explore the various meanings of analytics, provide a framework for aligning various types of analytics with associated roles and skill sets needed.

Executive Summary

Using quantitative methods is rapidly becoming, not an option for competitive advantage, but rather, at the very least, barely enough to keep up. Everyone needs to understand what’s involved in analytics, what you particular organization needs and how to do it.

Few people are comfortable with the concepts of advanced analytic methods. In fact, most people cannot explain the difference between a mean, a median and a sample mean. The misapplication of statistics is widespread, but today’s explosion of data sources and intriguing technologies to deal with them have changed the calculus. Embedded quantitative methods may relieve analysts of the actual construction of predictive models, but applying those models correctly requires understanding the different analytical types, roles and skill.

Analytics in the Enterprise

The emphasis of analytics is changing from one of long-range planning based on historical data, to dynamic and adaptive response based on timely information from multiple contexts, augmented and interpreted through various degrees of quantitative analysis. Analytics now permeates every aspect leading organizations’ operations. Competitive, technological and economic factors combine to require more precision and less lag time in discovery and decision-making.

For example, operational processing, the orchestration of business processes and secure capture of transactional data is merging with analytical processing, the gathering and processing of data for reporting and analysis. Analytics in commercial organizations has historically been limited to special groups working more or less off-line. Platforms for transaction processing were separated for performance and security reasons, an effect of “managing from scarcity.” But scarcity is not the issue anymore as the relative cost of computing has plummeted. Driven equally by technology and competition, operational systems are either absorbing or at least cooperating with analytical processes. This convergence elevates the visibility of all forms of analytics.

Confusion and mistakes in deploying analytics are common due to imprecise understanding of the various forms and types. Uncertainty about the staff and skills needed for various “types” of analytics are common. Messaging from technology vendors, service providers and analysts is murky and misleading, sometimes deliberately so.

The urgency behind implementing an analytics program, however, can be driven not by getting a leg up, but rather not falling behind.

Analytics and the Red Queen Effect

Analytics are crucial because the barriers to getting started are lower than ever. Everyone can engage in analytics now, of one type or another. As analytic capabilities increase across competitors, everyone must step up – it’s a Red Queen[i] effect.  When everyone was shooting from the hip, efficiency was a matter of degree. If everyone used crude models and unreliable data, then everyone should, more or less, work within the same margin of error. What separated competitors was good strategy and good execution. But now that everyone can employ quantitative methods and techniques like Naive Bayes, C4.5 and support vector machines, it will still be the strategy and execution that count. Companies must improve just to stay in place.  Each new level of analytics becomes the “table stakes” for the next.

Can You Compete on Analytics? Analytics Are Necessary – but Not Sufficient

Statistical methods using software have been shown to be useful in many aspects of an organization, such as fraud detection, demand forecasting and inventory management, but just using analytics has not been shown to necessarily improve the fortunes or effectiveness of the overall organization. In 2007, Davenport and Harris released their influential book[ii], Competing on Analytics, which described how a dozen or so companies used “analytics” to not only advise decision-makers, but to play a major role in the development of strategy and implementation of business initiatives. The book found a huge following and was a bestseller on the business book lists. It certainly placed the word “analytics” in the top of the mind of many decision makers. However, when comparing the fortunes of the twelve companies highlighted in the book, their performance in the stock market is less than spectacular as illustrated in Figure 2:

This scenario is often repeated – good work is performed inside an organization, but the benefits of the discipline do not permeate other parts of the business and, hence, have little effect on the organization as a whole. In another example, statistical methods have been used in the U.S. in agriculture for decades, and yields have improved dramatically, but the quality of the food supply has clearly degraded along with the fortunes of individual farmers.

Too many organizations, despite good intentions, do not see dramatic improvement in their fortunes after adopting wider-based analytical methods because:

First, rarely does one thing change a company. Analytics are a powerful tool, but it takes execution to realize the benefits. Perhaps if good analytical technique had been applied across the board along with a clear strategy to drive decisions based on quantitative models, better results may have followed. Instead, as is often the case, a visible project shows great promise and early results, but the follow through is wanting.

Data mining tools can actually be predictive, showing what is likely to happen or not happen. But what is often misunderstood is that data mining tools are usually poor at specifying when things will happen. In this case, too much faith is placed in the models, imbuing them with fortune-telling capabilities they simply lack. The correct approach is to test, run proofs of concept, and once in production engage in continuous improvement through mechanisms like champion/challenger and A/B testing.

Most of the companies try to understand customer behavior – which you can do with data mining – but it rarely captures the randomness of people’s behavior leading to overconfidence in the models. Given this customer is likely to purchase a car, when is the correct time to reach out? Perhaps right away, perhaps not. Data mining tools are not very good at individual propensities derived from behavior due to the randomness of human behavior. It is pretty common for inexperienced modelers to put too much faith in model results. The solution is to engage experienced talent to get a program started in the right track.

Return on investment in analytics is difficult to measure because there isn’t often a straight line from the model to results. Other parts of the organization contribute. An analytical process can inform decisions, either human or machine-driven, but the execution of those decisions is beyond the reach of an analytical system. People and process have to perform too. In addition, a successful analytical program can be the result of a well-defined strategy. Positive results from analytics would not have been possible without the formation of that strategy.

Professionals skilled in statistics, data mining, predictive modeling and optimization have been a part of many organizations for some time, but their contribution, and even an awareness of what they do, is sometimes poorly understood – and filled with many impediments to success.  By categorizing analytics by the quantitative techniques used and the level of skill of the practitioners who use these techniques (the business applications that they support are detailed in Part II of the series), companies can begin to understand when and how to use analytics effectively and deploy their analytic resources to achieve better results.

The Four Types of Analytics

There are related and unrelated disciplines that are all combined under the term analytics. There is advanced analytics, descriptive analytics, predictive analytics and business analytics, all defined in a pretty murky way. It cries out for some precision. What follows is a way to characterize the many types of analytics by the quantitative techniques used and the level of skill of the practitioners who use these techniques.

Figure 4: The Four Types of Analytics

Descriptive Title Quantitative Sophistication/Numeracy Sample Roles
    Type I QuantitativeResearch PhD or equivalent Creation of theory, development of algorithms. Academic/research. Often employed in business or government for very specialized roles
Type II Data Scientist orQuantitative


Advanced Math/Stat, not necessarily PhD Internal expert in statistical and mathematical modeling and development, with solid business domain knowledge
Type III Operational Analytics Good business domain, background in statistics optional Running and managing analytical models. Strong skills in and/or project management of analytical systems implementation
Type IV Business Intelligence/ Discovery Data and numbers oriented, but so special advanced statistical skills Reporting, dashboard, OLAP and visualization use, possibly design, Performing posterior analysis of results driven by quantitative methods

Type I Analytics: Quantitative Research

The creation of theory and development of algorithms for all forms of quantitative analysis deserves the title Type I. Quantitative Research analytics are performed by mathematicians, statisticians and other pure quantitative scientists. They discover new ideas and concepts in mathematical terms and develop new algorithms with names like Hidden Markov Support Vector Machines, Linear Dynamical Systems, Spectral Clustering, Machine Learning and a host of other exotic models. The discovery and enhancement of computer-based algorithms for these concepts is mostly the realm of academia and other research institutions (though not exclusively).  Commercial, governmental and other organizations (Google or Wall Street for example) employ staff with these very advanced skills; but in general, most organizations are able to conduct their necessary analytics without them, or employ the results of their research. An obvious example is the FICO score, developed by Quantitative Research experts at FICO (Formerly Fair Isaac) but employed widely in credit-granting institutions and even human resource organizations.

Type II Analytics: “Data Scientists”

More practical than theoretical, Type II is the incorporation of advanced analytical approaches derived from Type I activities. This includes commercial software companies, vertical software implementations, and even the heavy “quants” in industry who apply these methods specifically to the work they do like fraud detection, failure analysis, propensity to consume models, among hundreds of other examples. They operate in much the same way as commercial software companies but for just one customer (though they often start their own software companies too). The popular term for this role is “data scientist.”

“Heavy” Data Scientists. The Type II category could actually be broken down into two subtypes, Type II-A and Type II-B. While both perform roughly the same function – providing guidance and expertise in the application of quantitative analysis – they are differentiated by the sophistication of the techniques applied. II-A practitioners understand the mathematics behind the analytics and may apply very complex tools such as Kucene wrapper, loopy logic, path analysis, root cause analysis, synthetic time series or Naïve Bayes derivatives that are understood by a small number of practitioners. What differentiates the Type II-A from Type I is not necessarily the depth of knowledge they have about the formal methods of analytics (it is not uncommon for Type II’s to have a PhD for example), it is that they also possess the business domain knowledge they apply and their goal is to develop specific models for the enterprise, not for the general case as Type I’s usually do.

“Light” Data Scientists. Type II-Bs on the other hand may work with more common and well-understood techniques such as logistic regression, ANOVA, CHAID and various forms of linear regression. They approach the problems they deal with using more conventional best practices and/or packaged analytical solutions from third parties

Data Scientist Confusion. “Data Scientist” is a relatively new title for quantitatively adept people with accompanying business skills. The ability to formulate and apply tools to classification, prediction and even optimization, coupled with fairly deep understanding of the business itself, is clearly in the realm of Type II efforts. However, it seems pretty likely that most so-called data scientists will lean more towards the quantitative and data-oriented subjects than business planning and strategy. The reason for this is that the term data scientist emerged from those businesses like Google or Facebook where the data is the business; so understanding the data is equivalent to understanding the business. This is clearly not the case for most organizations. We see very few Type II data scientists with the in-depth knowledge of the whole business as, say, actuaries in the insurance business, whose extensive training should be a model for the newly designated data scientists (see our blog at: “What is a Data Scientist and What Isn’t”)

Though not universally accepted, data scientists must be able to effectively communicate their work to non-technical people. This is a major discriminator between a data scientist and a statistician. It is absolutely essential that someone in the analytics process have the role of chief communicator, someone who is comfortable working with quants, analysts and programmers, deconstructing their methodologies and processes, distilling them, and then rendering it in language that other stakeholders understand. Companies often fail to see that there is almost never anything to be gained by trying to put a PhD statistician into the role of managing a group of analysts and developers. It is safe to say that this role is represented more by a collaborative group of professionals than by a single individual.

Type III Analytics: Operational Analytics

Historically, this is the part of analytics we’re most familiar with. For example, a data scientist may develop a scoring model for his/her company. In Type III activity, parameters are chosen by the operational analytics expert analyst and are input into the model, generating the scores calculated by the Type II models and embedded into an operational system that, say, generates offers for credit cards. Models developed by data scientists can be applied and embedded in an almost infinite number of ways today. The application of Type II applications into real work is the realm of operational analysts. In very complex applications, real-time data can be streamed into applications based on Type II models with outcomes instantaneously derived through decision-making tools such as rules engines.

Packaged applications that embed quantitative methods such as predictive modeling or optimizations are also Type III in that the intricacies and the operation of the statistical or stochastic method are mostly hidden in a sort of “black box.” As analytics using advanced quantitative methods becomes more acceptable to management over time, these packages become more popular.

Decision making systems that are reliant on quantitative methods that are not well understood by the operators can lead to trouble. They must be carefully designed (and improved) to avoid overly burdening the recipients of useless or irrelevant information. This was a lesson learned in the early days of data mining, that generating “interesting” results without understanding what was relevant usually led to flagging interest in the technology. In today’s business environment, time is perhaps the scarcest commodity of all. Whether a decision-making system notifies people or machines, it must confine those messages to those that are the most relevant and useful.

False negatives are quite a bit more problematic as they can lead to transactions passing through that should not have. Large banks have gone under by not catching trades that cost billions of dollars. Think of false negatives as being asleep at the wheel.

Type IV Analytics: Business Intelligence & Discovery

Type III analytics aren’t of much value if their application in real business situations cannot be evaluated for their effectiveness. This is the analytical work we are most familiar with via reports, OLAP, dashboards and visualizations. This includes almost any activity that reviews information to understand what happened or how something performed, or to scan and free associate what patterns appear from analysis. The mathematics involved is simple. But pulling the right information – and understanding what information means – is still an art and requires both business sense and knowledge about sources and uses of the data.

Know Your Needs First

The scope of analytics is vast, ranging from the familiar features of business intelligence to the arcane and mysterious world of applied mathematics. Organizations need to be clear on their objectives and capabilities before funding and staffing an analytic program. Predictive modeling to dramatically improve your results makes for good reading, but the reality is quite different. The four types are meant to help you understand where you can begin or advance.

These categories are not hard and fast. Some activities are clearly a blend of various types. But the point is to add some clarity to the term “analytics” in order to understand its various use cases. Tom Davenport, for example, advocated creating a cadre of “PhDs with personality” in order to become an analytically competitive organization. That is one approach. Implementing analytics as part of other enterprise software you already have – or purchasing a specialized application that is already used and vetted in your industry – is a better place to start.


Use of some clear terminology can avoid confusion within your organization, not just internally, but in communication with vendors and service providers. To get the most out of analytics:

  • Be clear about what you need.  Having clarity on the meaning of analytics has clear benefits. Because the nature of analytics is a little mysterious to most people, a vendor statement that they provide “embedded predictive analytics” can no longer be taken at face value. You should look closely to see if those capabilities line up with your needs.
  • Don’t assume high value means high resource costs. In the same vein, you needn’t hesitate to begin analytical projects because you believe you need to source a dozen PhDs, when in fact, your needs are in the Type II category.
  • Formulate specific vendor questions based on what level of sophistication and resources you need. By more clearly specifying what type of analytics you need, it becomes very easy to ask: Is this tool designed to discover and create predictive models, or to deploy them from other sources? Do you offer training in quantitative methods or only in the use of your product? Is the tool designed for authoring scoring models or just using scored values?
  • Use analytic knowledge to start to prepare for Big Data.  Understanding what type of analytics – and results – you need will even help you in your soon-to-be-serious consideration of Big Data solutions, including Hadoop, its variants and its competitors, all of which use variants of the above techniques to process large quantities of information.

Analytics is a catchall phrase, but understanding the various uses and types should help in implementing the right approach for accomplishing the tasks at hand.  It should also help in discerning what is meant when the term is used, as almost anything can be called analytics.

Next Steps

Part II of this series will examine in depth the forms that analytics take in the organization and the business purposes it serves, and demonstrate through examples and case studies how analytics of all types are successfully employed. But analytics are a step in the process. Without effective decision-making practices the value in analytics is lost. Part III of this series will deal with decision making and decision management.

Author Bio: Neil Raden

Analyst, Consultant and Author in Analytics and Decision Science

Neil Raden, is the founder and Principal Analyst at Hired Brains Research, a provider of consulting and implementation services in business intelligence, analytics and decision managemen. Hired Brains focuses on the needs of organizations and capabilities of technology. He began his career as a property and casualty actuary with AIG in New York before moving into predictive analytics services, software engineering, and systems integration with experience in delivering environments for decision making in fields as diverse as health care to nuclear waste management to cosmetics marketing and many others in between.


[i] The Red Queen is a concept from evolutionary biology first used in Matt Ridley, The Red Queen: Sex and the Evolution of Human Nature, (New York: Macmillan Publishing Co, 1994).  The allusion is to the Red Queen in Lewis Carroll’s Through the Looking-Glass, who had to keep running just to stay in place.

[ii] Davenport, Harris, et al, “Competing on Analytics: The New Science of Winning,” New York, Harvard Business Press, 2007.

Posted in Uncategorized | 1 Comment

When Are Decisions Driven by Analytics, or Merely Informed by Them?


Boy did Julie Hunt ever hit the nail on the head: 

“But – real-time decision-making also has to be vetted with domain knowledge, human experience and common sense, to validate the viability of analytics results. Decisions make a positive difference for the enterprise only if they are based on accurate intelligence. While many things are possible with predictive analytics, there is always the danger of trying to force ‘reality’ to fit the model. This can be deadly to real-time operational decision-making.”

When it comes to decisions that can be made via models, you have to separate them into two categories: those that do not require 100% precision, and those that are too important to get wrong. 

For example, routing a call center call, approving a credit line increase, rating a car insurance premium – these are all “decisions” that are made in high volume, but getting some of them wrong, in the aggregate, causes little harm. Obviously, the closer you get to perfect performance the better, but you can allow these decisions to be made without human interference. Obviously you track the result and continuously improve the models.

On the other hand, many decisions in an enterprise are too important to turn over to some algorithms. In these cases, the quantitative analysis can be a part of the decision process, but ultimately the decision vests with the person or persons who take responsibility for it. In point of fact, very few managers are comfortable with answers based on probability. The difference between 80% probability and 95% probability simply doesn’t resonate. For important decisions, managers want one answer, and that requires discussion and consensus. 

We have to be very careful not to over-promise on analytics. 

Posted in Big Data, Business Intelligence, Decision Management | Tagged , , , , | 1 Comment

Personalized Medicine World Conference

This is the fourth or fifth year for this conference, and each year there are some surprises. The first couple of years it was a diverse collection of researchers, entrepreneurs and vendors (Oracle, Deloitte, etc.). The number of exhibitors seems about the same as last year, but there was a small booth for SAP HANA, which was sort of a surprise, but I learned they are aggressively going after the life sciences sector and Hasso Plattner was a keynote speaker. That’s a pretty good sign this conference is getting pretty commercial.

Like any conference, a few stars emerge and become familiar and repeat presenters. Atul Butte is one example. Here is his bio:

Atul Butte, MD, PhD is Chief of the Division of Systems Medicine and Associate Professor of Pediatrics, Medicine, and by courtesy, Computer Science, at Stanford University and Lucile Packard Children’s Hospital. Dr. Butte trained in Computer Science at Brown University, worked as a software engineer at Apple and Microsoft, received his MD at Brown University, trained in Pediatrics and Pediatric Endocrinology at Children’s Hospital Boston, then received his PhD in Health Sciences and Technology from Harvard Medical School and MIT. Dr. Butte has authored more than 100 publications and delivered more than 120 invited presentations in personalized and systems medicine, biomedical informatics, and molecular diabetes, including 20 at the National Institutes of Health or NIH-related meetings.

I did find Dr. Butte’s presentation about how bioinformatics tools applied to big public data have yielded new uses for drugs and new prototype drugs and diagnostics for type 2 diabetes. It was an interesting discussion of what we call big data analytics, but in the end, it just came back to making more drugs. 

When you attend medical conferences, speakers always have these extensive pedigrees, but what I wonder is, with all of the esteem, what sort of doctors are they? Are they too distanced from day-to-day clinical work to see the problems and possibilities? Are their decisions made within a bubble that excludes consideration of alternatives? That is the sense I get listening to them. 

A common term used by many of the speakers was “omics.” First we had genomics, then epigenomics followed by proteomics or metabolomics. All of these areas combine both bench science and informatics on a huge scale. The hope is that the digital examination of these minute measurements can lead to cures for diabetes, cancer, heat disease and Alzheimers.

Michael Snyder, Ph.D., Professor & Chair, Stanford Center of Genomics & Personalized Medicine gave a notable and introspective presentation about the use of a combination of omics methods to assess health states in a single individual over the course of almost three years (himself). Genome sequencing was used to determine disease risk. Longitudinal personal profiling of transcriptome, proteome and metabolome was used to monitor disease, including viral infections and the onset of diabetes. His premise is that these aproaches can transform personalized medicine. It was discovered that he carried genes for diabetes and in fact developed it during the period, but, in my opinion, failed to see the causal effect of poor sleep from repeated respiratory infections that corresponded with the spike in blood sugar. In some ways, it seems these brilliant scientists just don’t see the forest from the trees, and that hurts us.


Steven C Quay, M.D., Ph.D., FCAP, Founder, Atossa Genetics, Inc. pitched his own company devoted to obtaining routine, repeated, “painless” breast biopsy samples non-invasively for cytopathology, NGS, proteome, and transcriptome analysis of precursors to breast cancer; The use of breast specimens obtained non-invasively for biomarker discovery, clinical trial support, and patient selection, and to inform personalized medical therapy; Cancer prevention using intraductal treatment of reversible hyperplastic lesions.

Two problems with his presentation. NO ONE KNOWS HOW TO PREVENT BREAST CANCER. Also, the “painless” techniques are almost medieval. If you don’t believe me, look up “ductal lavage” and let me know if you’d want to submit to that repeatedly.

The problem is no one ever seems to use the word cure, or to speculate why these diseases exist at all. All of the research presented seems to end with the following refrain: “Hopefully leading to the development of new drugs…” Well, follow the money. 

I don’t know if I’ll go next year. 



Posted in Big Data, Decision Management, Genomics, Medicine, Research | Tagged , , , , , , , , , , | Leave a comment

Forked SQL: Informatica Gets It

By Neil Raden

About fifteen years ago, Microstrategy cofounder Sanju Bansal told me, “SQL is the best hope for leveraging the latent value from databases.” Fifteen years later, it’s extraordinary how correct Bansal was. Microstrategy is still a robust, free-standing Business Intelligence company while most of the proprietary multidimensional databases have disappeared. But what about the next fifteen years?

At least for business analytics, SQL is under attack. So much so, that there is an entire emergent market segment called NoSQL. If SQL itself is under siege, what about the myriad technologies that in one way or another are part of the SQL ecosystem like Informatica? Are they obsolete? Will we need to throw away the baby with the bathwater?

This whole dustup has been brewing for a decade or more. Since we started using computers in business 60+ years ago, the big machines were managed by a separate group of people with specialized skills, now generically referred to as IT. Though separate and decidedly non-businesslike, IT eventually became bureaucratic, fixed in its mission to control everything data. Even SQL was adopted very slowly, but it is solidly the tool of choice for most applications.

About ten years ago, though, a renegade group of people I called “the pony-tail-haired guys” (PTH for short) appeared on the scene with their externally-focused web sites and gradually, development tools, methods and monitoring software. At first, IT paid no attention to them because they didn’t interfere with the inwardly-focused enterprise computing environment, but as the perceived value of “e-business” developed, a great deal of friction and turf war fighting erupted. Computing bifurcated inside the firewall.

The PTH preferred web-oriented tools, open source software and search. That’s why Big Data/Hadoop and NoSQL are so divergent from enterprise computing. Different people, different applications, different brains.

But when it comes to Business Analytics, the lines are not so clear. The PTH found, just as enterprise people did (reluctantly), analytics is key to everything else. But while enterprise apps measure sales and revenue, the PTH guys are looking at really strange things like sentiment analysis. Today, I can download a Fortune 500 general ledger to my watch, but things like sentiment analysis look at 100’s of 1000’s times more data. Loading into a relational database and analyzing the data with a language designed for set operations and transactions just doesn’t work. So is there a justification for something other than SQL for this? Of course there is.

But the PTH guys, now that they’ve grown up a little, are starting to show the same sclerotic tendencies as their IT colleagues, assuming that their tools and methodologies are the ONLY tool for analytics and SQL needs to be put in the dustbin. That’s just silly. Existing data warehouse, data integration and data analysis and presentation tools are well-suited to lots of tasks that aren’t going away any time soon, though methodologies and implementations are in dire need for renovation, something I call a Surround Strategy as opposed to the ridiculous outdated idea of the Single Version of the Truth.

Here is where Informatica enters the picture. With all the breathless enthusiasm over Big Data, one thing is often lost in the rush: no fundamental laws of physics concerning data integration have been altered. This places Informatica squarely in the middle of every trend grabbing headlines today: big data, social analytics, virtualization and cloud.

Any type of reporting or analysis, whether through traditional ETL and data warehousing, Hadoop, Complex Event Processing or even Master Data Management deals with “used data,” meaning, data created through some original process. All used data has one attribute in common – it doesn’t like to play with other used data. Data extracted from a primary source typically has hidden semantics and rules in the application logic so that, on it’s own, it often doesn’t make sense.

This has been the most difficult part of assembling useful data for analysis historically, and with the entrance of mountains of non-enterprise data, the problem has only grown larger.

Luckily for the fortunes of Informatica, this is a big opportunity and they have stepped up.

For in depth descriptions of the following Big Data, cloud, Hadoop SaaS innovations coming from Informatica, refer to their materials at Briefly, they include:

Hparser, at facility to visually prepare very large datasets for processing in Hadoop, saving developer/analyst a substantial amount of time from what is mostly hand coding.

Complex Event Processing (CEP) which isn’t new for Informatica, but given the expanded complexity of data integration in the Big Data era, they have incorporated CEP into their own platform to detect and inform of events that can affect the performance, accuracy and management of the data from teratogenic processes.

Informatica pioneered the GUI diagram for building integration mappings, but 9.5 implements an integration optimizer which detects the optimal mapping rather than processing the diagram literally.

With each new release of a complex software product is the time-consuming and nerve-wracking routine of upgrades. 9,5 now provides automated regression testing to dramatically reduce the time and pain of upgrades.

In Information Lifecycle Management, 9.5 provides “Intelligent Partitions” to distribute data across devices hot, warm and cold storage.

Add to this, facilities for virtualization, replication and a slew of offerings for cloud-based applications – there is an inescapable logic for Informatica exploiting the new opportunities of Big Data.

Posted in Big Data, Research | Tagged , , , | 5 Comments