The Informed Data Lake: Beyond Metadata

The Informed Data Lake Strategy
Executive Summary

Historically, the volume and extent of data that an enterprise could store, assemble, analyze and act upon exceeded the capacity of their computing resources and was too expensive. The solution was to model some extract of a portion of the available data into a data model or schema, presupposing what was “important,” and then fit the incoming data into that structure.

But, the economics of data management today allow for the gathering of practically any data, in any form, skipping the engineered schema with the presumption that understanding the data can happen on an as-needed basis.

This newer approach of putting data in one place for later use is now described as a Data Lake. But sooner or later, one has to pay the piper, as this Data Lake approach involves manual, time-consuming data preparation and filtering that is often one-off, consumes a large percentage of the data scientist’s time and provides no reference to the content or meaning of the data.

The alternative is the Informed Data Lake. The difference between an Informed Data Lake and the static “dumb” data neatly arranged in a Data Lake, is a comprehensive set of capabilities that provides a graph based linked and contextualized information fabric (semantic metadata and linked datasets) where NLP (Natural Language Processing), Sentiment Analysis, Rules Engines, Connectors, Canonical Models for common domains and cognitive tools that can be plugged in to turn “dumb” data into information assets with speed, agility, reuse and value.

Today, companies in Pharmaceutical, Life Sciences, Financial Services, Retail, Government Agencies, and many other industries, are seeking ways to make the full extent of their data more insightful, valuable and actionable. Informed Data Lakes are leading the way through graph based data discovery and investigative analytics tools and techniques that uncover hidden relationships in your data, while enabling iterative question and answering of that linked data.

The Problem with a Traditional Data Lake Approach

Anyone tasked with analyzing data for understanding past events and/or predicting the future knows that the data assembled is always “used.” It’s secondhand. Data is captured digitally for purposes that are almost exclusively meant for purposes other than analysis. Operational systems automate processes and capture data to record the transactions. Document data is stored in formats for presentation and written in flowing prose without obvious structure; it’s written to read (or just record), not mined for later analysis. Clickstream data in web applications is captured and stored in a verbose stream but has to be reassembled for sense making.

.Organizations implementing or just contemplating a data lake often operate under the misconception that having the data in one place is an enabler to broader and more useful analytics leading to better decision-making and better outcomes. There is a large hurdle facing these kinds of approaches – while the data may be in one place physically (typically a Hadoop cluster), in essence all that is created is a collection of data siloes, unlinked and not useful in a broader context, reducing the data lake to nothing more than a collection of disparate data sources.  Here are the issues:

·      Data quality issues –data is not curated. Users have to deal with repeat data, old data, contextually wrong data

·      Data can be in different formats or different languages

·      Governance issues

·      Security issues

·      Companies being misinformed as to who can process and use the data

·      Also the ever changing nature of diverse data requires that the processing and analysis of the data be dynamic and evolves as the data changes or as the need change

The effort of adding meaning and context to the data falls on the shoulders of the analysts, a true time-sink. Data Scientists and Business Analysts are spending 50-80% of their time preparing and organizing their data and only 20% of their time analyzing it.

·      But what if data could describe itself?

·      What if analysts could link and contextualize data from different domains without having to go through the effort of curating it for themselves?

·      What if you had a Informed Data Lake to address all those issues and more?

The Argument for a Informed Data Lake

Existing approaches to curating, managing and analyzing data (and metadata) are mostly based on relational technology, which, for performance reasons, usually simplifies data and strips it of meaningful relationships while locking it into rigid schema. The traditional approach is to predefine what you want to do with the data, define the model and subsequent version of physical optimization and subsets for special uses. Context is like poured concrete – fluid at first needing a jackhammer to change it once it sets. You use the data as you originally designed it. If changes are needed, going back to redesign and modify is complicated.

The rapid rise of interest in “big data” has spawned a variety of technology approaches to solve, or at least, ease this problem such as text analytics and bespoke applications of AI algorithms. They work. They perform functions that are too time-consuming to do manually but they are incomplete because each one is too narrow, aimed at only a single domain or document type, or too specific in its operation. They mostly defy the practice of agile reuse because each new source, or even each new special extraction for a new purpose, has to start from scratch.

Given these limitations, where does one turn for help?  An Informed Data Lake is the answer.

At the heart of the Informed Data Lake approach is the linking and contextualizing of all forms of data using semantic based technology. Though descriptions of semantic technology are often times complicated, the concept itself is actually very simple:

–       It supplies meaning to data that travels with the data

–       The model of the data is updated on the fly as new data enters

–       The model also captures and understands the relationship between things from which it can actually do a certain level of reasoning without programming

–       Information from many sources can be linked, not through views or indexes, but through explicit and implicit relationships that are native to model

Conceptually, the Informed Data Lake is a departure from the earliest principle of IT: Parsimonious Development derived from a mindset of managing from scarcity[1] and deploying only simplified models. Instead of limiting the amount of data available, or even the access to it, Informed Data Lakes are driven by the abundance of resources and data. Semantic metadata provides the ability to find, link and contextualize information in a vast pool of data.

The Informed Data Lake works because it is based on a dynamic semantic model-approach based on graph driven ontologies. In technical terms, an ontology represents the meaning and relationships of data in a graph, an extremely compact and efficient way to define and use disparate data sources via semantic definitions based on business usage including terminology and rules that can be managed by business users:

•       Source data, application interfaces, operational data, and model metadata are all described in a consistent ontology framework supporting detailed semantics of the underlying objects. This means constraints on types, relations, and description logic, for example, are handled uniformly for all underlying elements.

•       The ontology represents both schema and data in the same way. This means that the description of metadata about the sources also represents a machine-readable way of representing the data itself for translation, transmission, query, and storage.

•       Ontology can richly describe behavior of services and composite applications in a way that a relational model can only do by being tightly bound to the applications logic.

•       The ontology is a run-time model, not just a design-time model. The ontology is used to generate rules, mappings, transforms, queries, and UI because all of the elements are combined under a single structure.

•       There is no reliance on indexes, keys, or positional notation to describe the elements of the ontology. Implementations do not break when local changes are made.

•       An ontological representation encourages both top-down, conceptual description and bottom-up, source- or silo-based representation of existing data. In fact, these can be in separate ontologies and easily brought together.

•       The ontology is designed to scale across users, applications, and organizations. Ontologies can easily share elements in an open and standard way, and ontology tools (for design, query, data exchange, etc.) don’t have to change in any way to reference information across ontologies.

Assuming a data lake is built for a broad audience, it is likely that no one party will have the complete set of data they think is of interest. Instead, it will be a union of all of those ideas, plus many more that arise as things are discovered, situations evolve and new sources of data become available. Thinking in the existing mode of database schema design, inadequate metadata features of Hadoop and just managing from scarcity in general, will fail under the magnitude of this effort. What the Informed Data Lake does is take the guesswork out of what the data means, what it’s related to and how it can be dynamically linked together without endless data modeling and remodeling.

All of the features and capabilities below are needed to keep a data lake from turning into a data swamp, where no one quite knows what it contains or if it is reliable.

Informed Data Lake Features:

·      Connectors to practically any source

·      Graph based, linked and contextualized data

·      Dynamic Ontology Mapping

·      Auto-generated conceptual models

·      Advanced Text Analytics

·      Annotation, Harmonization and Canonicalization

·       “Canonical” models to simplify ingest and classifying of new sources

·      Semantics querying and data enrichment

·      Fully customizable dashboards

·      With full data provenance adhering to IT standard

Sample Informed Data Lake Capabilities:

·      Manage business vocabulary along with technical syntax

·      Actively resolve differences in vocabulary across different departments, business units and external data

·      Support consistent assignment of business policies and constraints across various applications, users and sources

·      Accurately reflect all logical consequences of changes and dynamically reflect change in affected areas

·      Unify access to content and data

·      Assure and manage reuse of linked and contextualized data

Any vendor providing metadata based on semantic technology is in a unique position to provide these capabilities required to build and deploy the Informed Data Lake. It is based on open standards and takes a semantic approach from the beginning. In addition, it incorporates a very rich tool set that includes dozens of 3rd party applications that operate seamlessly within the Informed Data Platform. This is central to the ability to move the task of data integration and data extraction to more advanced knowledge Integration and knowledge extraction, without which it is impossible to fuel solutions in the areas of competitive intelligence, Insider trading surveillance, investigatory analytics and Customer 360, risk and compliance, as well as feeding existing BI applications (a requirement that is not going away anytime soon).

A Informed Data Lake Solution

The specific design pattern of the Informed Data Lake enables data science because analytics does not end with a single hypothesis test. Simple examples of “Data Scientists” building models on the data lake and saving the organization vast sums of money make good copy, but they do not represent what happens in the real world.

Often, the first dozen hypotheses are either obvious or non-demonstrable. When the model characterization comes back it presents additional components to validate and cross correlate. It is this discovery process that the data lake somehow needs to facilitate, and it needs to facilitate it well, otherwise the cost of the analytics is too high and the process is too slow to realize business value.

To enable that continuous improvement process of deep analytics requires more than a data strategy, it needs a tool chain to solve model refinement, and the best-known method to date is the Informed Data Lake. The significant pain point for deep analytics is refinement. And the lower the refinement costs are, the more business value can be extracted.

At some point you may have heard the criticism of BI and OLAP tools that you were constrained to the questions that were implicit in their models. In fact, the same criticism has been leveled at data warehouses.  The fact remains that both data warehouses and BI tools limit your questions to those that can be answered, not just with the available data, but how it is arranged physically and how well the query optimizer can resolve the query.

Now imagine what is possible if you could ask any question of the data in a massive data lake? This is where the Informed Data Lake comes into play.

Catalog capabilities allow for massive amounts of metadata and instantaneous access to it. Thus any user (or process) can “go shopping” for a dataset that interests them. Because the metadata is constructed in the form of an in-memory graph, linking and joining data that is of far different structures and perhaps never linked before, can be done instantaneously.

On a browser like interface,, the graph can show you not only the typical ways different data sets can be linked and joined, it can even recommend other datasets that you haven’t considered.

Once data is selected, the in-memory graph processing analyzes and traverses it structure to provide the instantaneous joins that would be impossible in a relational database. The net result is that arbitrarily complex models and tools can ask any question with unlimited joins as a result of processing optimized for multi-core CPU’s, very large memory models and fast interconnect across processing nodes.

Informed Data Lake in Action

Pharma R&D Intelligence:

Clinical trials involve great quantities of data from many sources, a perfect problem for an Informed Data lake. The Informed Data Lake allows the loading, unification and ingestion of the data without knowing a priori what analytics would be needed.  In particular, evaluating drug response would link many sources of data following participants with severity and occurrence of adverse drug reaction, across multiple trials, as well as unknown other classes of data.

Clinical trial data investigators and analysts can see the value of the graph based approach with the linking and contextualization they could not do otherwise.  They see many benefits including:

·      Identifying patients for enrollment based on more substantive criteria

·      Monitoring in real-time, to identify safety or operational signals,

·      Blending data from physicians and CROs (contract research organizations)


Insider Trading and Compliance Surveillance:

In the financial services space, the combination of deep analysis of large datasets with targeted queries of specific events and people give analysts and regulators an opportunity to catch wrongdoing early.

·      Identify an employee who has unusually high level of suspicious trading activity.

·      Spot patterns in which certain employees have histories of making the exact same trades at the exact same times.

·      Compare employees’ behaviors to their past histories, and spot situations where employees’ trading patterns make sudden, drastic changes


Making sense of data lakes takes discipline because a one-off approach will drain your best resources of time and patience. The Informed Data Lake approach, complete with a suite of NLP, AI, graph-based models and semantic technology is the sensible approach. Your two most expensive assets are staff and time. The Informed Data Lake allows you to do your work quicker, cheaper, faster, with more flexibility and greater accuracy, which has a major impact on your business. Without the Informed Data Lake, the data is a bewildering collection of pieces that analysts and data scientists can only understood in small pieces, diluting the value of the data lake.

The whole extended fabric of an ontology solution and its ability to plug in third-party abilities collapses many layers of logical and physical models in traditional data warehousing/business intelligence architectures into a single model. With the Informed Data Lake approach, tangible benefits accrue:

·      Widespread understanding of the model across many domains in the organization

·      Rapid implementation of new studies and applications by expanding the model, not re-designing it (even small adjustments to relational databases involve development at the logical, physical and downstream models, with time-consuming testing).

·      Application of Solution Accelerators that provide bundled models by industry/application type that can be modified for your specific need

·      “Data Democratization” making data available to users across the organization for their own data discovery and analytic needs, extracting greater value from the data

·      Discovering hidden patterns in relationships, something not possible with the rotational and drill down capabilities of IB tools

·      The ability for iterative question and answering, continuous data discovery and run time analytics across huge amounts of data and, more importantly, linked data from sources not typically associated previously

In conclusion, the Informed Data Lake layers a disparate collection of data sources of unknown origin, quality and currency, into a facility for almost limitless exploration and analysis.
[1] Managing from scarcity has historically driven IT to develop and deploy using the least amount of computing resources under the assumption that these resources were precious and expensive. In the current computing economy, the emphasis has shifted away from scarcity of hardware to scarcity of time and attention of knowledge workers

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to The Informed Data Lake: Beyond Metadata

  1. Pingback: Putting Context into Data Lakes – Liliendahl on Data Quality

  2. Pingback: Data Infrastructure, Data Pipeline and Analytics – Reading List – Sep 12, 2016 | Managing Software Development

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s