December 30, 2016

Pentaho World 2015 - Model Driven Data Integration presentation

During Pentaho World, Matt Casters and I promised to publish the slide deck and demo materials we showed during our breakout session. For those who missed Pentaho World, the original agenda is on the Pentaho World website. For those who were present in Orlando, you can access all of the presentations and this year and even video recordings of the talks. 

Anyhow, Matt and I promised to publish the materials, and a promise being a promise, here is a transcript of the presentation AND the zip file with Matt's demo. Sorry if it took over a year to get this out but the end of the year is a good moment to remember old promises.

The intro

Matt and I started working on metadata driven ETL somewhere back in 2008 when doing a migration project, moving a current Pentaho Customer from their existing ETL tool to Pentaho Data Integration. 

During that project we noticed that various ETL patterns were repeated throughout the whole implementation. Particularly the "staging" logic to pull data from the sources and land it into the data warehouse showed a repetition rate of about 120 times the same pattern. When we discovered this potential for re-use of the same logic, Matt helped me write a transformation that wrote 120 transformations based on the list of tables that needed to be staged. That was one of those rare real "aha erlebnis" moments.  We learned that ETL transformations didn't have to written by a a developer, but could be written by a machine. In casu Matt exposed an API for PDI to write PDI transformations.

The whole project finally led to the creation of the KFF project, a framework for managing PDI projects, but that is another story, (and one that has become largely obsolete due to the evolution of PDI as a product).

Metadata driven ETL through the metadata injection step, made it into the product in 2010, as you can see here in the original demo by Matt. And has since been documented intensively by various community members as David Fombella, Matt, Dan Keeley, and many others.

But back to model driven ETL and the Pentaho World 2015 preso.

What it is

We believe that exposing PDI's transformation engine to allow it to be controlled by a stream of metadata is key to delivering data integration capabilities at scale. And when we talk about scale, we are not talking about data volumes but the sheer amount of data integration coding that the foresee-able data explosion will require.

Metadata driven data integration doesn't just offer scalability from a development point of view. It also offers more maintainable solutions through better standardisation and massive adaptability capabilities. Having machines writing your code on the fly ensures the code remains in a perfect shape and does not degrade over time under de hands of a series of different developers, with different backgrounds, different coding principles and different naming standards.

We see various use cases in the market where model driven data integration can be of value. Here are some examples:

  • DI tool migration: Migrating away from your existing ETL tool to a cheaper and more modern alternative isn't an easy endeavour. Years of investment in any tool will cause an effective technology lock-in. However, as mentioned above, if your ETL logic follows specific patterns, and usually it does or can, the model driven ETL can cut down the re-engineering cost of your ETL to the point the business case becomes positive.
  • Data lake data ingestion: Landing data in a data lake is something many of our customers focus heavily on. It is the first hurdle to overcome to make the data available to data scientists who can crunch the data into useable information. The ability to ingest data across 100s of systems by scanning their metadata and the using it for data extract and ingestion into Hadoop speeds up the creation of a true enterprise data hub.
  • IoT: Internet of Things use cases are all about machines talking to each other. The value of the above described capabilities should be evident in this use case, which is underlined in the short demo described below.

The demo

The attached demo files show a simple use case of PDI receiving a json file with metadata information to read out a csv file and load it into a database. The demo consists of 4 folders.

  • Data: The data set for the demo
  • Step 1: The basic transformation, no metadata involved
  • Step 2: Same pattern, metadata driven
  • Step 3: Extended from just metadata driven data loading all the way up to publishnig

The caveats

Before handing you the goodies to play around with, here are some caveats for you to consider.
  • Continuous Integration: Due to it's complexity, model driven data integration needs arduous testing. The smallest change can create havoc at scale. Hence we suggest setting up a proper development environment with nightly builds of your solution and continuous testing.
  • Version management: Through the introduction of metadata not only code needs to be version managed, but also your metadata. And since the metadata can come from external sources proper validation before using it is needed.
  • 80/20: While a model driven approach can be applied to a great many ETL patterns, there are still many situations where manual coding cannot be avoided. Model driven ETL is not a magical bullet. Do not try to solve every data integration challenge with this approach.  

The goodies

Attached are the zip file with the demo code and the pdf file with slides presented at Pentaho World as well as the embedded presentation below.


Since ...

... we gave this talk, metadata driven data ingest and metadata driven ETL overall have become major topics at Pentaho and in the wider big data analytics industry. Many marketing posts, blogposts and and write up's have appeared. Here are a list of resources I know of.

Likely there are many more materials out there. I'm glad the topic was so well received.