December 30, 2016

Pentaho World 2015 - Model Driven Data Integration presentation

During Pentaho World, Matt Casters and I promised to publish the slide deck and demo materials we showed during our breakout session. For those who missed Pentaho World, the original agenda is on the Pentaho World website. For those who were present in Orlando, you can access all of the presentations and this year and even video recordings of the talks. 

Anyhow, Matt and I promised to publish the materials, and a promise being a promise, here is a transcript of the presentation AND the zip file with Matt's demo. Sorry if it took over a year to get this out but the end of the year is a good moment to remember old promises.

The intro

Matt and I started working on metadata driven ETL somewhere back in 2008 when doing a migration project, moving a current Pentaho Customer from their existing ETL tool to Pentaho Data Integration. 

During that project we noticed that various ETL patterns were repeated throughout the whole implementation. Particularly the "staging" logic to pull data from the sources and land it into the data warehouse showed a repetition rate of about 120 times the same pattern. When we discovered this potential for re-use of the same logic, Matt helped me write a transformation that wrote 120 transformations based on the list of tables that needed to be staged. That was one of those rare real "aha erlebnis" moments.  We learned that ETL transformations didn't have to written by a a developer, but could be written by a machine. In casu Matt exposed an API for PDI to write PDI transformations.

The whole project finally led to the creation of the KFF project, a framework for managing PDI projects, but that is another story, (and one that has become largely obsolete due to the evolution of PDI as a product).

Metadata driven ETL through the metadata injection step, made it into the product in 2010, as you can see here in the original demo by Matt. And has since been documented intensively by various community members as David Fombella, Matt, Dan Keeley, and many others.

But back to model driven ETL and the Pentaho World 2015 preso.

What it is

We believe that exposing PDI's transformation engine to allow it to be controlled by a stream of metadata is key to delivering data integration capabilities at scale. And when we talk about scale, we are not talking about data volumes but the sheer amount of data integration coding that the foresee-able data explosion will require.

Metadata driven data integration doesn't just offer scalability from a development point of view. It also offers more maintainable solutions through better standardisation and massive adaptability capabilities. Having machines writing your code on the fly ensures the code remains in a perfect shape and does not degrade over time under de hands of a series of different developers, with different backgrounds, different coding principles and different naming standards.

We see various use cases in the market where model driven data integration can be of value. Here are some examples:

  • DI tool migration: Migrating away from your existing ETL tool to a cheaper and more modern alternative isn't an easy endeavour. Years of investment in any tool will cause an effective technology lock-in. However, as mentioned above, if your ETL logic follows specific patterns, and usually it does or can, the model driven ETL can cut down the re-engineering cost of your ETL to the point the business case becomes positive.
  • Data lake data ingestion: Landing data in a data lake is something many of our customers focus heavily on. It is the first hurdle to overcome to make the data available to data scientists who can crunch the data into useable information. The ability to ingest data across 100s of systems by scanning their metadata and the using it for data extract and ingestion into Hadoop speeds up the creation of a true enterprise data hub.
  • IoT: Internet of Things use cases are all about machines talking to each other. The value of the above described capabilities should be evident in this use case, which is underlined in the short demo described below.

The demo

The attached demo files show a simple use case of PDI receiving a json file with metadata information to read out a csv file and load it into a database. The demo consists of 4 folders.

  • Data: The data set for the demo
  • Step 1: The basic transformation, no metadata involved
  • Step 2: Same pattern, metadata driven
  • Step 3: Extended from just metadata driven data loading all the way up to publishnig

The caveats

Before handing you the goodies to play around with, here are some caveats for you to consider.
  • Continuous Integration: Due to it's complexity, model driven data integration needs arduous testing. The smallest change can create havoc at scale. Hence we suggest setting up a proper development environment with nightly builds of your solution and continuous testing.
  • Version management: Through the introduction of metadata not only code needs to be version managed, but also your metadata. And since the metadata can come from external sources proper validation before using it is needed.
  • 80/20: While a model driven approach can be applied to a great many ETL patterns, there are still many situations where manual coding cannot be avoided. Model driven ETL is not a magical bullet. Do not try to solve every data integration challenge with this approach.  

The goodies

Attached are the zip file with the demo code and the pdf file with slides presented at Pentaho World as well as the embedded presentation below.


Since ...

... we gave this talk, metadata driven data ingest and metadata driven ETL overall have become major topics at Pentaho and in the wider big data analytics industry. Many marketing posts, blogposts and and write up's have appeared. Here are a list of resources I know of.

Likely there are many more materials out there. I'm glad the topic was so well received.



At the end of the year, looking back, not just on 2016 but also further back, I collected a couple of images from the Pentaho past. Most of this comes from old blog posts I happen to revisit (yes, a lot of end of year nostalgia is happening here).  And the internet has an excellent memory, so revisiting the past is easy.

As always, before you start reading, turn on the soundtrack please.

So here come the visuals. For those with a visual memory, maybe you can figure out which blog/author the below images came from? I have not given away the names. You need to follow a hyperlink to see the answer. But Pentaho Community Members should be able to score 5 out of 5 easily. 

1) To start with, below are some screenshots from early kettle versions. Version 1.1 or so. Pre-Pentaho material. In 2007 not a lot of people were writing about kettle, but some were! And look at those icons. Since those we have gone through at least 2 other versions.

The above images are taken from the blog of the following gentleman (you have to click to know the answer), early PDI contributor and experimenter in the field, as well as somebody who executed a lot of early day consulting engagements for Pentaho. Never heard of him, well, this was early days ... 

2) In the next series, I've added some 1 non-Pentaho specific images from the author's blog, just because it was so charming, and to make it a little easier for you to recognize this amazing Pentaho Community Contributor.

The images come from the blog of this gentleman, very active contributor of a great deal of Pentaho projects, including the Cookbook, KREX, Pendular, the Pentaho Analysis Editor, the Pentaho database inspection service, and likely a bunch of other projects that I totally forgot.

3) But there is more. Next one should be easy. These images ...

... were taken from the blog of this gentleman, no further introductions required.

4) And there is more. Try the following set, which is very PDI focussed. The images themselves should betray who posted this screenshots.

that came from the blog of this gentleman.

5) Finally, one more to concluded the series, cause all good things come in 5. Some screenshots might look totally unknown to today's Pentaho user, but hey, the first are from a community project that has transformed into something else, and some others are Pentaho Server version 3. Do you believe what our admin console used to look like?

The above were from the blog of this gentleman, a magical trout who ran the Pentaho IRC channel in early days and kicked of many a Community Project.

I could go on, ripping images from the internet and blogs, but I'm going to stop here. If you want to read about Pentaho, apart from the great online documentation that is available these days at, there is an enormous amount of people blogging on Pentaho. Go and google'em.

The above are all screenshots from blogposts of at least 5, if not 10 years back, things I went to revisit out of sheer nostalgia.

Jumping forward, and to conclude this blog post, look at where are at now.

Let's see where we'll get in 2017!

December 29, 2016

(SW01) Actually?

As mentioned in my previous post, I believe "Steel Wheels can be improved".  Here is the first trick, and though this is extremely simple, it is nonetheless, a "must do" to ensure your demo is credible.

As you pull open Steel Wheels, you'll notice that the dates for the orders in the Steel Wheels order database range between 2003 and 2005.  (Look at the top row in the sample report "Leading Product Lines (pivot table)) or below nostalgic screenshot I grabbed from the net. (More nostalgia available here, here and here if you are in the mood).

While this shows that Pentaho has been around for some years and earned it's stripes as a company :-), from a demo perspective, it doesn't really give a polished view. At the speed Analytics products evolve, you really cannot afford to give a demo with data from over 10 years ago. Can you?

Clearly this can be adjusted. And to avoid that you need to think about update scripts to correct the dates, here is the logic I have used in the past to actualize the dates in Steel Wheels.

The following columns in the database contain date information (apart from whether they are actually date fields).
  • dim_time.time_id
  • dim_time.year_id
  • orderfact.time_id
  • orderfact.order_date
  • orderfact.required_date
  • orderfact.shipped_date
  • orderfact.payment_date
  • orders.order_date
  • orders.required_date
  • orders.shipped_date
  • orders.payment_date

You can easily update these by running the following some basic SQL functions. I've come up with some simple logic that uses basic date functions for hibernate. Using the max payment data in the Sample Database as a reference, and up it based on the difference with the current year, year(today).

update dim_time
set time_id = to_char(
                                 (select (year(today) - year(max(paymentdate)))*12 from payments)
update dim_time
set year_id = year_id + (select (year(today) - year(max(paymentdate))) from payments);

UPDATE orderfact
SET time_id = to_char(
                                 (select (year(today) - year(max(paymentdate)))*12 from payments)
update orderfact
set orderdate = add_months(
                           (select (year(today) - year(max(paymentdate)))*12 from payments)
update orderfact
set requireddate = add_months(
                              (select (year(today) - year(max(paymentdate)))*12 from payments)
update orderfact
set shippeddate = add_months(
                             (select (year(today) - year(max(paymentdate)))*12 from payments)
update orders
set orderdate = add_months(
                           (select (year(today) - year(max(paymentdate)))*12 from payments)
update orders
set requireddate = add_months(
                              (select (year(today) - year(max(paymentdate)))*12 from payments)
update orders
set shippeddate = add_months(
                             (select (year(today) - year(max(paymentdate)))*12 from payments)
update payments
set paymentdate = add_months(
                             (select (year(today) - year(max(paymentdate)))*12 from payments)

If you have used Pentaho before, you'll know the various ways to apply the updates, but for those who are new to Pentaho and want the easy path, you can also simply execute the following PDI job.

Once executed the logic, refresh your Mondrian cache (execute the following URL),

and you should see the data in your report actualized.

Please note that while your data is now actualized, some of the reports, still contain hard coded references to the old years which might need some manual intervention, namely the following (URLs only valid if your are running your EE trial on localhost:8080):
If anybody wants to modify those reports and submit a non hard-coded version, feel free to contact me :-). I'll gladly publish the new report versions here, or link to where ever you posted them. In the meantime you can download the above SQL update script and PDI transformation here.


Marketplace on 7

I installed our latest software release, Pentaho 7.0, a couple of weeks back. Having been an active Pentaho Community User, the first thing I wanted to do is install the market place, to test out all the cool existing gadgets, including the new toys that were touted at #PCM16 and the recently promoted MQQTT steps.

For some awkward reason, the Pentaho EE server, still does not come with the Pentaho Marketplace standard enabled. It is not in the standard menu!

While I was getting ready to go through the old manual download in order to put the plugin into the right folder, I dawned to me that the procedure to get marketplace onto your EE install had been made a lot easier.  And indeed, when navigating to the Marketplace website, it wasn't hard to spot find the "Find Marketplace" instructions.

Once instructions found, I did notice that for some reason, Pentaho has still opted to hide enabling of the Marketplace in an obscure config file. Not the level of user friendliness I would have hoped, but better than pre-6.

A quick "./ restart" later, the Marketplace was available on my box

Easy enough.

December 28, 2016

Steel Wheels ?

All Business Analytics software platforms come with a sample data set and sample functionality. Usually this involves a fictitious company with fictitious products and fictitious users, a scenario, easy to understand for any end-user or IT-developer.

The objectives of such a set of sample data and sample analytics content are multiple:
1) offer a real-world context that both technical and non-technical people can grasp,
2) show within that context what the software has to offer (for evaluation purpuses),
3) offer a demo environment for sales purposes,
4) offer a tool-set for customer support to communicate and test on specific issues,
5) offer a common piece of common starter code for developers to collaborate against.

Microsoft has it's Northwind and Adventureworks databases, and the derived Foodmart for OLAP. Oracle has the employee data in the Scott/Tiger schema. Business Objects used to ship with the eFashion.mdb and eFashion.unv, Crystal Reports used the Xtreme data, and Cognos comes with the Great Outdoors Company data. The newer players on the market seem to offer more. Tableau offers several data sets. Qlikview have made sample applications part of their marketing strategy and have released hundreds of them in their demo gallery.

So let's take a moment and see what Pentaho has to offer in Pentaho EE 7. Pentaho has the Steel Wheels demo installed with a standard installation. As you log in to the Pentaho User Console (EE Trial Edition) the Welcome screen walks you through some of the samples. 

So how good is this demo really?

Once you open up the Browse Files perspective, you can actually see that the amount of actual reports and dashboards offered by SteelWheels, isn't very elaborate. 17 reports/dashboards in total are available. In all fairness, not a very elaborate amount. 

And if you look under the covers, the complexity of the available content does not nearly cover the functionality and actual strengths of the Pentaho platform.

Also the data set itself is pretty limited. If you open up the sample database, located in a HyperSonic database in the install directory/server/hsql-sample-database folder, then you'll see there is only a handful of tables in the sample database, and the amount of data in the tables, is not what you would expect from a "big data company".

But obviously a lot of the above has a simple explanation or even needs no explaining to those who have followed Pentaho over the years. The demo holds a lot of the heritage from when Pentaho was the affordable open source business analytics alternative for the SMB market. The company has come a long way since. And business analytics demo for the SMB market is a totally different thing than a big data analytics demo for the enterprise space. In essence, SteelWheels has not evolved a lot, while Pentaho has, and fast

Where am I going with this blog post?

Simple. Over the years, working with Pentaho, I've learned to extend and adapt SteelWheels to my needs. The demo is what it is. And not withstanding it's current limitations, provided you know it, you can turn it rapidly into the tool you need for whatever demo or presentation you need.

Knowing that Steel Wheels is the standard demo instrument that Pentaho makes available to all customers, resellers, partners, developers and community members, I wanted to share some of the lessons I learned over the years. I hope some of these tips and tricks can help you to turn the available demo, into a platform that helps you credibly demo the full potential of Pentaho.

The end goal? I hope that in sharing my lessons learned, others will start to submit their contributions. Maybe we'll collect sufficient materials to release Steel Wheels v4.0 and move from th Rusty Wheels to Stainless or Reinforced Steel Wheels. (My puns are aweful!)

Watch this space!

For those who used to read my blog posts years ago, when I was a more active Pentaho Community blogger, I used to add soundtracks to my blog posts. Here is the appropriate (and predictable) soundtrack for this post. Enjoy.

December 16, 2016

January 5, 2016