August 13, 2010

The Data Grid Step

Half a year ago I was happy to be involved in helping the folks from kJube get a project on the road. As the head of data integration at Pentaho I'm very happy to very occasionally be involved in the die-hard work that goes along with the implementation of BI projects because it allows me to touch base with the real world beyond the world of case tracking systems and software architectural dreams. If you can help a friend out at the same time it makes the occasion even better.

At one point it became clear that defining small data sets in Pentaho Data Integration (Kettle) was not as easy as it should be. Obviously it's nice to sit in an ivory tower and say that all information should be defined externally in a file or database table. However, setting up an XML file for a few lines of data is in a lot of cases over the top.

The question also is this: do you want to have ETL related information close to the ETL user or inside some external data source? Is it easier to manage the few lines of data in a user interface like Spoon or in an XML file somewhere? Answering those questions reveals your preference. In any case, as a result of the discussions and idea exchanges with kJube we decided that there was a need for a new "Data Grid" step in Kettle that would allow the ETL user to enter a few lines of constant data in an easy and clear way. Since ideas are often more valuable than a few lines of code, the step emerged and was thrown into the project quickly thereafter.

The use-case is the following: you have a number of company subsidiaries that all use the same database structure on different systems. Because all database structures are the same for these companies we want to create a loop in Kettle. Here is a screen shot from the actual project:

The transformation under "SMORDERCUSLOOP" generates a number of result rows. These rows are used to loop with in the sub-sequent jobs that are executed in parallel.

So we need a semi-fixed list of companies and systems and we can choose to define these in an external file or database or, as is the case here, in a "Data Grid" step:

As you can see, defining the set of constant rows is very easy with the new "Data Grid" step. In fact, everyone involved liked this step so much that it was contributed to the Kettle project. It is available in version 4 and upward in the "Input" category. For version 3.2 projects we have made the source code available under the Kettle Franchise Factory (KFF) project umbrella.

Jan & I are going to release more software that resulted from our adventures under the KFF umbrella. Follow this blog to learn more about KFF and/or more plugins in the near future. We'll also be more than happy to share project experiences at the upcoming Pentaho Community Event in Lisbon, Portugal.