August 25, 2010

KFF environment configurator

While running the (parametrized) data integration code on different customers/environments, you need to carefully keep track of which kettle.properties file you are using. The environment configurator solves that problem .

About kettle variables
The kettle.properties file is well know to all kettle-developers. For those of you who don't fit into that category (and for some strange reason still read this post): the file contains all environment variables that you want to use in jobs/transformations you execute through either spoon or kitchen. The file sits in your home directory, that is ${HOME}/.kettle and you can add whatever variables you want.

How would you typically use these environment variables? A simple example will illustrate this.

Suppose you use a standard directory for incoming flat file data you need to process. In that case you can create a variable in the kettle.properties file for this directory.
IN_FILE_DIR=/kff/projects/my_customer/my_application/data/in/
The variable would then be available in all kettle jobs and transformations you will develop. In this case for instance in the Text file input step.



Now if you are running projects for multiple customers, or you are running the same code on different environments (as in development, unit test, acceptance test, production ...) the input directory for your flat files will most likely be different. In order to keep your code generic you would obviously continue to work with the same variable IN_FILE_DIR, but there is only space for 1 kettle.properties file on your system.

Some might say that you can solve this issue by using named parameters which you can pass from the kitchen command line to your jobs/transformations. However if the amount of variables starts to grow, that becomes quite cumbersome to manage.

Environment configurator
In order to avoid manual switching between kettle.properties files for the above scenarios - as I assume many of you are doing - we implemented a little step which we have called the environment configurator.


Basically this step will read a kettle.properties file (which you can place in the location of your choice). It will read the .properties file, parse it and set the appropriate environment variables. Any values that are located in your $HOME/.kettle/kettle.properties will be overwritten.

Now if you insert the environment configurator as the first step in your data integration job, as we've done in the KFF-, you can control perfectly which variables will be used to execute your code.



Basically the environment configurator step does expect three input variables which are used throughout the kettle franchising factory, our framework for management and rapid deployment of (multiple) data integration solutions across customers and lifecycles (see more on Google Code about this framework)
KFF_CUSTOMER      The customer for which you are runing
KFF_APPLICATION   The application code you are running
KFF_LIFECYCLE     The environment (DEV, TST, UAT, PRD) you are runnnig
 or you can specify the path to your kettle.properties file.  (Full details in the documentation.)

Vision
For the moment we've released this kettle plugin as a quick hack for our own projects, to simplify our multi-customer/multi-environment set-up. However a lot of improvements and extension are imaginable. We would love to get the discussion going.

Some food for thought:
  • How many environments do you need? And what naming conventions could we use? Out of the top of my head I can think of the following, but we've only "implemented" the first 4.
    • DEV: development
    • TST:  unit test
    • UAI: user acceptance and integration test
    • PRD: production
    • DRP: mirror of production for performance testing
    • ...
  • Shouldn't kettle slowly evolve in such a way that the kettle.properties file becomes part of your project code? Or maybe two levels or kettle.properties could exist, the original kettle.properties that contains variables for all your projects, and a project specific one that contains project specific variables.
  • If variables become project specific, wouldn't it make sense to be able to edit them inside kettle in a grid that shows the different lifecycle environments and value you are using in each environment. As illustrated below:

Variable
DEV
TST
UAI
PRD
IN_FILE_DIR
/DEV/data/in/
/TST/data/in/
/UAI/data/in/
/PRD/data/in/
DWH_SERVER
localhost
localhost
DWH_UAI
DWH_PRD
DWH_DATABASE
MySQL_DWH_DEV
MySQL_DWH_TST
MySQL_DWH
MySQL_DWH
...

Get it
You can get your copy of the environment configurator on Google Code, in the downloads section.  All feedback is appreciated. We are very curious to see which use cases you will find for this plug-in.

What's cooking backwards compatible?

Roland Bouman just released the "kettle-cookbook" on Google Code. At first, if you look at the name, and are a bit used to the kettle kitchen terminology (kettle, kitchen, pan, spoon, chef, ...), you might expect this to be some manual on how to cook up the best data integration jobs and transformations using kettle. But it is something different. The cookbook is an auto documentation tool for kettle jobs and transformations.

Since we have just released a new project with over 300 jobs and transformations, we had been looking into documentation ourselves, but it seems Roland beat us to it. Time to give the cookbook a try (and see whether we can contribute).

Step 1: Installing the cookbook
... equals unzipping the code into a directory. Since our standard set-up is to have all the re-usables under the same directory, namely /kff/reusable, I unzipped the cookbook under /kjube/reusables/cookbook. No pain here.

Step 2: Running the cookbook against a directory of transformations/jobs
I have a data warehousing project template which uses the following root folder:
/kff/projects/templates/datawarehouse/code/     
with the following subdirectory structure and dummy jobs.

/kff/p../code/pre   pre-processing 
/kff/p../code/stg   jobs to load staging area of the DWH
/kff/p../code/ods   jobs to load ODS area of the DWH
/kff/p../code/dwh   jobs to load multidimensional DWH
/kff/p../code/pst   post-processing

So I pointed the cookbook at the "/kff/p../code/" directory to get a full documentation of my template project.
jaertsen@Jaybox:/kff/software/pdi/3.2.3$ sh kitchen.sh -file:/kff/reusable/cookbook/pdi/document-all.kjb -param:"INPUT_DIR"=/kff/projects/templates/datawarehouse/code/ -param:"OUTPUT_DIR"=/kff/projects/templates/datawarehouse/doc/
What did I get? Only documentation for the jobs and transformations in the root folder? What happened here?

After some digging it seems that the first step in the "get-kettle-job-and-trans-files-from-directory.ktr" transformation is a "Get File Names" step. That step differs between the 3.2.x version - which most of our customers are still on - and the 4.0.0 version, released recently. Basically, there is a flag "include subdirectories" in 4.0.0 version, which wasn't there before. So much for backwards compatibility, but that explains my issue.


So, even though I have no intention upgrading any of my customers yet, I ran the cookbook with version 4.0.0.
jaertsen@Jaybox:/kff/software/pdi/4.0.0$ sh kitchen.sh -file:/kff/reusable/cookbook/pdi/document-all.kjb -param:"INPUT_DIR"=/kff/projects/templates/datawarehouse/code/ -param:"OUTPUT_DIR"=/kff/projects/templates/datawarehouse/doc/
... and problem solved.

With great thanks to Roland for a wonderful documentation tool. I'll suggest a modification on Google Code for backwards compatibility.

August 13, 2010

The Data Grid Step

Half a year ago I was happy to be involved in helping the folks from kJube get a project on the road. As the head of data integration at Pentaho I'm very happy to very occasionally be involved in the die-hard work that goes along with the implementation of BI projects because it allows me to touch base with the real world beyond the world of case tracking systems and software architectural dreams. If you can help a friend out at the same time it makes the occasion even better.

At one point it became clear that defining small data sets in Pentaho Data Integration (Kettle) was not as easy as it should be. Obviously it's nice to sit in an ivory tower and say that all information should be defined externally in a file or database table. However, setting up an XML file for a few lines of data is in a lot of cases over the top.

The question also is this: do you want to have ETL related information close to the ETL user or inside some external data source? Is it easier to manage the few lines of data in a user interface like Spoon or in an XML file somewhere? Answering those questions reveals your preference. In any case, as a result of the discussions and idea exchanges with kJube we decided that there was a need for a new "Data Grid" step in Kettle that would allow the ETL user to enter a few lines of constant data in an easy and clear way. Since ideas are often more valuable than a few lines of code, the step emerged and was thrown into the project quickly thereafter.

The use-case is the following: you have a number of company subsidiaries that all use the same database structure on different systems. Because all database structures are the same for these companies we want to create a loop in Kettle. Here is a screen shot from the actual project:

The transformation under "SMORDERCUSLOOP" generates a number of result rows. These rows are used to loop with in the sub-sequent jobs that are executed in parallel.

So we need a semi-fixed list of companies and systems and we can choose to define these in an external file or database or, as is the case here, in a "Data Grid" step:


As you can see, defining the set of constant rows is very easy with the new "Data Grid" step. In fact, everyone involved liked this step so much that it was contributed to the Kettle project. It is available in version 4 and upward in the "Input" category. For version 3.2 projects we have made the source code available under the Kettle Franchise Factory (KFF) project umbrella.

Jan & I are going to release more software that resulted from our adventures under the KFF umbrella. Follow this blog to learn more about KFF and/or more plugins in the near future. We'll also be more than happy to share project experiences at the upcoming Pentaho Community Event in Lisbon, Portugal.

Matt


August 12, 2010

Time for a time-out

When you open kettle (sorry Pentaho Data Integration), it is so polite to re-open the transformations and jobs you had open the last time you were working. A nice feature, however ...when kettle tries to open jobs/transformations with connections to a server that isn't available anymore (because I shut down my VPN or I switched location), it will try to find that server. And that can take a long time.

Example: Below I opened kettle with 3 jobs 'remembered'. Those jobs each contain 4 Oracle connections and 1 AS/400 connection. All connections are shared. The time that passes between launching kettle and getting the three error boxes is about 54 Mississipi  :-). In the meantime you cannot do anything in the kettle interface. It doesn't react to anything until all error messages are on the screen. Knowing that I sometimes work on over 10 jobs/transformations, that sometimes makes a very long Mississipi.

In case you are confronted with this issue from time to time, the trick to avoid this kind of behaviour is so ensure that you have no network connection at all when you start up PDI. Unplug the cable or disable your wifi for a second, and that'll solve the issue. No more frustrated waiting.


In the meantime I have posted a Jira to ask the folks at Pentaho to make sure the time out happens a little bit sooner. A minor bug, in one of the major data integration tools [I tend to disagree with the Gartner boyz]:-)

August 10, 2010

Release 4.666 (aka bitter pain edition)

As software grows, software grows old. Time has an impact on everything. It is inevitable, even for intangible things as software. Sooner or later the beautiful piece of software you are working with today, will start to fall apart. Maybe the aging will start to show in the user interface. Little wrinkles will show that the GUI isn't able to keep up with 'fashion'. Maybe you'll notice the product getting a little slow in understanding, having a hard time to grasp the latest  standards, trends and concepts.

But the real aging of software lies under the skin, in the code that gets clogged up with more lines and lines of extra code where maybe a re-write was in order. Quick fixes, not exactly written as they should have been, but that no one dares to touch again afterward. Little modifications in the code to close that sales deal with customer x. Some extra lines to make everything go smooth on platform y. An extra call to avoid a crash if z occurs. If you have been involved in software development, you know this is all inevitable.

So sooner or later any software needs to be replaced. There is no way around that.

I believe it is clear that many of the existing (proprietary) BI/DWH tools on the market have reached the point where there product needs (or needed) a rewrite. Some examples include the following.
  • Business Objects have rewritten their immensely popular reporting product from the ground up. Business Objects 6.5 was the last "old release", Business Objects XI is the new release.
  • IBM have (finally) rewritten Ascential's Datastage (after their acquisition of the product at the beginning of this century). They have integrated it into the IBM Websphere family. Datastage 7.5 was the last release of the old breed.  Infosphere Datastage 8.0 is the new product release.
  • Oracle is coming up soon with a new version of their data integration tool. Oracle Warehouse Builder will cease to exist after 11g and will be replaced with Oracle Data Integration (a cross over between OWB and Sunopsis).

Probably there are few more examples out there, but already these three share some striking similarities:
  • All of the existing products have been around since before the year 2000, became very popular, even market leaders in the BI or data integration market segment, and gained a large customer base;
  • In each case the vendor has done a complete or significant product rewrite in order to assure they have a product that can keep up with market demands;
  • Marketing wise, none of the vendors really choose to position the new product as such on the market, they all positioned the new software as an upgrade of the existing product. IBM choose to maintain the product name and even version numbering, Business Objects kept the name and went from version 6.5 to XI. Oracle positions the rewrite of OWB as a merging of the best features of OWB and Sunopsis, while in reality not much will be left of OWB;
  • Last but not least, in all cases the 'upgrade' scenario doesn't drill down to an actual upgrade, but rather results in a costly (and painfull) migration scenario to a new software;
  • In all too many cases customers blindly accepted the upgrade and paid for costs.

Indeed, many customers have invested so much into these technologies that they cannot or dare not imagine a life without this software. But are these customers truly locked in by the years and years of investment in this software, IF the upgrade really is a migration? If you need to migrate, why not migrate to another platform.

Almost 2 years ago we published a white paper (that is still available on our website). This paper showed the cost differences in proprietary (Oracle, Microsoft) versus open source (Pentaho) data integration software. (( I believe we were slightly ahead of Mark Madsen, although our study was only a three pager of course  ;) )).  Anyhow in this study we also discussed the migration scenario. (Read the white paper if you want the details.)

Basically we discussed 3 scenario's:
  • Scenario 1: Remain closed source, in other words, stay with your proprietary vendor and pay for the extra/new licenses you might need in the future. That means no need to rework anything. Pay license costs where needed, and build new functionalities as you need them.
  • Scenario 2: Go open source. Rip and replace! Remove your proprietary solution and replace it completely with open source. Throw away your old licenses (kind of a cost saver) but rewrite everything you have (or still need).
  • Scenario 3: Have both. Put open source next to the proprietary solution for the extra/new functionality and go for a slow migration. You are stuck with your license cost, but will not buy new functionality (remain with old version of the software). I the mean time you deploy open source next to it.

Basically the conclusion back then was that often the best scenario was to go for open source along side closed source because a rip and replace was too expensive. Basically the migration cost for rewriting your ETL jobs (or reports) - that is, the effort of your IT resources (or even off-shore resources) -  makes a business case for rip and replace very expensive (on the short term). In the long run, rip and replace is the cheapest, but where's the CIO that will go for a project that has the best ROI only after 5 years?

I believe it is clear to everyone that in the case of the "painful upgrade", the above charts change a bit. Due to the fact that the upgrade really is a migration cost, we'll see the red curve as well as the purple  shifting up in the first two years. Remaining with your proprietary vendor will cost a big effort from your IT resources, just as migrating to a new software would. In that case, clearly option number two, rip and replace, becomes the best scenario.

So, please, if you are one of those customers, that feels cornered by a company proposing you a product upgrade which actually is the release from hell. Consider a real migration as a serious alternative option.

August 9, 2010

The one billion dollar query

... or the cost of bad data warehouse design.
(last week an IT resource of one of our customers told me this story.)

Since some time we are working with a customer to improve their data warehouse architecture. The top management of this company has been ignoring the need for data warehouse solutions for at least 10 years. Sure, investments have been made, but always on a departmental or project basis. No one ever bothered looking a the big picture. Different data warehouse architects were present (or not) over time, each using different design techniques. All initiatives have resulted into so called "information silo's". The consequence is a heterogenous mix of copies of operational tables, the thing called 'operational data store', data warehouses of different types and OLAP cubes. The whole of this runs on an IBM mainframe combined with a Windows Server running SAS 9.1.

The result of this data warehouse back-end fiasco, is that end-users within the company have taken things into their own hands and started building their own solutions. Some use Excel to elaborate complex analytical constructions, others have gained near-admin access to the SAS reporting server(s) so they can actually build their own OLAP cubes and reports. IT has lost control of what is going on and currently IT hardly has any feeling with what real business intellingence needs are.

For some time now we've been trying to inventorize all the problems and issues there are today, and what the cost is for the company. A hard task, as many of the costs are hidden. The following "cost example" was a little gem that showed up just because one of the IT guys (girl actually) - who has put a lot of effort in listing out all the issues and costs - bothered to manually monitor user activity for some time.

Apparantly what happened is that a user has/had been running a query, through QMF, on a part of the data warehouse with a particularly bad data model design. The result was a query that used up around 30.000 CPU seconds, running in a job class with top priority and no limitations (in other words, the mainframe administrator believed that that job actually had the right to consume all those resources), so if she wouldn't have stopped it, it would have continued to run (until the user bothered to end it).

JOBNAME   STEPNAME  PROCSTEP  JOBID     OWNER    C  SIO    CPU%    CPU-TIME
D003359Q  DB2CQMFB  DB2CQMFB  JOB12332  D003359  R  0.00   15.71   29112.05

I guess, most of you, also the ones that never work on mainframe, do realize that mainframe computing costs are high. So does the customer. They already moved from an "individual mainframe" to a 'hosted mode' in order to save costs. And still their infrastructure costs are high.
 
Anyhow to make a long story short, I called the hosting company, to inquire what might be the approximate cost of this particular query - which ironically never returned any result because we had it killed. I'm not naming any figures, but the number I was given, was as high a the net monthly salary of the IT analyst that discovered the issue.
 
Some inquiry with the users, learned us that they actually launch this (type of) query regularly at the end of the month. They estimated somewhere between 5 to 10 of these queries per month. I leave the maths up to you! Anyhow, I believe this to be good material to go and talk to the management of this company and tell them a story about the benefits of good data modeling and architecture design.

August 6, 2010

Visiting customers ...

... should not exclude fun.