October 19, 2010

Table compare - test automation


Some time ago we released a series of kettle plugins as part of the Kettle Francising Factory. The plugins (kff-plugins3.jar for PDI version 3.2.3 and above and kff-plugins4.jar for PDI 4.0.0 and above) are available on code.kjube.be but sofar not much documentation has been provided.

One of the kettle steps that is available in those jar files is the Table Compare. This step does what its name says. It compares the data from two tables (provided they have the same lay-out. It'll find differences between the data in the two tables and log it. We developed this plugin for acceptance tests scenarios in large projects. Consider (as a hypothetical example :-) ) the following use case of an data integration tool migration.

Suppose you have a data warehouse which is being loaded using, well, let's say, Oracle Warehouse Builder. Now Oracle has bought Sunopsis in 2006. Since then the development of OWB has somewhat stalled :-). Sometime after Oracle has anounced the launch of a new product called ODI - Oracle Data Integrator (very appropriate name, btw). This product combines the best of both worlds (or so Oracle sales reps state) but most Oracle Warehouse Builder customers know since a long time that migrating their code from OWB to ODI is not easy. And here, out of necessity, an opportunity arises. If you are faced with a painful and costly migration, which you can only post-phone as long as your support contract allows, why not move to a cheaper data integration tool, as let's say, kettle

The above migration scenario is exactly the type of projects we do at kJube. I'm not going to go into detail on this type of projects, but one element I do want to underline here is the following: If you cannot automate user acceptance testing you can forget about doing this type of projects. The Table Compare step does exactly this.

So what does the thing do?

Conceptually the Table Compare does the following for each pair of tables you hand it.

  1. It will count the records in each table and make the result of that count available.
  2. It will do a left, right and inner join between the two tables (the counts of those statistics aren't available
  3. All the records that appear only in the right or left join are logged as 'Errors'
  4. All the records that are common between the Reference and Compare table are put through a detailed compare on attribute level. All attributes that don't match are logged as 'Errors'. 

How to use it?

Now you have gotten the conceptual explanation on the Table Compare, I guess it is time for the technical stuff. As you can see, the Table Compare step contains quite a few fields that require input.

  • The "Reference connection" and "Compare connection" are the database connections from which the reference/compare table data will come.
  • The "Reference schema field" and "Compare schema field" contain the schema names for the reference/compare table.
  • The "Reference table field" and "Compare table field" contain the actual table names. This means that you could compare two tables with a different name, as long as they have the same column names.
  • The "Key fields field" should contain a comma separated list of they fields that make up the 'primary' key of the table(s) you are comparing. The primary key is needed because without this information the two tables cannot be correctly joined.
  • The "Exclude fields field" contains a comma separated list of columns that you want to exclude from the comparison. E.g. because they exist in the first table, but not in the second.
  • The "Number of errors field" allows you to specify the name of the output column that will contain the total number of errors found for the comparison of your tables.
  • The "Number of reference/compare table records field" allows you to specify the name of the field that will contain the actual number of records found in each table.
  • The "Number of left/inner/right join errors field" allows you to specify the name of the field(s) that will contain the number of errors found for each join type.
  • The "Error handling key description input field" allows you specify the name of the output field for the 'where clause" of the record that gave an error.
  • The "Error handling reference/compare value input field" allows you to specify the output field names for the actual values that differ.

If you find all of the above pretty confusing that is understandable. There is a lot of fields but most of them have little importance. They just allow you to choose how your field will be name, but have few functional importance. Still, in order to improve your understanding of the subject, we thought an example was in place.

In order to show you the example we needed some tables in an online database that we can compare. We found the information contained in the Ensembl project very suitable for this purpose. What is the project about?
The Ensembl project produces genome databases
for vertebrates and other eukaryotic species,
and makes this information freely available online.
Basically this project has a large amount of databases (one per species) that all have a similar structure. Perfect for our purpose. There are plenty of species available for comparison, but we picked:

We just picked two tables from each database and put them through the Compare Table step for demonstration purposes. The transformation is shown below (and is also available for download).
As the first step we used the data grid step to decide on which tables to run through the Compare step.

Previewing the first output ('Comparison Statistics') delivers the following:
It shows that both the 'analysis' and 'attrib_typ' table have a different number of records for the human vs chimp database. (Luckily?)

Previewing the second output ('Comparison Error Details') shows some details on the actual differences (in this case the inner join details).
Clearly the record with analysis_id=2 has different values for ALL columns in the table.

Hopefully this sample helps to understand what the Table Compare can do. The best way to see it is to download the .ktr and give it a spin. We'll also add the .ktr to the KFF project as a project template. So you'll also find the code in the next KFF release.


We know already that the following improvements would be handy:
  • Allow the connections to be field names that are accepted from the previous step. That would allow to do testing across more than two connections.
  • Save the following statistics:
    • nrRecordsInnerJoin
    • nrRecordsLeftJoin
    • nrRecordsRightJoin
  • The 3 error fields are currently expected as input fields in the step, this should be corrected. Also their name might be more appropriately be output field :-)
... but if you have further suggestions to improve this step, please let us know.

October 18, 2010

KFF & Cookbook

When we (Matt and myself) kicked off the KFF project, one of the things we wanted to include was auto-documentation. Matt made a small proof of concept to generate PDF documentation but that didn't reach finalization. At the same time, while Matt, Roland and Jos were busy writing Pentaho Kettle Solutions (Sorry, what were you thinking? You do not own a copy of this book?), the subject of auto-documentation came up too. Roland picked it up, and turned it into the kettle cookbook, a great auto-documentation tool for kettle.

Why did we want auto-documentation in the first place?
The reason we wanted auto documentation is that we are lazy. We know this. Actually we  have known this since a long time. So we needed a solution that would minimize effort on our side.

Also it often turns out that customers do not really want to pay for documentation. They just see that as part of the development process and certainly don't want to pay for any time you put into documentation. So minimizing effort is also keeping costs low, which seems to please customers.

Another reason for wanting auto documentation is that over the years we witnessed projects where the documentation of data integration  code was something in the line of a gigantic word document filled with screenshots - those very bulky screenshots made using MS (com)P(l)AINT. Obviously that kind of documentation stays-up-to date until 5 minutes after the next error in your data integration run. And stale documenation is even worse than no documentation.

So what was the way to go?
We quickly concluded that something or some one had to document the code for us - continuously. Since you cannot outsource everything to India; and since mind-reading solutions aren't just there yet, we thought along the lines of generating documentation from the code itself. What we didn't know is that it could even get better, and that Roland would write the auto-documenation tool and in doing so really minimized effort for every one.

About kettle documentation possibilities 
Now before zooming in on the cookbook, I would light to high-light some nice documentation features that are in kettle since quite some time. The examples below are taken from KFF, namely from the batch_launcher.kjb.

1) Job/Transformation properties
Ever kettle job or transformation has a serie of meta-data tags on the properties tab, accessible by right clicking on the canvas of spoon (or through the menu).

The available tags are the following:
  • Name: The name of the transformation/job. This name doesn't need to be equal to the physical name of the XML file in which you want to save the code, although not aligning the physical
  • Description: A short description. Short as in: fits on one line.
  • Extended description: A full text description
  • Status: Draft or Production
  • Version: A free text version number
  • Created by/at: ID of creator and timestamp of creation
  • Modified by/at: ID of modifier and timestamp of modification

To my experience this gives quite a few fields to stick in some elementary descriptions of functionality.

2) Canvas notes
First of all the fact that there are really no lay-out restrictions in how you organize a data integration job or transformation is a strong documentation feature by itself. Many ETL tools will oblige you to always work left to right, or oblige you to always see every step on attribute level. Often that makes the view a developer has of the canvas, well, not much of an overview. In kettle you do not run into that issue. 

Because of the fact that you can design jobs in a modular way (using sub-jobs), you can also ssure that you never need to design a job/transformation that looks like  the one below .  (For the record: I didn't design the below transformation myself.)  Obviously now I'm stating that a good data integration design, makes documentation readable, which is a bit beyond pure documentation functionality, but still, it is an important thing to consider when thinking about auto-documenting your solution.

On top of the great lay-out possibilities, you can insert notes on the canvas of any job/transformation . They allow for free text comments (without lay-out possibilities). This is good to document things that still need finalizing, to highlighte certain elements of your job/transformation, important remarks like 'don't ever change this setting', etc.

Although the notes aren't actually linked to any steps, the vicinity of a note to a step is good enough to show what step the comment actually belongs to. And in case you really want to link your comments to specific steps there also are 'Step descriptions'.

3) Step descriptions
Step description are available through a simple right click on the step you want to document.
A step description dialog opens up and you can take down any comments related to the step you clicked in free text format (no formatting).

All in all, kettle as a tool, has great lay-out possiblities and sufficient documentation 'place holders' to stuff your comments in. The next thing is to get that information back out.

The Cookbook
As I wrote in the intro of this post, Roland Bouman put together an auto-documentation tool for kettle during the writing of Pentaho Kettle Solutions. He presented this to the Pentaho Community even before the release of the book, both in a webcast as well as on the Pentaho Community Gathering 2010 in Cascais (presentation here).

What does the cookbook do? Well, basically it will read all kettle jobs and transformations in a specific directory (INPUT_DIR) and generate html documentation for this code in another directory (OUTPUT_DIR) using the software you already have installed, namely kettle. In other words, if you are a kettle user, you just need to tell the cookbook code where our code is and where you want to documentation to go. I'm not sure if it could get more simple than that. Yet, as far as I know this is the only data integration tool that actually is capable of auto-documenting itself

Cookbook features
The feature I'd like to show is that all your code is transformed into html pages which maintain the folder structure that you might have given to your project. In my example I've auto-documented the /kff/reusable folder, which looks like this:
So basically per job/transformation you have 1 html page, which is located in a directory structure that matches perfectly your original directory structure. Plain and simple.

Obviously the tree view shown here is clickable and allows you to navigate directly to any job/transformation you might want to explore.

On each page (for each job/transformation) quite an extensive amount of information is listed out. First you find the  meta-data tags from the properties tab. The below screenshot matches the batch_launcher.kjb properties as shown above. Note that the fields "version" and "status" aren't exported for some reason but apart from that all the fields are there.

After the meta-data elements, the parameters a job might expect are listed out. In case of our batch_launcher.kjb these are the following. Since the named parameters are quite important for the understanding of a transformation, it is appropriate they are listed on top of the page.

Next you'll find an export of the actual canvas you see in spoon in your documentation, including all the notes. Now this is true magic. The screenshots in the documentation are exactly like what you see on the canvas in spoon. And the steps are clickable. The'll bring you right to the job or transformation that the step refers to, or to the description of the step. In other words, you can drill down from jobs to sub-jobs to transformations to steps as you would in spoon. That is no less than amazing!
The step descriptions themselves are listed lower on the page. In the below screenshot you'll see the step descriptions we entered for the step 'kff_logging_init' before. (Note that page breaks are lost.)

However if you look at the step descriptions that do not just launch another job or transformation you even get some of the actual code. Look at this table input step where you actually get the SQL code that is executed.

All in all, the cookbook generates amazingly detailed documentation. In case you aren't convinced by the screenshots and explanation above, please check for yourself below (or in a full browser window).

KFF & Kettle-Cookbook
After the above explanation it doesn't need much clarification that integrating KFF and the cookbook was peanuts. The KFF directory structure is clear.
/kff/projects/my_customer/my_project/code  -->contains your code
/kff/projects/my_customer/my_project/doc  --> contains the documentation
So the INPUT_DIR and OUTPUT_DIR for connecting the cookbook to KFF are clear. The only thing needed was to add a step to the batch_launcher.kjb which called the top level job of the Cookbook and pass it two variables.

As I said, it was extremely simple to connect KFF to Cookbook

So from our next release on, if you download and install KFF, you'll automatically have a download of the Kettle-Cookbook in there, and whether you want it or not, all your projects will be auto-documented. You just need to figure out how to share the /kff/projects/my_customer/my_project/doc directory with people who would actually like to read the manual.

A big thanks to Roland!

October 14, 2010


Thank you Google people for throwing this message in our faces several times a day!!  Unfortunately some of our clients are large organisations (banks, government, etc) who really believe that Internet Explorer 6 was the end of web browsing evolution. So please Google people, stop developing for the web, there really is no future in that.  )-:

October 12, 2010

PCG10 in figures

PCG10, the Pentaho Community Meeting 2010 in Lisbon is over (safe to say after over 2 weeks), and attention on the live blog is slowly dying. So I thought, full attention on KFF now.

But Paul Stoellberger twittered the following statistics yesterday:
which made me think of looking up some of the Google Analytics statistics to share. After all, we are into business intelligence, aren't we?

So what has been the interest in the PCG10 presentations I added to the live blog post?

Here's the general statistics for the first 2 weeks:

Obviously the first days was highest, dropping after the 5th day. However, even though we had only 15 page views or so per day after that initial period, people still spend some time reading the pages. I'm curious how long that attention will last.

So where are the visitors from. Well, mainly it would seem that the Pentaho Community Event is a American-European party. Attention from other continents is extremely limited.

Top 20 countries (out of 37) are listed below. I was mostly amazed by Brazil. I would have expected more attention from that country, given the push to organize PCG next year :-)  And I was pleasanty surprised to see that adding the Dutch and Belgian contribution together (no contributions from Luxemburg) would put the Benelux in third position. Small countries, ahoi!

And further, well, no more comments, let the figures speak for themselves I would say !

October 8, 2010

Pentaho Kettle Solutions Overview

Dear Kettle friends,
Great news! Copies of our new book Pentaho Kettle Solutions are finally shipping. Roland, Jos and myself worked really hard on it and, as you can probably imagine, we were really happy when we finally got the physical version of our book in our hands.
Book front
So let's take a look at what's in this book, what the concept behind it was and give you an overview of the content...
The concept
Given the fact that Maria's book called Pentaho Data Integration 3.2 was due when we started, we knew that a beginners guide would be ready by the time that this book was going to be ready. As such we opted to look at what the data warehouse professional might need when he or she would start to work with Kettle. Fortunately there is already a good and well known check-list out there to see if you covered everything ETL related and it's called The 34 subsystems of ETL, a concept by Ralph Kimball that was first featured in his book The Data Warehouse Lifecycle Toolkit. And so we asked Mr Kimballs permission to use his list which he kindly provided. He was also gracious enough to review the related chapter of our book.
By using this approach we allow the users to flip to a certain chapter in our book and directly get the information they want on the problem they are facing at that time. For example, Change Data Capturing (subsystem 2, a.k.a. CDC) is handled in Chapter 6: Data Extraction.
In other words: we did not start with the capabilities of Kettle. We did not take every step or feature of Kettle as a starting point. In fact, there are plenty of steps we did not cover in this book. However, everywhere a step or feature needed to be explained while covering all the sub-systems we did so as clearly as we could. Rest assured though; since this book handles just about every topic related to data integration, all of the basic and 99% of the advanced features of Kettle are indeed covered in this book ;-)
The content
After a gentle introduction into how ETL tools came about and more importantly how and why Kettle came into existence, the book covers 5 main parts:
1. Getting started
This part starts with the a primer that explains the need for data integration and takes you by the hand into the wonderful world of ETL.
Then all the various building blocks of Kettle are explained. This is especially interesting for folks with prior data integration experience, perhaps with other tools, as they can read all about the design principles and concepts behind Kettle.
After that the installation and configuration of Kettle is covered. Since the installation is a simple unzip, that includes a detailed description of all the available tools and configuration files.
Finally, you'll get hands-on experience in the last chapter of the first part titled "An example ETL Solution - Sakila". This chapter explains in great detail how a small but complex data warehouse can be created using Kettle.

2. ETL
In this part you'll first encounter a detailed overview of the 34 sub-systems of ETL after which the art of Data Extraction is covered in detail. That includes extracting information from all sorts of file types, databases, working with ERP and CRM systems, Data profilng and CDC.
This is followed by chapter 7 "Cleansing and Conforming" in which the various data cleansing and validation steps are covered as well as error handling, auditing, deduplication and last but not least scripting and regular expressions.
Finally this second part of the book will cover everything related to star schemas including the handling of dimension tables (chapter 8), loading of fact tables (chapter 9) and working with OLAP data (chapter 10).

3. Management and deployment
The third main part of the book deals with everything related to the management and deployment of your data integration solution. First you'll read all about the ETL development lifecycle (chapter 11), scheduling and monitoring (chapter 12), versioning and migration (chapter 13) and lineage and auditing (chapter 14). As you can guess from the titles of the chapters, a lot of best practices, do's-and-don'ts are covered in this part.

4. Performance and scalability
The 4th part of our book really dives into the often highly technical topics surrounding performance tuning (chapter 15), parallelization, clustering and partitioning (chapter 16), dynamic clustering in the cloud (chapter 17) and real-time data integration (chapter 18).
It's personally hope that the book will lead to more performance related JIRA cases since chapter 15 explains how you can detect bottlenecks :-)

5. Advanced topics
The last part conveniently titled "Advanced topics" deals with things we thought were interesting to a data warehouse engineer or ETL developer that is faced with concepts like Data Vault management (chapter 19), handling complex data formats (chapter 20) or web services (chapter 21). Indispensable in case you want to embed Kettle into your own software is chapter 22 : Kettle integration. It contains many Java code samples that explain to you how you can execute jobs and transformations or even assemble them dynamically.
Last but certainly not least since it's probably one of the most interesting chapters for a Java developer is chapter 23: Extending Kettle. This chapter explains to you how you can develop step, job-entry, partitioning or database type plugins for Kettle in great detail so that you can get started with your own components in no time.
I hope that this overview of our new brain-child gives you an idea of what you might be buying into. Since all books are essentially a compromise between page count, time and money I'm sure there will be the occasional typo or lack of precision but rest assured that we did our utmost best on this one. After all, we did each spend over 6 months on it...
Feel free to ask about specific topics you might be interested in to see if they are covered ;-)
Until next time,

October 4, 2010

Twitter real-time BI

Something which a lot of you people already knew, but which I just discovered, is that my (and your) Tweets are being watched. Not in 'big brother is watching you', but rather as 'big marketeer is studying you'.

What happened:

This week I made a quick twitter reply to the following post of @magicaltrout in which he was complaining his employer didn't pay him (yet).
I made a quick reply to this that went as follows:
Next thing that happened - within minutes - I had a new follower on Twitter. Check this.
Clearly some companies scan twitter to figure out whether you are a potential customer, then add you to their followers list, probably with the idea of following up even closer, or to start spamming you.

When I looked into the Twitter account, it actually belonged to a Fastmkn.com, a company that actually sets up this kind of services for their customers. Their website was clear about their services:

So, when you tweet, know your tweets are being 'mined', watch out what your write, and make sure you have your hand on the anti-spam button.

October 3, 2010


September 30, 2010

PCG10 KFF presentation

Unfortunately I wasn't able to blog my own presentation live during the Pentaho Community Gathering in Cascais, Portugal (PCG10), last saturday. The Live Blog page drew a lot of attention, during and after the event - statistics will follow - and I even got a few times the question whether I would still write a summary of the KFF presentation to go with the slides. Well, I will do no such thing! Instead however ...

Since the whole objective of our presentation was to somehow "launch" KFF - except for a handfull of insiders, no one within the Pentaho Community heard about KFF before - I taught it would be worthwile to write a full walk-through of the presentation for all the persons that might visit the 'PCG10 Live Blog'. So here it goes, no summary, but the full presentation in blog format, plus some little extra's at the end. Enjoy.

KFF, as presented at the Pentaho Community Gathering 2010

As the title slide of my presentation suggests, KFF is all about Pentaho Data Integration, often better known as kettle. KFF has ambitions to be an exciting addition to the existing toolset kettle, spoon, kitchen (all clearly visible in the picture) and at the same time be a stimulator for improvement of these tools.

Why oh why?

Any, and I mean any, consultant that has worked at least once with a data integration tool, be it Informatica, Datastage, MS Integration Services, Business Objects Data Integration, Talend (somehow forgot to name this one at PCG10), Sunopsis - Oracle Warehouse Builder - Oracle Data Integration, has been confronted with the fact that some elementary things are not available out of the box in any of these tools. I think about:
  • A job/transformation logging  without set-up or configuration
  • Integrated alerting mechanisms (for when things go wrong)
  • Integrated reporting 
    • as part of the alerting or 
    • just to understand the health of your data integration server
  • Guideliness for a multi-environment (DEV, TST, UAI, PRD) set-up 
  • Easy code versioning and migration between environments
  • Automated archiving of your code
  • ... etc
After some years - too many I would say - I came to the conclusion that whatever the data integration technology, I was always rewriting the same concept over and over again. And all ustomers have seemed more than happy with the "frameworks" I've built. So I started wondering, how it was possible that data integration vendors were not covering the above requirements with a standard solution, if the requirements are the same across all customers. I took the discussion up with Matt, and he felt the same.

Once we realized this, re-implementing the same concepts again and again became hard to bare.

Luckily Matt and myself had the chance to do a lot of projects together, using kettle, and we started building something we could re-use on our projects back in 2005. With every new project we did with kJube, our 'solution' grew, and we got more and more conviced that we needed to share this.

So in June 2010 we listed out all we had and decided to clean the code and package a first version of what we had to show and share on the Pentaho Community Gathering.

We noticed soon that the first version couldn't include nearly all we had ready. What we present at PCG10, is just a basic version to show you all what were doing. The whole release schedule willt ake until january 2011, if new additions or change requests interfere.

So what is KFF?

We decided to call our solution the Kettle Franchising Factory.

Franchising seemed a nice term because that remained nicely within the existing kettle, spoon, kitchen, chef, carte, etc metaphor. It indicates that the KFF objective is to scale up your data integration  restaurant to multiple 'locations' where you cook the same food. That's basically what we want. Make kettle deployments multi environment, multi customer, whilst keeping the set-up standard.

The term Factory refers to the fact that we want every part of the process to go as  speedy and automatic as possible. This factory contains all the tools to deploy kettle solutions as swift as possible.

The tools through which we reach those goals are several:
  • Some of the requirements we meet through proposing set-up standards. We try to make as few things dependend on standards or guideliness, everything should be configurable, but large data integration deployments stay neat and clean only if some clear set-up standards are respected. Also, standards on parametrization need to be imposed if you want to make your code flexible enough to run on multiple environments without further modifications.
  • A lot of functionality is implemented using reusable kettle jobs and transformations, often using named variables.
  • Quite a few kettle plugins have been written too. We believe that when certain actions can be simplified by providing a kettle plugin, that we should provide that plugin.
  • Up to now we have 4 project templates we want to include with the KFF. Some "projects" always have the same structure if one follows best practices, so why should we rewrite things.
  • Scripting. Although limited, there is also some scripting involved in KFF.

So let's go into details

A first element of the KFF we want to show is the 'batch_launcher.kjb'. This kettle job is designed to be a wrapper around your existing ETL code or one of the templates we'll ship with KFF. The objective is make all calls to re-usable logic as logging, archiving etc in this wrapper without the need to modify your code.

What does this job do (as of today):
  1. The first step of this job will read the right configuration file(s) for your current project/environment. For this we've developped a step called the 'environment configurator'. So based upon some input parameters, the environment configurator will override any variables that (might) have been see in kettle.properties to ensure that the right variables are used.
  2. The job 'kff_logging_init' will
    1. create logging tables (in case they didn't exist yet), currently on MySQL or Oracle,
    2. clean up logging tables in case there should be data in there
    3. check whether the previous run for this project finished (succesfully)
    4. creates a 'batch run'
  3. The next job calls one of our project templates currently the datawarehouse template but can easily be replaced by the top level job of your data integration project
  4. After the data integration code has finished, 'kff_logging_reports' generates standard reports on top of the logging tables . The reports are kept with the kitchen logs.
  5. 'kff_logging_archive' 
    1. closes the 'batch_run' based on results in the logging tables and
    2. archives the logging tables (more on that later)
  6. 'kff_backup_code' makes a zip file of the data integration code which is tagged with the same batch_run_id as the kitchen log file and the generated reports.

How does the environment configuration work, and why is it necessary? Well, the fact kettle standard only provides one kettle.properties file in which to put all your parameters is kind of limiting to setting up a multi-environment kettle project. The way you actually switch between environments in a flexible way is actually by changing the content of variables. So we created the environment configurator. I'm not gonna elaborate on this again, since I've blogged about this plug-in in august when we first released it. I believe that blog-post elaborates more than enough on the usage of this step.

Obviously the environment configurator is something that works when you execute code through kitchen, that is in batch mode. However whenever you fire up spoon, it will just read the properties files in your $KETTLE_HOME directory. In order to overcome the problem also in the development interface.

Consequently, if you have correctly set up your configuration files, the kff_spoon_launcher.sh [No windows script available yet. We do accept contributions from people running Windows as OS.] will automatically set the right configuration files at run time and fire up spoon on the environment you want. As a little addition, nothing more than a little hack, we also change the background of your kettle canvas. That way you see whether you are logged on in DEV, TST, UAI or PRD, which is good to know when you want to launch some code from the kettle interface.

So how about that system to create logging tables? Well, the logging tables we use are the standard job, transformation and step logging tables. We tried to stick as much to the existing PDI logging and just add on top of that. 

What did we add:

  • We implemented the concept of a batch logging table. For every time you launch a batch process, in this table a record will be logged that covers your whole batch run. In casu it will log the execution of the top level job. So yes, this is nothing but job logging, but since the top level job has a specific meaning within a batch process, isolating it's logging opens up possibilities.
  • We also implemented the concept of a rejects logging table. Kettle has great error handling, however one feature we felt was missing is to standardize that error handling. Our reject plug-in merges all records that have been rejected by an output step into a common format and inserts them into our reject logging table. The full records is preserved, so information could theoretically be reprocessed later. [Question Pedro Alvez: "Is the reprocessing part of KFF? Answer: No, since we don't believe automation of that is straight forward enough.]
  • Logging tables are created on the fly. Why? Well, whenever you are running your jobs/transformations on a new environment you get that nasty errors that your logging tables don't exist. Why should you be bothered with that. If they don't exist, we create them. 

  • Creating the logging tables on the fly wasn't just done because we like everything to go automatically. Suppose you would want to run two batch processes in parallel. In a set-up with a single set of logging tables your logging information would get mixed up. Not in our set-up. You can simply define a differrent set of logging tables for the second batch run and your logging stays nicely separated. 
  • Obviously to implement the above, you need to be able to change your logging settings in your jobs and transformations at run-time. For this Matt has written some nifty logging parameters injection code that actually injects the log table information and log connection into the jobs and transformations. More about that on the next slide. 
  • At the end of the batch run we also archive logging information. Even if you have been using different sets of logging tables, all information is merged back together, allowing historical reporting on your data integration processes. Also, the archive tables avoid that your logging tables fill up and make the kettle development interface become sluggish when visualizing the logging.

The rejects step isn't the only plug-in we have written over the last few years. The next slide illustrates some other steps that have been developed.

  • Trim strings
  • Date-time calculator
  • Table compare 

I have blogged about these steps before, so again, I will not write this out again. Also, one step that isn't mentioned here, but which we developed too, and has been contributed back to kettle 4.0 is the data grid step.

Another aspect of KFF are project templates. For the moment, what we have is rather meager - only the datawarehouse template is available -, but we do have quite some stuff in the pipeline that we want to deploy.

  • The datawarehouse template should grow out to be a 'sample' datawarehouse project containing lots of best practices and possibly a lot of reusable dimensions (as in date dimension, time dimension, currency dimension, country dimension, ...) 
  • The data vault generator is a contribution from Edwin Weber which came to us through Jos van Dongen. We are still looking into how we can add it. But it seems promissing.
  • The campaign manager is a mailing application, also know as norman-mailer, which we use internally at kJube. It allows you to easily read out a number of email addresses, send mails, and capture reponses from POP3.
  • The db-compare template does an automatic compare of the data in a list of tables in two databases. It will log all differences between the data in the two tables. It is something we've used for UAI testing when we need to prove our customer that UAI and PRD are alligned.

  • After the presentation Roland Bouman came to me with a great idea for another template. I will not reveal anything as he has his hands full with the cookbook for the time being, and we are busy with KFF. When the time is ripe, you'll hear about this template too.

So to sum it all up: KFF pretends to be a big box, with all of the below contents.

We don't expect all of this to be in there from day one. Actually the day we presented KFF at PCG10 was day 1, so have some patience and let us add what we have over the next months.

How will KFF move forward?

Well we believe the first step was releasing something to the community. We'll keep on doing that. The code for the project is fully open source (GPL) and available no Google Code. Check code.kjube.be or go to the kettle-franchising project on google code. We'll listen to your feedback and adapt where possible!

Also we'll follow these basic guidelines:

  • Make KFF deployment as simple as possible. As is simple as a kettle deploy is impossible since kettle is deployed within KFF, but if you know kettle, you know what we mean.
  • We also believe that some of the functionality we have built doesn't belong in KFF but rather in kettle itself. We'll push those things back to Pentaho. (We'll try to find time to discuss with the kettle architect :-) )
  • If and when something should be automated/simplified in a plug-in we'll do so. 
  • We believe we should integrate with other project around kettle, as the cookbook. 

Who's in on this?

For the moment, Matt and myself are the drives behind this "project". Older version have been in production with some of kJube's customers for years. Sometimes they contribute, sometimes they are just happy it does what it should.

We hope to welcome a lot of users and contributers over next months.


is always welcome!


September 29, 2010

PCG10 on PCG11

Since I've been blogging so much on PCG10, I thought it was worth to quickly gather all the ideas that I heard on PCG11 at PCG10. Probably I was able to catch only a fraction of them, but I count on the community to add the rest of the ideas as comments (or on the Pentaho Forums and Wiki, where they belong in order to get a good discussion going).

Improvement idea nbr 1: shorter twitter tag
I'm not sure if this idea needs much explanation or even discussion? The hash tag PentahoMeetup10 was annoyingly long for people tweeting from the room. If the name of the Pentaho Community Gathering doesn't change (see improvement idea 2) I vote for #PCG11.

Improvement idea nbr 2: rename the Pentaho Community Gathering
I've heard rumours (@julianhyde) that the name Pentaho Open World would become the new name. I'm not sure if this is a strategic hint to L. Ellison to make Oracle buy Pentaho. But then again, Julian  knows more about the company strategy than I do.

Improvement idea nbr 3: location
THE QUESTION at PCG10 was: "How will we ever top this location?". Indeed Cascais and surroundings were an amazing location! The hotel where the event was held, with a view over the sunny beach was fantastic! A tower of cupcakes and sweets in the room! Private espresso machines for PCG10! Nice little restaurants with fresh fish dishes right outside the door for a great lunch break! Webdetails and Xpand-IT even managed to organize a cloudless sky and +25°C temperatures. 

The question to ask might well be: "Will any one dare to organize PCG11?"

So how go about setting up PCG11? Do a poll in the community on prefered locations? Asking different contenders for a proposal and let the best win? I don't have the answer, but the organizer of PCG11 will have a hard time topping PCG10, that is for sure.

A name that came up quite a lot for PCG11, was Brazil. Quite a few Brazilians have been following the live blog, and it seems that there's a very active Pentaho Community out there. But Summer in Brazil is coming real soon, which brings us to improvement idea nbr 4.

Improvement idea nbr 4: timing

Another interesting idea mentioned was to go from a yearly to a half-yearly event. That would perfectly fit with PCG switches between the Northern and Southern hemisphere. We could have a PCG11-South and a PCG11-North. Or Up and Down? Or ... well whatever.

Improvement idea nbr 5: presentation formats
On PCG10, all 15 presentations were either slideshows and/or demo's with exception of Dan's (codek1) "presentation". He just went up front with a short prepared speech actually interviewing the audience on methodology. That resulted into a very interesting groups discussion. 

So this might actually lead to an idea for varying the format of presentations. Some idea's I heard:
  • groups discussions on a specific topic (who's interested can participate)
  • a "what sucks session" (proposed by Grumpy)
  • architecture sessions around a white board
There are many possibilities, and with 15 presentations in one day (a heavy schedule) a change of format is most welcome.

Improvement idea nbr 6: extending PCG to a full OSBI event
Based on Aaron's presentation at PCG10, where the idea was discussed that Pentaho BI server is more and more becoming a Pentaho BI APPLICATION server, which might/will also support JasperReports, BIRT, etc the discussion to invite also these open source projects to the community table. PCG would then become a true open source BI event. Personally I find that a very challenging idea, however it does raise some practical questions. PCG is growing quickly as it is, adding even more momentum would make organisation of the event just too tough for any of the partners to take on? How to balance the agenda between Pentaho / Non-Pentaho stuf, after all it's a Pentaho sponsored event?  

So, that's it, I've posted whatever I remembered from the nice talks with great people, if a faboulous surrounding, accompanied wit great food, wine and beer. It's the best way these ideas won't fade. Most of these ideas come from great minds and fine people in the community. I hope posting them helps to stimulate the discussion. One thing is sure: PCG will keep on getting better.

    PCG10 participants

    Three days after the event, some mails are going around, trying to reconstruct the "who was who". Indeed the Pentaho Community Event is growing, and I believe many discovered only when seeing the group picture that they didn't get round to meet quite a few of the participants.

    I discovered too that I missed to opportunity to get to know some people. So based on the mails that have been circulating and a 'tagged' group picture (thank you Jens Bleuel) I'm trying to put together the PCG10 Participant list.

    It is work in progress, so please people, help me sticking the right name to the right person. Also, should any one rather remain anonymous, drop me a mail, I'll rename you to Mr X(action).

    Nbr Name (First name - Last name) Twitter from
    1 Roland Bouman @rolandbouman .nl
    2 Matt Casters @mattcasters .be
    3 Håkon Torjus Bommen .no
    4 Marco Gomes .pt
    5 Jos van Dongen aka Grumpy @josvandongen .nl
    6 Carlos Amorim .es
    7 Jens Bleuel .de
    8 Pedro Alves @pmalves .pt
    9 Nikolai Sandved @NikolaiSandved .no
    10 Gunter Rombauts .be
    11 Jochen Olejnik .de
    12 Tom Barber @magicaltrout .uk
    13 Nuno Brites .pt
    14 David Duque .pt
    15 Slawomir Chodnicki @slawo_ch .de
    16 Pedro Pinheiro .pt
    17 Nelson Sousa .pt
    18 Nuno Severo .pt
    19 Paula Clemente .pt
    20 Pedro Martins .pt
    21 Samatar Hassan .fr
    22 André Simões @ITXpander .pt
    23 Rui Gonçalves .es
    24 Dan Keeley @codek1 .uk
    25 Anthony Carter .ir
    26 Julian Hyde @julianhyde .us
    27 Rob van Winden .nl
    28 Pompei Popescu .ro
    29 Jan Aertsen @jan_aertsen .be
    30 Ingo Klose @i_klose .de
    31 Sergio Ramazzina @serasoftitaly .it
    32 Martin Stangeland .no
    33 Dragos Matea .ro
    34 Juan José Ortilles .es
    35 Paul Stoellberger @pstoellberger .at
    36 Doug Moran aka Caveman @doug_moran .us
    37 Thomas Morgner .us
    38 Kees Romijn .nl
    39 Aaron Phillips @phytodata .us

    ... and the following are people that were present, but somehow dropped out of the group picture. Maybe they were on the beach?

    Picture Name (First name - Last name) Twitter from
    Nuno Moreira @webdetails .pt
    Bart Maertens @bartmaer .be
    Juliana Alves .pt