October 19, 2010

Table compare - test automation

Background

Some time ago we released a series of kettle plugins as part of the Kettle Francising Factory. The plugins (kff-plugins3.jar for PDI version 3.2.3 and above and kff-plugins4.jar for PDI 4.0.0 and above) are available on code.kjube.be but sofar not much documentation has been provided.

One of the kettle steps that is available in those jar files is the Table Compare. This step does what its name says. It compares the data from two tables (provided they have the same lay-out. It'll find differences between the data in the two tables and log it. We developed this plugin for acceptance tests scenarios in large projects. Consider (as a hypothetical example :-) ) the following use case of an data integration tool migration.

Suppose you have a data warehouse which is being loaded using, well, let's say, Oracle Warehouse Builder. Now Oracle has bought Sunopsis in 2006. Since then the development of OWB has somewhat stalled :-). Sometime after Oracle has anounced the launch of a new product called ODI - Oracle Data Integrator (very appropriate name, btw). This product combines the best of both worlds (or so Oracle sales reps state) but most Oracle Warehouse Builder customers know since a long time that migrating their code from OWB to ODI is not easy. And here, out of necessity, an opportunity arises. If you are faced with a painful and costly migration, which you can only post-phone as long as your support contract allows, why not move to a cheaper data integration tool, as let's say, kettle

The above migration scenario is exactly the type of projects we do at kJube. I'm not going to go into detail on this type of projects, but one element I do want to underline here is the following: If you cannot automate user acceptance testing you can forget about doing this type of projects. The Table Compare step does exactly this.

So what does the thing do?

Conceptually the Table Compare does the following for each pair of tables you hand it.

  1. It will count the records in each table and make the result of that count available.
  2. It will do a left, right and inner join between the two tables (the counts of those statistics aren't available
  3. All the records that appear only in the right or left join are logged as 'Errors'
  4. All the records that are common between the Reference and Compare table are put through a detailed compare on attribute level. All attributes that don't match are logged as 'Errors'. 

How to use it?

Now you have gotten the conceptual explanation on the Table Compare, I guess it is time for the technical stuff. As you can see, the Table Compare step contains quite a few fields that require input.


  • The "Reference connection" and "Compare connection" are the database connections from which the reference/compare table data will come.
  • The "Reference schema field" and "Compare schema field" contain the schema names for the reference/compare table.
  • The "Reference table field" and "Compare table field" contain the actual table names. This means that you could compare two tables with a different name, as long as they have the same column names.
  • The "Key fields field" should contain a comma separated list of they fields that make up the 'primary' key of the table(s) you are comparing. The primary key is needed because without this information the two tables cannot be correctly joined.
  • The "Exclude fields field" contains a comma separated list of columns that you want to exclude from the comparison. E.g. because they exist in the first table, but not in the second.
  • The "Number of errors field" allows you to specify the name of the output column that will contain the total number of errors found for the comparison of your tables.
  • The "Number of reference/compare table records field" allows you to specify the name of the field that will contain the actual number of records found in each table.
  • The "Number of left/inner/right join errors field" allows you to specify the name of the field(s) that will contain the number of errors found for each join type.
  • The "Error handling key description input field" allows you specify the name of the output field for the 'where clause" of the record that gave an error.
  • The "Error handling reference/compare value input field" allows you to specify the output field names for the actual values that differ.
Example

If you find all of the above pretty confusing that is understandable. There is a lot of fields but most of them have little importance. They just allow you to choose how your field will be name, but have few functional importance. Still, in order to improve your understanding of the subject, we thought an example was in place.

In order to show you the example we needed some tables in an online database that we can compare. We found the information contained in the Ensembl project very suitable for this purpose. What is the project about?
The Ensembl project produces genome databases
for vertebrates and other eukaryotic species,
and makes this information freely available online.
Basically this project has a large amount of databases (one per species) that all have a similar structure. Perfect for our purpose. There are plenty of species available for comparison, but we picked:

We just picked two tables from each database and put them through the Compare Table step for demonstration purposes. The transformation is shown below (and is also available for download).
As the first step we used the data grid step to decide on which tables to run through the Compare step.

Previewing the first output ('Comparison Statistics') delivers the following:
It shows that both the 'analysis' and 'attrib_typ' table have a different number of records for the human vs chimp database. (Luckily?)


Previewing the second output ('Comparison Error Details') shows some details on the actual differences (in this case the inner join details).
Clearly the record with analysis_id=2 has different values for ALL columns in the table.

Hopefully this sample helps to understand what the Table Compare can do. The best way to see it is to download the .ktr and give it a spin. We'll also add the .ktr to the KFF project as a project template. So you'll also find the code in the next KFF release.

Improvements

We know already that the following improvements would be handy:
  • Allow the connections to be field names that are accepted from the previous step. That would allow to do testing across more than two connections.
  • Save the following statistics:
    • nrRecordsInnerJoin
    • nrRecordsLeftJoin
    • nrRecordsRightJoin
  • The 3 error fields are currently expected as input fields in the step, this should be corrected. Also their name might be more appropriately be output field :-)
... but if you have further suggestions to improve this step, please let us know.

October 18, 2010

KFF & Cookbook

When we (Matt and myself) kicked off the KFF project, one of the things we wanted to include was auto-documentation. Matt made a small proof of concept to generate PDF documentation but that didn't reach finalization. At the same time, while Matt, Roland and Jos were busy writing Pentaho Kettle Solutions (Sorry, what were you thinking? You do not own a copy of this book?), the subject of auto-documentation came up too. Roland picked it up, and turned it into the kettle cookbook, a great auto-documentation tool for kettle.

Why did we want auto-documentation in the first place?
The reason we wanted auto documentation is that we are lazy. We know this. Actually we  have known this since a long time. So we needed a solution that would minimize effort on our side.

Also it often turns out that customers do not really want to pay for documentation. They just see that as part of the development process and certainly don't want to pay for any time you put into documentation. So minimizing effort is also keeping costs low, which seems to please customers.

Another reason for wanting auto documentation is that over the years we witnessed projects where the documentation of data integration  code was something in the line of a gigantic word document filled with screenshots - those very bulky screenshots made using MS (com)P(l)AINT. Obviously that kind of documentation stays-up-to date until 5 minutes after the next error in your data integration run. And stale documenation is even worse than no documentation.

So what was the way to go?
We quickly concluded that something or some one had to document the code for us - continuously. Since you cannot outsource everything to India; and since mind-reading solutions aren't just there yet, we thought along the lines of generating documentation from the code itself. What we didn't know is that it could even get better, and that Roland would write the auto-documenation tool and in doing so really minimized effort for every one.

About kettle documentation possibilities 
Now before zooming in on the cookbook, I would light to high-light some nice documentation features that are in kettle since quite some time. The examples below are taken from KFF, namely from the batch_launcher.kjb.

1) Job/Transformation properties
Ever kettle job or transformation has a serie of meta-data tags on the properties tab, accessible by right clicking on the canvas of spoon (or through the menu).


The available tags are the following:
  • Name: The name of the transformation/job. This name doesn't need to be equal to the physical name of the XML file in which you want to save the code, although not aligning the physical
  • Description: A short description. Short as in: fits on one line.
  • Extended description: A full text description
  • Status: Draft or Production
  • Version: A free text version number
  • Created by/at: ID of creator and timestamp of creation
  • Modified by/at: ID of modifier and timestamp of modification

To my experience this gives quite a few fields to stick in some elementary descriptions of functionality.

2) Canvas notes
First of all the fact that there are really no lay-out restrictions in how you organize a data integration job or transformation is a strong documentation feature by itself. Many ETL tools will oblige you to always work left to right, or oblige you to always see every step on attribute level. Often that makes the view a developer has of the canvas, well, not much of an overview. In kettle you do not run into that issue. 

Because of the fact that you can design jobs in a modular way (using sub-jobs), you can also ssure that you never need to design a job/transformation that looks like  the one below .  (For the record: I didn't design the below transformation myself.)  Obviously now I'm stating that a good data integration design, makes documentation readable, which is a bit beyond pure documentation functionality, but still, it is an important thing to consider when thinking about auto-documenting your solution.



On top of the great lay-out possibilities, you can insert notes on the canvas of any job/transformation . They allow for free text comments (without lay-out possibilities). This is good to document things that still need finalizing, to highlighte certain elements of your job/transformation, important remarks like 'don't ever change this setting', etc.



Although the notes aren't actually linked to any steps, the vicinity of a note to a step is good enough to show what step the comment actually belongs to. And in case you really want to link your comments to specific steps there also are 'Step descriptions'.

3) Step descriptions
Step description are available through a simple right click on the step you want to document.
A step description dialog opens up and you can take down any comments related to the step you clicked in free text format (no formatting).


All in all, kettle as a tool, has great lay-out possiblities and sufficient documentation 'place holders' to stuff your comments in. The next thing is to get that information back out.

The Cookbook
As I wrote in the intro of this post, Roland Bouman put together an auto-documentation tool for kettle during the writing of Pentaho Kettle Solutions. He presented this to the Pentaho Community even before the release of the book, both in a webcast as well as on the Pentaho Community Gathering 2010 in Cascais (presentation here).

What does the cookbook do? Well, basically it will read all kettle jobs and transformations in a specific directory (INPUT_DIR) and generate html documentation for this code in another directory (OUTPUT_DIR) using the software you already have installed, namely kettle. In other words, if you are a kettle user, you just need to tell the cookbook code where our code is and where you want to documentation to go. I'm not sure if it could get more simple than that. Yet, as far as I know this is the only data integration tool that actually is capable of auto-documenting itself

Cookbook features
The feature I'd like to show is that all your code is transformed into html pages which maintain the folder structure that you might have given to your project. In my example I've auto-documented the /kff/reusable folder, which looks like this:
So basically per job/transformation you have 1 html page, which is located in a directory structure that matches perfectly your original directory structure. Plain and simple.

Obviously the tree view shown here is clickable and allows you to navigate directly to any job/transformation you might want to explore.

On each page (for each job/transformation) quite an extensive amount of information is listed out. First you find the  meta-data tags from the properties tab. The below screenshot matches the batch_launcher.kjb properties as shown above. Note that the fields "version" and "status" aren't exported for some reason but apart from that all the fields are there.

After the meta-data elements, the parameters a job might expect are listed out. In case of our batch_launcher.kjb these are the following. Since the named parameters are quite important for the understanding of a transformation, it is appropriate they are listed on top of the page.

Next you'll find an export of the actual canvas you see in spoon in your documentation, including all the notes. Now this is true magic. The screenshots in the documentation are exactly like what you see on the canvas in spoon. And the steps are clickable. The'll bring you right to the job or transformation that the step refers to, or to the description of the step. In other words, you can drill down from jobs to sub-jobs to transformations to steps as you would in spoon. That is no less than amazing!
The step descriptions themselves are listed lower on the page. In the below screenshot you'll see the step descriptions we entered for the step 'kff_logging_init' before. (Note that page breaks are lost.)

However if you look at the step descriptions that do not just launch another job or transformation you even get some of the actual code. Look at this table input step where you actually get the SQL code that is executed.

All in all, the cookbook generates amazingly detailed documentation. In case you aren't convinced by the screenshots and explanation above, please check for yourself below (or in a full browser window).



KFF & Kettle-Cookbook
After the above explanation it doesn't need much clarification that integrating KFF and the cookbook was peanuts. The KFF directory structure is clear.
/kff/projects/my_customer/my_project/code  -->contains your code
/kff/projects/my_customer/my_project/doc  --> contains the documentation
So the INPUT_DIR and OUTPUT_DIR for connecting the cookbook to KFF are clear. The only thing needed was to add a step to the batch_launcher.kjb which called the top level job of the Cookbook and pass it two variables.


As I said, it was extremely simple to connect KFF to Cookbook

So from our next release on, if you download and install KFF, you'll automatically have a download of the Kettle-Cookbook in there, and whether you want it or not, all your projects will be auto-documented. You just need to figure out how to share the /kff/projects/my_customer/my_project/doc directory with people who would actually like to read the manual.

A big thanks to Roland!

October 14, 2010

IE6


Thank you Google people for throwing this message in our faces several times a day!!  Unfortunately some of our clients are large organisations (banks, government, etc) who really believe that Internet Explorer 6 was the end of web browsing evolution. So please Google people, stop developing for the web, there really is no future in that.  )-:

October 12, 2010

PCG10 in figures

PCG10, the Pentaho Community Meeting 2010 in Lisbon is over (safe to say after over 2 weeks), and attention on the live blog is slowly dying. So I thought, full attention on KFF now.

But Paul Stoellberger twittered the following statistics yesterday:
which made me think of looking up some of the Google Analytics statistics to share. After all, we are into business intelligence, aren't we?

So what has been the interest in the PCG10 presentations I added to the live blog post?

Here's the general statistics for the first 2 weeks:


Obviously the first days was highest, dropping after the 5th day. However, even though we had only 15 page views or so per day after that initial period, people still spend some time reading the pages. I'm curious how long that attention will last.


So where are the visitors from. Well, mainly it would seem that the Pentaho Community Event is a American-European party. Attention from other continents is extremely limited.


Top 20 countries (out of 37) are listed below. I was mostly amazed by Brazil. I would have expected more attention from that country, given the push to organize PCG next year :-)  And I was pleasanty surprised to see that adding the Dutch and Belgian contribution together (no contributions from Luxemburg) would put the Benelux in third position. Small countries, ahoi!


And further, well, no more comments, let the figures speak for themselves I would say !

October 8, 2010

Pentaho Kettle Solutions Overview

Dear Kettle friends,

Great news! Copies of our new book Pentaho Kettle Solutions are finally shipping. Roland, Jos and myself worked really hard on it and, as you can probably imagine, we were really happy when we finally got the physical version of our book in our hands.

Book front

So let's take a look at what's in this book, what the concept behind it was and give you an overview of the content...

The concept

Given the fact that Maria's book called Pentaho Data Integration 3.2 was due when we started, we knew that a beginners guide would be ready by the time that this book was going to be ready. As such we opted to look at what the data warehouse professional might need when he or she would start to work with Kettle. Fortunately there is already a good and well known check-list out there to see if you covered everything ETL related and it's called The 34 subsystems of ETL, a concept by Ralph Kimball that was first featured in his book The Data Warehouse Lifecycle Toolkit. And so we asked Mr Kimballs permission to use his list which he kindly provided. He was also gracious enough to review the related chapter of our book.

By using this approach we allow the users to flip to a certain chapter in our book and directly get the information they want on the problem they are facing at that time. For example, Change Data Capturing (subsystem 2, a.k.a. CDC) is handled in Chapter 6: Data Extraction.

In other words: we did not start with the capabilities of Kettle. We did not take every step or feature of Kettle as a starting point. In fact, there are plenty of steps we did not cover in this book. However, everywhere a step or feature needed to be explained while covering all the sub-systems we did so as clearly as we could. Rest assured though; since this book handles just about every topic related to data integration, all of the basic and 99% of the advanced features of Kettle are indeed covered in this book ;-)

The content

After a gentle introduction into how ETL tools came about and more importantly how and why Kettle came into existence, the book covers 5 main parts:

1. Getting started

This part starts with the a primer that explains the need for data integration and takes you by the hand into the wonderful world of ETL.
Then all the various building blocks of Kettle are explained. This is especially interesting for folks with prior data integration experience, perhaps with other tools, as they can read all about the design principles and concepts behind Kettle.
After that the installation and configuration of Kettle is covered. Since the installation is a simple unzip, that includes a detailed description of all the available tools and configuration files.
Finally, you'll get hands-on experience in the last chapter of the first part titled "An example ETL Solution - Sakila". This chapter explains in great detail how a small but complex data warehouse can be created using Kettle.

2. ETL
In this part you'll first encounter a detailed overview of the 34 sub-systems of ETL after which the art of Data Extraction is covered in detail. That includes extracting information from all sorts of file types, databases, working with ERP and CRM systems, Data profilng and CDC.
This is followed by chapter 7 "Cleansing and Conforming" in which the various data cleansing and validation steps are covered as well as error handling, auditing, deduplication and last but not least scripting and regular expressions.
Finally this second part of the book will cover everything related to star schemas including the handling of dimension tables (chapter 8), loading of fact tables (chapter 9) and working with OLAP data (chapter 10).

3. Management and deployment
The third main part of the book deals with everything related to the management and deployment of your data integration solution. First you'll read all about the ETL development lifecycle (chapter 11), scheduling and monitoring (chapter 12), versioning and migration (chapter 13) and lineage and auditing (chapter 14). As you can guess from the titles of the chapters, a lot of best practices, do's-and-don'ts are covered in this part.

4. Performance and scalability
The 4th part of our book really dives into the often highly technical topics surrounding performance tuning (chapter 15), parallelization, clustering and partitioning (chapter 16), dynamic clustering in the cloud (chapter 17) and real-time data integration (chapter 18).
It's personally hope that the book will lead to more performance related JIRA cases since chapter 15 explains how you can detect bottlenecks :-)

5. Advanced topics
The last part conveniently titled "Advanced topics" deals with things we thought were interesting to a data warehouse engineer or ETL developer that is faced with concepts like Data Vault management (chapter 19), handling complex data formats (chapter 20) or web services (chapter 21). Indispensable in case you want to embed Kettle into your own software is chapter 22 : Kettle integration. It contains many Java code samples that explain to you how you can execute jobs and transformations or even assemble them dynamically.
Last but certainly not least since it's probably one of the most interesting chapters for a Java developer is chapter 23: Extending Kettle. This chapter explains to you how you can develop step, job-entry, partitioning or database type plugins for Kettle in great detail so that you can get started with your own components in no time.

I hope that this overview of our new brain-child gives you an idea of what you might be buying into. Since all books are essentially a compromise between page count, time and money I'm sure there will be the occasional typo or lack of precision but rest assured that we did our utmost best on this one. After all, we did each spend over 6 months on it...

Feel free to ask about specific topics you might be interested in to see if they are covered ;-)

Until next time,

Matt

October 4, 2010

Twitter real-time BI

Something which a lot of you people already knew, but which I just discovered, is that my (and your) Tweets are being watched. Not in 'big brother is watching you', but rather as 'big marketeer is studying you'.

What happened:

This week I made a quick twitter reply to the following post of @magicaltrout in which he was complaining his employer didn't pay him (yet).
I made a quick reply to this that went as follows:
Next thing that happened - within minutes - I had a new follower on Twitter. Check this.
Clearly some companies scan twitter to figure out whether you are a potential customer, then add you to their followers list, probably with the idea of following up even closer, or to start spamming you.

When I looked into the Twitter account, it actually belonged to a Fastmkn.com, a company that actually sets up this kind of services for their customers. Their website was clear about their services:

So, when you tweet, know your tweets are being 'mined', watch out what your write, and make sure you have your hand on the anti-spam button.




October 3, 2010

Pi