October 18, 2010

KFF & Cookbook

When we (Matt and myself) kicked off the KFF project, one of the things we wanted to include was auto-documentation. Matt made a small proof of concept to generate PDF documentation but that didn't reach finalization. At the same time, while Matt, Roland and Jos were busy writing Pentaho Kettle Solutions (Sorry, what were you thinking? You do not own a copy of this book?), the subject of auto-documentation came up too. Roland picked it up, and turned it into the kettle cookbook, a great auto-documentation tool for kettle.

Why did we want auto-documentation in the first place?
The reason we wanted auto documentation is that we are lazy. We know this. Actually we  have known this since a long time. So we needed a solution that would minimize effort on our side.

Also it often turns out that customers do not really want to pay for documentation. They just see that as part of the development process and certainly don't want to pay for any time you put into documentation. So minimizing effort is also keeping costs low, which seems to please customers.

Another reason for wanting auto documentation is that over the years we witnessed projects where the documentation of data integration  code was something in the line of a gigantic word document filled with screenshots - those very bulky screenshots made using MS (com)P(l)AINT. Obviously that kind of documentation stays-up-to date until 5 minutes after the next error in your data integration run. And stale documenation is even worse than no documentation.

So what was the way to go?
We quickly concluded that something or some one had to document the code for us - continuously. Since you cannot outsource everything to India; and since mind-reading solutions aren't just there yet, we thought along the lines of generating documentation from the code itself. What we didn't know is that it could even get better, and that Roland would write the auto-documenation tool and in doing so really minimized effort for every one.

About kettle documentation possibilities 
Now before zooming in on the cookbook, I would light to high-light some nice documentation features that are in kettle since quite some time. The examples below are taken from KFF, namely from the batch_launcher.kjb.

1) Job/Transformation properties
Ever kettle job or transformation has a serie of meta-data tags on the properties tab, accessible by right clicking on the canvas of spoon (or through the menu).

The available tags are the following:
  • Name: The name of the transformation/job. This name doesn't need to be equal to the physical name of the XML file in which you want to save the code, although not aligning the physical
  • Description: A short description. Short as in: fits on one line.
  • Extended description: A full text description
  • Status: Draft or Production
  • Version: A free text version number
  • Created by/at: ID of creator and timestamp of creation
  • Modified by/at: ID of modifier and timestamp of modification

To my experience this gives quite a few fields to stick in some elementary descriptions of functionality.

2) Canvas notes
First of all the fact that there are really no lay-out restrictions in how you organize a data integration job or transformation is a strong documentation feature by itself. Many ETL tools will oblige you to always work left to right, or oblige you to always see every step on attribute level. Often that makes the view a developer has of the canvas, well, not much of an overview. In kettle you do not run into that issue. 

Because of the fact that you can design jobs in a modular way (using sub-jobs), you can also ssure that you never need to design a job/transformation that looks like  the one below .  (For the record: I didn't design the below transformation myself.)  Obviously now I'm stating that a good data integration design, makes documentation readable, which is a bit beyond pure documentation functionality, but still, it is an important thing to consider when thinking about auto-documenting your solution.

On top of the great lay-out possibilities, you can insert notes on the canvas of any job/transformation . They allow for free text comments (without lay-out possibilities). This is good to document things that still need finalizing, to highlighte certain elements of your job/transformation, important remarks like 'don't ever change this setting', etc.

Although the notes aren't actually linked to any steps, the vicinity of a note to a step is good enough to show what step the comment actually belongs to. And in case you really want to link your comments to specific steps there also are 'Step descriptions'.

3) Step descriptions
Step description are available through a simple right click on the step you want to document.
A step description dialog opens up and you can take down any comments related to the step you clicked in free text format (no formatting).

All in all, kettle as a tool, has great lay-out possiblities and sufficient documentation 'place holders' to stuff your comments in. The next thing is to get that information back out.

The Cookbook
As I wrote in the intro of this post, Roland Bouman put together an auto-documentation tool for kettle during the writing of Pentaho Kettle Solutions. He presented this to the Pentaho Community even before the release of the book, both in a webcast as well as on the Pentaho Community Gathering 2010 in Cascais (presentation here).

What does the cookbook do? Well, basically it will read all kettle jobs and transformations in a specific directory (INPUT_DIR) and generate html documentation for this code in another directory (OUTPUT_DIR) using the software you already have installed, namely kettle. In other words, if you are a kettle user, you just need to tell the cookbook code where our code is and where you want to documentation to go. I'm not sure if it could get more simple than that. Yet, as far as I know this is the only data integration tool that actually is capable of auto-documenting itself

Cookbook features
The feature I'd like to show is that all your code is transformed into html pages which maintain the folder structure that you might have given to your project. In my example I've auto-documented the /kff/reusable folder, which looks like this:
So basically per job/transformation you have 1 html page, which is located in a directory structure that matches perfectly your original directory structure. Plain and simple.

Obviously the tree view shown here is clickable and allows you to navigate directly to any job/transformation you might want to explore.

On each page (for each job/transformation) quite an extensive amount of information is listed out. First you find the  meta-data tags from the properties tab. The below screenshot matches the batch_launcher.kjb properties as shown above. Note that the fields "version" and "status" aren't exported for some reason but apart from that all the fields are there.

After the meta-data elements, the parameters a job might expect are listed out. In case of our batch_launcher.kjb these are the following. Since the named parameters are quite important for the understanding of a transformation, it is appropriate they are listed on top of the page.

Next you'll find an export of the actual canvas you see in spoon in your documentation, including all the notes. Now this is true magic. The screenshots in the documentation are exactly like what you see on the canvas in spoon. And the steps are clickable. The'll bring you right to the job or transformation that the step refers to, or to the description of the step. In other words, you can drill down from jobs to sub-jobs to transformations to steps as you would in spoon. That is no less than amazing!
The step descriptions themselves are listed lower on the page. In the below screenshot you'll see the step descriptions we entered for the step 'kff_logging_init' before. (Note that page breaks are lost.)

However if you look at the step descriptions that do not just launch another job or transformation you even get some of the actual code. Look at this table input step where you actually get the SQL code that is executed.

All in all, the cookbook generates amazingly detailed documentation. In case you aren't convinced by the screenshots and explanation above, please check for yourself below (or in a full browser window).

KFF & Kettle-Cookbook
After the above explanation it doesn't need much clarification that integrating KFF and the cookbook was peanuts. The KFF directory structure is clear.
/kff/projects/my_customer/my_project/code  -->contains your code
/kff/projects/my_customer/my_project/doc  --> contains the documentation
So the INPUT_DIR and OUTPUT_DIR for connecting the cookbook to KFF are clear. The only thing needed was to add a step to the batch_launcher.kjb which called the top level job of the Cookbook and pass it two variables.

As I said, it was extremely simple to connect KFF to Cookbook

So from our next release on, if you download and install KFF, you'll automatically have a download of the Kettle-Cookbook in there, and whether you want it or not, all your projects will be auto-documented. You just need to figure out how to share the /kff/projects/my_customer/my_project/doc directory with people who would actually like to read the manual.

A big thanks to Roland!