Table compare - test automation

Background

Some time ago we released a series of kettle plugins as part of the Kettle Francising Factory. The plugins (kff-plugins3.jar for PDI version 3.2.3 and above and kff-plugins4.jar for PDI 4.0.0 and above) are available on code.kjube.be but sofar not much documentation has been provided.

One of the kettle steps that is available in those jar files is the Table Compare. This step does what its name says. It compares the data from two tables (provided they have the same lay-out. It'll find differences between the data in the two tables and log it. We developed this plugin for acceptance tests scenarios in large projects. Consider (as a hypothetical example :-) ) the following use case of an data integration tool migration.

Suppose you have a data warehouse which is being loaded using, well, let's say, Oracle Warehouse Builder. Now Oracle has bought Sunopsis in 2006. Since then the development of OWB has somewhat stalled :-). Sometime after Oracle has anounced the launch of a new product called ODI - Oracle Data Integrator (very appropriate name, btw). This product combines the best of both worlds (or so Oracle sales reps state) but most Oracle Warehouse Builder customers know since a long time that migrating their code from OWB to ODI is not easy. And here, out of necessity, an opportunity arises. If you are faced with a painful and costly migration, which you can only post-phone as long as your support contract allows, why not move to a cheaper data integration tool, as let's say, kettle?

The above migration scenario is exactly the type of projects we do at kJube. I'm not going to go into detail on this type of projects, but one element I do want to underline here is the following: If you cannot automate user acceptance testing you can forget about doing this type of projects. The Table Compare step does exactly this.

So what does the thing do?

Conceptually the Table Compare does the following for each pair of tables you hand it.

It will count the records in each table and make the result of that count available.
It will do a left, right and inner join between the two tables (the counts of those statistics aren't available
All the records that appear only in the right or left join are logged as 'Errors'
All the records that are common between the Reference and Compare table are put through a detailed compare on attribute level. All attributes that don't match are logged as 'Errors'.

How to use it?

Now you have gotten the conceptual explanation on the Table Compare, I guess it is time for the technical stuff. As you can see, the Table Compare step contains quite a few fields that require input.

The "Reference connection" and "Compare connection" are the database connections from which the reference/compare table data will come.
The "Reference schema field" and "Compare schema field" contain the schema names for the reference/compare table.
The "Reference table field" and "Compare table field" contain the actual table names. This means that you could compare two tables with a different name, as long as they have the same column names.
The "Key fields field" should contain a comma separated list of they fields that make up the 'primary' key of the table(s) you are comparing. The primary key is needed because without this information the two tables cannot be correctly joined.
The "Exclude fields field" contains a comma separated list of columns that you want to exclude from the comparison. E.g. because they exist in the first table, but not in the second.
The "Number of errors field" allows you to specify the name of the output column that will contain the total number of errors found for the comparison of your tables.
The "Number of reference/compare table records field" allows you to specify the name of the field that will contain the actual number of records found in each table.
The "Number of left/inner/right join errors field" allows you to specify the name of the field(s) that will contain the number of errors found for each join type.
The "Error handling key description input field" allows you specify the name of the output field for the 'where clause" of the record that gave an error.
The "Error handling reference/compare value input field" allows you to specify the output field names for the actual values that differ.

Example

If you find all of the above pretty confusing that is understandable. There is a lot of fields but most of them have little importance. They just allow you to choose how your field will be name, but have few functional importance. Still, in order to improve your understanding of the subject, we thought an example was in place.

In order to show you the example we needed some tables in an online database that we can compare. We found the information contained in the Ensembl project very suitable for this purpose. What is the project about?

The Ensembl project produces genome databases
for vertebrates and other eukaryotic species,
and makes this information freely available online.

Basically this project has a large amount of databases (one per species) that all have a similar structure. Perfect for our purpose. There are plenty of species available for comparison, but we picked:

We just picked two tables from each database and put them through the Compare Table step for demonstration purposes. The transformation is shown below (and is also available for download).

As the first step we used the data grid step to decide on which tables to run through the Compare step.

Previewing the first output ('Comparison Statistics') delivers the following:

It shows that both the 'analysis' and 'attrib_typ' table have a different number of records for the human vs chimp database. (Luckily?)

Previewing the second output ('Comparison Error Details') shows some details on the actual differences (in this case the inner join details).

Clearly the record with analysis_id=2 has different values for ALL columns in the table.

Hopefully this sample helps to understand what the Table Compare can do. The best way to see it is to download the .ktr and give it a spin. We'll also add the .ktr to the KFF project as a project template. So you'll also find the code in the next KFF release.

Improvements

We know already that the following improvements would be handy:

Allow the connections to be field names that are accepted from the previous step. That would allow to do testing across more than two connections.
Save the following statistics:

nrRecordsInnerJoin
nrRecordsLeftJoin
nrRecordsRightJoin

The 3 error fields are currently expected as input fields in the step, this should be corrected. Also their name might be more appropriately be output field :-)

... but if you have further suggestions to improve this step, please let us know.