- It will count the records in each table and make the result of that count available.
- It will do a left, right and inner join between the two tables (the counts of those statistics aren't available
- All the records that appear only in the right or left join are logged as 'Errors'
- All the records that are common between the Reference and Compare table are put through a detailed compare on attribute level. All attributes that don't match are logged as 'Errors'.
- The "Reference connection" and "Compare connection" are the database connections from which the reference/compare table data will come.
- The "Reference schema field" and "Compare schema field" contain the schema names for the reference/compare table.
- The "Reference table field" and "Compare table field" contain the actual table names. This means that you could compare two tables with a different name, as long as they have the same column names.
- The "Key fields field" should contain a comma separated list of they fields that make up the 'primary' key of the table(s) you are comparing. The primary key is needed because without this information the two tables cannot be correctly joined.
- The "Exclude fields field" contains a comma separated list of columns that you want to exclude from the comparison. E.g. because they exist in the first table, but not in the second.
- The "Number of errors field" allows you to specify the name of the output column that will contain the total number of errors found for the comparison of your tables.
- The "Number of reference/compare table records field" allows you to specify the name of the field that will contain the actual number of records found in each table.
- The "Number of left/inner/right join errors field" allows you to specify the name of the field(s) that will contain the number of errors found for each join type.
- The "Error handling key description input field" allows you specify the name of the output field for the 'where clause" of the record that gave an error.
- The "Error handling reference/compare value input field" allows you to specify the output field names for the actual values that differ.
If you find all of the above pretty confusing that is understandable. There is a lot of fields but most of them have little importance. They just allow you to choose how your field will be name, but have few functional importance. Still, in order to improve your understanding of the subject, we thought an example was in place.
In order to show you the example we needed some tables in an online database that we can compare. We found the information contained in the Ensembl project very suitable for this purpose. What is the project about?
The Ensembl project produces genome databases
for vertebrates and other eukaryotic species,
and makes this information freely available online.
We just picked two tables from each database and put them through the Compare Table step for demonstration purposes. The transformation is shown below (and is also available for download).
As the first step we used the data grid step to decide on which tables to run through the Compare step.
Previewing the first output ('Comparison Statistics') delivers the following:
It shows that both the 'analysis' and 'attrib_typ' table have a different number of records for the human vs chimp database. (Luckily?)
Previewing the second output ('Comparison Error Details') shows some details on the actual differences (in this case the inner join details).
Clearly the record with analysis_id=2 has different values for ALL columns in the table.
Hopefully this sample helps to understand what the Table Compare can do. The best way to see it is to download the .ktr and give it a spin. We'll also add the .ktr to the KFF project as a project template. So you'll also find the code in the next KFF release.
Improvements
We know already that the following improvements would be handy:
- Allow the connections to be field names that are accepted from the previous step. That would allow to do testing across more than two connections.
- Save the following statistics:
- nrRecordsInnerJoin
- nrRecordsLeftJoin
- nrRecordsRightJoin
- The 3 error fields are currently expected as input fields in the step, this should be corrected. Also their name might be more appropriately be output field :-)
































