March 9, 2011

I feel rejected!

On september 21st 2010, I wrote in this blog post the following lines.
  • Rejects:  A generic component for error handling. It will converge all error records into one common format so all your rejects fit in one and the same output file or table. We'll elaborate on this one soon in an extra blogpost. Documentation will be added to the KFF pages.
Today, while on the phone with "a fan of the art of data integration through the means of Pentaho Data Integration", I realised that I never wrote that extra blogpost, and neither did anyone update the KFF pages with regard to the "Rejects Step". So it is time I amend.

Error handling in kettle
The standard error handling in kettle is probably well know to all of you. For those who aren't up to speed, a quick intro follows here.

A common scenario is shown below.

You have stored information in the staging area of your data warehouse, which you need to load into an ODS. The ODS area of your data warehouse however needs to contain validated or clean data. Therefore your ODS table(s) may have foreign key or other constraints to ensure that your data is correct. Consequently any record  you try to write to the ODSTable, but that isn't allowed by the data model for whatever reason (value larger than the field, precision not correct, foreign key not found, ...) will throw an error and your transformation will end with errors. 

This will make you, as an professional data integration designer/developer, very unhappy. So you will adjust your transformation and put a lot of logic between the data in and data out steps, to ensure that everything can be loaded correctly to the target model. 

But nevermind how precise and clearvoyant you are, at some point in time you will have some nasty record passing by for which you didn't write the right code, and bang, your transformation goes in error. Then you discover kettle has error handling functionality for these nasty records. Yes, kettle can divert records, which the data output steps cannot process, to another step. That looks as follows.
Basically, what is done very often, is just divert the records that cannot be inserted into the database to a flat file, and leave it there so some one can have a look at why this record wasn't inserted into the ODS or data warehouse. At the same time this permits your data flow to end without throwing errors and aborting in the mid of the night so you need to get up and restart that data warehouse load or else you'll find 10 angry managers at your desk in the morning.

In the data output step where you enable the error handling (in this case the ODSTable step), you can decide some extra settings for the error handling.


For those who want to read up bit on this feature, go and surf the net. Try these pages:
The documentation isn't abundant, but the information is out there. My little scenario ends here, because I really want to talk about the 'Rejects step' and not do a write up on Error handling in PDI.  So here I go.

Why a rejects step, if kettle already does it all?
As you might guess from the above scenario, one of the problems you often end up with if you push all the rejects to flat files, is that you end up with a large amount of files, with a lot of records to analyze for data quality. If you are lucky, you are working with to shelf data, and you'll have few rejected records and few files to look at, but most people will not be that lucky.
So as your data warehouse grows (not only in size, but also in data integration logic and code), the amount of rejects files will grow and you'll loose the overview of what is happening in terms of data quality.

To avoid this problem, we came up with the idea to merge all rejects into one single database table. This has some advantages:
  • Having all your rejects in one single place, makes it easy to make some statistics on your rejected records.
  • If they are in a database table, they really are just part of your data integration logging, and therefore you can easily include them in the logging reports you already make every day to see/show how the ETL runs have been going.
So what does it look like?

Basically nothing has been modified to the standard error handling mechanism of PDI. The only thing we added was the output step for the rejects. In this step you can define the following things:
  • Rejected records connection: database connection to which you want to write the rejects (this can be a parameter)
  • Rejected records schema: database schema o which you want to write the rejects (this can be a parameter) 
  • Rejected records table: database table to which you want to write the rejects (this can be a parameter)  
  • Error count field: name that you want to give the column containing the error count (this can be a parameter)  
  • Error descriptions field: name that you want to give the column containing the error description (this can be a parameter)  
  • Error fields field: name that you want to give the column containing the error fields (this can be a parameter)  
  • Error codes field: name that you want to give the column containing the error code (this can be a parameter)  
Below those input fields you can specify which list of incoming fields make up the key of your record.

The rejects table will have the following lay-out. 
  • id: This column contains all the key fields of your rejected record in comma separated format (provided you specified them in the rejects step) 
  • value: This column contains all the remaining (non-key) fields of your rejected record in comma separated format 
  • batch_id: Your batch ID (see KFF)
  • transname: name of the transformation that generated the reject
  • jobname: name of the job that generated the reject
  • stepname: name of the step that generated the reject
  • ${error_count_field}: nbr of errors
  • ${error_descriptions_field}: error description
  • ${error_fields}: field that generated the error
  • ${error_codes}: error code 
  • logdate: date/time of the reject
Simple enough to write some statistics on. Should I want to know which steps generate most errors, a simple query of the type:
select stepname, count(*)
from ${rejected_records_table}
group by stepname
would tell me enough.

Now if you add the element time to this, storing your rejects for each nightly batch run, you can plot a nice evolution of how you have been handling data quality problems over time. Any data warehouse project will sooner or later require this kind of statistics, because data quality problems are there. The question is only when they will surface.

Good luck with it!