March 9, 2011

I feel rejected!

On september 21st 2010, I wrote in this blog post the following lines.
  • Rejects:  A generic component for error handling. It will converge all error records into one common format so all your rejects fit in one and the same output file or table. We'll elaborate on this one soon in an extra blogpost. Documentation will be added to the KFF pages.
Today, while on the phone with "a fan of the art of data integration through the means of Pentaho Data Integration", I realised that I never wrote that extra blogpost, and neither did anyone update the KFF pages with regard to the "Rejects Step". So it is time I amend.

Error handling in kettle
The standard error handling in kettle is probably well know to all of you. For those who aren't up to speed, a quick intro follows here.

A common scenario is shown below.

 
You have stored information in the staging area of your data warehouse, which you need to load into an ODS. The ODS area of your data warehouse however needs to contain validated or clean data. Therefore your ODS table(s) may have foreign key or other constraints to ensure that your data is correct. Consequently any record  you try to write to the ODSTable, but that isn't allowed by the data model for whatever reason (value larger than the field, precision not correct, foreign key not found, ...) will throw an error and your transformation will end with errors. 

This will make you, as an professional data integration designer/developer, very unhappy. So you will adjust your transformation and put a lot of logic between the data in and data out steps, to ensure that everything can be loaded correctly to the target model. 

But nevermind how precise and clearvoyant you are, at some point in time you will have some nasty record passing by for which you didn't write the right code, and bang, your transformation goes in error. Then you discover kettle has error handling functionality for these nasty records. Yes, kettle can divert records, which the data output steps cannot process, to another step. That looks as follows.
Basically, what is done very often, is just divert the records that cannot be inserted into the database to a flat file, and leave it there so some one can have a look at why this record wasn't inserted into the ODS or data warehouse. At the same time this permits your data flow to end without throwing errors and aborting in the mid of the night so you need to get up and restart that data warehouse load or else you'll find 10 angry managers at your desk in the morning.

In the data output step where you enable the error handling (in this case the ODSTable step), you can decide some extra settings for the error handling.

 

For those who want to read up bit on this feature, go and surf the net. Try these pages:
The documentation isn't abundant, but the information is out there. My little scenario ends here, because I really want to talk about the 'Rejects step' and not do a write up on Error handling in PDI.  So here I go.
 

Why a rejects step, if kettle already does it all?
As you might guess from the above scenario, one of the problems you often end up with if you push all the rejects to flat files, is that you end up with a large amount of files, with a lot of records to analyze for data quality. If you are lucky, you are working with to shelf data, and you'll have few rejected records and few files to look at, but most people will not be that lucky.
So as your data warehouse grows (not only in size, but also in data integration logic and code), the amount of rejects files will grow and you'll loose the overview of what is happening in terms of data quality.

To avoid this problem, we came up with the idea to merge all rejects into one single database table. This has some advantages:
  • Having all your rejects in one single place, makes it easy to make some statistics on your rejected records.
  • If they are in a database table, they really are just part of your data integration logging, and therefore you can easily include them in the logging reports you already make every day to see/show how the ETL runs have been going.
So what does it look like?

Basically nothing has been modified to the standard error handling mechanism of PDI. The only thing we added was the output step for the rejects. In this step you can define the following things:
  • Rejected records connection: database connection to which you want to write the rejects (this can be a parameter)
  • Rejected records schema: database schema o which you want to write the rejects (this can be a parameter) 
  • Rejected records table: database table to which you want to write the rejects (this can be a parameter)  
  • Error count field: name that you want to give the column containing the error count (this can be a parameter)  
  • Error descriptions field: name that you want to give the column containing the error description (this can be a parameter)  
  • Error fields field: name that you want to give the column containing the error fields (this can be a parameter)  
  • Error codes field: name that you want to give the column containing the error code (this can be a parameter)  
Below those input fields you can specify which list of incoming fields make up the key of your record.
 


The rejects table will have the following lay-out. 
  • id: This column contains all the key fields of your rejected record in comma separated format (provided you specified them in the rejects step) 
  • value: This column contains all the remaining (non-key) fields of your rejected record in comma separated format 
  • batch_id: Your batch ID (see KFF)
  • transname: name of the transformation that generated the reject
  • jobname: name of the job that generated the reject
  • stepname: name of the step that generated the reject
  • ${error_count_field}: nbr of errors
  • ${error_descriptions_field}: error description
  • ${error_fields}: field that generated the error
  • ${error_codes}: error code 
  • logdate: date/time of the reject
Simple enough to write some statistics on. Should I want to know which steps generate most errors, a simple query of the type:
select stepname, count(*)
from ${rejected_records_table}
group by stepname
would tell me enough.

Now if you add the element time to this, storing your rejects for each nightly batch run, you can plot a nice evolution of how you have been handling data quality problems over time. Any data warehouse project will sooner or later require this kind of statistics, because data quality problems are there. The question is only when they will surface.


Good luck with it!

March 8, 2011

Another kettle book on its way ...

For the kettle afinados out there, two excellent PDI books are already available namely: 
   


And if that is not enough, another book is in the making. María Carina Roldán has teamed up with Adrián Pulvirenti to deliver the "Pentaho Data Integration 4: Cookbook" (again with publishing house PACKT). 

The book will deliver a series of 'recipes' to solve typical (and less typical) kettle riddles, divided over 9 chapters with the following titles.
  1. Working with Databases
  2. Reading and Writing Files
  3. Manipulating XML Structures
  4. File Management
  5. Looking up for Data
  6. Understanding How Data Flows
  7. Executing and Reusing Jobs and Transformations
  8. Integration with Pentaho Suite
  9. Some More Useful Recipes


I have had the pleasure revising chapters 1 to 7 up to now, and will soon read chapters 8 and 9 which are in the writing.

My impression so far is that the analogy with a cookbook is actually well choosen.The recipes are organised in logical groups like you would have starters, rice dishes, deserts etc in a cookbook, but still the recipes within a same category, though logically related can be very different, like vanilla pudding and chocolate mouse.  

Maybe not all of them may be to your liking, as each recipe tackles a different problem.  And some people just prefer chocolate over strawberry. But having said that I'm not saying you shouldn't get the book because there might be some recipes you don't like or don't want to use. Actually, while reviewing the book, I found all the recipes interesting. And even if there are a few I won't use to literally cook the dish Maria and Adrian sugges, I will use the ideas I picked up in them to flavor my own dishes.

You can already get the book in RAW format from PACKT right now, or way till the end of april until all chapters are complete and the book becomes available.

Blog for mobile devices

We noticed that about 5% of our blog visits are from mobile devices. This percentage has been increasing steadily over the last year.




Since november or december 2010, Google Blogger (finally) supports adapted lay-outs for visualization of blog pages on mobile devices. Given the fact that you are using mobile devices to look at our site, we have enabled this feature on our blog. We hope it will improve the readability of our blog while you are on the move.

What you should see when you go to our blog on your mobile device is the following:


Enjoy the reading. And don't hesitate to leave your comments.

Connect as sysdba

Sometimes, when working with Oracle database, you need to connect as sysdba or sysoper.
A user must connect AS SYSDBA or AS SYSOPER if he wants to perform one of the tasks that require sysdba or sysoper privileges(such as to shutdown or to startup an instance. If he connects as SYSDBA, he becomes SYS, if he connects as SYSOPER, he becomes PUBLIC.
On the command line, using sql-plus this is pretty simple. Just add "as sysdba" to your connect statement.
connect sys/my_secret_password as sysdba
However when you are working with a graphical user interface logging in as sysdba isn't that evident. Basically because the login box that you get presented just allows you to insert a username and a password. So where goes the "as sysdba" part?

Using SQL Developer

Just add the "as sysdba" statement after your username. I needed this not so long ago, I luckily found it after some Googl-ing. My sql-plus isn't up to par, so really being able to use a GUI is really a time saver.



Using Quest Toad
If you are using Toad, the answer is fairly simple. They have provided a little drop down list called "connect as" which allows you to select between "normal", "sysdba" and "sysoper".

AD/AC


Mobile blogging

This blog post wasn't written entirely on the Mobile Blogger app by Google. Some editing has been done afterwards using the standard Blogger webinterface. All editing done afterwards is listed out below.

Google finally released a mobile app for Blogger. Since I happen to travel a lot the following days, I thought to give the app a try, just to kill time on the road and getting ready for the next PCG ;-)



First of all let me say I'm a fan of the available Google mobile apps. Gmail on Android is excellent. Mobile Maps is beyond excellent. Skymaps is ultra cool and though I hardly use YouTube on my computer, I started using it on Android. Needless to say I have expectations for this Mobile Blogger app.

From the moment you fire up the app, you are connected to your Google account. Mobile single sign-on, nice. And the app will present you with your list of blogs. So you simple pick the right blog from the list and ... action.

Creating a new mobile post and edit it with Mobile Blogger is pretty intuitive. Just hit the pencil button on the main screen and you are ready to go. There is a first text box for the title and a second for your post. Easy enough to start a new post, for instance while you are standing in a cue to check in your luggage in Ciampino airport.


At the bottom of the screen there is a publish/save button, which allows you to publish your post or save a draft, same as in the web version of Blogger. I believe saving isn't done automatically (which is the case in the web interface). A bit annoying, as you will have to remember to regularly save your work. Seems like MS Office 3 dot something all over.

Next to the pencil button, there is a button to see you list of mobile posts. It is important to know that that list will only show mobile posts (published and draft versions). Any drafts or posts you may have made on your pc, you will not see in this list. So forget about quickly adding a few lines to a post on which you were working from home. You just don't have access to those. And vice versa, you don't see your mobile posts in the web interface. So there is zero interoperability. A missed opportunity if you asks me.

Under the main text box, you find two buttons to add images. By now I can already tell you that you will probably use these more then actually writing text. An Android phone just isn't theft right device for massive text input. Also since you cannot lay-out (as in using bold, italics, underline, different fonts, ...) your text in Mobile Blogger app, creating a lot of text doesn't make sense. You have to few means to control the readability. So adding images will be what you'll do most using this app.

The first button allows you to take a snap and insert that right away in the text. Here's the boring view from my train seat. 




After you'be taken the picture the screen goes black for about a second or two. What the app does in that moment isn't clear. I would guess uploading the image. The second image button allows you to pick a picture from you Android's media gallery. Again, after selecting the image the app gets sluggish for a second or so.

The pictures you add will appear next to the buttons, and not in the text. This way you don't know where your pictures will appear in your text. And since there is no preview feature I have no clue while writing this where the picture of my metro ride will appear. Not really very user friendly.

Just below the picture bar there is a line to type your tags. Unfortunately you will have to remember your usual tags as the app doesn't seem to know my tag list. Again, a bit annoying as I'd rather keep my tag list nice and clean.

And finally just above the publish/save buttons, there is a line which allows you to geotag your post. Since you can only add one single location, this seems to indicate you aren't supposed to write long posts, meaning if you are writing on this Blogger app for over two hours on a high speed train, your location changes regularly, which isn't supported.

All in all I must say I'm not overly impressed with this app. It is hardly a step up from the SMS/MMS and email posting that has been available for years. Sure I can tag and geo-tag my posts now, but what extra functionaries are there. In general it stems Blogger keeps lagging behind on other blog software and services, also in the mobile field. Hopefully getting a first version on the market, will be the beginning of some rapid improvements. The feedback form included in the app would seems to indicate Google wants listen to the users. I really hope they make something of this app.

PS: While walking past the tram stop close to my place, I noticed the light boards indicating the position and ETA of the trams finally became operational. Those board have been hanging there for over a year without lighting up (except for a short period where they showed the hour). Let us pray there is no analogy with Mobile Blogger.



Editing done afterwards:
The following editing of the text has been done using the standard Blogger interface.
  • All pictures which are added using Mobile Blogger appear at the bottom of your post. I move all images to the right place in the text, resized them and centered them.
  • I changed the appearance of all the text to 'justify' and added a blank line between all paragraphs as the text wasn't really readable otherwise.
  • I put in the wrong label, so I updated that.
  • I've added some links to the text. I didn't figure out whether that could be done using the app.
Conclusion:
As the app is today, I don't suggest using it for any serious mobile blogging. It is good to quickly upload a picture with some text and geo-tag it. Beyond that, the app will just let you down.