February 22, 2010

Parallell ETL execution

In Pentaho Data Integration it is extremely easy to execute jobs in parallel. The job below is an example of 7 jobs launched together right after the step which 'starts' the job.

It's very easy to configure. (You can listen to some Melody Gardot while doing it, as you can see !) . Just right click on the step previous to the parallel jobs, and select ''Launch next entries in parallel".

So kettle has made life easy for us again (as of version 3.0 if I recall correctly). However it doesn't end here. You want to run jobs in parallel in order to speed up your whole ETL process. In order to achieve the best possible results, there are a few things to study and consider.

Balance out short and long running jobs to get a good spread of system load
Your ETL batch process will contain some jobs that - running standalone - would last a very long time, while others will be relatively short.  Putting the long running jobs in parallel seems like a better choice than putting the 20 very short running jobs in parallel. The latter will probably  just create a 5 minute peak load on your system while after that 5 minute peak you still need to wait for the longest running job. So spreading jobs intelligently over the time span of the longest running job (or chain of jobs) seems like the way to go.

So rather than just launching all you've got in parallel without thinking, a good set-up would be something like the below. Obviously this means that you have an approximate knowledge of how long each job would last if run by itself. Make sure you get those statistics.

Next is to try. Trial and error is part of the game. Some configurations will work better than others. So playing around a bit is unfortunately part of the game.

Funtional dependencies
Needless to say, if jobs depend on each other, it's better not to parallelize them. 

Competing for resources
Similarly, jobs that are competing for the same resources - not CPU - e.g. that might require reading (or updating/inserting) the same tables, are best schedule one after the other, as they might drastically reduce performance.

As time goes by
Panta rei. Everything changes all the time. Therefore what is today your longest running job, may be be just an average job tomorrow. And a well performing job might become the worst pupil in the class. Data volumes and complexity of the processing will change over time. This will affect your parallelization. So re-evaluate your set-up from time to time.

In other words, scheduling jobs in parallel isn't really a technical issue. To start with, it all drills down to knowing the performance of your single jobs and have a good functional understanding of your code.


Remark: Obviously things can get more technical. Once you get to the point where you have one single job that still takes too much time, you might need to spread that job over different machines. That's a different kind of parallelization than what I describe above. How to do this has been described already.