May 24, 2010

kettle vs BODI - 'out of the box' performance comparison

I was cleaning up my laptop this weekend and ran into a forgotten file with a quick performance comparison between PDI and BODI I did for a customer.  Now if I say "performance comparison", please don't think about a laboratory like test with fully documented results and full control over all variables. On the contrary, our approach to this performance comparison was extremely lean, for the simple reason that we were doing a PDI POC on functionality, not on performance. So the performance test was something for which we were allowed to take 2 hours time max.

Anyway the set-up was the following:
1) PDI and BODI installed on the same machine
2) Reading/writing from/to the same database server
3) Take 3 existing (simple) BODI jobs and convert them (without thinking) into PDI jobs

I guess 1) and 2) don't need much comment. I guess running on the same machine makes the test results kind of comparable. If that doesn't what does. Also since we were reading/writing data from/to the same database server, I believe we kind of excluded network or io issues in the comparison. About point 3) I still want to have a quick word.

We wanted to work on everyday simple jobs without spending time on them, because that is what a real world scenario looks like. Most ETL developers I know just grab the ETL tool and start bashing. Many of them don't really master all the tricks for performance tuning. So if you are looking for a tool that is performing well, I guess, what you mean is that you are looking for a tool that is performing well 'out of the box' or in a scenario where no product expert is invited to spend 3 days on fine-tuning your code and infrastructure. Depending on your needs, you might agree or not, but that was our philosophy.

Although the executed code doesn't matter much, I still give a bit of background on what type of jobs we ran.
  • Job/Transformation 1: Read 20 mio rows, split the stream in 2, perform in each sort a stream on different fields, count the amount of resulting records from both stream and write the output (+/- 20 lines) to an output table.
  • Job/Transformation 2: Read 20 mio rows, perform an in memory lookup for one of the colums to a table with approximately 10.000 rows and write the results to a table.
  • Job/Transformation 3: Read 20 mio rows, denormalize them and write to disk
Anyway these were the results.

Transformation BODI (sec) PDI (sec) Difference
Transformation 1 4260 1501 184% faster
Transformation 2 1563 1035 51% faster
Transformation 3 5048 1054 379% faster

Or in other words, even in the "worst case" PDI was 50% quicker than Business Objects Data Integrator. And that in an out of the box without any tuning scenario.

Want more information: contact kJube