The merge step in kettle, provides a great functionality to identify new, changed, identical and deleted records when comparing 2 different tables. This allows you to split your incoming data in such a way that you can handle inserts, updates and (logical) deletes in the most efficient way as you can filter out all 'identical' records, and avoiding the extra round trips to the database an insert/update would create.
I used the step in an ETL flow where I needed to compare between information from an Oracle and a DB2/400 database.The DB2/400 is the original data, while I keep a copy of that same data in our operational datastore on Oracle. So in order to find what has changed since my last batch load, I compare both tables, cross platform, using the merge step.
When testing the transformation however I noticed some funny behaviour. Notwithstanding the fact that both data flows mapped to the merge step were sorted on the same fields (a requirement for the merge step to work) still the refresh of the target table didn't run not correct.
After some digging I figured out that both databases used different sort settings as show below. The DB2 on AS/400 sorted character fields differently than the Oracle database. Now off course I could have done the sort using which would have solved my sort problems, but then again sorting and aggregating is something better left to the database(s). They are just better at that.
Still, it was a little 'bug' that took me some time to figure out. Once I figured out the issue, the solution was easy, I added the NLS_SORT parameter to the Oracle connection which set the sort settings on Oracle the same as on AS400 at least for the duration of that session. Enough to solve my problem.
This allowed me to keep the sort on the database (more efficient than having the ETL tool solve that) and still have correct results. So for those of you who use this functionality cross database: check you sort results carefully before using the merge step !