Pentaho Kettle Solutions Overview


Dear Kettle friends,
Great news! Copies of our new book Pentaho Kettle Solutions are finally shipping. Roland, Jos and myself worked really hard on it and, as you can probably imagine, we were really happy when we finally got the physical version of our book in our hands.
Book front
So let's take a look at what's in this book, what the concept behind it was and give you an overview of the content...
The concept
Given the fact that Maria's book called Pentaho Data Integration 3.2 was due when we started, we knew that a beginners guide would be ready by the time that this book was going to be ready. As such we opted to look at what the data warehouse professional might need when he or she would start to work with Kettle. Fortunately there is already a good and well known check-list out there to see if you covered everything ETL related and it's called The 34 subsystems of ETL, a concept by Ralph Kimball that was first featured in his book The Data Warehouse Lifecycle Toolkit. And so we asked Mr Kimballs permission to use his list which he kindly provided. He was also gracious enough to review the related chapter of our book.
By using this approach we allow the users to flip to a certain chapter in our book and directly get the information they want on the problem they are facing at that time. For example, Change Data Capturing (subsystem 2, a.k.a. CDC) is handled in Chapter 6: Data Extraction.
In other words: we did not start with the capabilities of Kettle. We did not take every step or feature of Kettle as a starting point. In fact, there are plenty of steps we did not cover in this book. However, everywhere a step or feature needed to be explained while covering all the sub-systems we did so as clearly as we could. Rest assured though; since this book handles just about every topic related to data integration, all of the basic and 99% of the advanced features of Kettle are indeed covered in this book ;-)
The content
After a gentle introduction into how ETL tools came about and more importantly how and why Kettle came into existence, the book covers 5 main parts:
1. Getting started
This part starts with the a primer that explains the need for data integration and takes you by the hand into the wonderful world of ETL.
Then all the various building blocks of Kettle are explained. This is especially interesting for folks with prior data integration experience, perhaps with other tools, as they can read all about the design principles and concepts behind Kettle.
After that the installation and configuration of Kettle is covered. Since the installation is a simple unzip, that includes a detailed description of all the available tools and configuration files.
Finally, you'll get hands-on experience in the last chapter of the first part titled "An example ETL Solution - Sakila". This chapter explains in great detail how a small but complex data warehouse can be created using Kettle.

2. ETL
In this part you'll first encounter a detailed overview of the 34 sub-systems of ETL after which the art of Data Extraction is covered in detail. That includes extracting information from all sorts of file types, databases, working with ERP and CRM systems, Data profilng and CDC.
This is followed by chapter 7 "Cleansing and Conforming" in which the various data cleansing and validation steps are covered as well as error handling, auditing, deduplication and last but not least scripting and regular expressions.
Finally this second part of the book will cover everything related to star schemas including the handling of dimension tables (chapter 8), loading of fact tables (chapter 9) and working with OLAP data (chapter 10).

3. Management and deployment
The third main part of the book deals with everything related to the management and deployment of your data integration solution. First you'll read all about the ETL development lifecycle (chapter 11), scheduling and monitoring (chapter 12), versioning and migration (chapter 13) and lineage and auditing (chapter 14). As you can guess from the titles of the chapters, a lot of best practices, do's-and-don'ts are covered in this part.

4. Performance and scalability
The 4th part of our book really dives into the often highly technical topics surrounding performance tuning (chapter 15), parallelization, clustering and partitioning (chapter 16), dynamic clustering in the cloud (chapter 17) and real-time data integration (chapter 18).
It's personally hope that the book will lead to more performance related JIRA cases since chapter 15 explains how you can detect bottlenecks :-)

5. Advanced topics
The last part conveniently titled "Advanced topics" deals with things we thought were interesting to a data warehouse engineer or ETL developer that is faced with concepts like Data Vault management (chapter 19), handling complex data formats (chapter 20) or web services (chapter 21). Indispensable in case you want to embed Kettle into your own software is chapter 22 : Kettle integration. It contains many Java code samples that explain to you how you can execute jobs and transformations or even assemble them dynamically.
Last but certainly not least since it's probably one of the most interesting chapters for a Java developer is chapter 23: Extending Kettle. This chapter explains to you how you can develop step, job-entry, partitioning or database type plugins for Kettle in great detail so that you can get started with your own components in no time.
I hope that this overview of our new brain-child gives you an idea of what you might be buying into. Since all books are essentially a compromise between page count, time and money I'm sure there will be the occasional typo or lack of precision but rest assured that we did our utmost best on this one. After all, we did each spend over 6 months on it...
Feel free to ask about specific topics you might be interested in to see if they are covered ;-)
Until next time,
Matt