January 26, 2011

Excel writer fun

This post is about making some complex Excel reports using Pentaho data integration and the Excel Writer step.


The Excel Writer step
Some time ago Slawomir Chodnicki (aka Slawo) released the Excel Writer Step version 1.2 for Pentaho Data Integration. Not only did he deliver a wonderful piece of extra PDI functionality, but he also accompanied its release with a great blogpost as well as a sheer endless list of samples.




Since the Excel Writer is donation ware, and Slawo is a big kettle contributor, it should come a no surprise that the Excel Writer will be standard included in kettle 4.2. A great contribution, which have been eager to use in a real life situation.


The use case
Last week, I used the Excel Writer for an assignment. The reason why customer wanted the results in Excel were simple. They needed to manipulate both the data as well as the look and feel of the graphs, before using them in Word documents and or Powerpoint documents. Some questions weren't even needed in the final report. So a lot of editing was needed after the creation of the initial report.


The basic requirements were the following:
  • Generate Excel reports from the responses to (multiple choice) questionnaires.
  • Create one Excel sheet with the questionaire, and one Sheet per question showing the answers 
  • Each Excel report with answers should contain both the figures in a table as well as in a graph. 
  • Each questionnaire can consist of somewhere between 20 to a 100 questions. 
  • Each question can have somewhere between 2 (yes/no) to 10 answers.
The code
The basic code was pretty simple. 

  • One transformation to fetch all the questions of the questionnaire and write them to an Excel Sheet. 
  • One transformation to write all the answers to a sheet, to be executed per question
  • + some logic to clean up the previous version and zip the new version


The output - one Excel file?
My first idea was to generate one Excel book, containing 1 sheet with an overview of all the questions in the questionnaire, and next per question 1 sheet with the graphs/tables on the answer. It was based upon two sheet templates and the result was supposed to look like this.




However when I tried this I ran into the strange issue that the graphs on my Excel template disappeared. I took me some time to figure out that this was a limitation of the Excel writer. A limitation that was even documented, but unfortunately, I don't always read tool tips.




Anyhow, Slawo confirmed me that the POI library that is used, has some limitations:
You may have hit a limitation here. In order to use a sheet as template 
> it needs to be copied (only in memory, but a copy non the less). If POI 
> does not understand things (like charts) it ignores them as best it can. 
> But if you ask POI to copy stuff, it will ignore things it does not understand too. 
Basically it seems that when your sheet template gets copied, our friend the poi library will just randomly leave out stuff he (or she?) doesn't like. 


The output - many Excel files
So I started looking for a work around that was still elegant enough. The answer was pretty simple. Don't work with separate sheets, but with separate files. As soon as I disabled the use of the SheetTemplate, and I wrote answers to each question each time to a new Excel sheet, all graphs and lay-out stayed intact.
To make it a bit easier for the users, to navigate the large number of output files, I used the hyperlink features of the Excel Writer step. Actually putting a hyperlink from the questionnaire to the right file with answers made the whole set of reports easily browsable. And yes, on every sheet with answers, a link back to the overview was inserted.
I also added a zip step at the end to bind all the reports together into 1 zip file, making it easier for the user to receive the file. (More on that in my next post by the way). 


Some Excel tricks
Using Excel allowed me (or obliged me) to use some Excel report tricks as dynamically changing the range of values shown in your graph (based on the actual values that are filled in in your table). Or allowing the user to quickly change the numbers in the graph from absolute figures to percentages.
Even though I'm not such an Excel reporting fan, it was amazing to see that all of your "programming" in Excel actually continues to work nicely (as long as you don't use SheetTemplates that is :-) ).


Conclusion
Notwithstanding the issue with the 'disappearing' charts, I must say that I'm pretty pleased with the end result and the actual time to put the whole thing together. Development time was really minimal and given the nice set of samples that Slawo has put out there, using the Excel Writer writer in the correct way was really a piece of cake. And to round it up, I must agree with Trout, that Slawo is encredibly quick with problem analysis and answers. And I'm not expressing any opinion about Pentaho support with that :-)