September 30, 2010

PCG10 KFF presentation

Unfortunately I wasn't able to blog my own presentation live during the Pentaho Community Gathering in Cascais, Portugal (PCG10), last saturday. The Live Blog page drew a lot of attention, during and after the event - statistics will follow - and I even got a few times the question whether I would still write a summary of the KFF presentation to go with the slides. Well, I will do no such thing! Instead however ...

Since the whole objective of our presentation was to somehow "launch" KFF - except for a handfull of insiders, no one within the Pentaho Community heard about KFF before - I taught it would be worthwile to write a full walk-through of the presentation for all the persons that might visit the 'PCG10 Live Blog'. So here it goes, no summary, but the full presentation in blog format, plus some little extra's at the end. Enjoy.

KFF, as presented at the Pentaho Community Gathering 2010


As the title slide of my presentation suggests, KFF is all about Pentaho Data Integration, often better known as kettle. KFF has ambitions to be an exciting addition to the existing toolset kettle, spoon, kitchen (all clearly visible in the picture) and at the same time be a stimulator for improvement of these tools.

Why oh why?

Any, and I mean any, consultant that has worked at least once with a data integration tool, be it Informatica, Datastage, MS Integration Services, Business Objects Data Integration, Talend (somehow forgot to name this one at PCG10), Sunopsis - Oracle Warehouse Builder - Oracle Data Integration, has been confronted with the fact that some elementary things are not available out of the box in any of these tools. I think about:
  • A job/transformation logging  without set-up or configuration
  • Integrated alerting mechanisms (for when things go wrong)
  • Integrated reporting 
    • as part of the alerting or 
    • just to understand the health of your data integration server
  • Guideliness for a multi-environment (DEV, TST, UAI, PRD) set-up 
  • Easy code versioning and migration between environments
  • Automated archiving of your code
  • ... etc
After some years - too many I would say - I came to the conclusion that whatever the data integration technology, I was always rewriting the same concept over and over again. And all ustomers have seemed more than happy with the "frameworks" I've built. So I started wondering, how it was possible that data integration vendors were not covering the above requirements with a standard solution, if the requirements are the same across all customers. I took the discussion up with Matt, and he felt the same.

Once we realized this, re-implementing the same concepts again and again became hard to bare.


Luckily Matt and myself had the chance to do a lot of projects together, using kettle, and we started building something we could re-use on our projects back in 2005. With every new project we did with kJube, our 'solution' grew, and we got more and more conviced that we needed to share this.

So in June 2010 we listed out all we had and decided to clean the code and package a first version of what we had to show and share on the Pentaho Community Gathering.


We noticed soon that the first version couldn't include nearly all we had ready. What we present at PCG10, is just a basic version to show you all what were doing. The whole release schedule willt ake until january 2011, if new additions or change requests interfere.

So what is KFF?


We decided to call our solution the Kettle Franchising Factory.

Franchising seemed a nice term because that remained nicely within the existing kettle, spoon, kitchen, chef, carte, etc metaphor. It indicates that the KFF objective is to scale up your data integration  restaurant to multiple 'locations' where you cook the same food. That's basically what we want. Make kettle deployments multi environment, multi customer, whilst keeping the set-up standard.

The term Factory refers to the fact that we want every part of the process to go as  speedy and automatic as possible. This factory contains all the tools to deploy kettle solutions as swift as possible.


The tools through which we reach those goals are several:
  • Some of the requirements we meet through proposing set-up standards. We try to make as few things dependend on standards or guideliness, everything should be configurable, but large data integration deployments stay neat and clean only if some clear set-up standards are respected. Also, standards on parametrization need to be imposed if you want to make your code flexible enough to run on multiple environments without further modifications.
  • A lot of functionality is implemented using reusable kettle jobs and transformations, often using named variables.
  • Quite a few kettle plugins have been written too. We believe that when certain actions can be simplified by providing a kettle plugin, that we should provide that plugin.
  • Up to now we have 4 project templates we want to include with the KFF. Some "projects" always have the same structure if one follows best practices, so why should we rewrite things.
  • Scripting. Although limited, there is also some scripting involved in KFF.

So let's go into details

A first element of the KFF we want to show is the 'batch_launcher.kjb'. This kettle job is designed to be a wrapper around your existing ETL code or one of the templates we'll ship with KFF. The objective is make all calls to re-usable logic as logging, archiving etc in this wrapper without the need to modify your code.

What does this job do (as of today):
  1. The first step of this job will read the right configuration file(s) for your current project/environment. For this we've developped a step called the 'environment configurator'. So based upon some input parameters, the environment configurator will override any variables that (might) have been see in kettle.properties to ensure that the right variables are used.
  2. The job 'kff_logging_init' will
    1. create logging tables (in case they didn't exist yet), currently on MySQL or Oracle,
    2. clean up logging tables in case there should be data in there
    3. check whether the previous run for this project finished (succesfully)
    4. creates a 'batch run'
  3. The next job calls one of our project templates currently the datawarehouse template but can easily be replaced by the top level job of your data integration project
  4. After the data integration code has finished, 'kff_logging_reports' generates standard reports on top of the logging tables . The reports are kept with the kitchen logs.
  5. 'kff_logging_archive' 
    1. closes the 'batch_run' based on results in the logging tables and
    2. archives the logging tables (more on that later)
  6. 'kff_backup_code' makes a zip file of the data integration code which is tagged with the same batch_run_id as the kitchen log file and the generated reports.


How does the environment configuration work, and why is it necessary? Well, the fact kettle standard only provides one kettle.properties file in which to put all your parameters is kind of limiting to setting up a multi-environment kettle project. The way you actually switch between environments in a flexible way is actually by changing the content of variables. So we created the environment configurator. I'm not gonna elaborate on this again, since I've blogged about this plug-in in august when we first released it. I believe that blog-post elaborates more than enough on the usage of this step.

Obviously the environment configurator is something that works when you execute code through kitchen, that is in batch mode. However whenever you fire up spoon, it will just read the properties files in your $KETTLE_HOME directory. In order to overcome the problem also in the development interface.

Consequently, if you have correctly set up your configuration files, the kff_spoon_launcher.sh [No windows script available yet. We do accept contributions from people running Windows as OS.] will automatically set the right configuration files at run time and fire up spoon on the environment you want. As a little addition, nothing more than a little hack, we also change the background of your kettle canvas. That way you see whether you are logged on in DEV, TST, UAI or PRD, which is good to know when you want to launch some code from the kettle interface.

So how about that system to create logging tables? Well, the logging tables we use are the standard job, transformation and step logging tables. We tried to stick as much to the existing PDI logging and just add on top of that. 


What did we add:

  • We implemented the concept of a batch logging table. For every time you launch a batch process, in this table a record will be logged that covers your whole batch run. In casu it will log the execution of the top level job. So yes, this is nothing but job logging, but since the top level job has a specific meaning within a batch process, isolating it's logging opens up possibilities.
  • We also implemented the concept of a rejects logging table. Kettle has great error handling, however one feature we felt was missing is to standardize that error handling. Our reject plug-in merges all records that have been rejected by an output step into a common format and inserts them into our reject logging table. The full records is preserved, so information could theoretically be reprocessed later. [Question Pedro Alvez: "Is the reprocessing part of KFF? Answer: No, since we don't believe automation of that is straight forward enough.]
  • Logging tables are created on the fly. Why? Well, whenever you are running your jobs/transformations on a new environment you get that nasty errors that your logging tables don't exist. Why should you be bothered with that. If they don't exist, we create them. 


  • Creating the logging tables on the fly wasn't just done because we like everything to go automatically. Suppose you would want to run two batch processes in parallel. In a set-up with a single set of logging tables your logging information would get mixed up. Not in our set-up. You can simply define a differrent set of logging tables for the second batch run and your logging stays nicely separated. 
  • Obviously to implement the above, you need to be able to change your logging settings in your jobs and transformations at run-time. For this Matt has written some nifty logging parameters injection code that actually injects the log table information and log connection into the jobs and transformations. More about that on the next slide. 
  • At the end of the batch run we also archive logging information. Even if you have been using different sets of logging tables, all information is merged back together, allowing historical reporting on your data integration processes. Also, the archive tables avoid that your logging tables fill up and make the kettle development interface become sluggish when visualizing the logging.


The rejects step isn't the only plug-in we have written over the last few years. The next slide illustrates some other steps that have been developed.

  • Trim strings
  • Date-time calculator
  • Table compare 

I have blogged about these steps before, so again, I will not write this out again. Also, one step that isn't mentioned here, but which we developed too, and has been contributed back to kettle 4.0 is the data grid step.


Another aspect of KFF are project templates. For the moment, what we have is rather meager - only the datawarehouse template is available -, but we do have quite some stuff in the pipeline that we want to deploy.


  • The datawarehouse template should grow out to be a 'sample' datawarehouse project containing lots of best practices and possibly a lot of reusable dimensions (as in date dimension, time dimension, currency dimension, country dimension, ...) 
  • The data vault generator is a contribution from Edwin Weber which came to us through Jos van Dongen. We are still looking into how we can add it. But it seems promissing.
  • The campaign manager is a mailing application, also know as norman-mailer, which we use internally at kJube. It allows you to easily read out a number of email addresses, send mails, and capture reponses from POP3.
  • The db-compare template does an automatic compare of the data in a list of tables in two databases. It will log all differences between the data in the two tables. It is something we've used for UAI testing when we need to prove our customer that UAI and PRD are alligned.




  • After the presentation Roland Bouman came to me with a great idea for another template. I will not reveal anything as he has his hands full with the cookbook for the time being, and we are busy with KFF. When the time is ripe, you'll hear about this template too.

So to sum it all up: KFF pretends to be a big box, with all of the below contents.


We don't expect all of this to be in there from day one. Actually the day we presented KFF at PCG10 was day 1, so have some patience and let us add what we have over the next months.

How will KFF move forward?

Well we believe the first step was releasing something to the community. We'll keep on doing that. The code for the project is fully open source (GPL) and available no Google Code. Check code.kjube.be or go to the kettle-franchising project on google code. We'll listen to your feedback and adapt where possible!


Also we'll follow these basic guidelines:

  • Make KFF deployment as simple as possible. As is simple as a kettle deploy is impossible since kettle is deployed within KFF, but if you know kettle, you know what we mean.
  • We also believe that some of the functionality we have built doesn't belong in KFF but rather in kettle itself. We'll push those things back to Pentaho. (We'll try to find time to discuss with the kettle architect :-) )
  • If and when something should be automated/simplified in a plug-in we'll do so. 
  • We believe we should integrate with other project around kettle, as the cookbook. 

Who's in on this?

For the moment, Matt and myself are the drives behind this "project". Older version have been in production with some of kJube's customers for years. Sometimes they contribute, sometimes they are just happy it does what it should.



We hope to welcome a lot of users and contributers over next months.

Feedback

is always welcome!



Thanks!




September 29, 2010

PCG10 on PCG11

Since I've been blogging so much on PCG10, I thought it was worth to quickly gather all the ideas that I heard on PCG11 at PCG10. Probably I was able to catch only a fraction of them, but I count on the community to add the rest of the ideas as comments (or on the Pentaho Forums and Wiki, where they belong in order to get a good discussion going).

Improvement idea nbr 1: shorter twitter tag
I'm not sure if this idea needs much explanation or even discussion? The hash tag PentahoMeetup10 was annoyingly long for people tweeting from the room. If the name of the Pentaho Community Gathering doesn't change (see improvement idea 2) I vote for #PCG11.

Improvement idea nbr 2: rename the Pentaho Community Gathering
I've heard rumours (@julianhyde) that the name Pentaho Open World would become the new name. I'm not sure if this is a strategic hint to L. Ellison to make Oracle buy Pentaho. But then again, Julian  knows more about the company strategy than I do.

Improvement idea nbr 3: location
THE QUESTION at PCG10 was: "How will we ever top this location?". Indeed Cascais and surroundings were an amazing location! The hotel where the event was held, with a view over the sunny beach was fantastic! A tower of cupcakes and sweets in the room! Private espresso machines for PCG10! Nice little restaurants with fresh fish dishes right outside the door for a great lunch break! Webdetails and Xpand-IT even managed to organize a cloudless sky and +25°C temperatures. 

The question to ask might well be: "Will any one dare to organize PCG11?"

So how go about setting up PCG11? Do a poll in the community on prefered locations? Asking different contenders for a proposal and let the best win? I don't have the answer, but the organizer of PCG11 will have a hard time topping PCG10, that is for sure.

A name that came up quite a lot for PCG11, was Brazil. Quite a few Brazilians have been following the live blog, and it seems that there's a very active Pentaho Community out there. But Summer in Brazil is coming real soon, which brings us to improvement idea nbr 4.

Improvement idea nbr 4: timing

Another interesting idea mentioned was to go from a yearly to a half-yearly event. That would perfectly fit with PCG switches between the Northern and Southern hemisphere. We could have a PCG11-South and a PCG11-North. Or Up and Down? Or ... well whatever.


Improvement idea nbr 5: presentation formats
On PCG10, all 15 presentations were either slideshows and/or demo's with exception of Dan's (codek1) "presentation". He just went up front with a short prepared speech actually interviewing the audience on methodology. That resulted into a very interesting groups discussion. 

So this might actually lead to an idea for varying the format of presentations. Some idea's I heard:
  • groups discussions on a specific topic (who's interested can participate)
  • a "what sucks session" (proposed by Grumpy)
  • architecture sessions around a white board
There are many possibilities, and with 15 presentations in one day (a heavy schedule) a change of format is most welcome.

Improvement idea nbr 6: extending PCG to a full OSBI event
Based on Aaron's presentation at PCG10, where the idea was discussed that Pentaho BI server is more and more becoming a Pentaho BI APPLICATION server, which might/will also support JasperReports, BIRT, etc the discussion to invite also these open source projects to the community table. PCG would then become a true open source BI event. Personally I find that a very challenging idea, however it does raise some practical questions. PCG is growing quickly as it is, adding even more momentum would make organisation of the event just too tough for any of the partners to take on? How to balance the agenda between Pentaho / Non-Pentaho stuf, after all it's a Pentaho sponsored event?  


So, that's it, I've posted whatever I remembered from the nice talks with great people, if a faboulous surrounding, accompanied wit great food, wine and beer. It's the best way these ideas won't fade. Most of these ideas come from great minds and fine people in the community. I hope posting them helps to stimulate the discussion. One thing is sure: PCG will keep on getting better.

    PCG10 participants

    Three days after the event, some mails are going around, trying to reconstruct the "who was who". Indeed the Pentaho Community Event is growing, and I believe many discovered only when seeing the group picture that they didn't get round to meet quite a few of the participants.

    I discovered too that I missed to opportunity to get to know some people. So based on the mails that have been circulating and a 'tagged' group picture (thank you Jens Bleuel) I'm trying to put together the PCG10 Participant list.

    It is work in progress, so please people, help me sticking the right name to the right person. Also, should any one rather remain anonymous, drop me a mail, I'll rename you to Mr X(action).



    Nbr Name (First name - Last name) Twitter from
    1 Roland Bouman @rolandbouman .nl
    2 Matt Casters @mattcasters .be
    3 Håkon Torjus Bommen .no
    4 Marco Gomes .pt
    5 Jos van Dongen aka Grumpy @josvandongen .nl
    6 Carlos Amorim .es
    7 Jens Bleuel .de
    8 Pedro Alves @pmalves .pt
    9 Nikolai Sandved @NikolaiSandved .no
    10 Gunter Rombauts .be
    11 Jochen Olejnik .de
    12 Tom Barber @magicaltrout .uk
    13 Nuno Brites .pt
    14 David Duque .pt
    15 Slawomir Chodnicki @slawo_ch .de
    16 Pedro Pinheiro .pt
    17 Nelson Sousa .pt
    18 Nuno Severo .pt
    19 Paula Clemente .pt
    20 Pedro Martins .pt
    21 Samatar Hassan .fr
    22 André Simões @ITXpander .pt
    23 Rui Gonçalves .es
    24 Dan Keeley @codek1 .uk
    25 Anthony Carter .ir
    26 Julian Hyde @julianhyde .us
    27 Rob van Winden .nl
    28 Pompei Popescu .ro
    29 Jan Aertsen @jan_aertsen .be
    30 Ingo Klose @i_klose .de
    31 Sergio Ramazzina @serasoftitaly .it
    32 Martin Stangeland .no
    33 Dragos Matea .ro
    34 Juan José Ortilles .es
    35 Paul Stoellberger @pstoellberger .at
    36 Doug Moran aka Caveman @doug_moran .us
    37 Thomas Morgner .us
    38 Kees Romijn .nl
    39 Aaron Phillips @phytodata .us


    ... and the following are people that were present, but somehow dropped out of the group picture. Maybe they were on the beach?

    Picture Name (First name - Last name) Twitter from
    Nuno Moreira @webdetails .pt
    Bart Maertens @bartmaer .be
    Juliana Alves .pt

    September 28, 2010

    PCG10 in pictures

    After blogging PCG10, Doug kindly asked me whether I could also host the event pictures. So I've quickly added a picture gallery on the kJube website.

    There are many more participants out there that took pictures, may I ask every one to mail me their pics, or preferably a link to an archive or so. I'll make sure it all ends up in the gallery.

    I've created the following categories: 

    1) The day before: Most people arrived the day or evening before PCG10 in Cascais. I've added those pictures in here.


    2) PCG10: All picture from the Pentaho Community Gathering in Hotel Albatroz, including some pictures of the 2 hour lazy lunch break.  


    3) Saturday night: After being exposed to a vast amount of presentation, PCG participants break loose.  


    4) Bowling and later: Sunday morning a bowling event was planned (where large amounts of coffee were consumed). After that many people drifted of to Sintra or just hung out at Cascais  


    5) Cascais and surroundings: Sceneries of Cascais. An amazing location for PCG10.


    With many thanks to the current contributers:
    • Kees Romijn
    • Jens Bleuel
    • Jan Aertsen
    There are many more participants out there that took pictures, may I ask every one to mail me their pics, or preferably a link to an archive or so. I'll make sure it all ends up in the gallery.

    September 25, 2010

    Pentaho Community Gathering (Live)

    It's september 25th , I'm sitting in Cascais, hotel Albatroz, where the Pentaho Community Gathering 2010 is happening. I'll add some stuff 'live' to our blog as presentations happen.

    Remarks: 
    • September 26th, I revisited the post, cleaned up a bit, added the missing video and some slide presentations that came in late. I tried to leave the 'live' feeling though.
    • September 28th, after I sobered up, I fixed some deadlinks (presentations of Roland and André are now correctly linked), added Nuno's presentation, and added some links and thank you's to the organizers of this great event.

    The agenda is pretty crammed so we'll have to see whether we'll manage to stay on track.


    Another thought that passes my mind is whether people will actually be able to refrain from running outside to catch some sun. The view from the meeting room says it all.



    10h15 - Dough Moran
    Doug Moran kicked off the meeting by presenting every one to one another and thanking WebDetails and Xpand-IT for organizing the event.




    Oh yeah, T-shirts to be distributed later.


    With this, Jos van Dongen, industry analyst, informs the world the conference has kicked off.



    10h30 - Pedro Pinheiro - CDA  [presentation coming up]
    Pedro explains CDA, community data access, a server side solution for data access usable for dashboarding and reporting.



    Since it's a server side solution it's a bit hard to "show" what it "looks like", but more will be revealed during the later presentations (by WebDetails) where dashboarding/reporting tools will use CDA.


    10h54 - Julian Hyde - Mondrian stuff  [presentation]
    Hej, we are ahead of schedule? Can Julian keep it so? I'm not sure as he's calmly taking off with some slides from previous community meetings as well as his kid.



    Anyhow, Mondrian is undergoing a full rewrite. Some of the code goes back 9 yeas no, so rewriting all that involves A LOT of stuff. Currently Julian seems to wonder if the code will ever build again - see his previous blog post on that - but he's confident he'll get things running (if his youngest doesn't turn off his PC to often).

    So what can we expect?
    • Attribute oriented analysis
    • Physical models
    • Composite keys
    • Measure groups: cubes with multiple fact tables eliminating the need for virtual cubes
    • Improved schema validation


    As far as transition to Mondrian 4.0 is concerned, Julian says it won't be easy, but Mondrian will remain backwards compatible towards version 3. Workbench will need a rework due to the modifications in Mondrian, that is if Pentaho wants to keep workbench. But there are other options. Agile BI or the Metadata Editor might be extended to serve the purpose. The decision hasn't been made yet. A long beta process is  foreseen.

    Short coffee break now, next is Matt Casters

    11h39 - Matt Casters - Dynamic ETL / Metadata Injection [no presentation, check demo below]
    Matt goes over the history of ETL tools has undergone from quick hacks, over frameworks, over code-generators to real data integration engines as we know them. This presentation is about "what is next"?


    Matt shows the example of dynamically loading a csv file into a table. In this use case you don't know the .csv file name upfront, neither do you know the field names, data types etc. What the meta data injector does is passing all the right information to your transformation?

    Anyhow, to say it with Matt's words, cut the talk, just show us the demo.


    In order to enable this kind of meta data injection, a rework of steps is needed, so it'll take some time before this functionality is available throughout PDI. Also, probably some kind of light weight UI will be needed for the design of these dynamic ETL solutions.

    The call to the community is: please provide use cases for dynamic ETL.

    12h00 - Aaron Philips (@phytodata) - Plug-ins and extension points [presentation]
    The BI server is becoming a business intelligence oriented application server rather than just a BI solution server. Eg. CDA (presented earlier on) has been developed as a plug-in that runs as an application on the BI server.

    (Aaron's presentation seems extremely well written out, so I guess it'll be self-explanatory when it will be published later on. We'll add links as soon as all presentations are added online.)


    A very interesting idea presented as an illustration of a BI server extension is an alternative to xactions (Yeah!), being a GroovyEngine plugin for the BIserver. This triggers interesting remarks from the community though. We already have job scheduling mechanism namely PDI, why aren't we using this.



    Doug's reply to the matter is that both options are open. The platform will offer possibilities for plugins and you can go one way or the other. Julian wonders why we need two times the same functionality. Seems like the whole discussion evolves around whether Pentaho want to offer a BI server or a BI application server. In the first case Pentaho would offer BI functionality, while in the second case they offer a platform to run BI applications on, even external ones like BIRT reports, Jaspersoft reports, ...  Interesting discussions.

    A remarkable fact to add, is that Aaron's presence on PCG10, is on specific request from the community. Some months ago a poll was launched by community members to make sure that attendance of Pentaho developers that have ear for the needs of the community is wanted. The results of the poll were clear: "Ship Aaron to PCG10". Will other Pentaho developers score better next year? Or will Aaron remain the uncrowned community hero? To be continued.

    Presentation finished at 12h32, so we are still on our challenging schedule.

    12h30 - Nelson Sousa - CDE (Community Dashboard Editor) [presentation coming up]
    Nelson kicks off wildly - but claims he has done wilder things - with a CDE demo. It shows clearly how you can click together your dashboard (row after column after row after column after row ...), based on CDF components, CDA elements, ...


    (The demo itself is pretty interesting. It's a dashboard showing Tweet statistics.)
    The dashboard editor generates .html, .js, .css files which goes into the BI server.

    For more information on CDE: http://webdetails.pt/

    Lunch break




    While all presenters have been respecting the time table, it seems that most of the community couldn't resist to stay out a bit longer for lunch. So in the end we picked up the agenda with a half hour delay.

    14h30 - Tom Barber and Paul Stoellberger  - PAT (Pentaho Analysis Tool) [presentation / presentation]

    Paul Stoellberger kicked off PAT presentation with two slides and dived immediately into the demo part, demo-ing all slice & dice, drill down/across, filter etc functionalities of PAT, both in stand-alone mode or as part of the Pentaho BI server.


    For the moment Paul and Tom aren't adding new features because they want to focus on getting a stable 1.0 out there. Obviously there are some interesting ideas for the PAT future as predictive analytics (including WEKA) or adding new charting options (using protovis). But for now feedback from the community on stability and bug reporting are the highest on the request list.

    After Paul concluded, Tom presented PAT ideas on modular and collaborative BI with OSGI. The original idea was to work around collaborative BI only, but the ideas have expanded, ... and remain mostly only ideas for now. However the baseline idea is that currently Pentaho doesn't support collaboration in any way. The CDF has the possibility to insert some comments, and of course you can mail report links etc, but that is about where it ends. So the idea is to build this in using OSGI, a module system for Java allowing you to install new modules without stopping or rebooting the server. Next thing Tom starts of a demo on some of the basic features of making 'PAT RESTfull'.

    14h30 - Jan Aertsen and Matt Casters - KFF (Kettle Kitchen Factory) [presentation]

    I agree Julian, it's hard to blog your own presentation!
    (30/09 but I finally added the full walk-through here)


    15h00 - Nuno Moreira - Pentaho Dashboards, breaking barriers [presentation]
    Nuno shows a visually stunning presentation (no images as below text might suggest though) and is asking us to think of dashboards as:
    1. a sexy stripper
    2. a sexy stripper doing a lap dance
    3. a sexy stripper doing a lap dance which you can disassemble (?) [not my words]
    4. a sexy stripper doing a lap dance which you can disassemble and talk to also. 
    5. a sexy stripper doing a lap dance which you can disassemble and talk to also, and who allows you to squeeze her
    6. a sexy stripper doing a lap dance which you can disassemble and talk to also, and who allows you to squeeze her and who doesn't mind you share her with your friends.




    No need to say that this way of looking at dashboards really captured the attention of the Pentaho crowd. Obviously Nuno was using a metaphor to talk about different levels of customizability of and interaction with dashboards.

    Short coffee break
    ... with wonderful cup cakes:                                           ... made by the organizer of the whole event
                                                                                              (A big big thank you for that!)


    15h30 - Jos van Dongen (aka Jos von Dongen, aka Grumpy) - Data mining [presentation]
    "Is data mining the newest piece of shit? ... I don't think so.", says Jos, and kicks of his presentation.


    Jos walks us through the different data mining tools and techniques: decision trees, neural networks, regression analysis, ..., explains the differences between supervised vs unsupervised learning, splitting your data sets, etc.

    He worked out the examples in Weka / Kettle and showed it a shortlive demo. Personally I believe that the Weka / Kettle integration is an extremely powerful feature (which many commercial data mining / ETL tools) don't even offer today. I really like the demo and hope to start using this type of functionality soon.

    16h00 - Dan [Codek] - Approaches to implementations and methodology [no presentation]
    Dan is trying to get us back on schedule by keeping his presentation short. No slides prepared, just a short series of ideas sketched on paper as a guideline for his talk. Since he moved from consulting to being responsible for business intelligence in a real company, he's interested on how people manage their projects.

    Basically his talk started with the question who is using SCRUM and ... I filmed the rest.


    16h30 - André Simões - PDI job/transformation framework [presentation]
    André Simões, aka ITXpander, aka 'The useless guy on IRC', talks about an ETL framework including ETL chaining, ETL scheduling, building in check points and making self-contained ETL processes to ensure restartability, etc. Great stuff. A merger between this and KFF has been decided on the spot. A clear indication that there is a need for this kind of utilities.


    Another addition to the presentation: Pentaho Reporting in Confluence.

    (Sorry for the limited notes, as this was close to KFF I was to interested in understanding the presentation.)


    In the meantime the heavy schedule starts to weigh on the participants. Having a sunny beach right outside the room doesn't make it easier.



    But the final list of presenters are know for being able to keep their presentation juicy and spicy, so no doubt the audience will remain present.

    17h00 - Pedro Alves - CCC (Community Charting Components) [presentation]
    Pedro explored 20 charting libraries to see which one was the best to add to Pentaho as the existing charting is crap. He toyed with the idea to write a charting metadata layer allowing to plug-in all existing charting layers, but that idea was quickly tossed aside as it would add to much layers of complexity.



    So he backed out and thought about what users want. Users don't care about the library you use for charting, they just want that you can create the visualizations they need. So he looked for a visualization library rather than a charting library, being protovis. On top of this, Pedro started developing CCC (Community Charting Components), a charting library based on protovis. This allows you to always go back to the visualization library and make/adjust your chart as you want.

    Next Pedro did a demo on how the CCC fit in with CDA, CDF and CDE.

    Group picture session



    17h30 - Roland Bouman - Kettle Cookbook [presentation]
    Roland elaborates on dominant users, positive eating experiences, having the guts, communism and Mao's manual which brings him straight to the kettle-cookbook.


    Since Roland is a huge fan of the RTFM theorem, he considered it was time to ensure that based on the ETL code - actually the ultimate documentation, why can't the users read that ...- it was time to automatically create documentation, because who actually wants to write documentation, it's even more boring than reading it.


    The kettle-cookbook auto documentation tool is developed in kettle under LGPL license. It will scan a directory of kettle code (.ktr / .kjb) and will generate cross-linked.html pages with a TOC, including diagrams and an overview of all variables, connections, fields, ...

    18h00 - Jens Bleuel - Concept and realization for a PDI watchdog [presentation]
    Jens has the hard task to actually still kick some life into a crowd that has been hit with tons of slides and demos throughout the day.


    Basically the watchdog checks whether .ktr/.kjb are alive. Jens walked us through the code he wrote for this.

    ... and this concludes the presentations.  Over and out. I'm wasted of blogging all day.
    Maybe I'll add some more pictures later about the evening part of the event.





    I heard that quite a few people are actually reading the blog post as it grows
     ... which I didn't really expect (honestly). But while you guys are at it, please
    leave a few comments on this way of 'reporting' on the event 
    (so we can continue to improve). Thank you !!!