February 25, 2010

The Excel Piramid

Old post blog post from 12/11/04, as it was on our website. I don't know if that blog will continue to exist, so I've posted a copy here. Unfortunately, 5 years later, I still keep on encountering Excel construction workers on a regular basis. It just won't go away.

Anyhow, here's the post.

"To my utmost frustration I keep on encountering them at several client sides. Excel pyramids. Just like their Egyptian counterparts it takes years to make them, it remains a mystery how exacly they have been built, they are built for the glory of the top-man and finally that same top-man gets burried under it (or at least that's what he deserves)!

To support continuous demand for reports, companies often turn to Excel. Obvious, as it remains the most the most available tool. But as time passes by more requests come in, time presses and one Excel sheets gets built on top of another. Chains of Excel reporting spring to life. One linked to the other, (ab)using the vlookup function to fetch data from on sheet to the other, copy pasting linked data from one sheet to the other until a highly complex spiderweb of linked Excel sheets has been built that none dare to touch.

The question I ask myself is why nobody pulls the emergency break in time. I've seen situations where for years (> 10 years) companies built one Excel on top of the other, developping monthly "procedures" to accurately update the collection of sheets that have grown to gigabytes of data each month, copying that "reporting-database-folder" over and over again.

The update procedures become more cumbersome every month, until in the end the company finds itself with a department of 20 people purely dedicated to relinking and recalculation Excel sheets.

It's nice to see when finally they see the light shining on the outside of the piramid and realize they need to change. But finding your way through the maze inside the pyramid, to actually understand how you get from chaos to an automated solution is far from evident. And as usual customer that are deepest in the sh** are the ones most eager to see results.

J

February 22, 2010

Parallell ETL execution

In Pentaho Data Integration it is extremely easy to execute jobs in parallel. The job below is an example of 7 jobs launched together right after the step which 'starts' the job.


It's very easy to configure. (You can listen to some Melody Gardot while doing it, as you can see !) . Just right click on the step previous to the parallel jobs, and select ''Launch next entries in parallel".


So kettle has made life easy for us again (as of version 3.0 if I recall correctly). However it doesn't end here. You want to run jobs in parallel in order to speed up your whole ETL process. In order to achieve the best possible results, there are a few things to study and consider.

Balance out short and long running jobs to get a good spread of system load
Your ETL batch process will contain some jobs that - running standalone - would last a very long time, while others will be relatively short.  Putting the long running jobs in parallel seems like a better choice than putting the 20 very short running jobs in parallel. The latter will probably  just create a 5 minute peak load on your system while after that 5 minute peak you still need to wait for the longest running job. So spreading jobs intelligently over the time span of the longest running job (or chain of jobs) seems like the way to go.

So rather than just launching all you've got in parallel without thinking, a good set-up would be something like the below. Obviously this means that you have an approximate knowledge of how long each job would last if run by itself. Make sure you get those statistics.


Next is to try. Trial and error is part of the game. Some configurations will work better than others. So playing around a bit is unfortunately part of the game.

Funtional dependencies
Needless to say, if jobs depend on each other, it's better not to parallelize them. 


Competing for resources
Similarly, jobs that are competing for the same resources - not CPU - e.g. that might require reading (or updating/inserting) the same tables, are best schedule one after the other, as they might drastically reduce performance.

As time goes by
Panta rei. Everything changes all the time. Therefore what is today your longest running job, may be be just an average job tomorrow. And a well performing job might become the worst pupil in the class. Data volumes and complexity of the processing will change over time. This will affect your parallelization. So re-evaluate your set-up from time to time.



In other words, scheduling jobs in parallel isn't really a technical issue. To start with, it all drills down to knowing the performance of your single jobs and have a good functional understanding of your code.



J

Remark: Obviously things can get more technical. Once you get to the point where you have one single job that still takes too much time, you might need to spread that job over different machines. That's a different kind of parallelization than what I describe above. How to do this has been described already.

February 20, 2010

Collaboration made easy

Since last year, we at kJube have been regularly using Yugma for desktop sharing and online presentations. It's been working out well as Matt wrote about a year ago.

However, you still need to install Yugma on your machine, and on 64bit (which I've been running since kubuntu 8.04) that has given me some headache.

So I recently had a look at some other tools namely dim dim as well as vyew. Both are fully browser based.

Dim dim started up perfectly both on my kubuntu 9.10 as on my Mac G5. However I soon discovered that screensharing was 'to be released for Mac and Linux in Q2'. And that is a least a month away. Even though the other functionalities of dim dim looked ok, screensharing is really the main thing I was looking for, so I skipped dim dim and moved to vyew.

And I must say that after a short test vyew.com came out as a serious piece of software. I created an account and started a meeting on my Mac in less than 2 minutes, invited myself on my Linux machine and started the software also there. Fully browser based, no installation on either of the machines and working out of the box.

Not only did the screensharing work, also VoIP and camera seemed to work. Wheeew. I must say, this came as an extremely pleasant surprise.

Below you can see a screenshot of me using vyew, sharing some Pentaho Data Integration work with myself during a late evening meeting. We really got a lot done :-)



Anyhow, if you are looking for an free (up to 20 persons in a meeting) online collaboration tool, try vyew! Also it has some cool features for collaborative editing of documents. Only minor remark: screen rendering is sometimes slow. But how fast do you want to run over the screen when on the other side people are expecting you to go slow so they can follow?


J




Comment (22/02/2010):

When I tried to let Harris share his desktop today, I became painfully aware that desktop sharing from Mac to Linux works nicely. However the other direction doesn't. As the people of Vyew say themselves "pretty lame". Still they are ahead of dim dim, having desktop sharing work on Mac already. Sorry for this post being not completely correct.

February 19, 2010

conditional no smoking

No smoking signs became increasingly prominent the last few years. Even so much that the next sign I expected to see would have been something like: 'If you catch someone smoking here you may burn them alive. Art 52 $666 of the latest over the top law on smoking'

But to the point. I just saw the following sign on an Italian train.





'No smoking, if this train is local or regional'

We don't want to be to harsh on smokers, now do we? Let's start with short detoxication rides. But for the longer travel we allow smoking otherwisd they'll get all shakey and nervous and might attack the conductor (that is if he bothers to do his job and pass through the wagons).


J

just when I got used to things being open and free

Just recently I needed an ETL tool for a customer to do a quick proof of concept. And when I need an ETL at 0 cost, which I can install in less that 5 minutes, Pentaho Data Integration (PDI) is surely my choice. Honesty, of course, bids me to day that if I would have half a million euro to spare as well as some months of deployment time, Pentaho Data Integration would still be the ETL tool of my choice. I was converted over 5 years ago, still waiting for better tools to come around.

So I have used to the nice fact that PDI ships with a load of JDBC drivers allowing you to connect to whatever database you might or might not want to run into:
  • IBM DB2 AS400
  • Apache Derby
  • Borland Interbase
  • ExtenDB
  • Firebird SQL
  • Greenplum
  • Gupta SQL
  • H2
  • Hypersonic
  • IBM DB2
  • Infobright
  • Informix
  • Ingres
  • Intersystems Cache
  • Kingbase ES
  • Lucid DB
  • MS Access
  • MS SQL Server
  • Max DB
  • MonetDB
  • MySQL
  • Neoview
  • Netezza
  • Oracle
  • Oracle RDB
  • PALO MOLAP server
  • PostgreSQL
  • Remedy 
  • SAP R/3
  • SQLite
  • Sybase
  • SybaseIQ
  • Teradata
  • UniVerse DB
  • Vertica
  • dBase


For those ofyou, not too familiar  with the ETL market, I just want to point out that the commercial ETL vendors have made 'connectivity' their personal cash cow over the last decennium. Looking at Informatica standard edition for instance, you might notice the little footnote reference next to the words 'Get access'. Yep that cute little floating 1 which translates at the bottom of the page into 'When complemented by Informatica PowerExchange'.  How many customers do they have that by Informatica PowerCenter standard edition for a figure with 5 zero's, without the need to 'Get access' to data? Euh, does not compute ......

But back to the topic. So, I'm a happy data warehouser, able to get all connectivity I want , for free. That is I was until now. I unzipped kettle and wanted to connect to DB2 on zOS and the following horrible message showed up:




The version of the IBM Universal JDBC driver in use is not licensed for connectivity to QDB2/ databases. To connect to this DB2 server, please obtain a licensed copy of the IBM DB2 Universal Driver for JDBC and SQLJ.


Bummer. It seems that the JDBC drivers delivered by IBM require a license file.



As of DB2 UDB v8.1.2 the Universal JDBC driver requires a license JAR file to be in the CLASSPATH along with the db2jcc.jar file. Here are the names of the required license JAR files:
  • For CloudscapeTM Network Server V5.1: db2jcc_license_c.jar
  • For DB2 UDB V8 for Linux, UNIX, and Windows servers: db2jcc_license_cu.jar
  • For DB2 UDB for iSeries® and z/OS servers (provided with DB2 Connect and DB2 Enterprise Server Edition): db2jcc_license_cisuz.jar



Of course I cannot blame PDI for the functioning of IBM's JDBC drivers. Obviously also Big Blue has it's cash cows, being in this case the expensive mainframe platform that many a customer cannot get rid off. And so these customers can couch up the $$$ and I can forget about using PDI for a quick proof of concept.

Just when I got used to everything just everything being available when I needed it.


J

February 18, 2010

ETL is IO bound

... which doesn't stop Pentaho Data Integration from optimally using all the CPU power it can get out off that old Windows 2000 hag we are developing on  :-)

February 17, 2010

to trim or not to trim

While converting a series of Datastage ETL jobs to Pentaho Data Integration, I came across the following statement used in Datastage:

trim(TextField)



Funny enough the Datastage trim function doesn't just remove trailing or leading blanks. It also removes "multiple occurrences of blanks to a single occurrence of a blank", as nicely documented here. So trim('__Hello_____World') would yield 'Hello_World' as result (where _ stands for a blank. Not exactly what you would expect when you see a simple trim function. Yet, this setting is the standard behavior of the trim function in Datastage. Not very intuitive, neither in line with how a trim function is implemented in most databases or ETL tools.

Obviously this type of trim has it usage. But, while imitating this functionality using Pentaho data integration wasn't an issue, you might wonder whether this is actually a desired standard behavior for a trim function?

J

February 1, 2010

AD/AC

meaning,
another day another customer
:-)



J

dirty apple keyboards meets bosh

Lately my apple keyboard was getting a bit hard to work on. Most of the keys had become 'sticky keys' - even though I thought that to be a Windows concept - but which seems only logical after 4 years of heavy duty without one single moment of personal hygiene. So I thought it was time it deserved a treatment.
After a short information round on the net, I discovered that this - see the picture - was a possible solution. In two hours I'll know more about this experiment :-/