7 years ago, at Pentaho, Matt Casters launched the initiative of having a Star Modeler inside Pentaho Data Integration (kettle). The idea behind the Modeler was to be able to manage your data model from within the data integration tool and to link your data integration code directly to the data model definition. This post is not about the Star Modeler but for those interested, Diethard Steiner did a good writeup of how the modeler worked, so those interested, can read up on some Pentaho history there.
The reason I bring up some Pentaho & kettle history because Star Modeler initiative never took off. Matt put this experiment out in the open, but the idea never made it past a first release. The functionality never got included into the product, unfortunately. There where several reasons for this:
- The first (and least interesting) reason was political in nature. The team architecting the front-end part of Pentaho's technology claimed modeling belonged to the reporting side of the product. The front-end team ended up owning that task and nothing useful ever saw the light.
- The second reason why a proper modeler was never put on the kettle roadmap, was because in 2011, kettle was increasingly being used in big data context and kettle was being used to integrate with Mongo, Hadoop, Cassandra, CouchDB and so on. The whole idea of a Star Modeler just didn't match up with the schema-less future the big data revolution had in mind for us.
- The last reason I can see, why the Modeler didn't take off, is because in essence, when writing to an RDBMS or columnar SQL database, once a schema is created in the database, a series of tables exists to which you can (read: have to map) map your data integration at all times. In essence, schema creation is a one time job, and once created you need to comply with what is there. That reduces the need for a Start Modeler.
7 years later, the world of data analytics and our insights have evolved. The big data revolution has indeed kicked in. Data volumes are exploding all around us and the need for flexibility in the data model, beyond what traditional RDBM's can offer is recognized everywhere. At the same time however, in the world of schema-free databases the challenge of data integration has become more significant.
I guess, what I'm trying to say, is that, the reasons that might have made the Star Modeler fail, might not exist anymore. And guess what, it happens to be so, that Matt released a new Modeler for kettle. This time to manage graph models in Neo4j and load data into Neo4j at the same time.
As you can see from below screenshots, the Graph Modeler ressembles his older brother somewhat ;-)
I strongly believe, that in the schema free Neo4j world, the ability to manage your graph model from within the tool you use to also load your graph gives you the level of control you need to ensure that whatever data you want to map into your graph is at all times correctly mapped to the model that should covern your graph, even with limited or no constraints in place to ensure model integrity.
In my next blog post, I will spend some time explaining how to use this Neo4j kettle step.