Testimonial: Moldable database analysis

Carlos Lopez is a developer. He started to work with Glamorous Toolkit recently. He used it for reasoning about a database he did not know. Here is the description of his experience in his own words.

Recently I had the task of analyzing a database system for a customer. This database is growing in functionality and the objective was to map the existing database model.

Databases are complex systems; they hold large amounts of data and in doing so, they also contain large amounts of metadata, which describes the data itself. Data is typically stored between multiple tables for different reasons: efficiency, security, etc. The metadata contains the description of the tables, columns and the relationships that connect those different pieces. To understand the database, the first step is to analyze its schema; there are many tools that can generate a schematic directly from the metadata and aid in the analysis.

Databases also hold procedures that describe the business rules which create or modify the data (via the well known INSERT, UPDATE, DELETE or SELECT SQL statements). A full analysis of a database must go over the data storage design and the procedures that manage the data. Understanding the procedures means reading the SQL code inside them. A typical database can contain hundreds of these procedures and it is an arduous job to analyze, group and contextualize each one of these.

What is Glamorous Toolkit (GT)?

Glamorous Toolkit is a platform for data visualization (among many other purposes). I discovered the tool by watching the presentation given by Tudor Girba at Curry On in 2019. In this presentation, one of the ideas that is presented is that objects can have many different representations, and that the proper representation can really accelerate our understanding of these objects while minimizing our metal effort in doing so. The proper representation depends on the current context and the current objective (and the current user), and so, the idea of moldable development is introduced, where the user tailors the environment to present what is needed.

I was immediately inspired to try it. GT is developed using Pharo Smalltalk which was completely new for me, so in the beginning I was able to look at the demos (and see the potential of GT) but was unable to do anything with it. To get a quick grasp of the language, I studied the videos created for the Pharo MOOC which are online. These are very well presented and provide a great introduction to the language. Having some fluency in Pharo is required to do anything with GT.

Getting data into GT

The first step was to get some data into GT. This step is either very easy or the hardest part; if GT contains a reader for the data that you wish to analyze, then it's easy. If there is no specialized reader, then it gets harder.

In my case, I wanted to analyze the T-SQL code generated directly from the database, where each database object, either table or procedure is stored in its own file. GT does not come with a parser for this language (even though it has many others) but comes with a couple of tools to build one. For my case, I decided that a simpler route (due to my lack of experience with the pharo tools) was to preprocess the sql code before bringing it to GT. I used an ANTLR grammar for T-SQL to produce an XML file with the corresponding abstract syntax tree. GT has xpath tools to work with XML, so once in that format, it was easy to load the data and work with it.

Building a basic table view

The XML files that contain a table definition will contain nodes labeled create_table which describe the table names and nodes labeled column_definition which contain the columns with their name and type. The general framework to follow is that we need to create an object to represent each one of these elements. I created a package in GT named DBSystem and created the first two objects: DBTable and DBTableColumn. After loading one of the XML files that described a table, and reading the data inside it, I had the first table and column objects in the model.

The table object has instance variables holding the name of the table, and an array holding the columns objects, each with a name and a type.

The magic of GT comes in when an object has a method that is annotated, in this case with the <gtView> tag. This method is discovered automatically by the tool, and it indicates that the object knows how to create a representation of itself.

The first view that I used was columnedList and it enabled showing the table columns in a familiar manner.

Columns and relationships

In the XML files, primary keys, as well as foreign keys are defined outside of the scope of the column definitions. I added a method to the table objects to collect that information once the table itself had been defined, and to attach that information to the corresponding column nodes.

At this moment, it was possible to read many tables at a time, so I created a DBModel object to hold the collection of tables. From DBModel we can access all tables, and from each table, we can access all columns. Foreign keys have a refererence to the table and column of the primary key they reference. From the DBModel's top level view, it's possible to collect all of these references. I created a DBTableConnection class and stored the table-column pairs for source and reference.

GT has a graph builder called GtMondrian, and a corresponding mondrian <gtView>. Using the array of DBTableConnection, it's possible to visualize the whole database diagram.

(database names have been obscured for this presentation).

Even though this diagram looks nice, it is not very useful as it is very crowded. What's interesting is that if I click on any of the nodes, then it takes me to the corresponding table that is linked in the definition of the graph. This subtle detail is what makes GT extremely powerful for analysis, as will be seen in later steps.

View explosion

So far, this tool can be used to look at the database schema, but it is very basic when compared against other existing tools used to analyze databases; but, here is where things begin to get really interesting.

Table and neighbors

I don't want to see the whole database at the same time, but I do want to see a table along with the other tables that interact with it. Using the collection of DBTableConnection within the database model, I can ask the DBModel to give a list of tables that are linked to a specific one, providing the table neighbors. I link the current DBModel instance inside the table objects, and create a new table view (in addition to the one I had before), where each table can draw itself within the context of its interacting neighbors.

Note that before, each table knew if it was referencing another table via a foreign key, but did not know if other tables were referencing any of its primary keys. This information is now available using the global DBModel.

Here's a view of a table and its neighbors, using again a Mondrian view.

Table categories

Initially, tables didn't have colors, and the diagram looked a little dull. I decided to add color, and for that purpose, I come up with a list of categories and assign each table to one. I subclass DBModel with a DBModelCustomer and use it to hold this DB's specific metadata. I add a list of categories that map to a list of colors, and a list of all the tables with a category assigned. Now the diagrams look more interesting but also, reveal something _unexpected_ in some cases. There are tables that are interacting with other tables in categories that don't make a lot of sense. I have either classified the table in a wrong category or discovered a flaw in the design. Either way, it's a discovery!

I don't have a lot of categories, but I really can't keep them in my head. I decide to build another view with the categories and add that to the DBModel object.

As soon as I see the the list of categories, I wonder, which tables belong to each category? I Create a DBTableGroup object, and fill it with the tables from each category. Then I attach another mondrian view to this table group.

At this point, this tool is now, in my opinion, superior to other existing tools that are tailor made for this purpose. Clicking on a table creates a context-specific diagram of the table. And I can now analyze tables in groups, according to any criteria that I define.

This is very powerful. For example: it is amazing and so easy to see missing links between tables.

Digging into procedures

Table information is only the beginning, there is also procedure information. For my analysis, I need to concentrate on data creation and modification, so that means looking at INSERT, UPDATE and DELETE operations. I follow the same procedure as before, create a DBProcedure object and use it to collect the data definition nodes within.

Any of these statements can be very complex, as they may be nested and may involve multiple columns. But each one, specifies a single table as a target of the operation. So I collect each node, and obtain the target table for each. This is straightforward but it's necessary to do some extra work to obtain the actual table name in cases where an alias is being used.

Now, each procedure has INSERT, UPDATE and DELETE targets. Next? create a new set of views of course.

First, a table view inside the database model, to see all of the procedures and very quickly find out what kinds of operations they perform. No need to read any actual code .

In the table diagrams, I place the number of procedures affecting each one, and add a list view of the actual procedures affecting each table.

This view shows 2 procedures deleting rows from the same table. It turns out that only one of the procedures is used, and the other is an old version. Another discovery!

There and back again

Finally, when inspecting a procedure, I decided to show the tables that it affects, and the operations performed at the base of the table.

This fortunate accident enables us to traverse the whole system. From tables to procedures to tables again. Clicking on the nodes enables following the data, and to get the context for each operation. I haven't come across another tool that would allow me to do anything close. I now have hundreds of custom made diagrams, hundreds of custom made groups, all hyperlinked for navigation and ready for custom searches.

Conclusion

Other tools let you create and use databases with efficiency and simplicity, but I really can't think of any tool, other than GT, that will enable diving into an existing database and that will allow finding answers to your own questions. This is only the tip of the iceberg, with many more possibilities to explore.

feenk blog