How Messy is My Database?

How Messy is My Database?

On MPE, HowMessy is a Robelle tool that prints a report on the internal efficiency of IMAGE/SQL databases. Here are some tips on how to analyze the reports:

Master Datasets (Automatic or Manual)

If Secondaries are over 30 percent and Inefficient Pointers are over 50 percent, the dataset is either too full or hashing improperly. Increase capacity to a higher number that is not a power of 2, or change the data-type and format of the search field. Increasing the Block Factor should reduce Inefficient Pointers. Look for clustering problems: Load Factor less than 80 percent, Secondaries less than 5 percent, and Max Blocks greater than 100. Clustering usually results from a sequential binary key; change it to type X or U.

Secondaries are created when a synonym occurs in IMAGE/SQL. Synonyms are two or more key values that hash to the same location in a master dataset. The first entry gets the coveted location and becomes a Primary, while the others look for "secondary" space in adjacent unused locations. Of course if a new key value comes along that belongs in a secondary's location, the secondary must be moved (this is called "migrating secondaries").

Normally, synonyms are not a problem. But when you have a "binary" key (types I, J, K and R), strange things can happen because IMAGE/SQL does not attempt to select random locations. It just divides the right 32 bits of the key value by the dataset capacity and uses the remainder as the primary hash location. The weirdest case we have ever seen was a UK user with an R4 key field. R4 is a long floating-point, it has 64 bits, with the exponent and the most significant portion of the mantissa in the leftmost bits. So when IMAGE/SQL selects the rightmost four bytes, they are always 0 and all entries hash to the same location. The user discovered this fact when he exceeded the limit on maximum synonyms per location!

Detail Datasets

Ignore the Load Factor, unless dataset overflow is likely, and even that is not too serious with the new "automatic detail expansion." If the dataset has more than one path, check that the Primary Path (!) has a large Average Chain Length and is often accessed. Elongation tells how inefficiently packed the chains are, relative to their optimum packing. Elongation of 8.0 on a Primary Path means that disc I/O are reduced by a factor of eight if you reload the dataset. Look for Average Chain equal to 1.00, Standard Deviation about 0.01, and Max Chain of 2 or 3; this is usually a dataset that should have exactly one entry per key value but actually has duplicate key values. Look for paths with long chains (Average Chain plus Std Dev > 100), especially if the path is Sorted (S). Just remember that every search key is there for an important on-line user need.

A detail dataset with a blocking factor of 10 may have had so many puts and deletes over time that its free-entry chain points all over the place. When 10 new line items are placed on the same chain into this detail, they end up in 10 different blocks. If this detail were repacked, all 10 entries could reside in the same block. Instead of 10 I/Os to read the chain, it would then take just one. If there is more than one path, remember that a Dbunload followed by a Dbload repacks only the primary path, not all of the paths. Detpack from Adager gives you more options, including selecting which path to pack and how long to work on it.

If you want to pursue this further, download Robelle's How Messy tutorial as a self-extracting archive.