Repository database schema validation and comparison
Background and motivation
The schemacomp tool may be used when troubleshooting or examining the database schema for an Alfresco repository. The tool has two main functions:
Producing schema dumps as XML files.
Validating a database schema.
Schema dumps are not new to the Alfresco repository - prior to the introduction of this tool however, the only way to judge the validity of the schema was to examine the file manually and with simple text tools such as the Unix diff command. This tool performs a certain amount of automatic comparison that should remove much of the effort needed in making these comparisons.
If any changes are made to the database schema during server start-up (such as when installing Alfresco afresh) then the schemacomp tool performs both schema dumping and validation as described below. The dumps and validation are made both pre-upgrade (i.e. before the schema changes) and post-upgrade.
Definition of terms
The terms below are used throughout the rest of this document.
A schema, sequence, table, column, index, primary key or foreign key.
The definitive representation of an Alfresco repository schema for a given schema version on a vendor specific RDBMS. The reference schema is a model for what should be present in the database after installing or upgrading an Alfresco repository to particular version. A reference schema is presented in the same XML format as a schema dump. For example a schema reference may be produced for MySQL on version 5025 of the Alfresco repository schema.
The database schema that will be compared and validated with respect to a reference schema. For example, if installing an Alfresco repository from scratch, then the newly created schema will be a target schema for comparison against the appropriate reference schema.
Performing schema dumps
Schema dumps are XML representations of the RDBMS schema. They should conform to the XSD:
As mentioned above, a schema dump is performed automatically during repository server startup if there were changes made to the database schema. The Alfresco log will indicate if any dumps were performed - entries such as these will be present:
2012-01-30 17:46:58,517 INFO [domain.schema.SchemaBootstrap] [main] Normalized schema dumped to file /tomcat/temp/Alfresco/Alfresco-schema-PostgreSQLDialect-pre-upgrade-alf_-5548956643327704619.xml. 2012-01-30 17:46:58,518 INFO [domain.schema.SchemaBootstrap] [main] Normalized schema dumped to file /tomcat/temp/Alfresco/Alfresco-schema-PostgreSQLDialect-pre-upgrade-avm_-2166257481854030130.xml. 2012-01-30 17:46:58,518 INFO [domain.schema.SchemaBootstrap] [main] Normalized schema dumped to file /tomcat/temp/Alfresco/Alfresco-schema-PostgreSQLDialect-pre-upgrade-jbpm_-2230905975269998715.xml. 2012-01-30 17:46:58,519 INFO [domain.schema.SchemaBootstrap] [main] Normalized schema dumped to file /tomcat/temp/Alfresco/Alfresco-schema-PostgreSQLDialect-pre-upgrade-act_-8103448407472298481.xml.
Similar entries for the post-upgrade files will also be present.
Note: that the legacy tool is still included and will create dumps of its own - the log messages look similar but shouldn't be confused with the new-format dumps.
Triggering dumps by JMX
In addition to automatic dumping, dumps can be manually invoked by use of the JMX interface.
Please note: this is an enterprise only feature.
The JMX category Alfresco > DatabaseInformation > SchemaExport contains two operations:
java.util.List dumpSchemaToXML(String prefixList)
The first operation takes no parameters and when invoked will create four dump files one for each prefix 'alf_', 'act_', 'jbpm_' and 'avm_'. The prefix means that only tables and sequences whose names begin with the prefix will be included in the dump. Related items - such as the indexes belonging to a particular table - will be dumped regardless of name.
The second variation takes a single String parameter and is a comma-separated list of prefixes that you wish to dump. If this operation were invoked with the parameter 'alf_acl_, alf_node_' for example, then two files would be created (one for each prefix). The tables dumped in the first file would include alf_acl_change_set and alf_acl_member, tables in the second file would include alf_node_aspects and alf_node_assoc. Neither file would include alf_locale or alf_permission since they do not carry one of the supplied prefixes.
Both of these calls will result in the log showing the location of the dumped files, but they also return a List of path names. JConsole will helpfully display these lists in a copy/paste friendly manner.
Performing schema validation
As for schema dumps, schema validation can happen either due to a schema change during repository start -up, or triggered manually via JMX.
Schema validation is performed in two steps differencing and validation.
Step 1. Differencing
Differencing produces similar information to that obtained by using the Unix tool diff against a known 'good' reference schema dump and a potentially problematic target schema dump. However, since the tool is designed for performing a comparison between two database schemas, rather than arbitrary text, the output is more specific about the types of difference. The types of difference that can be reported are:
A database object appears in both the reference and target schemas, but has differences in its properties. For example if an index appears in both schemas but has a different name.
A database object appears in the reference schema but no corresponding object has been identified in the target database.
A database object appears in the target schema but no corresponding object has been identified in the reference database.
One advantage of schemacomp differencing over traditional diff tool comparisons is that an index is not recognised by the exact text appearing in a dump, but: by which table the index belongs to; which columns are indexed and in what order. If an index has the expected name and belongs to the correct table but has the wrong columns, or the correct columns in the wrong order - then differences will be reported. Or conversely, if the correct table has an index with the correct columns in the correct order but has the wrong index name - then this will be reported. The name can be ignored during comparisons (useful for auto-generated index names) or can be taken into account - part of the task of producing reference schema files is to specify this behaviour using DbValidator objects (explained later).
Index related example
Supposing we have the following index defined in the reference schema (which for schema version 5025, we do!)
permission_id, authority_id, allowed, applies
This index is specified in the schema reference file in this way (parts ommitted for brevity):
When the target schema's index is compared against this reference then firstly a list of candidate matches are produced. There may be more than one matching index in the target schema, in which case a redundant database object warning is issued.
Candidate matches are produced dependent on object type. For indexes:
If the parent table is the same and the index name is the same - then it is considered the same index.
If the name is different but the parent table is the same and the columns indexed are the same (and in the same order) - then it is is considered to be the same index.
Taking the first scenario for matching and using the permission_id index defined above, then if the permission_id index in the target database has the allowed and applies columns in the reverse order than is expected, the log file would notify us of validation problems:
2012-01-31 11:24:24,280 WARN [domain.schema.SchemaBootstrap] [RMI TCP Connection(11)-10.244.50.71] Schema validation found 2 potential problems, results written to: /tomcat/temp/Alfresco/Alfresco-PostgreSQLDialect-Validation-alf_-5903917616348258838.txt
The contents of the report file would look something like this:
Difference: expected index .alf_access_control_entry.permission_id.columnNames='allowed', but was .alf_access_control_entry.permission_id.columnNames='applies' Difference: expected index .alf_access_control_entry.permission_id.columnNames='applies', but was .alf_access_control_entry.permission_id.columnNames='allowed'
Each line shows a problem with a particular database property. Here it tells us that the property at the path .alf_access_control_entry.permission_id.columnNames has the value applies but according to the reference schema should be allowed. The leading dot of the path can be ignored (the schema name would be present before the leading dot in the case of Oracle for example), then we have the table name alf_access_control_entry, the index name permission_id within that and a zero-indexed list property within that - the third item (index 2) is the property at fault: columnNames.
Similarly, the next line tells us that the next item in the column name list (columnNames) has the value allowed but was expected to be 'applies'.
Step 2. Validation
Validation allows the application of more complex rules than whether there is a difference between two property values. Validation is performed by DbValidator objects. A chain of DbValidator objects is associated with each database object in the reference schema - each of these is executed in turn and given the chance to create validation errors based on the corresponding object in the target schema.
If an index has not been given a specific name then the RDBMS will auto-generate one at creation time. This means that the reference schema cannot specify the exact name that the index in the target database will have. This would lead to schema differences being reported if it were not for the use of validators. A NameValidator<code> may be specified for such an index:
This example is from a DB2 schema reference file (<code>Schema-Reference-ALF.xml) and tells us that although in the original reference schema the index was named SQL120116153558430 any index having the appropriate parent table, column names (and column order) is valid as long as the name matches the regular expressionSQL[0-9]+.
When the validator is invoked, it checks that the name property of the index matches the supplied regular expression. In addition to this, the validator reports - when asked - that it takes responsibility for the name property of the index. This stops the schemacomp tool from applying the differencing logic to the property. A DbValidator can choose to apply its validation in addition to the differencing logic by not taking sole responsibility for any properties. Conversely a validator can also take sole responsibility for an entire database object in which case no differencing logic is applied to any part of the object.
A similar problem to the auto-generated name problem is when a database object is created automatically. DB2 creates indexes on the fly under certain circumstances. It is not known whether these indexes will exist at the time the schemacomp tool will be run. Furthermore, the indexes are an implementation detail for DB2 rather than an explicit declaration on how the Alfresco schema should appear. To suppress such errors an IgnoreObjectValidator may be used - it takes responsibility for validation of the associated database object, but performs no actual validation.
Another index related example
Supposing an index is expected to be auto-generated and is defined in the schema reference file as:
Perhaps a specific unsupported upgrade path has introduced an unexpected schema change - it may not be a problem as such, but it is important that differences are highlighted so that a decision can be made on whether the difference represents a problem and whether a fix will need to be made. On running the schemacomp tool, the following might be observed in the log files:
2012-01-31 14:28:50,697 WARN [domain.schema.SchemaBootstrap] [main] Schema validation found 1 potential problems, results written to: /tomcat/temp/Alfresco/Alfresco-DB2Dialect-Validation-Post-Upgrade-alf_-4048062354335481885.txt 2012-01-31 14:28:51,440 INFO [domain.schema.SchemaBootstrap] [main] Compared database schema with reference schema (all OK): class path resource [alfresco/dbscripts/create/org.hibernate.dialect.DB2Dialect/Schema-Reference-AVM.xml] 2012-01-31 14:28:53,326 INFO [domain.schema.SchemaBootstrap] [main] Compared database schema with reference schema (all OK): class path resource [alfresco/dbscripts/create/org.hibernate.dialect.DB2Dialect/Schema-Reference-JBPM.xml] 2012-01-31 14:28:54,682 INFO [domain.schema.SchemaBootstrap] [main] Compared database schema with reference schema (all OK): class path resource [alfresco/dbscripts/create/org.hibernate.dialect.DB2Dialect/Schema-Reference-ACT.xml]
We can see that the AVM, JBPM and ACT database objects are all as expected, but there is a difference between the target schema and the ALF (alf_ prefixed database objects) schema reference. Looking at that file we can see that an index that is expected to have been auto-generated has been created with an explicit name:
Validation: index ALFUSER.ALF_ACCESS_CONTROL_ENTRY.SQL120131142718040.name='idx_alf_ace_auth' fails to match rule: name must match pattern 'SQL[0-9]+'
Specifically, the error report is stating that the index defined in the schema reference having the name SQL120131142718040 belonging to the table ALF_ACCESS_CONTROL_ENTRY is expected to be named in the same way: prefixed with SQL then a string of one or more digits.
Triggering validation by JMX
In addition to automatic validation, validation can be manually invoked by use of the JMX interface.
Please note: this is an enterprise only feature.
The JMX category Alfresco > DatabaseInformation > SchemaValidator contains one operation:
The operation takes no parameters and returns nothing, however if the operation is invoked then validation will be performed and the Alfresco log will show the results:
2012-01-31 14:51:46,770 INFO [domain.schema.SchemaBootstrap] [RMI TCP Connection(13)-10.244.50.71] Compared database schema with reference schema (all OK): class path resource [alfresco/dbscripts/create/org.hibernate.dialect.PostgreSQLDialect/Schema-Reference-ALF.xml] 2012-01-31 14:51:47,360 INFO [domain.schema.SchemaBootstrap] [RMI TCP Connection(13)-10.244.50.71] Compared database schema with reference schema (all OK): class path resource [alfresco/dbscripts/create/org.hibernate.dialect.PostgreSQLDialect/Schema-Reference-AVM.xml] 2012-01-31 14:51:49,847 INFO [domain.schema.SchemaBootstrap] [RMI TCP Connection(13)-10.244.50.71] Compared database schema with reference schema (all OK): class path resource [alfresco/dbscripts/create/org.hibernate.dialect.PostgreSQLDialect/Schema-Reference-JBPM.xml] 2012-01-31 14:51:50,910 INFO [domain.schema.SchemaBootstrap] [RMI TCP Connection(13)-10.244.50.71] Compared database schema with reference schema (all OK): class path resource [alfresco/dbscripts/create/org.hibernate.dialect.PostgreSQLDialect/Schema-Reference-ACT.xml]
In the example above there were no problems found in the target schema. Troubleshooting