Transactional Metadata Query (TMDQ)

cancel
Showing results for 
Search instead for 
Did you mean: 

Transactional Metadata Query (TMDQ)

andy1
Senior Member
5 4 5,243

Introduction

This blog compares query support provided by transactional metadata query (TMDQ) and the Index Engine. The two differ in a number of respects which are described here. This blog is an evolution of material previously presented at the Alfresco Summit in 2013.

TMDQ delivers support for transactional queries that used to be provided by Lucene in Alfresco Content Services (ACS) prior to version 5.0. In ACS 4.1 SOLR was introduced with an eventually consistency query model. In 5.0, Lucene support was removed in favour of TMDQ. As TMDQ replaced Lucene, some restrictions on its use are similar. For example, both post processes results for permissions in a similar way and there are restrictions on the scale of the result sets it can cope with as a result. The Index Engine has no such restrictions. It also seems from use that if a query can be run against the database the scope of the query is such that it will probably have no issue with the number of results returned.

Overview

Some queries can be executed both transactionally against the database or with eventual consistency against the Index Engine. Only queries using the AFTS or CMIS query languages can be executed against the database. The Lucene query language can not be used against the database while selectNodes  (XPATH) on the Java API always goes against the database, walking and fetching nodes as required.

In general, TMDQ does not support: structural queries, full text search, special fields like SITE that are derived from structure and long strings (> 1024 characters). Text fields support exact(ish) and pattern based matching subject to the database collation. Filter queries are rewritten along with the main query to create one large query. Ordering is fine, but again subject to database collation for text.

TMDQ does not support faceting. It does not support any aggregation: this includes counting the total number of matches for the query. FINGERPRINT support is only on the Index Server.

AFTS and CMIS queries are parsed to an abstract form. This is then sent to an execution engine. Today, there are two execution engines: the database and the Index Engine. The default is to try the DB first and fall back to the Index Engine - if the query is not supported against the DB. This is configurable for a search sub-system and per query using the Java API. Requesting consistency should appear in the public API "some time soon".

Migrations from Alfresco Content Service prior to 5.0 will require two optional patches to be applied to support TMDQ. Migrations to 5.0 require one patch: 5.0 to 5.1 a second. New installations will support TMDQ by default. The patches add supporting indexes that make the database ~25% larger.

Public API and TMDQ

From the public API, anything that is not a simple query, a filter query, an option that affects these, or an option that affects what is returned for each node in the results, is not supported by TMDQ. The next two sections consider what each query language supports.

Explicitly, TMDQ supports: 

  • query
  • paging
  • include
  • includeRequest
  • fields
  • sort
  • defaults
  • filterQueries
  • scope (single)
  • limits for permission evaluation

The default limits for permission evaluation will restrict the results returned from TMDQ based on both the number of results processed and time taken. These can be increased if required.

The public API does not support TMDQ for:

  • templates
  • localisation and timezone
  • facetQueries
  • facetFields
  • facetIntervals
  • pivots
  • stats
  • spellcheck
  • highlight
  • ranges facets
  • SOLR date math

Some of these will be ignored and produce transactional results; others will fail and be eventual.

The public API will ignore the SQL select part of a CMIS query and decorate the results as it would do for AFTS.

CMIS QL & TMDQ

For the CMIS query language all expressions except for CONTAINS(), SCORE() and IN_TREE() can now be executed against the database. Most data types are supported except for the CMIS uri and html types. Strings are supported but only if 1024 characters or less in length. In Alfresco Content Services 5.0, OR, decimal and boolean types were not supported; they are from 5.1 on. Primary and secondary types are supported and require inner joins to link them together - they can be somewhat tedious to write and use.

You can skip joins to secondary types from the fetch in CMIS using the public API. You would need an explicit SELECT list and supporting joins from a CMIS client. You still need joins to secondary types for predicates and ordering. As CMIS SQL supports ordering as part of the query language you have to do it there and not via the public API sort.  

Post 5.2, left outer join from primary and secondary types to secondary types will also be supported. This covers queries to find documents that do not have an aspect applied - which is currently best implemented using something like

CONTAINS('-ASPECT:hidden')

today.

For multi-valued properties, the CMIS query language supports ANY semantics from SQL 92. A query against a multi-lingual property like title or description is treated as multi-valued and may match in any language. In the results you will see the best value for your locale - which may not match the query. Ordering will consider any value.

UPPER() and LOWER()

UPPER() and LOWER() functions were in early drafts for the CMIS 1.0 specification and sunsequently dropped. These are not part of the CMIS 1.0 or 1.1 specifications. They are currently supported in the CMIS query language for TMDQ only as ways to address specific database collation issues and case sensitivity. Only equality is supported.  LIKE is not currently supported. For example:

{
   "query": {
       "language": "cmis",
       "query" : "select * from cmis:document where LOWER(cmis:name) = 'project contract.pdf'"
   }
}

Alfresco FTS QL & TMDQ

It is more difficult to write AFTS queries that use TMDQ as the default behaviour is to use full text queries for text: these can not go against the database. Again, special fields like SITE and TAG that are derived from structure will not go to the database. TYPE, ASPECT and the related exact matches are OK. All property data types are fine but strings again have to be less than 1024 characters in length. Text queries have to be prefixed with = to avoid full text search. PARENT is supported. OR is supported in 5.1 and later.

Ranges are not currently supported - there is no good reason for this - it needs to generate a composite constraint which we have not done. PATH is not supported nor is ANCESTOR.

Subtle differences

  1. The database has fixed collation as defined by the database schema. SOLR can use any collation. The two engines can produce different results for lexical comparison, case sensitivity, ordering, when using mltext properties, etc

  2. The database results include hidden nodes. You can exclude them in the query. The Index Engine will never include hidden nodes and respects the index control aspect.

  3. The database post filters the results to apply permissions. TMDQ is not intend to scale to more than 10s of thousands of nodes. It will not perform well for users who can read 1 node in a million. It can not and will not tell you how many results matched the query. To do this could require an inordinate number of permission checks. It does enough to give you the page requested and if there is more. The Index Engine can apply permissions at query and facet time to billions of nodes.
    For the same reason, do not expect any aggregation support in TMDQ: there is currently no plan to push access restriction into the database at query time.

  4. CONTAINS() support is actually more complicated. The pure CMIS part of the query and CONTAINS() part are melded together into a single abstract query representation. If the overall query, both parts, can go against the database that is fine. You have to follow the rules for AFTS & TMDQ. By default, in CMIS the CONTAINS() expression implies full text search so queries will go to the Index Server.

  5. The database does not score. It will return results in some order that depends on the query plan - unless you ask for specific ordering. For a three part OR query where some docs match more then one constraint they are treated equal. In the Index Engine - the more parts of an OR match the higher the score. The docs that match more optional parts of the query will come higher up.

  6. Queries from share will not use TMDQ as they will most likely have a full text part to the query and ask for facets.

Summary

Transactional Metadata Query and the Index Engine are intended to support different use case. They differ in queries and options that they support and subtly in the results with respect to collation and scoring. We default to trying transactional support first for historical reasons and it seems to be what most people prefer if they can have it.

4 Comments
janv
Alfresco Employee

Andy,

Thanks for the detailed write-up.

For those that are interested in even more thoughts on using TMDQ then please see also Axel's recent BeeCon 2017 presentation:

[BeeCon 2017] - Axel Faust - Making proper use of transactional metadata queries - YouTube 

http://beecon.buzz/2017/assets/files/E10/Making%20proper%20use%20of%20transactional%20metadata%20que... 

Regards,

Jan

andy1
Senior Member

Hi Jan

Thanks for that, I missed BeeCon and have not had time to catch up.Looks like some interesting stuff and ideas  from Axel Faust. I may clarify some stuff around multi-valued properties. The main thing is that we only support SQL 92 ANY semantics. There is also no technical reason why content.size etc should not be supported from AFS - it's in the CMIS QL  - I will check. (I may be inferring the wrong thing from the slides)

The indexes to support TMDQ are not applied by default so that large customers did not have to wait for a big patch if they did not want the functionality. They could also defer it and take two steps to get there. New installs have the appropriate indexes from the start.

Three have been a couple of bugs fixes around null recently. Using null and exists resolves most use case. The current null behaviour is what is expected when left outer join to aspects comes along on "5.2 next" - ie absence == null.

The use case Axel Faust‌ gives are spot on. I agree with this too

stop using “lucene” / lucene-related API

Cheers

Andy

afaust
Master

When I prepared the slides for BeeCon I did test queries using the constructs I listed as "not supported" to make sure I did not report anything I could not back up at least with a reference to my local 5.2 installation (I believe 201702 GA). A query for either

=cm:content.size:1024

=cm:content.mimetype:"application/pdf"

fails when executed via the NodeBrowser as a TMQ. So do the full-qualified variants

={http://www.alfresco.org/model/content/1.0}content.size:1024

={http://www.alfresco.org/model/content/1.0}content.mimetype:"application/pdf"

(in case anyone is wondering, no amount of trying to encode special characters helps)

During query parsing and query model handling, the field is recognized as cm:content (without the meta fields) and cm:content is of course not supported ([...]QueryModelException: [...] Unsupported property type {http://www.alfresco.org/model/dictionary/1.0}content)

Even if content size were supported, in AFTS it is impractical to use since you cannot do a range-query with TMQ whereas CMIS QL can use greater-than, less-than etc.

One thing I apparently did not check closely enough was the issue with TMQ indices supposedly not being added by default in a new install. I had not noticed that the bootstrap scripts and runtime check had changed sometime in the 5.x line.

andy1
Senior Member

Thanks Axel

Andy