The Alfresco Content Tracker

a_gazzarini · ‎10 Apr 2019

Recently I have been involved in an investigation task about the Alfresco Tracker Subsystem, with a specific focus on the Content Tracker. This post is the output of the analysis, which can also be found in the SearchServices repository, under the documentation folder.

The ContentTracker is part of the Alfresco Tracker Subsystem which is composed by the following members:

ModelTracker: for listening on model changes
ContentTracker: described in this post
MetadataTracker: for tracking changes in metadata nodes
AclTracker: for listening on ACLs changes
CascadeTracker: which manages cascade updates (i.e. updates related with the nodes hierarchy)
CommitTracker: which provides commit and rollback capabilities

Each Solr that composes your search infrastructure (regardless it holds a monolithic core or a shard) has a singleton instance of each tracker type, which is registered, configured and scheduled at SolrCore startup. The SolrCore holds a TrackerRegistry for maintaining a list of all "active" tracker instances.

The ContentTracker queries for documents with "unclean" content (i.e. data whose content has been modified in Alfresco), and then updates them. Periodically, at a configurable frequency, the ContentTracker checks for transactions containing nodes that have been marked as "Dirty" (changed) or "New". Then,

it retrieves the cached version of that data from the ContentStore
it retrieves the corresponding (text) content from Alfresco
it updates the ContentStore
it re-indexes the data in the hosting Solr instance

Later, the CommitTracker will persist those changes.

The Tracking Subsystem

The class diagram below provides a high-level overview about the main classes involved in the Tracker Subsystem.

As you can see from the diagram, there's an abstract interface definition (Tracker) which declares what is expected by a Tracker and a Layer Supertype (AbstractTracker) which adopts a TemplateMethod [1] approach. It provides the common behavior and features inherited by all trackers, mainly in terms of:

Configuration
State definition (e.g. isSlave, isMaster, isInRollbackMode, isShutdown)
Constraints (e.g. there must be only one running instance of a given tracker type in a given moment)
Locking: the two Semaphore instances depicted in the diagram used for a) implementing the constraint described in the previous point b) providing an inter-trackers synchronisation mechanism.

The Tracker behavior is defined in the track() method that each tracker must implement. As said above, the AbstractTracker forces a common behaviour on all trackers by declaring a final version of that method, and then it delegates to the concrete trackers (subclasses) the specific logic by requiring them the implementation of the doTrack() method.

Each tracker is a stateful object which is initialized, registered in a TrackerRegistry and scheduled at startup in the SolrCoreLoadRegistration class. The other relevant classes depicted in the diagram are:

SolrCore: the dashed dependency relationship means that a Tracker doesn't hold a stable reference to the SolrCore: it obtains that reference each time it's needed.
ThreadHandler: The ThreadExecutionPool manager which holds a pool of threads needed for scheduling asynchronous tasks (i.e. unclean content reindexing)
TrackerState: being a shared instance across all trackers, it would have been called something like TrackersState or TrackerSubsystemState. It is used for holding the trackers state (e.g. lastTxIdOnServer, trackerCycles, lastStartTime)
TrackerStats: maintains a global stats about all trackers. Following the same approach of the TrackerState, it is a shared instance and therefore the name is a little bit misleading because it is related to all trackers
SOLRAPIClient: this is the HTTP proxy / facade towards Alfresco REST API: in the sequence diagrams these interactions are depicted in green
SolrInformationServer: The Solr binding for the InformationServer interface, which defines the abstract contract of the underlying search infrastructure

Startup and Shutdown

The Trackers startup and registration flow is depicted in the following sequence diagram:

Solr provides, through the interface SolrEventListener, a notification mechanism for registering custom plugins during a SolrCorelifecycle. The Tracker Subsystem is initialized, configured and scheduled in the SolrCoreLoadListener which delegates the concrete work to SolrCoreLoadRegistration. Here, a new instance of each tracker is created, configured, registered and then scheduled by means of a Quartz Scheduler. Trackers can share a common frequency (as defined in the alfresco.cronproperty) or they can have a specific configuration (e.g. alfresco.content.tracker.cron).

The SolrCoreLoadRegistration also registers a shutdown hook which makes sure all registered trackers will follow the same hosting SolrCore lifecycle.

Content Tracking

The sequence diagram below details what happens in a single tracking task executed by the ContentTracker:

At a given frequency (which again, can be the same for each tracker or overriden per tracker type) the Quartz Scheduler invokes the doTrack() method of the ContentTracker. Prior to that, the logic in the AbstractTracker is executed following the TemplateMethod [1] described above; specifically the "Running" lock is acquired and the tracker is put in a "Running" state.

Then the ContentTracker does the following:

get documents with "unclean" content
if that list is not empty, each document is scheduled (asynchronously) for being updated, in the content store and in the index

In order to do that, the ContentTracker never uses directly the proxy towards ACS (i.e. the SOLRAPIClient instance); instead, it delegates that logic to the SolrInformationServer class. The first step (getDocsWithUncleanContent) searches in the local index all transactions which are associated to documents that have been marked as "Dirty" or "New". The field where this information is recorded is FTSSTATUS; it could have one of the following values:

Dirty: content has been updated / changed
New: content is new
Clean: content is up to date, there's no need to refresh it

The "Dirty" documents are returned as triples containing the tenant, the ACL identifier and the DB identifier.

NOTE: this first phase uses only the local Solr index, no remote call is involved.

If the list of Tenant/ACLID/DBID triples is not empty, that means we need to fetch and update the text content of the corresponding documents. In order to do that, each document is wrapped in a Runnable object and submitted to a thread pool executor. That makes each document content processing asynchronous.

The ContentIndexWorkerRunnable, once executed, delegates the actual update to the SolrInformationServer which, as said above, contains the logic needed for dealing with the underlying Solr infrastructure; specifically:

the document that needs to be refreshed, uniquely identified by the tenant and the db identifier, is retrieved from the local content store. In case the cached document cannot be found in the content store, the /api/solr/metadata remote API is contacted in order to rebuild the document (only metadata) from scratch.
the api/solr/textContent is called in order to fetch the text content associated with the node, plus the transformation metadata (e.g, status, exception, elapsed time)
if the alfresco.fingerprint configuration property is set to true and the retrieved text is not empty the fingerprint is computed and stored in the MINHASH field of the document
the content fields are set
the document is marked as clean (i.e. FTSSTATUS = "Clean") since its content is now up to date
the cached version is overwritten in the content store with the up to date definition
the document (which is a SolrInputDocument instance) is indexed in Solr

Rollback

The Rollback sequence diagram illustrates how the rollback process works:

The commit/rollback process is a responsibility of the CommitTracker, so the ContentTracker is involved in these processes only indirectly.

When it is executed, the CommitTracker acquires the execution locks from the MetadataTracker and the AclTracker. Then it checks if one of them is in a rollback state. As we can imagine, that check will return true if some unhandled exception has occurred during indexing.

If one of the two trackers above reports an active rollback state, the CommitTracker lists all trackers, invalidates their state and issues a rollback command to Solr. That means any update sent to Solr by any tracker will be reverted.

How does the ContentTracker work in shard mode?

The only source that the ContentTracker checks in order to determine the "unclean" content that needs to be updated is the local index. As consequence of that, the ContentTracker behavior is the same regardless the search infrastructure shape and the context where the hosting Solr instance lives. That is, if we are running a standalone Solr instance there will be one a ContentTracker for each core watching the corresponding (monolithic) index. If instead we are in a sharded scenario, each shard will have a ContentTracker instance that will use the local shard index.

How does the ContentTracker work in Master/Slave mode?

In order to properly work in a Master/Slave infrastructure, the Tracker Subsystem (not the only ContentTracker) needs to be

enabled on Master(s)
disabled on Slaves

The only exceptions to that rule are about:

The MetadataTracker: only if the search infrastructure uses dynamic sharding [2] the Metadata tracker is in charge to register the Solr instance (the Shard) to Alfresco so it will be included in the subsequent queries. The tracker itself, in this scenario, won't track anything.
The ModelTracker: each Solr instance pulls, by means of this tracker, the custom models from Alfresco, so it must be enabled in any case.

The document file in the SearchService repository provides an additional paragraph with the configuration attributes related with the Tracker subsystem. I didn't put that long table in this post because it doesn't add any information: if you need to configure the trackers just have a look at the end of that document.

What's next?

The Tracker Subsystem is one of the main areas where the Search Team is devolving analysis and investigation efforts: that will allow to find a space for introducing further improvements in the architecture.

-------

[1] https://en.wikipedia.org/wiki/Template_method_pattern

[2] http://docs.alfresco.com/5.1/concepts/solr-shard-config.html

aowian · ‎11 Apr 2019

Thanks for the article, Andrea Gazzarini‌. You did an excellent job displaying your in-depth knowledge!

Here is a related article: https://ahmedowian.wordpress.com/2016/04/05/alfresco-solr-trackers-showcase/

a_gazzarini · ‎11 Apr 2019

Hi Ahmed Owian many thanks! Just reading your blog: very interesting!

ranjeetsi · ‎12 Jul 2019

Thanks for the article Andrea ! It has got in depth information!

The documentation folder link https://github.com/Alfresco/SearchServices/tree/master/alfresco-search/doc/architecture/trackers is not accessible.

angelborroy · ‎15 Jul 2019

The code was restructured.

Working link is now available at SearchServices/search-services/alfresco-search/doc/architecture/trackers at master · Alfresco/Search...

lcolorado · ‎18 Oct 2022

Hi!

We have a large repo that has finished indexing the metadata, but thread dumps show that cascade tracking is still running. Searches using the PATH keyword were very fast on Solr 4, but on ASS 2.0.2, they can be (sometimes) very slow. We think that cascade tracking could be slowing down the searches.

Is there a way to check the progress of cascade indexing?

Thank you,

Luis

angelborroy · ‎19 Oct 2022

When a node requires PATH re-ndexing, a new property is added or modified in DB with the number of the Transaction that provoked this change (sys:cascadeTx).

The CascadeTracker, in SOLR side, gets a chunk of document transactions with status CASCADE_FLAG=1 (int@s_@cascade) and finds the nodes with that "sys:cascadeTx" equals to the transaction. After that, PATH properties are recalculated for the node and for every children requesting the information to the repository via the REST API. Once the work has been done, the transaction is marked back again to CASCADE_FLAG=0 to avoid processing it again with CascadeTracker.

You may check the progress of CascadeTracker by finding all the pending transactions to be processed (CASCADE_FLAG=1) with a query similar to this one:

http://localhost:8983/solr/alfresco/select?fl=*,[cached]&indent=on&q={!term%20f=int@s_@cascade}1&sor...

This is the number of transactions still to be processed by the CascadeTracker.

When CascadeTrack work is done, the number of results in this query should be equals to 0.

heiko_robert · ‎7 Dec 2022

@lcoloradowe observe the same as soon as secondary child assocs on folders/containers are involved in any way. I just had a case this week where the index to a repo was even bigger than the repo itself because several folders in other folders were "linked" as secondary children.
Alfresco might have a similar problem with its workflow folders. I now created a ticket in github since this issue brakes a lot of good use cases (e.g. case management). s. https://github.com/Alfresco/SearchServices/issues/415

The Alfresco Content Tracker

The Alfresco Content Tracker

The Tracking Subsystem

Startup and Shutdown

Content Tracking

Rollback

How does the ContentTracker work in shard mode?

How does the ContentTracker work in Master/Slave mode?

What's next?

We use cookies on this site to enhance your user experience