How to track the progress of the indexing process in Alfresco SOLR

cancel
Showing results for 
Search instead for 
Did you mean: 

How to track the progress of the indexing process in Alfresco SOLR

angelborroy
Alfresco Employee
4 3 7,235

When dealing with large deployments, SOLR tracking process can take some time to index all the nodes in the repository.

Tracking the progress of the indexation becomes a key feature in these scenarios.

Below two different approaches to get this information are described.

Simple approach based in Transaction Count

Alfresco SOLR is indexing the repository from the initial transaction to the latest one. There are several REST API methods to get the latest indexed transaction in SOLR, but using the Summary Action should be enough.

As noted by my colleague Mark Tunmer, using Summary Action is better in terms of performance than using Report Action.

http://127.0.0.1:8983/solr/admin/cores?action=summary&core=alfresco&wt=json

{
    "responseHeader": {
... }, "Summary": { "alfresco": { ... "Alfresco Nodes in Index": 2825, ... "Id for last TX on server": 4096,
"Id for last TX in index": 3984,
... } } } }

From this sample data, latest indexed transaction is 3984 while the Repository is containing 4096 transactions..

In order to get the number of transactions pending to be indexed, following SQL Query can be run in the database.

select count(1) from alf_transaction where id > 3984;
68

Additionally, the number of transactions indexed can be obtained with this SQL sentence.

select count(1) from alf_transaction where commit_time_ms <= 3984;
3652

This metric can provide a simple indicator on the progress of the indexing process, but it's not that accurate as one transaction can contain 1 document or 1 million documents.

There is another relevant number in this report, the "Alfresco Nodes in Index". This value (2,825 in the sample) contains the number of Alfresco nodes indexed by SOLR.

Our first reaction could be to perform this SQL Query in the database.

select count(1) from alf_node;
4871

However, this number (4,871) is very different from the one observed in the report (2,825).

Additionally, if we look at the SOLR Web Console, a "numDocs" with a different number is provided.

http://127.0.0.1:8983/solr/alfresco/admin/luke?_=1598597864787&numTerms=0&show=index&wt=json

{
  "responseHeader": {
    ...
  },
  "index": {
    "numDocs": 5913,
    ...
  }
}

The detail of the SOLR Documents included in this 5,913 count is detailed in Deconstructing SOLR Indexes, but let's focus now on that difference from the count in DB and the "Index node count" value.

Detailed approach based in Node Count

SOLR is not indexing every Node in the database, as there are non-indexable types and aspects for an Alfresco Content Model. Indexed types in SOLR can be obtained running the following REST API call.

http://127.0.0.1:8983/solr/alfresco/select?q=*&facet.field=TYPE&facet=on&wt=json

{
  "responseHeader": {
    ...
  },
  "index": {
    "numDocs": 5913,
    "maxDoc": 5913,
    ...
  },
  "fields": {
    "TYPE": {
      "docs": 2825,
      "topTerms": [
        "{http://www.alfresco.org/model/content/1.0}content", 1170,
        "{http://www.alfresco.org/model/content/1.0}failedThumbnail", 1000,
        "{http://www.alfresco.org/model/content/1.0}category", 335,
        "{http://www.alfresco.org/model/action/1.0}actionparameter", 129,
        "{http://www.alfresco.org/model/content/1.0}folder", 77,
        "{http://www.alfresco.org/model/content/1.0}thumbnail", 35,
        "{http://www.alfresco.org/model/action/1.0}action", 10,
        "{http://www.alfresco.org/model/content/1.0}authorityContainer", 10,
        "{http://www.alfresco.org/model/system/1.0}container", 8,
        "{http://www.alfresco.org/model/forum/1.0}post", 7,
        "{http://www.alfresco.org/model/datalist/1.0}todoList", 5,
        "{http://www.alfresco.org/model/content/1.0}person", 4,
        "{http://www.alfresco.org/model/content/1.0}systemfolder", 4,
        "{http://www.alfresco.org/model/datalist/1.0}issue", 4,
        "{http://www.alfresco.org/model/action/1.0}actioncondition", 3,
        "{http://www.alfresco.org/model/rule/1.0}rule", 3,
        "{http://www.alfresco.org/model/forum/1.0}topic", 3,
        "{http://www.alfresco.org/model/action/1.0}compositeaction", 3,
        "{http://www.alfresco.org/model/content/1.0}zone", 3,
        "{http://www.alfresco.org/model/linksmodel/1.0}link", 2,
        "{http://www.alfresco.org/model/datalist/1.0}dataList", 2,
        "{http://www.alfresco.org/model/transfer/1.0}transferGroup", 1,
        "{http://www.alfresco.org/model/system/1.0}store_root", 1,
        "{http://www.alfresco.org/model/site/1.0}sites", 1,
        "{http://www.alfresco.org/model/site/1.0}site", 1,
        "{http://www.alfresco.org/model/forum/1.0}forum", 1,
        "{http://www.alfresco.org/model/content/smartfolder/1.0}smartFolderTemplate", 1,
        "{http://www.alfresco.org/model/content/1.0}mlRoot", 1,
        "{http://www.alfresco.org/model/content/1.0}category_root", 1
      ],
      "histogram": [
        ...
      ]
    }
  },
  "info": {
    ...
    },
  }

From these results, we have a list of every type indexed with the number of documents indexed for that type.

As we are working with the alfresco SOLR core, we need to limit the results from database to this store.

SELECT id FROM alf_store WHERE protocol='workspace' AND identifier='SpacesStore';
--
6

So we can build a SQL Query in order to count the number of indexable nodes in the repository.

SELECT count(1)
FROM alf_node AS n,
     alf_qname AS q
WHERE n.type_qname_id=q.id
AND n.store_id=6 AND q.local_name IN ('category', 'content', 'actionparameter', 'folder', 'thumbnail', 'action', 'authorityContainer', 'container', 'post', 'todoList', 'person', 'systemfolder', 'issue', 'zone', 'actioncondition', 'rule', 'compositeaction', 'topic', 'link', 'dataList', 'transferGroup', 'store_root', 'site', 'sites', 'smartFolderTemplate', 'forum', 'mlRoot', 'category_root', 'failedThumbnail') AND NOT EXISTS (SELECT alf_qname.local_name FROM alf_node, alf_node_aspects, alf_qname WHERE alf_node.id=n.id AND alf_node.id = alf_node_aspects.node_id AND alf_qname.local_name in ('version', 'thumbnailModification', 'renditioned') AND alf_qname.id = alf_node_aspects.qname_id); 2825

We are counting the nodes belonging to the types indexed by SOLR and excluding those having a non indexable aspect. Note that the list of types and aspects may be different in your deployment.

In this sample we have the same count (2,825) for SOLR documents and Repository nodes. So the index is storing the latest information from the database.

This is not a cheap SQL Query in terms of resources, so be careful before running it on a live system. You can improve the execution of the sentence by studying the explain plan and modifying the SQL for your database.

Sample EXPLAIN PLAN for PostgreSQLSample EXPLAIN PLAN for PostgreSQL

Finding the transactions that are being indexed by SOLR

When SOLR is indexing transactions containing a large list of nodes, the statistics may be frozen for a while. Metadata Tracker is processing a number of transactions in every cycle, that is specified in the alfresco.transactionDocsBatchSize solr core paramater (100 by default). And each of those transactions may contain some nodes to be indexed.

If SOLR Summary report includes the value 3984 for "Id for last TX in index" field and transactionDocsBatchSize core parameter is configured with default value (100), following SQL Query will provide a count of the documents that are being indexed in the current Metadata Tracker cycle.

SELECT count(1)
FROM alf_node
WHERE transaction_id IN 
    (SELECT id 
     FROM alf_transaction 
     WHERE id > 3984
ORDER BY 1 LIMIT 100);

 

Are you using any other method to track the progress of the SOLR Indexing process? Share that with the Community!

 

>> Experimental project to perform these validations available in https://github.com/AlfrescoLabs/index-checker 

About the Author
Angel Borroy is Hyland Developer Evangelist. Over the last 15 years, he has been working as a software architect on Java, BPM, document management and electronic signatures. He has been working with Alfresco during the last years to customize several implementations in large organizations and to provide add-ons to the Community based on Record Management and Electronic Signature. He writes (sometimes) on his personal blog http://angelborroy.wordpress.com. He is (proud) member of the Order of the Bee.
3 Comments