How to track the progress of the indexing process in Alfresco SOLR

cancel
Showing results for 
Search instead for 
Did you mean: 

How to track the progress of the indexing process in Alfresco SOLR

angelborroy
Alfresco Employee
2 3 1,937

When dealing with large deployments, SOLR tracking process can take some time to index all the nodes in the repository.

Tracking the progress of the indexation becomes a key feature in these scenarios.

Below two different approaches to get this information are described.

Simple approach based in Transaction Count

Alfresco SOLR is indexing the repository from the initial transaction to the latest one. There are several REST API methods to get the latest indexed transaction in SOLR, but using the Summary Action should be enough.

As noted by my colleague Mark Tunmer, using Summary Action is better in terms of performance than using Report Action.

http://127.0.0.1:8983/solr/admin/cores?action=summary&core=alfresco&wt=json

{
    "responseHeader": {
... }, "Summary": { "alfresco": { ... "Alfresco Nodes in Index": 2825, ... "Id for last TX on server": 4096,
"Id for last TX in index": 3984,
... } } } }

From this sample data, latest indexed transaction is 3984 while the Repository is containing 4096 transactions..

In order to get the number of transactions pending to be indexed, following SQL Query can be run in the database.

select count(1) from alf_transaction where id > 3984;
68

Additionally, the number of transactions indexed can be obtained with this SQL sentence.

select count(1) from alf_transaction where commit_time_ms <= 3984;
3652

This metric can provide a simple indicator on the progress of the indexing process, but it's not that accurate as one transaction can contain 1 document or 1 million documents.

There is another relevant number in this report, the "Alfresco Nodes in Index". This value (2,825 in the sample) contains the number of Alfresco nodes indexed by SOLR.

Our first reaction could be to perform this SQL Query in the database.

select count(1) from alf_node;
4871

However, this number (4,871) is very different from the one observed in the report (2,825).

Additionally, if we look at the SOLR Web Console, a "numDocs" with a different number is provided.

http://127.0.0.1:8983/solr/alfresco/admin/luke?_=1598597864787&numTerms=0&show=index&wt=json

{
  "responseHeader": {
    ...
  },
  "index": {
    "numDocs": 5913,
    ...
  }
}

The detail of the SOLR Documents included in this 5,913 count is detailed in Deconstructing SOLR Indexes, but let's focus now on that difference from the count in DB and the "Index node count" value.

Detailed approach based in Node Count

SOLR is not indexing every Node in the database, as there are non-indexable types and aspects for an Alfresco Content Model. Indexed types in SOLR can be obtained running the following REST API call.

http://127.0.0.1:8983/solr/alfresco/select?q=*&facet.field=TYPE&facet=on&wt=json

{
  "responseHeader": {
    ...
  },
  "index": {
    "numDocs": 5913,
    "maxDoc": 5913,
    ...
  },
  "fields": {
    "TYPE": {
      "docs": 2825,
      "topTerms": [
        "{http://www.alfresco.org/model/content/1.0}content", 1170,
        "{http://www.alfresco.org/model/content/1.0}failedThumbnail", 1000,
        "{http://www.alfresco.org/model/content/1.0}category", 335,
        "{http://www.alfresco.org/model/action/1.0}actionparameter", 129,
        "{http://www.alfresco.org/model/content/1.0}folder", 77,
        "{http://www.alfresco.org/model/content/1.0}thumbnail", 35,
        "{http://www.alfresco.org/model/action/1.0}action", 10,
        "{http://www.alfresco.org/model/content/1.0}authorityContainer", 10,
        "{http://www.alfresco.org/model/system/1.0}container", 8,
        "{http://www.alfresco.org/model/forum/1.0}post", 7,
        "{http://www.alfresco.org/model/datalist/1.0}todoList", 5,
        "{http://www.alfresco.org/model/content/1.0}person", 4,
        "{http://www.alfresco.org/model/content/1.0}systemfolder", 4,
        "{http://www.alfresco.org/model/datalist/1.0}issue", 4,
        "{http://www.alfresco.org/model/action/1.0}actioncondition", 3,
        "{http://www.alfresco.org/model/rule/1.0}rule", 3,
        "{http://www.alfresco.org/model/forum/1.0}topic", 3,
        "{http://www.alfresco.org/model/action/1.0}compositeaction", 3,
        "{http://www.alfresco.org/model/content/1.0}zone", 3,
        "{http://www.alfresco.org/model/linksmodel/1.0}link", 2,
        "{http://www.alfresco.org/model/datalist/1.0}dataList", 2,
        "{http://www.alfresco.org/model/transfer/1.0}transferGroup", 1,
        "{http://www.alfresco.org/model/system/1.0}store_root", 1,
        "{http://www.alfresco.org/model/site/1.0}sites", 1,
        "{http://www.alfresco.org/model/site/1.0}site", 1,
        "{http://www.alfresco.org/model/forum/1.0}forum", 1,
        "{http://www.alfresco.org/model/content/smartfolder/1.0}smartFolderTemplate", 1,
        "{http://www.alfresco.org/model/content/1.0}mlRoot", 1,
        "{http://www.alfresco.org/model/content/1.0}category_root", 1
      ],
      "histogram": [
        ...
      ]
    }
  },
  "info": {
    ...
    },
  }

From these results, we have a list of every type indexed with the number of documents indexed for that type.

As we are working with the alfresco SOLR core, we need to limit the results from database to this store.

SELECT id FROM alf_store WHERE protocol='workspace' AND identifier='SpacesStore';
--
6

So we can build a SQL Query in order to count the number of indexable nodes in the repository.

SELECT count(1)
FROM alf_node AS n,
     alf_qname AS q
WHERE n.type_qname_id=q.id
AND n.store_id=6 AND q.local_name IN ('category', 'content', 'actionparameter', 'folder', 'thumbnail', 'action', 'authorityContainer', 'container', 'post', 'todoList', 'person', 'systemfolder', 'issue', 'zone', 'actioncondition', 'rule', 'compositeaction', 'topic', 'link', 'dataList', 'transferGroup', 'store_root', 'site', 'sites', 'smartFolderTemplate', 'forum', 'mlRoot', 'category_root', 'failedThumbnail') AND NOT EXISTS (SELECT alf_qname.local_name FROM alf_node, alf_node_aspects, alf_qname WHERE alf_node.id=n.id AND alf_node.id = alf_node_aspects.node_id AND alf_qname.local_name in ('version', 'thumbnailModification', 'renditioned') AND alf_qname.id = alf_node_aspects.qname_id); 2825

We are counting the nodes belonging to the types indexed by SOLR and excluding those having a non indexable aspect. Note that the list of types and aspects may be different in your deployment.

In this sample we have the same count (2,825) for SOLR documents and Repository nodes. So the index is storing the latest information from the database.

This is not a cheap SQL Query in terms of resources, so be careful before running it on a live system. You can improve the execution of the sentence by studying the explain plan and modifying the SQL for your database.

Sample EXPLAIN PLAN for PostgreSQLSample EXPLAIN PLAN for PostgreSQL

Finding the transactions that are being indexed by SOLR

When SOLR is indexing transactions containing a large list of nodes, the statistics may be frozen for a while. Metadata Tracker is processing a number of transactions in every cycle, that is specified in the alfresco.transactionDocsBatchSize solr core paramater (100 by default). And each of those transactions may contain some nodes to be indexed.

If SOLR Summary report includes the value 3984 for "Id for last TX in index" field and transactionDocsBatchSize core parameter is configured with default value (100), following SQL Query will provide a count of the documents that are being indexed in the current Metadata Tracker cycle.

SELECT count(1)
FROM alf_node
WHERE transaction_id IN 
    (SELECT id 
     FROM alf_transaction 
     WHERE id > 3984
ORDER BY 1 LIMIT 100);

 

Are you using any other method to track the progress of the SOLR Indexing process? Share that with the Community!

 

>> Experimental project to perform these validations available in https://github.com/AlfrescoLabs/index-checker 

About the Author
Angel Borroy is Senior Software Engineer in Alfresco. Over the last 15 years, he has been working as a software architect on Java, BPM, document management and electronic signatures. He has been working with Alfresco during the last years to customize several implementations in large organizations and to provide add-ons to the Community based on Record Management and Electronic Signature. He writes (sometimes) on his personal blog http://angelborroy.wordpress.com. He is (proud) member of the Order of the Bee. Angel Borroy is Engineer at Alfresco.
3 Comments
mtunmer
Alfresco Employee

Adding to Angels comments about the transaction count approach, the Solr summary output can be misleading as the repository grows when it comes to the count and time estimates.

Using this example, the output shows 4.5 million Tx's in the repo, 2.9 million indexed so far, and 5 hours left to index the remaining 1.6 million at the current rate

"Id for last TX on server": 4524959,
"Id for last TX in index": 2931237,
"Approx transactions remaining": 1593722,
"Approx transaction indexing time remaining": "5 Hours",


A combination of the node service job cleaning out empty transactions, and the way that databases manage primary key sequences, means it's unlikely that the database has 4.5 million actual transactions. A count confirms this.

select count(id) from alf_transaction;
612724

 

With the "Id for last TX in index" value of 2931237 from the summary and this query, we can get a clearer idea of how far along the reindex is.

declare @idx_tx_curr_num int = 2931237
select
t1.tx_max_num,
t2.tx_total_count,
t3.tx_todo, t4.tx_done,
((t4.tx_done * 100)/ t2.tx_total_count) as pct_done
from
(select max(id) as tx_max_num from alf_transaction) as t1,
(select count(id) as tx_total_count from alf_transaction) as t2,
(select count(id) as tx_todo from alf_transaction where id > @idx_tx_curr_num) as t3,
(select count(id) as tx_done from alf_transaction where id < @idx_tx_curr_num) as t4;
tx_max_num   tx_total_count   tx_todo   tx_done   pct_done
4524959      612724           131037    481686    78

 

The index is 78% through the transactions, with 131k remaining, and it's been running for just over 60 minutes. If the indexing keeps moving at a similar pace, the rebuild should finish in about another 20-25 minutes.

 

Breaking down the nodes into transaction groups gives an initial picture of the distribution across the transactions

select t1.nodes_per_tx, count(*) num_of_txs from
(select count(*) nodes_per_tx from dbo.alf_node group by transaction_id) t1
group by t1.nodes_per_tx;
nodes_per_tx   num_of_txs
1              405877
2              109898
3              55484
4              5567
5              9498
...            ...
17853          1
27172          1
29772          1
72059          1

 

400k transactions have 1 node, 100k transactions have 2 nodes, etc, down to the single transaction with 72000 nodes. If that large transaction is in the remaining 22% , the rebuild will take longer than the predicted 25 minutes.

zhihailiu
Active Member

Thank you. Probably the best article I found on this topic.  

I am still puzzled by something in a full index rebuild with Alfresco search services 2.0.1. The repository has a document with id=960575 and transaction_id=683958. Solr summary shows that indexing is up-to-date and "Id for last TX in index"=684913. It seems that the document is indexed. However, Alfresco search can find the document by metadata such as title, but not keyword/full text. In addition, I think Solr would ask Alfresco for text content for full text index in a request like "api/solr/textContent?nodeId=960575". But Alfresco tomcat localhost_access_log does not have such an entry.

It seems the document is metadata indexed but not full text indexed. Are these two separate steps? If so, how would we track the full text indexing?

mtunmer
Alfresco Employee

FTS indexing of Content is a separate process (content tracker) that tracks behind the metadata indexing. The summary also shows some information about this.

"FTS": {
"Node count whose content is in sync": 1023,
"Node count whose content needs to be updated": 0
},
...
"Approx content indexing time remaining": "0 Seconds",
...
"Doc Transformation time (ms)": {
"Start": null,
"N": 0,
"Min": 0.0,
"Max": 0.0,
"Mean": 0.0,
"Varience": "NaN",
"StdDev": "NaN"
}