How many content remains to be indexed?

cancel
Showing results for 
Search instead for 
Did you mean: 

How many content remains to be indexed?

angelborroy
Alfresco Employee
2 1 720

Our colleague @aitseitz reached me to point out a missing feature related to Content Indexing. Despite we are still considering the addition of this feature to the product, this blog post describe alternative approaches to get required information.

Alfresco Search Services indexes metadata and content using different trackers. Once the metadata has been indexed, some components (like the Enterprise Admin Web Console for the Enterprise version) will report that no indexation is happening.

 

Screenshot 2022-05-13 at 10.46.45.png

However, despite all the metadata has been created in SOLR Index, the content of the documents is still being indexed by the ContentTracker component. While we improve the feature, this blog post will give you some tools to check the status of the content indexation.

 

Search Services 2.0.x

From Search Services 2.0 there is no content store in SOLR Side, so two new fields are required to identify if the content of a SOLR Document needs to be indexed or updated:

  • LATEST_APPLIED_CONTENT_VERSION_ID: it corresponds to the identifier of the latest applied content property ID ( it can be null)
  • LAST_INCOMING_CONTENT_VERSION_ID: If the field has the same value of the previous one (or it is equal to SolrInformationServer.CONTENT_UPDATED_MARKER), then the content is supposed to be in sync. Otherwise, the content is intended as outdated and therefore it will be selected by the ContentTracker.

Additional details can be found in:

https://github.com/Alfresco/SearchServices/blob/master/search-services/alfresco-search/doc/architect...

The following SOLR Query will return the number of SOLR Documents with a different value in those *_CONTENT_VERSION_ID fields:

http://localhost:8983/solr/alfresco/select
?q=*
&fq={!frange l=1 u=1 v=$equals}
&equals=if(not(eq(LATEST_APPLIED_CONTENT_VERSION_ID,LAST_INCOMING_CONTENT_VERSION_ID)),1,0)
&indent=on
&wt=json

{ "responseHeader":{ ... }, "_original_parameters_":{ ... }, "lastIndexedTx":574, "lastIndexedTxTime":1652429863621, "txRemaining":0, "response":{"numFound":76,"start":0,"docs":[ ... ]}, "processedDenies":false }

In this sample, the content of 76 documents is still pending to be indexed or updated. While the lastIndexedTx points the latest TX in DB and the txRemaining value indicates there is no metadata pending to be indexed.

If you want to know the status of the nodes pending to be indexed or updated, you can add those *_CONTENT_VERSION_ID fields to the "fl" parameter:

http://localhost:8983/solr/alfresco/select
?q=*
&fl=[cached]LATEST_APPLIED_CONTENT_VERSION_ID,LAST_INCOMING_CONTENT_VERSION_ID
&fq={!frange l=1 u=1 v=$equals}
&equals=if(not(eq(LATEST_APPLIED_CONTENT_VERSION_ID,LAST_INCOMING_CONTENT_VERSION_ID)),1,0)
&indent=on
&wt=json

{ "responseHeader":{ ... }, "_original_parameters_":{ ... }, "lastIndexedTx":574, "lastIndexedTxTime":1652429863621, "txRemaining":0, "response":{"numFound":7,"start":0,"docs":[ { "LATEST_APPLIED_CONTENT_VERSION_ID":591, "LAST_INCOMING_CONTENT_VERSION_ID":-10}, { "LATEST_APPLIED_CONTENT_VERSION_ID":393, "LAST_INCOMING_CONTENT_VERSION_ID":-10}, { "LATEST_APPLIED_CONTENT_VERSION_ID":573, "LAST_INCOMING_CONTENT_VERSION_ID":-10}, { "LATEST_APPLIED_CONTENT_VERSION_ID":579, "LAST_INCOMING_CONTENT_VERSION_ID":-10}, { "LATEST_APPLIED_CONTENT_VERSION_ID":606, "LAST_INCOMING_CONTENT_VERSION_ID":-10}, { "LATEST_APPLIED_CONTENT_VERSION_ID":582, "LAST_INCOMING_CONTENT_VERSION_ID":-10}, { "LATEST_APPLIED_CONTENT_VERSION_ID":585, "LAST_INCOMING_CONTENT_VERSION_ID":-10}] }, "processedDenies":false }

In this sample, we can see that the content of the 7 documents is still pending to be indexed, since the LAST_INCOMING_CONTENT_VERSION_ID value is set to SolrInformationServer.CONTENT_OUTDATED_MARKER (-10)

Alternatively, Search Services Admin REST API can be used to get this information.

http://localhost:8983/solr/admin/cores?action=summary&core=alfresco

<lst name="FTS">
  <long name="Node count whose content is in sync">170</long>
  <long name="Node count whose content needs to be updated">7</long>
</lst>

 

Search Services 1.4.x

When using Search Services 1.3.x / 1.4.x the logic is quite different.

There is a field in the Solr schema called FTSSTATUS that could have the following domain:

  • Clean: the text content of the document is in sync, no update is needed
  • New: the document has been just created, it has to be updated with the corresponding text content
  • Dirty: the text content of the document changed, the new content needs to be retrieved and the document updated

The following SOLR Query will return the number of SOLR Documents faceted by FTSSTATUS field:

http://localhost:8080/solr/alfresco/select?facet.field=FTSSTATUS&facet=on&indent=on&q=*&wt=json

{ "responseHeader":{ ... }, "_original_parameters_":{ ... }, "lastIndexedTx":136, "lastIndexedTxTime":1652430806500, "txRemaining":0, "response":{"numFound":874,"start":0,"docs":[ ... ]}, "facet_counts":{ "facet_fields":{ "FTSSTATUS":[ "New",152, "Clean",125, "Dirty",3]}, "processedDenies":false }

In this sample, we can see the content of 152 New Documents and 3 Dirty Documents need to be indexed. The content of 125 Documents is in sync (Clean) and it's not required.

Alternatively, Search Services Admin REST API can be used to get this information.

http://localhost:8983/solr/admin/cores?action=summary&core=alfresco

<lst name="FTS">
<long name="Node count with FTSStatus Clean">125</long>
<long name="Node count with FTSStatus Dirty">3</long>
<long name="Node count with FTSStatus New">152</long>
</lst>

 

About the Author
Angel Borroy is Hyland Developer Evangelist. Over the last 15 years, he has been working as a software architect on Java, BPM, document management and electronic signatures. He has been working with Alfresco during the last years to customize several implementations in large organizations and to provide add-ons to the Community based on Record Management and Electronic Signature. He writes (sometimes) on his personal blog http://angelborroy.wordpress.com. He is (proud) member of the Order of the Bee.
1 Comment