(Re)Indexing Large Repositories - Follow up questions - Part 2

Showing results for 
Search instead for 
Did you mean: 

(Re)Indexing Large Repositories - Follow up questions - Part 2

Alfresco Employee
1 0 4,674

In May we restarted our Tech Talk Live programme, led by architects, engineers and developers. In the first session @tpage and I presented on (Re)Indexing Large Repositories.

There were many questions and we only had time to answer a few during the session. Due to the quantity of questions we have batched them up and will try to answer them in a series of blog posts.

This is the second part of the follow up questions series of blog post.

You can find the Part 1 of this blog post in (Re)Indexing Large Repositories - Follow up questions



What are some metrics that have to be monitored in Solr admin console during the reindexing process?

Despite the usual parameters for every running service (cpu, memory, disk storage, network…) you can also monitor the performance of JVM and Jetty. Detailed reference for Solr Metrics is available in https://lucene.apache.org/solr/guide/6_6/metrics-reporting.html 

Is there any tool to know if solr indexes are corrupted ? if so, how to repair those corrupted indexes only instead of doing full reindexing

You can try default Lucene tool CheckIndex (https://solr.pl/en/2011/01/17/checkindex-for-the-rescue/) or use some other external tool to inspect the SOLR Core like Luke (https://github.com/DmitryKey/luke). If none of this tools provides helpful outputs, the recommended approach is to perform a full re-index.

Is there any monitoring tool to notify about unindexed nodes and reason for failure?

You can get all the nodes not indexed by using a query like TYPE:’ErrorNode’, but the reason for the failures can be only found in log files. You can retry indexing the node with the SOLR REST API (https://docs.alfresco.com/sie/concepts/solr-admin-asynchronous-actions.html), so you can get again the error in the log.


Partial Re-Index

If I configure solr6 to disable content indexing, what should use to check current total index node count?

You can use a simple ‘*’, that will return a count of every document stored in the SOLR Index, as metadata is indexed before content. We were using TEXT:[* TO *]in the session, that means documents having something different from NULL in Content Field, because we wanted to be sure that metadata and content was present in the new SOLR Server.

How to update/manipulate  lastIndexTX value, I mean if I want to index repository again without reindex just by manipulating transaction id

I would say: “Don’t do that, never!”. 

But anyway, if you want to, just change the SOLR Document with type “State” to set the ACL and Transaction Id you want.!term%20f%3DDOC_TYPE%7DState 

Meaning let’s say I have done last year data index and now I taking those index to other solr server and I want to index current year data and merge with index which was already done for last year data

You can use standard Apache SOLR tool: https://lucene.apache.org/solr/guide/6_6/merging-indexes.html 

And you need to consolidate your SOLR Content Store together, filtering by date or by any other criteria.



Any suggestions for keeping memory usage down while performing a full reindex?  During regular operation, I can get away with 16GB of memory for 2.7 million nodes but while reindexing I hit out of memory errors even when bumping the server and heap past 32GB.

That looks like a leak memory, not related with the consumption of memory during the reindexing process. You should take JMX Dumps, gather log evidences and create a new ticket so we can inspect the problem you are experimenting. Use your Support Channel (if you are Enterprise user) or create a new ticket in https://issues.alfresco.com (the more detailed data and information, the better)

Are there recommended JVM parameters for Solr 6 (GC tuning, memory tuning) ?

We are using default recommendations from Apache SOLR: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.6.5/solr/bin/solr#L1813

These settings are released in the ZIP Distribution file and Docker Image, but you can use the GC_TUNE environment variable in case you need to change them.

Any special considerations for Systems with twice as many ACLs/ACEs as nodes ?

Alfresco SOLR is not taking care of ACEs, so no problem with that. If you have a lot of ACLs, probably you need to tune AclTracker settings and have a look at your queries, as the impact of filtering large ACLs could penalize response time performance.

How'd be the database performing during the reindexation process in the separate search engine? Would it be properly responding to users' requests yet?

The database is receiving more SELECT queries than before, so you need to increase the Connection Pool. Additionally, you need to increase the resources (CPU and Memory), so the increment of reading sentences can be processed in parallel with the regular use of the system. Use some profiling or monitoring tool while testing a re-indexing process in order to have an idea on the amount of increment for these resources.


Other Blog Posts in the Series

* (Re)Indexing Large Repositories - Follow up questions

If you have more questions then feel free to post them in the comments below!

About the Author
Angel Borroy is Hyland Developer Evangelist. Over the last 15 years, he has been working as a software architect on Java, BPM, document management and electronic signatures. He has been working with Alfresco during the last years to customize several implementations in large organizations and to provide add-ons to the Community based on Record Management and Electronic Signature. He writes (sometimes) on his personal blog http://angelborroy.wordpress.com. He is (proud) member of the Order of the Bee.