I'm running a custom scheduled job that's processing documents in the repository in batches of 9,000 documents. The repository has many million documents but it's not a problem for me because Alfresco could be used in the meantime. This question is related to my other post https://hub.alfresco.com/t5/alfresco-content-services-forum/java-heap-space-with-java-class-custom-a... . I can process about 36,000 docments per hour because the job is started every 15 minutes. Processing the batch takes between 6 and 11 minutes so Alfresco has some time to "settle down". The job does an AFTS query on Solr like this:
+TYPE:"cm:content" AND -cm:creator:"System" AND -TYPE:"cmerson" AND -TYPE:"lnk:link" AND -TYPE:"fmost" AND -TYPE:"dl.issue" AND -TYPE:"dl.event" AND -TYPE:"dl.todolist" AND -TYPE:"cm.post" AND +cm:created:["1900-01-01" TO "2022-10-21T01:53:00.356281"]
During development and testing with the SDK I noticed that sometimes there were Solr time out problems and so I handled them in the code. The code tries to execute the query 10 times, waiting 45 seconds between tries. If it can't get a reply after that the entire batch is retried. Obviously this is a major drag on overall performance. When a job starts it's very unlikey, though it happens sometimes, for a time out to appear but as more batches are processed the time out problem gets worse and worse. Sometimes the time out happens when the search parameters are set (searchService.query(sp) and sometimes when a new set of 900 documents has to come (rs.getNodeRefs(). Although Alfresco could be used during processing right now it's not used so the only load is the job.
After about 12 hours of processing it's so bad that entire batches are redone a few times before they can finish. We then restart Alfresco and the job picks up where it stopped. In the testing envivronmet this is ok but I could not do this in production.
The processing logic reads a query offset (skipCount for the query) and does 10 queries of 900 documents. After the batch is processed, the offset is updated (in a Postgres table) and the job waits to be run again.
This is the part in docker-compose,yml relevant to solr.
#Solr needs to know how to register itself with Alfresco
#Alfresco needs to know how to call solr
#Create the default alfresco and archive cores
SOLR_JAVA_MEM: "-Xms10240m -Xmx16384m"
(See SOLR_JAVA_MEM: "-Xms10240m -Xmx16384m" )
I increased the Solr container size but it did not help.
I also noticed this drop in performance when trying an external shell script (/bin/bash) using the REST API with curl a few weeks ago but back then I wasn't sure what caused this problem and switchd to Java running in Alfresco. This makes me think it's not a Java problem. Solr appears to be too busy or stuck on something.
My question is what causes Solr to time out more and more often and how can thihs be tuned?
Thanks for the explanartion of the limitation of skipCount. Before the skipCount approach I used to set a tag on the documents but the Java Heap problem didn't help me when using an action ;-) I'll now go back to that and mark the nodes with a tag.