I'm running a scheduled job in Alfresco 7.2 community in Docker on Linux every 5 minutes which processes nodes located with AFTS queries. The job takes less than 2 minutes to run but I use a longer time between runs so that the repo gets indexed properly before the next run. The reason for this is that during the job I tag the nodes being procesed and the AFTS query used needs to ignore those nodes. The wait time and hence the performacne is not critical in this case because Alfreco can be used during processing.
The job gets a batch of 500 nodes from the AFTS qury, deletes older versions from some nodes that are in an external database table and deletes the ones not in the list. I need to process over 11 million nodes. I've tried several combinations of cron expressions and batch sizes but every time after a day or so something happnes with the cache and I see this type of message in alfresco.log:
2022-11-17 10:51:07,334 WARN [org.alfresco.repo.cache.TransactionalCache.org.alfresco.cache.node.nodesTransactionalCache] [http-nio-8080-exec-7] Transactional update cache 'org.alfresco.cache.node.nodesTransactionalCache' is full (5000000).
2022-11-17 10:57:52,814 WARN [org.apache.activemq.transport.failover.FailoverTransport] [ActiveMQ NIO Worker 7] Transport (nio://activemq:61616) failed, not attempting to automatically reconnect
Sometimes other caches are reported. After that the job runs with strange results, probably because the AFTS query can not find the right nodes, perhaps becacuse they're not indexed until the errors get worse, ending with:
2022-11-17 14:37:40,955 ERROR [org.alfresco.repo.descriptor.RepositoryDescriptorDAOImpl] [DefaultScheduler_Worker-3] getDescriptor:
org.mybatis.spring.MyBatisSystemException: nested exception is org.apache.ibatis.exceptions.PersistenceException:
### Error querying database. Cause: org.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection; nested exception is org.postgresql.util.PSQLException: The connection attempt failed.
### The error may exist in alfresco/ibatis/#resource.dialect#/node-common-SqlMap.xml
### The error may involve alfresco.node.select_NodeByNodeRef
### The error occurred while executing a query
### Cause: org.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection; nested exception is org.postgresql.util.PSQLException: The connection attempt failed.
During the last run started yesterday I noticed that cache problems appear only after the job has to delete nodes, but it might be a coincidence. When cache problems appear, the next few runs of the job see rapidly incrasing heap space usage. In the DEV envornment I can restart Alfresco and processing starts again but in PROD I need to stabilize this.
My questions are:
1) What parameters should be tuned that would impact the most? If this is the way to go.
1- You seem to have 5000000 which means the property "cache.node.nodesSharedCache.tx.maxItems" is overriden somewhere. Default value is 125000 only.Check all defaults in the link shared above. You need to tune it depending on your system needs. Set a value, test and increase or decrease. Be mindful that as the number goes up, it will take more heap memory. Check the docs shared above. If you are processing a really large batch, try to reduce the batch.
Thanks for your ideas. The cache value of 5000000 was set as a test because in previous runs this error happened when it was at 125000. To me it looks like no matter how high I set this, the problem comes up anyway. Since my batch size is pretty small, only 500 nodes, I have the feeling something else is happening.
Right now I have the process running for 14 hours, doing 12 runs per hour, with the cache back at 125000 but with timeToLiveSeconds = 30 and cluster.type=local. I am hoping this will evict the cache sooner and avoid any messaging between caches (probably not relevant). However my job log shows that so far many nodes had versions removed but no nodes have been deleted. There seems to be a big difference between deleting versions and deleting nodes (which could also have versions).
If this runs into problems I'll try to disable the cache completely as per your link and maybe also reduce the batch size to 100 with the cron expression set to every minute.
So to follow up with another question, is a batch size of 500 nodes considered big?