most of our customers run on Alfresco 5.1/5.2, Solr 4 but from my experience the versions 5-6 differ only marginally in terms of performance and scalability.
Solr is a memory beast and requires special attention. Solr 6 has more features and better tools to avoid OOM exceptions but you need to handle them. Some recommend to split your index using sharding to overcome hardware limits.
90 % of unplanned downtimes we have seen in the last 10 years were related to an effect which we call thread escalation: If the Alfresco repository waits on or gets exceptions from other components (database, transformation, solr query) it will create new threads until the whole system has eaten all resources. The retrying transaction concept even accelerates the thread escalation. So
the main rule to scale your alfresco is to avoid increasing of threads. Monitor the number of threads and if they are increasing find out the root cause. On heavy used production systems you have only ~30 minutes to avoid a thread escalation and system downtime.
are there enough free db connections ( expect 1 connection per active client + 20%) and is the db fast enough?
if you give the jvm a lot of memory: check and monitor garbage collection
are there increasing response times and number of threads from share?
is the system waiting on solr response (mostly caused by OOM)? This may be addressed by log and jvm monitoring combined with automated kills/restarts. Check new tools shipped with solr 6 or write your own
is the system waiting on transformations (mostly requested by share and solr) and therefore creating more (waiting) threads?
After many tests, it looks like my problem is a combination of cluster feature + High volume documents. When i test a 2 nodes cluster (2,5 million docs) with 10 concurrent sessions, it take tomcat about 30 min to top his memory usage and CPU. At that moment, alfresco become very slow, almost non-responsive. And most important, we could have more than 40 actives connexion to Oracle. Same test, same server with cluster feature disable, the server runs fine and the database connexion stay low (1-3). On the other end, i've got another setup (identical to the previous cluster) that is for dev, so with less documents (170 000). When i did the same test in cluster, alfresco stay very responsive, no balloon for tomcat, low active connexion to Oracle. So it looks like the volume has an influence when in cluster mode. Did someone experience this kind of situation?
Hi Marc, volumetry has always a big influence (but not only for a cluster). The term sizing speaks about this too. It is not the same an Alfresco alf_node_properties table with thousands of records, than one with several hundreds of millions. It is not the same a database size of 10Gb than a database 100 times bigger. It is not same the indices size in SOLR for some thousands of documents, than the size for several millions. The number of documents, the number of metadata properties and ACLs have a big impact on the resources needed by Alfresco and its components such as as SOLR or relational database. For example, the needed JVM for a SOLR instance depends directly on the number of documents of the repository.
I did the calculation for SOLR and validate with alfresco, so i hope that it is ok. My alf_node_properties table is about 50 millions records. Solr engine is deploy in both nodes. It really looks like there is a problem when i activate the cluster feature on my server with high volume database. Without cluster activated, the same server react normally. What overhead, cluster is causing?
But I would point or try with SOLR first. I mean, one of the recommendations/best practices for reducing performance problems (in a cluster or not) is to separate SOLR from the repository. The memory calculation and CPU consumption is only for SOLR. You have a competition on machine resources between SOLR and Alfresco repository each every moment, not only in CPU and JVM memory, also in IO (SOLR indexing/writing and searching/reading) while Alfresco is ingesting data in contentstore. So you can try to:
Configure a third dedicated machine with Alfresco and SOLR similar to your current nodes but in this case, Alfresco will be used only for indexing. This Alfresco node should be out of the cluster and even in read only mode.
Configure your current Alfresco nodes (let say user nodes) in the cluster, for not having SOLR locally and pointing to the new third dedicated machine for searching.