We're running alf5.2g CE / solr4 with less than 100K nodes in repo, and 12GB of java heap size, 16GB of system RAM.
Since somes days, we noticed a huge memory consumption and OOM errors. But we see that each time a GC Dump is run using VisualJVM tool, all the Heap memory is released. We've found that VisualJVM is performing a explicit GC before dumping, that's why the memory is released.
The question: why the heap memory is rising up just after with unreferenced objects (immediatly releasable by GC) ?
For most garbage collection variants in Java, it is perfectly normal (to some degree) that heap memory gradually increases even as objects are dereferenced, until a certain threshold is reached at which the collection algorithm will clean up a particular memory area. Dumping the heap is also not guaranteed to run an explicit GC - it depends on the tool / invocation used to initiate the dump, so while the simple dump action in JVisualVM may perform a GC cycle, you actually can choose what to do when using the low-level JMX Hotspot diagnostics bean or the jmap command line utility.
The strange thing is that memory is increasing very quick and high with some OOM at the end, but if we run a memory dump during this rising period, GC is releasing all the memory (which would mean that this memory was not used anymore.....). Does it make any sense ?
We finally found what causes this very high memory consumption: the high number of groups. We already noticed that Alf behaves badly when the number of user groups is rising. In this example we had about 20K groups, which cause OOM. We fall down to about 3K groups and everything is now OK. Does that seem normal to you ?
A quick rise and fall after a OOM error makes sense to me. What likely happens is that you have one thread / request load a lot of your groups, which suddenly takes a substantial amount of memory, and when the OOM error occurs, the thread is killed and all the data it loaded now is unreferenced, so it can be cleaned up, causing the decrease again during GC.
The amount of 20k groups in itself is a bit much, but I don't see how the number in itself can cause a OOM with the amount of heap space you have assigned. It must be something very specific that is done in the one thread / request which triggers the OOM that somehow multiplies the base memory cost of a group (which is either just a node or a name, depending on how it is loaded).