There is no difference between Enterprise and Community Edition regarding the approach of using a separate core (on same system or a separate SOLR does not matter either). Actually, Community Edition is way more flexible here due to the SOLR licensing for Enterprise.
The conversion via JODConverter is not "better" per se, e.g. it is not faster in any way. The only improvement it brings is that JODConverter can be used to utilise parallel instances of LibreOffice and helps with LibreOffice process health by restarting the processes automatically.
Setting Alfresco in 100% read-only mode is impossible unless you use a DB user with only read-access privileges. There are various code pieces during startup that overrule any read-only setting configured via alfresco-global.properties (e.g. the default transaction mode which you can set). And I assume those functions will fail if you use a database user with read-only access. But it is possible to have a 98% read-only mode Alfresco that is shielded from any user requests that supports only SOLR. A couple of my customers are doing that.
In Community Edition you'd either have to use a 3rd-party clustering module to ensure its caches are consistent or disable the core caches for nodes to make sure that you always read the consistent state from the database.
I thought the JOD converter is much faster, because you can use parallel instances of libreoffice conversions (as long as you have CPU cores, normally I use 4 to 6 instances on different ports if there many Office conversions to do). Since the Indexer is no more single threaded, it can use also the parallel instances to complex-convert to text, so I thought this would be better than the single libre office thread on community. Am I missing something or did I misunderstand the whole thing?
Many thanks for all your advises and comments.
Axel, when you say "using a separate core on same system" do you mean running two separate solr cores running in parrallele and both connected to a single alfresco instance ? Is it possible ? I've no clue how to do that...
The easiest way (but not the shortest one) for me would be to clone the full system as Martin says...
The link to your TMQ session looks very helpful, I'll check that !
I am just saying that the transformation via JODConverter is not faster when comparing single-process to single-process. If you have the resources to parallelize JODConverter will of course be more efficient overall.
Yes, I do mean running separate cores in parallel. Since a core is made up by the configuration folders in solrHome that containing a core.properties file, you can simply just duplicate one of the existing folders (e.g. workspace-SpacesStore or alfresco - depending on how they are called in your system), give it a distinct name and also configure its solrcore.properties to use a distinct storage location for its index. Next time you start SOLR, the new core config folder will be picked up and that core will start tracking Alfresco as per its configuration.
Cool project - looks like you are involved :-)
Thank you for the link, I'll give a try in the nearest future. I was looking for something like that for a long time.
OK I'll test the method you mentionned, and potentially put solr on a new server for better performances.
BTW, I confirm that some duplicate index in solr are automatically fixed (a query that return duplicate DBID day X will return single node a day after). Does it make sense for you ? (We're running massive bulk loading on this Platform).
The SOLR index can fix itself for many issue without reindexing everything.
It should fix any duplicates, stuff that is missing, etc.
You can also reindex nodes that match a query - or just do them one at a time.
With community you can define one index to use. The second one you are building will be ignored - it will add some extra load. Once you are done you just need to flip over the configuration and use the new index. There are no helpful admin screens to do this in community and you will have to stop and restart to pick up the property changes.
If we can nail the route cause of anything like this it will be at the top of the fix list !
It really helps everyone if you can describe what you think the cause may be and raise it in ALF.
In general the fraction of deleted nodes in the index is not an issue. The background merge operations in lucene consider this along with other stuff when they decide which segments to merge. Index optimisation is not required as at was years ago and you will in fact throw away some segment level caches. Lucene improved support for lots of segments quite some time ago. Yes a few things scale with doc count - not enough to worry about.
For index rebuild time it depends what you measure. In SOLR 4 and 6 metadata is indexed ahead of content. SOLR caches the docs it adds to the repository for a number of reasons - one is to avoid content transformation at rebuild. Sharing the content is not good as two indexes may both try to write to the cache - you would have to copy it - I will give this some more thought. It would be easy enough to have one to use the cache read only for example.