Index Merging Performance

Showing results for 
Search instead for 
Did you mean: 

Index Merging Performance

1 1 4,358

This page intends to document the relation between the document distribution in the index segments and merge performance.  It applies to all versions of Alfresco that have Lucene (pre-5.0), and does not apply to newer versions.

For each Alfresco Store, there is a corresponding IndexInfo live file in /path/to/repository/lucene-indexes/${protocol}/${identifier}, for example /opt/alfresco/repos/3.3.3/lucene-indexes/workspace/SpacesStore/IndexInfo

The class contains a main method to read  those files.  The index info file contains, as its name implies, summary information about the status of the various underlying lucene segments (committed, merge, merge target, ...) and number of lucene documents contained in a particular segment. The IndexInfoBackup file is a fallback, and is used if the IndexInfo file is missing or corrupted.  

Note that the IndexInfo does not need to be read on the production server. It can be copied on another server, and read by passing the full path to the folder containing the IndexInfo file as command line parameter.  The class can be run with this very simple script, running for example off of the Alfresco SDK :  

# $1 : path to the folder containing the IndexInfo file to read  
SDK_ROOT=/opt/alfresco/sdks/3.3.3  #replace as appropriate
THIRD_PARTY_JARS=$(find $SDK_ROOT/lib/server/dependencies -name '*.jar' | xargs | sed -e 's/ /:/g')
ALF_JARS=$(ls -1 $SDK_ROOT/lib/server/alfresco*.jar | xargs | sed -e 's/ /:/g') 



And as an alternative: a Powershell script for a machine that has Alfresco installed:

#directory where IndexInfo is placed in:
Set-Variable -Name iiDir -Value C:\temp\analyze
Set-Variable -Name jars -Value 'c:\alfresco\tomcat\webapps\alfresco\WEB-INF\lib\*'
Set-Variable -Name mainClass -Value
java -cp $jars $mainClass $iiDir > c:\temp\analyze\report.txt

Note: It is NOT needed to have repository / db / tomcat installed on the machine where the script is run.   

Let's analyze some sample outputs from test repositories:  

'Good' Sample output :

Entry List for /opt/alfresco/repos/3.3.0/lucene-indexes/workspace/SpacesStore
     Size = 10        
          0        Name=8e6be575-e1d3-489f-b4c7-20337b700515 Type=INDEX Status=COMMITTED Docs=8527 Deletions=0         
          1        Name=752eac1c-7f29-47fe-992b-6ab9e4c850ac Type=INDEX Status=COMMITTED Docs=3024 Deletions=0         
          2        Name=b2c507b9-f4a3-4974-800c-60def99a4aa0 Type=INDEX Status=COMMITTED Docs=1512 Deletions=0         
          3        Name=0b80e759-8e15-4e8b-a786-f89210681d53 Type=INDEX Status=COMMITTED Docs=523 Deletions=0         
          4        Name=a6b147fe-f39b-402f-8479-61dabf879d65 Type=INDEX Status=COMMITTED Docs=92 Deletions=0          
          5        Name=724cb73a-e455-4da1-be76-44c6222d1d93 Type=DELTA Status=COMMITTED Docs=1 Deletions=1         
          6        Name=dc1a47b7-2243-407e-9fe4-c782f62746e1 Type=DELTA Status=COMMITTED Docs=1 Deletions=1         
          7        Name=df930230-089e-4b9c-bf68-3e500f13e679 Type=DELTA Status=COMMITTED Docs=1 Deletions=1         
          8        Name=82487ce7-654a-48d6-899f-d5887dda76d9 Type=DELTA Status=COMMITTED Docs=1 Deletions=1         
          9        Name=fbc386a6-9cd8-401c-9330-f2b79f851059 Type=DELTA Status=COMMITTED Docs=1 Deletions=1


'Bad' Sample Output :  

Entry List for /opt/alfresco/repos/3.3.0-bad/lucene-indexes/workspace/SpacesStore       
     Size = 9        
          0        Name=ca612da7-fa5b-4612-aa1e-14e47ff97eb6 Type=INDEX Status=COMMITTED Docs=271702 Deletions=0         
          1        Name=9c75c493-4588-47cf-b9dd-4b5a67cfc0dc Type=INDEX Status=COMMITTED Docs=239888 Deletions=0         
          2        Name=3d185854-a40d-4de6-bab6-61d2888ec67c Type=INDEX Status=COMMITTED Docs=162640 Deletions=0         
          3        Name=fde91947-fb1e-406a-a411-ad103b0deccb Type=INDEX Status=COMMITTED Docs=154118 Deletions=0         
          4        Name=019578d2-cf7a-4067-91a0-16f2ecb82ee3 Type=INDEX Status=COMMITTED Docs=81467 Deletions=0          
          5        Name=79cd965e-6cbe-4559-98b7-0ae337b315c3 Type=DELTA Status=COMMITTED Docs=1 Deletions=2         
          6        Name=5c67efc8-7454-4c7a-9694-ebc7d0e53b55 Type=DELTA Status=COMMITTED Docs=0 Deletions=2         
          7        Name=be3bf0ba-e0dc-450c-a864-01f08024fcf5 Type=DELTA Status=COMMITTED Docs=0 Deletions=2         
          8        Name=71ebb8b3-17e0-4606-92df-d4e5a3c471a8 Type=DELTA Status=COMMITTED Docs=0 Deletions=2


It is usually a good practice that the highest-numbered  INDEX entries (which contains the least documents, number 4 in the examples above) do not contain more than a few hundred documents. If it's not the case, it could lead to massive amount of IO pressure on the index directories for merging operations. Only applies to COMMITTED index segments, if you have a lot of MERGE segments, then it may make more sense to look at it when the indexes are not currenlty being heavily merged (ie when most statuses are not COMMITTED) as the numbers may change a lot in a short amount of time.  

The number of on-disk segments is controlled by lucene.indexer.mergerTargetIndexCount for alfresco versions >= 3.3.3 and lucene.indexer.mergerMergeFactor for alfresco versions

1 Comment