Deconstructing SOLR Indexes

cancel
Showing results for 
Search instead for 
Did you mean: 

Deconstructing SOLR Indexes

angelborroy
Alfresco Employee
5 0 4,533

When accessing to SOLR Web Admin Console, both alfresco and archive cores provide an Overview section that includes relevant data.

 

solr-1.png

In this case, we have an Alfresco core including 5,928 documents with 104.82 MB of disk storage.

A full search of every document in the core can be obtained by using the /select handler with an asterisk search criteria.

 

solr-2.png

This query returns a number of 2,825 documents, that is quite less than the original 5,928 documents obtained in the Overview section.

Finding the missing documents

Alfresco SOLR Indexes are storing Nodes from Alfresco Repository (2,825 in this case) but they are also including additional indexing documents required to perform tracking and searching operations. Every document on an Alfresco SOLR index includes a property named DOC_TYPE that describes the type of the document. Exploring this field in the Schema option, gives us the total count of documents (5,928).

 

solr-3.png

  • 3,036 Tx documents to track Repository database transactions
  • 2,825 Node documents to track Alfresco Repository Nodes (folders, files and other types of nodes)
  • 58 Acl documents to track permissions
  • 7 AclTx documents to track ACL transactions
  • 2 State documents to track internal status of the SOLR Core

Estimating the storage size for hidden documents

Using tools like Luke, these documents (Tx, Acl, AclTx and State) can be removed from the Index in order to calculate the disk storage required by them.

luke-delete.png

 

Once all these documents have been deleted, preserving only the 2,825 Node documents for Alfresco Repository Nodes, the count of documents and the storage size is providing the raw information for Alfresco Nodes.

 

solr-state-1.png

Around 4 MB have been removed from the original storage size (104.82), so this is the storage used by the hidden documents in this SOLR Index.

Recap

  • Don't use these techniques on a living Alfresco SOLR Index, as you can corrupt the result and it will not work anymore with Alfresco Repository
  • This information is partially replicated on every SOLR Shard, so when splitting your SOLR Index in Shards you need to provide additional storage than the expected Index Storage Size / # of Shards
  • When splitting into shards there is also additional replicated information for term vectors and doc values that depends on your custom content model
About the Author
Angel Borroy is Hyland Developer Evangelist. Over the last 15 years, he has been working as a software architect on Java, BPM, document management and electronic signatures. He has been working with Alfresco during the last years to customize several implementations in large organizations and to provide add-ons to the Community based on Record Management and Electronic Signature. He writes (sometimes) on his personal blog http://angelborroy.wordpress.com. He is (proud) member of the Order of the Bee.