Scaling Search with DB_ID_RANGE

cancel
Showing results for 
Search instead for 
Did you mean: 

Scaling Search with DB_ID_RANGE

blong
Active Member
1 4 3,230

Alfresco Content Services v5.2.2 introduced a new sharding method called DB_ID_RANGE.  This is the first shard scalable sharding method available to Alfresco.  Before going into the details of what I mean by shard scalable, it might be good to go through a little Alfresco Search history.

Before v4.0, Alfresco used Apache Lucene as its search engine.  Each Alfresco platform server contained embedded Lucene libraries.  The platform used those libraries to index all properties and contents.  In clustered environments, each server performed the same indexing operations, effectively building their own index in parallel to other instances.  Since Lucene was embedded into the Alfresco platform, it naturally scaled horizontally with the platform.

By default, indexing was performed synchronously or in-transaction.  This means it provided consistency to searches.  In other words, if you searched Alfresco immediately after adding or updating a node, that node would be in the result set.  The downside to this approach is that the index operation is performed in series giving the user a slower experience.  There was an option to do asynchronous indexing, however, the architecture had its own set of problems we don't need to go into for this blob post.

Under the embedded Lucene architecture, it was possible to scale horizontally for search performance, but only with respect to the number of search requests.  A single search must still execute completely one instance and cover the full index.  On top of that, it is not possible to scale the indexing operations.  This solution was fine for smaller repositories, but when a repository grows into the millions, it becomes untenable.

Along comes Alfresco v4.0 and Apache Solr was integrated with Alfresco.  Apache Solr is a web application wrapped around the Apache Lucene search engine.  So it is not a major technological shift.  The key capability here is the independent scalability of the search engine.  This was implemented using a polling model.  Basically, the Solr application polls Alfresco for new transactions since the last index, Alfresco provides the data, and Solr indexes those transactions.  Since it utilizes polling, searches become inconsistent.  This means there is a lag between updates and search results.  It also means different Solr servers could return different results, depending on their current lag.  By default, this lag is 15 seconds, so users are rarely impacted.

Under the Solr1 architecture, scalability was similar to Lucene.  The only difference is the ability to scale independent of the platform/backend application.  A single search still needed to execute on one Solr instance and cover the full index and each Solr instance still must index the full repository.

None of these scalability issues change with Solr4 or Solr6 outside of the addition of sharding or index replication.  Index replication is for a different blog post, so we will stick with sharding in this post.

The concept of Solr sharding was first introduced in Alfresco v5.1.  Sharding allows for an index to be divided among instances.  This means that one complete index can be distributed across 2+ servers.  This means that the indexing is scalable by a factor of the number of shards.  5 shards across 5 servers will index 5 times faster.  On top of that, a single search is also distributed in a similar matter, making a search 5 times faster.  However, a search across shards must merge search results, possibly making the search slower in the end.  When you consider loads were there are more searches than shards, you actually lose some search performance in most cases.

Solr shards are defined by a hash algorithm called a sharding method.  The only sharding method supported in Alfresco v5.1 was ACL_ID.  This means permissions were hashed to determine shards that contained the index.  So when a search is performed, ACLs are determined, containing shards are selected, and the search is performed, merged, and returned to the user.  It is optimal for one shard to be selected, then a small index search is performed and no result set merge is performed.  This is only beneficial if permissions are diverse.  If millions of documents have the same ACL the sharding is unbalanced and effectively useless.

To support other use cases, especially those without diverse sets of permissions, several sharding methods were introduced in Alfresco Content Services v5.2.  This includes DB_ID, DATE, and any custom text field.  DB_ID allows for well balanced shards in all situations.  DATE allows for well balanced shards in most situations.  That would not be the case with heavy quarter-end or year-end processing.  A well balanced shard provides the best scalability.  There good and bad reasons to choose ACL_ID or DB_ID or DATE or your own custom property.  Those are for another blog post.

With sharding and all these sharding methods available, most scalability issues have a solution.  However, there is still another issue.  A sharding group must have a predefined a number of shards.  This means that each shard will grow indefinitely.  So an administrator must project our the maximum repository size and create an appropriate number of shards.  This can be difficult, especially with repositories without a good retention policy.  Also, since it is best to hold the full index in memory, scalability is best when you can limit the size of each shard to something feasible given your hardware.

Search Engine

ProsCons

Apache Lucene

(Alfresco v3.x to v4.x)

Consistent

Scale with search requests

Embedded: no HTTP layer

No scale independence from platform

No scale with single search request

No scale with index load

Indefinite index size

Apache/Alfresco Solr v1

(Alfresco v4.0 to v5.1)

Scale independence from platform

Scale with search requests

Eventually consistent

No scale with single search request

No scale with index load

Indefinite index size

Back-end Database

(Alfresco v4.2+)

Consistent

Used alongside Solr engines

Scale with back-end database

DBA skills needed to maintain

Only available for simple queries

Apache/Alfresco Solr v4

(Alfresco v5.0+)

Same as Solr v1

Sharding available

Same as Solr v1

Alfresco Search Service v1.x

(Alfresco v5.2+)

Same as Solr v4

Embedded web container

Same as Solr v4

Shard Method: ACL_ID

(Alfresco v5.1+)

(Solr v4 or SS v1.0+)

Embedded web container

Scale independence from platform

Scale with search requests

Scale with single search request

Scale with index load

Reduction of index size

No scale for number of shards

Likely search result merge across shards

Balance depends on node permission diversity

Indefinite index size

Shard Method: DATE

(Alfresco v5.2+)

(SS v1.0+)

Same as ACL_ID

Date search performance

Reduction of index size

No scale for number of shards

Likely search result merge across shards

Index load on one shard at a time

Indefinite index size

Shard Method: custom

(Alfresco v5.2+)

(SS v1.0+)

Same as ACL_ID

Custom field search performance

Reduction of index size

No scale for number of shards

Likely search result merge across shards

Balance depends on custom field

Indefinite index size

Shard Method: DB_ID

(Alfresco v5.2+)

(SS v1.0+)

Same as ACL_ID

Always balanced

Reduction of index size

No scale for number of shards

Always search result merge across shards

Indefinite index size

Shard Method: DB_ID_RANGE

(Alfresco v5.2.2+)

(SS v1.1+)

Same as DB_ID

Scale for number of shards

Full control of index size

Proactive administration required

Always search result merge across shards

You can see similar comparison information in Alfresco's documentation here: Solr 6 sharding methods | Alfresco Documentation.

In Alfresco Content Services v5.2.2 and Alfresco Search Services v1.1.0, the sharding method DB_ID_RANGE is now available.  This allows an administrator to define a set number of nodes indexed by each shard.  This allows additional shards to be added at any time.  Although it has always been possible to add additional shards at any time (theoretically), those shards would have a new hash which would inevitably perform duplicate indexing work already performed.

Let's start with a fresh index.  Follow the instructions provided here: Installing and configuring Solr 6 without SSL | Alfresco Documentation.  However, ignore the initialization of the alfresco/archive core.  If you did this anyway, stop the Alfresco SS server, remove the alfresco/archive directories, and start it back up.  We basically want to start it without any indexing cores.

To properly enable sharding, follow the instructions here: Dynamic shard registration | Alfresco Documentation.  Although that is under Solr 4 configuration, it holds for Alfresco SS too.  I also recommend you change the solrhome/templates/rerank/conf/solrcore.properties file to meet your environment.

To start using DB_ID_RANGE, we are going to define the shards using simple GET requests through the browser.  In this example, we are going to start with a shard size of 100,000 nodes each.  So the 1st shard will have the 1st 100,000 nodes, the 2nd will have the next 100,000.  We will define it with just 2 shards to start.  When we need to go beyond 200,000 nodes, it would be logical to create a new shard group, starting at 200,000.  However, that does not work yet in Alfresco v5.2.  You must define a maximum number of shards that is as large as feasibly possible for your environment.

We are going to start with 3 server instances and grow to use 5 instances.

Create your 1st shard group and the 1st shard on the 1st and 2nd instances of your servers:

http://<instance1>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=al...

Create the 2nd shard on the 1st and 3rd instances of your servers:

http://<instance1>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=al...

For about the first 200,000 nodes added to the system, this will work for your search engine.  In this configuration, twice as much load will be placed on instance1 than the other two instances, so it is not a particularly great setup, but this is just an example for learning purposes.

Now let's suppose we are at 150,000 nodes and we want to get ready for the future.  Let's add some more shards.

http://<instance2>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=al...

Now we are ready for 800,000 more nodes and room to add 3 more shards.  Let's suppose we are now approaching 1,000,000 nodes, so let's add another 1,000,000 node chunk.

http://<instance4>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=al...

Suppose the search is not performing as well as you would like and you scaled vertically as much as you can.  To better distribute the search load on the new shard, you want to add another instance to the latest shard.

http://<instance1>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=al...

When you move beyond 2,000,000 nodes and you want to downscale the shard above, you can try the following command to remove the shard.  Notice the coreName combines the coreName used to create the shard, appended by a dash and the shardId.

http://<instance1>:8983/solr/admin/cores?action=removeCore&storeRef=workspace://SpacesStore&coreName=alfresco-shards-0-7-3

It is recommended that you keep the commands you used to create the shards.  You should hold that in your documentation so you know which shards were defined for which DBID ranges.  The current administration console interface does not help you with that all that much.  I would expect to see that improve with future versions of Alfresco CS.

4 Comments