Achieving Higher Metadata Indexing Speed with Elasticsearch

cancel
Showing results for 
Search instead for 
Did you mean: 

Achieving Higher Metadata Indexing Speed with Elasticsearch

karan97425
Partner
4 0 2,236

Working with Elasticsearch gives us the flexibility to customize it. One can choose to get the index ready with just the metadata of uploaded files or with content of the files as well. Having content indexed will have time and cost implications and thus is chosen if only necessary. If we consider the data volume of up to 1 billion files, even getting the Metadata indexed can be a lengthy job. Here we will be looking at the approach that was used for indexing and the speed that was achieved with our deployment setup hosted over Amazon Web Service (AWS)
Alfresco Elasticsearch Connector can be used with multiple threads to perform indexing based on Node ID of the files present in the database, with right set of configurations of Elasticsearch i.e. number of primary and replica shards, refresh time we can achieved better indexing speed. Below is the complete set of infrastructure and configuration that was used.

Elasticsearch Data Node:

  • AWS Elasticsearch v7.10 with Availability Zone: 1-AZ
  • Number of Data Nodes: 3
  • Instance Type: r6g.2xlarge.search
  • Storage type: EBS
  • EBS volume type: Provisioned IOPS (SSD)
  • EBS volume size: 1000 GiB per node
  • Fielddata cache allocation: 20
  • Max clause count: 1024

Indexing Instance

  • EC2 Indexing Instance to run alfresco-elasticsearch-connector-distribution-3.1.1
  • Indexing Instance type: t2.2xlarge (8vCPUs, 32GiB RAM)
  • Number of EC2 Instances for Indexing: 3
  • Number of threads running on Instance 1 and 2 is 7 each with 6 threads on instance 3. Total threads running in parallel is 20
  • Maximum Heap allocated to each thread is 4GB (-Xmx4G)

Elasticsearch Master Node:

  • Number Of Master Nodes: 3
  • Master Node Instance type: m5.large.search
  • Master node is added for resilience, and may be avoided without having significant impact.

Elasticsearch Settings

  • Number of Primary shards: 32
  • Number of Replica Shards: 0
  • Refresh Time: Disabled
  • Translog flush threshold: 2GB

Other Components

  • Active MQ: mq.m4.large
  • RDS is used as Database with db.r5.2xlarge running PostgreSQL
  • EC2 Instance running ACS & Transform Service: m5a.xlarge
  • ACS Version used is 7.2.0

The deployment architecture of the setup looks like this:Capture.JPG

Setting Up Elasticsearch

At first, shard creation is performed which depends on the data volume and size of each shard that has to be created. Abiding by AWS recommendations, each shard should be not more than 50GB and after estimating the total indexed data size of to be indexed files, the number of shards can be calculated. Like for 1 Billion, estimated size of metadata indexed data is approximately 1.3TB and keeping each shard at 40GB, we get number of shards as 32. Below command be be run from the same VPC to create such a setup

curl -XPUT 'https://<Elasticsearch DNS>:443/alfresco?pretty' -H 'Content-Type: application/json' -d'
{
  "settings" :{
        "number_of_shards":32,
        "number_of_replicas":0
  }
}'

Other critical parameters like refresh interval and translog flush threshold can be set. Refresh time is the time in which indexed data gets searchable and should be disabled (by passing -1) or put to a higher value during indexing to avoid unnecessary usage of resources. Translog flush threshold is set to a higher size e.g. 2GB to avoid frequent flushed during indexing. Both can be set using below commands

// To set the refresh time to -1 for disabling it
curl -XPUT "https://<Elasticsearch DNS>:443/alfresco/_settings" -H 'Content-Type: application/json' -d '{ "index" : { "refresh_interval" : "-1" }}' // To set flush threshold at 2GB curl -XPUT "https://<Elasticsearch DNS>:443/alfresco/_settings?pretty" -H 'Content-Type: application/json' -d '{"index":{"translog.flush_threshold_size" : "2GB"}}' //To verify all the setting, run below command curl -XGET "https://<Elasticsearch DNS>:443/alfresco/_settings?pretty" -H 'Content-Type: application/json' -d '{ "index" : { "refresh_interval" }}'

Setting Up Re-Indexing Instance

  • Deploy 3 EC2 instances using configuration from table above in same VPC as all other services
  • Attach these EC2 instances to security group such that all incoming traffic is allowed from other services
  • Install Java on all the 3 instances
  • Copy alfresco-elasticsearch-connector-distribution-3.1.1 to 3 EC2 Instance
  • We were running 7 threads each on two instances and 6 on third to achieve total 20 thread count
  • Browse to folder where alfresco-elasticsearch-reindexing-3.1.1-app.jar is located
  • Run below code with following necessary modifications.
    • server.port: provide unique port numbers to run required number of threads needed from an instance. For example, to run 7 threads from Instance1; copy below code 7 times providing unique port in each of 7 sets of commands. Similarly, update other parameters as below.
    • Provide unique nodeID for each thread using parameter alfresco.reindex.fromId and alfresco.reindex.toId. Idea is to equally distribute total file count among the threads. In this case, 1B among 20 threads, each thread getting 50 million each. For example,
      • For Thread 1: alfresco.reindex.fromId=0 alfresco.reindex.toId=50000000
      • For Thread 2: alfresco.reindex.fromId=50000001 alfresco.reindex.toId=100000000
      • For Thread 3: alfresco.reindex.fromId=100000001 alfresco.reindex.toId=150000000
      • ......
      • For Thread 20: alfresco.reindex.fromId=950000001 alfresco.reindex.toId=1000000000
      • Running more than 20 threads at this infrastructure has not shown improvement in indexing speed.

Indexing Command

nohup java -Xmx4G -jar alfresco-elasticsearch-reindexing-3.1.1-app.jar \
--server.port=<unique port> \
--alfresco.reindex.jobName=reindexByIds \
--spring.elasticsearch.rest.uris=https://<Elasticsearch DNS>:443 \
--spring.datasource.url=jdbc:postgresql://<DB Writer URL>:5432/alfresco \
--spring.datasource.username=**** \
--spring.datasource.password=**** \
--alfresco.accepted-content-media-types-cache.enabled=false \
--spring.activemq.broker-url=failover:\(ssl://<Broker 1>:61617,ssl://<Broker 2>:61617\) \
--spring.activemq.user=alfresco \
--spring.activemq.password=***** \
--alfresco.reindex.fromId=0 \
--alfresco.reindex.toId=50000000 \
--alfresco.reindex.multithreadedStepEnabled=true \
--alfresco.reindex.concurrentProcessors=30 \
--alfresco.reindex.metadataIndexingEnabled=true \
--alfresco.reindex.contentIndexingEnabled=false \
--alfresco.reindex.pathIndexingEnabled=true \
--alfresco.reindex.pagesize=10000 \
--alfresco.reindex.batchSize=1000  &

Indexing Speed

Below table summarizes different indexing obtained with different data volumes with identical infrastructure and configurations. 

Metadata Indexing 

20 Millions

200 Millions

300 Millions

1 Billion

Total Time Consumed

40 minutes

8 hours 9 minutes

13 hours 43 minutes

47 hours 30 minutes

Average Indexing Speed

30 million files/hour

24 million files/hour

21 million files/hour

21.2 million files/hour

Elasticsearch Operations

300k-400k per minute

200k-350k per minute

200k-350k per minute

200k-350k per minute

Data | Master Node CPU%

85% | 10%

95% | 15%

95% | 15%

96% | 15%

Data Node Memory %

80%

97%

98%

98%

RDS Thread Count

411

411

411

411

RDS CPU%

5%

5%

5%

5%

Re-Indexer CPU%

5%

5%

5%

5%

 

Conclusion

With certain tricks we can customize Elasticsearch to cater to our needs. With right set of infrastructure and configurations, one can achieve greater indexing performance. Here, we have an optimum setup which can be scaled up further to achieve higher indexing speed.