Running Alfresco in AWS - Migrate EBS based contentstore to S3

cancel
Showing results for 
Search instead for 
Did you mean: 

Running Alfresco in AWS - Migrate EBS based contentstore to S3

spoogegibbon
Active Member
2 3 4,918

Introduction



So you've installed Alfresco in Amazon AWS, and your contentstore is on either a local ephemeral disk or it's on EBS.

This guide is to help you migrate from these to S3 using the Alfresco S3 Connector.



The information in this guide was compiled during the contentstore migration from EBS to S3 for one of our large AWS Alfresco users.



There are a variety of reasons to migrate the contentstore to S3. The main one is to increase the resilience of the store - during most of the AWS outages it has been EBS that has been most affected, including data loss (search google for 'ebs data loss').

With S3's 'Designed for 99.999999999% durability and 99.99% availability of objects over a given year' sla, and 'Amazon S3 Server Side Encryption (SSE)' , putting your content on S3 means you will have secure and available content items at all times.

Set up S3



First of all, create a new S3 bucket for you to use.

Make a note of the Bucket name. This will be used in all places tagged <s3_bucket>.



It is also a good idea to secure the bucket more than the default, using IAM - see the AWS documentation for this.



Next, install a tool that will allow you to migrate your existing content to S3, such as the S3tools.



If you are using RHEL6, the instructions are (for other operating systems follow the instructions on the s3tools website):

as root;

cd /etc/yum.repos.d

wget http://s3tools.org/repo/RHEL_6/s3tools.repo

yum install s3cmd

s3cmd --configure


Follow the instructions and enter the credentials asked for so you can connect to your bucket.



Once set up, check connectivity using:

s3cmd ls


This should list your buckets.

Copy your content to S3



Navigate to your contentstore directory:

cd /<dir_path>/alf_data/contentstore


If you want to check to see what will be uploaded to S3, perform a dry run first:

s3cmd sync --dry-run ./ s3://<s3_bucket>/contentstore/


Once you are happy that all is well, start the upload:

s3cmd sync ./ s3://<s3_bucket>/contentstore/


Navigate to your contentstore.deleted directory (these steps are optional if you want to keep your deleted files):

cd /<dir_path>/alf_data/contentstore.deleted


If you want to check to see what will be uploaded to S3, perform a dry run first:

s3cmd sync --dry-run ./ s3://<s3_bucket>/contentstore.deleted/


Once you are happy that all is well, start the upload:

s3cmd sync ./ s3://<s3_bucket>/contentstore.deleted/-system-/


If your contentstore is not massive and you have space on your ephemeral disks, you can copy your contentstore to 'cachedcontent' - this will mean that the S3 cached content is pre-populated. It is much better to have this on the local ephemeral disk than on EBS.

cp -r contentstore cachedcontent


Alfresco S3 Connector



Download the Alfresco S3 Connector.

Once downloaded, follow the rest of the steps in the above help to install the module into Alfresco.

alfresco-global.properties



There are some changes you will need to do to your 'alfresco-global.properties'. These are all documented in the Alfresco S3 Connector information. These changes are:

s3.accessKey=<put your account access key or IAM key here>

s3.secretKey=<put your account secret key or IAM secret here>

s3.bucketName=<s3_bucket>

s3.bucketLocation=<see http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region>

s3.flatRoot=false

s3.encryption=AES256

dir.contentstore=contentstore

dir.contentstore.deleted=contentstore.deleted





If you are using lucene, set the following if it is not already set:

index.recovery.mode=AUTO


Make sure that Alfresco is stopped before you progress any further.

DB Update



Once your content is all in S3, and your Alfresco properties are all configured to use S3 as the contentstore location, there is one final step that is needed to be performed - update the Database!

One of the tables Alfresco uses in the Database has a property that links an item of content to its location. Since we have moved the content to S3, we need to update all these links in the DB. Luckily it's easy Smiley Happy



First, get the details of you database configuration from 'alfresco-global.properties'.

db.name=<db.name>

db.username=<db.username>

db.password=<db.password>

db.host=<db.host>


If the mysql tools are not already installed on your box, install them, e.g.

yum install mysql


Run Mysqldump, connecting to your DB, and dump the table called 'alf_content_url'.

The command below does this (you will be prompted for your user's pwd):

mysqldump -u <db.username> -p -h <db.host> <db.name> alf_content_url > s3_migration.sql


Next, make a backup of this dump in case anything goes hideously wrong Smiley Happy

cp s3_migration.sql s3_migration.sql.bak


Then, we need to change every store location for each file to point to S3.

This involves changing the values of the 'content_url' column from 'store://...' to 's3://...'

Here's a command I made earlier to do this (if you are on linux):

find s3_migration.sql -type f -exec sed -i 's/store:\/\//s3:\/\//g' {} \;


Once that completes successfully, you now need to re-import this table data.

Connect to your mysql db (you will be prompted to enter the user's pwd):

mysql -u <db.username> -p -h <db.host>


Switch to use the database that Alfresco uses:

use <db.name>;


Import your modified sql file:

source s3_migration.sql;


Exit mysql.



So, to recap:

S3 bucket has been created.

S3 cmd line tool such as s3cmd has been installed.

Content has been copied to S3.

The 'Alfresco S3 Connector' module has been installed into your Alfresco instance.

alfresco-global.properties has been updated.

Alfresco has been stopped

A dump of the 'alf_content_url' has been made, and a backup of that made.

The store location has been modified in the sql dump.

The modified dump file has been re-imported into your mysql db.



You are now ready to restart Alfresco...



There are a few methods to check that the S3 connector is all working:

1. monitor the 'cachedcontent' directory - it is used as a cache for the S3 content so that Alfresco doesn't have to request frequently used content from S3 each time it is used.

2. Upload some new content and check the S3 bucket.

3. Enable logging for jets3t as below and see what the logs say.



If things don't work, you could do the following:



Try re-synching your content.

Pro tips



You can enable JMX Instrumentation on the S3 connector by adding the following JAVA_OPTS to your Alfresco start scripts:

'-Djets3t.mx -Djets3t.bucket.mx=true -Djets3t.object.mx=true'



Logging - The S3 connector is based on jets3t, so follow the logging information for this tool:

http://jets3t.s3.amazonaws.com/toolkit/guide.html



3 Comments
blog_commenter
Active Member
Is it necessary to update the DB? I migrated an environment to AWS using S3 as the contentstore. Instead of using s3cmd I used replication and afterwards switched it to S3/CachingContentStore.
spoogegibbon
Active Member
It depends on your methods of getting the content to S3.

The content_url fields in the DB need to point to the correct location.
berendnl
Member II

Note that you MUST update the content_url_crc after you changed the alf_content_url path.

Otherwise, Alfresco will create duplicate short_urls, find orphaned nodes and in some cases actually delete files it shouldn't delete yet.