In Alfresco, the content binaries are stored separately from the metadata, which is always found in the database. The primary metadata that acts as a reference to the binaries takes the form contentUrl=store://.........|mimetype=...(etc). The abstraction that takes care of mapping the store://... part of the reference to a physical location is the ContentStore interface.
A request is made to the ContentStore for a ContentWriter.
The ContentStore creates the appropriate storage for the new content. This will either be a new' binary or a copy of an existing binary, depending on the type of access required by the client.
The ContentWriter hooks appropriate listeners to the content streams.
The ContentService wires the ContentWriter up to the current transaction.
The client writes to the stream using one of several convenience methods or the raw NIO Channel.
Upon stream closure, the metadata is written to the database in the context of the current transaction.
A request is made to the ContentStore for a ContentReader.
The ContentStore opens the underlying NIO Channel.
The client reads the content using methods on the ContentReader.
Copying, Moving and Versioning Files
Once a write Channel has been closed, the content is never modified by any high-level processes. Moving, copying and versioning a file merely affects the content metadata. It is possible to end up with several references to the same underlying raw binary content.
Content Binaries and Transactions
Because binaries are not modified, it means that writes to the filesystem do not become visible until the metadata has been committed to the database. In the event of transaction failure or rollback, the metadata will be left in the pre-transaction state i.e. referencing the older binary; the newer content binary will be left in an orphaned state for later cleanup.
When a file node (or anything containing a reference to raw content) is permanently deleted, there is just one less reference to the raw content. When there are no more references to some raw content, it is called orphaned. Were nothing further done, the content stores would just irreversibly fill up with content.
Cleaning up Orphaned Content (Purge)
Once all references to a content binary have been removed from the metadata, the content is said to be orphaned. Orphaned content can be deleted or purged from the content store while the system is running. Identifying and either sequestering or deleting the orphaned content is the job of the contentStoreCleaner.
In the default configuration, the contentStoreCleanerTrigger fires the contentStoreCleaner bean.
Use this property to dictate the minimum time that content binaries should be kept in the contentStore. In the above example, if a file is created and immediately deleted, it will not be cleaned from the contentStore for at least 14 days. The value should be adjusted to account for backup strategies, average content size and available disk space. Setting this value to zero will result in a system warning as it breaks the transaction model and it is possible to lose content if the orphaned content cleaner runs whilst content is being loaded into the system. If the system backup strategy is just to make regular copies, then this value should also be greater than the number of days between successive backup runs.
This is a list of ContentStore beans to scour for orphaned content.
When orphaned content is located, these listeners are notified. In this example, the deletedContentBackupListener copies the orphaned content to a separate deletedContentStore.
Note that this configuration will not actually remove the files from the file system but rather moves them to the designated deletedContentStore, usually contentstore.deleted. The files can be removed from the deletedContentStore via script or cron job once an appropriate backup has been performed.
Eager Content Cleanup
If you have an appropriate backup strategy, usually involving a ReplicatingContentStore, then the content can be removed after a day and need not be sent to a backup deletedContentStore. In your custom configuration context, override the contentStoreCleaner bean as follows:
The ContentService deals with a single ContentStore injected into the store property. Assuming an alternative implementation of a ContentStore is written (com.x.y.MyDBStore), then the fileContentStore bean must be overridden as follows:
We mentioned earlier that there are hooks put onto the content write stream, so that any number of tasks can be performed, in the same transaction, when the stream is closed. It is possible to replicate content between the primary fileContentStore and any number of secondary content stores upon stream closure. The component that handles this is the org.alfresco.repo.content.replication.ReplicatingContentStore.
For example, let us assume that your server has a fast, big, local disk to store content on /var/alfresco/content-store. However, for backup purposes, the content is best stored on a network filesystem accessible on /share/alfresco/content-store. In order to keep storage costs down, really old content is archived to a tape drive that is accessible on /tape/alfresco/content-store-archives.