Orphan files

cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
unknown-user
Active Member

Orphan files

Hi community,

My question seems a bit confusing, but after several days I spent on it without any success I have to ask: is it reliable and efficient implementation of File Content Storage used in Alfresco?
Some details of my problem.
A new Space was created without applying versioning aspect. I added content to the space (text, images, pdf,?). After that some of the files were modified (updated) via Web, CIFS and/or FTP interfaces (no checkin/checkout operations). What I found out after these updates ? new files are created in storage, Node references are updated correctly, but all predecessors are still presented in the storage. My storage increases in size and more and more orphan files appear there.
As I found FileContentWriter doesn?t allow writing to an existing file and new one will be created on file update. Ok, that?s related to concurrent access issue. ContentStoreCleanupJob might be used to clean up periodically the storage but it is still not in operation (commented in configuration file) as it may remove all files from the storage (it analyzes just last modification date), right? May be it is my misunderstanding or there are some undocumented configuration tricks to prevent such behavior?
Last days I looked through the Slide?s storage implementation. It?s based on Apache Commons Transaction library. I carried out some tests to prove its stability in multi user environment. It looks quite good, at least I may concurrently modify/read the same files, although an intermediate storage is used.

Has everybody met the same problem before or may be I have to configure my server in a specific way?
My environment:
WinXP, Alfresco 1.2RC2, Oracle 10g.

Thanks in advance,
Valera.
3 Replies
derek
Established Member

Re: Orphan files

Hi,

The content store cleaner is active in the V1.2 release.  You are right that it uses the last modified date, but then again, content is never modified after the stream to it has been closed.  Additionally, the cleaner is set to only delete orphaned content older than 14 days.  As if that isn't enough, the cleaner listener will currently back up the orphaned content.

I assure you that the content store is robust and efficient - for the following reasons:

1. Content is referenced by a node property or properties of type d:content.  There may be several nodes or properties referencing the same content in the content store.  So, we have shared content between the version stores, content copies, etc.

2. We use the filesystem instead of the database.  This removes incompatibility issues across database vendors.  Additionally, the random file access support (as required by CIFS and other apps) cannot be provided by database persistence without first copying files to the filesystem.  This goes for both reads and writes.  Although we already have support to do the random-access spoofing, it is not required.  There are also size issues at play here.  Where would you rather store a 5GB file?

3. Backup is much simpler with the metadata split from the content.  The filestore can be backed up in whatever state it is in after the database has been backed up.  Since copying large files can take time, this means that the system doesn't have to lock the database against writes while taking a dump of the content.

4. Our metadata store can function with all content missing.  This means that supported customers can give us a copy of the metadata for diagnostics without giving away any of the actual content.

5. We have zero wait for content reads or writes.  Any system that persists to filesystem and overwrites the original file will, at some time before the end of the transaction, spend time overwriting the content.  During this time, all read requests will have to wait.  Additionally, the write won't be able to start until all read requests have finished.  Then, during the write, there will be a small chance of corruption in the case of sudden VM termination.  We don't have that issue.

6. By not changing files once they have been closed, we can safely and quickly perform incremental backups.  We can mix content stores up and spread files around.  We can archive files to compressed storage whilst still having them available to the repo.

You can be sure that:
    The cleaner will remove orphaned content only
    Content is shared
    Access is fast
    Content access is transactionally safe
Regards
unknown-user
Active Member

Re: Orphan files

Hi Derek,

Thank you for the explanation. I?ve just downloaded V1.2 release and figured out how the store cleaner manages the orphaned content.

May be some words about a system we are developing now to clarify why it?s so important to rely on safe storage mechanism. The system will manage huge amount of large files (raster data 200-500 Mb in size) and many users will need to update the images quite intensively. That?s why I?m afraid we cannot wait until the cleaner does its job ? our underlying SAN may exceed the limits during a working day. Of coarse, buying a new storage would be a good solution, but not in our caseSmiley Sad

Could you please give me some hints how is it possible to implement a custom handler that will delete old version (not real version in terms of Alfresco system) of just modified file? As I understood WriteStreamListener could be a proper place to apply our requirements, am I right?
Another issue I have concern about is Delete content operation. Can I safely delete the storage file as a corresponding Node (on NodeService) is deleted? That?s clear, the store cleaner will do that, but could it be done immediately?

Regards,
Valera
derek
Established Member

Re: Orphan files

Hi,

You can set the cleaner to delete orphaned content that is older than a day at the minimum.  If you don't have enough space to handle two days'-worth of generated content, then you'll need to delete as soon as the node is deleted.

You can delete the content when the node is deleted as long as you first check to ensure that no other nodes are using the content first.  This can arise if nodes are copied or versioned.  It's not too difficult to check, but you'll have to hook into the persistence layer to execute a query similar to that done by the cleaner job.  There'll be a performance hit in this case, so ensure that you set up a good index on the node_properties.string_value column.  I've raised a task to ensure that the URL goes into the Lucene index (http://www.alfresco.org/jira/browse/AR-458)- this will make the search much faster and make this more feasible.  I'd recommend doing the task in the background, i.e. farm the potential orphan URL to a list that gets dealt with post-transaction.

Perhaps we should build this functionality into the system as optional behaviour.  Most systems don't have such space limitations, but those that do …

How does the content compress?  It would be possible to move older content to a compressed store that will still be accessible to the server.  It can sit there for a while until the cleaner comes around and decides that it must go.  This multi-store support is only part of the Enterprise, though.

Regards