Bulk Import from a Filesystem

cancel
Showing results for 
Search instead for 
Did you mean: 

Bulk Import from a Filesystem

pmonks2
Member II
0 87 26.9K

The Use Case



In any CMS implementation an almost ubiquitous requirement is to load existing content into the new system.  That content may reside in a legacy CMS, on a shared network drive, on individual user's hard drives or in email, but the requirement is almost always there - to inventory the content that's out there and bring some or all of it into the CMS with a minimum of effort.



Alfresco provides several mechanisms that can be used to import content, including:


Alfresco is also fortunate to have SI partners such as Technology Services Group who provide specialised content migration services and tools (their open source OpenMigrate tool has proven to be popular amongst Alfresco implementers).



That said, most of these approaches suffer from one or more of the following limitations:



  • They require the content to be massaged into some other format prior to ingestion


  • Orchestration of the ingestion process is performed external (ie. out-of-process) to Alfresco, resulting in excessive chattiness between the orchestrator and Alfresco.


  • They require development or configuration work


  • They're more general in nature, and so aren't as performant as a specialised solution



An Opinionated (but High Performance!) Alternative



For that reason I recently set about implementing a bulk filesystem import tool, that focuses on satisfying a single, highly specific use case in the most performant manner possible: to take a set of folders and files on local disk and load them into the repository as quickly and efficiently as possible.



The key assumption that allows this process to be efficient is that the source folders and files must be on disk that is locally accessible to the Alfresco server - typically this will mean a filesystem that is located on a hard drive physically housed in the server Alfresco is running on.  This allows the code to directly stream from disk into the repository, which basically devolves into disk-to-disk streaming - far more efficient than any kind of mechanism that requires network I/O.



How those folders and files got onto the local disk is left as an exercise for the reader, but most OSes provide efficient mechanisms for transferring files across a network (rsync and robocopy, for example).  Alternatively it's also possible to mount a remote filesystem using an OS-native mechanism (CIFS, NFS, GFS and the like), although doing so reintroduces network I/O overhead.



Another key differentiator of this solution is that all of the logic for ingestion executes in-process within Alfresco.  This completely eliminates expensive network RPCs while ingestion is occurring, and also provides fine grained control of various expensive operations (such as transaction commits / rollbacks).



Which leads into another advantage of this solution: like most transactional systems, there are some general strategies that should be followed when writing large amount of data into the Alfresco repository:



  1. Break up large volumes of writes into multiple batches - long running transactions are problematic for most transactional systems (including Alfresco).


  2. Avoid updating the same objects from different concurrent transactions.  In the case of Alfresco, this is particularly noticeable when writing content into the same folder, as those writes cause updates to the parent folder's modification timestamp.[EDIT] In recent versions of Alfresco, the automatic update of a folder's modification timestamp (cm:modified property) has been disabled by default.  It can be turned back on (by setting the property 'system.enableTimestampPropagation' to true), but the default is false so this is likely to be less of an impact to bulk ingestion than I'd originally thought.



The bulk filesystem import tool implements both of these strategies (something that is not easily accomplished when ingestion is coordinated by a separate process).  It batches the source content by folder, using a separate transaction per folder, and it also breaks up any folder containing more than a specific number of files (1,000 by default) into multiple transactions.  It also creates all of the children of a given folder (both files and sub-folders) as part of the same transaction, so that indirect updates to the parent folder occur from that single transaction.

But What Does this Mean in Real Life?



The benefit of this approach was demonstrated recently when an Alfresco implementation had a bulk ingestion process that regularly loaded large numbers (1,000s) of large image files (several MBs per file) into the repository via CIFS.  In one test, it took approximately an hour to load 1,500 files into the repository via CIFS.  In contrast the bulk filesystem import tool took less than 5 minutes to ingest the same content set.



Now clearly this ignores the time it took to copy the 1,500 files onto the Alfresco server's hard drive prior to running the bulk filesystem import tool, but in this case it was possible to modify the sourcing process so that it dropped the content directly onto the Alfresco server's hard drive, providing a substantial (order of magnitude) overall saving.

What Doesn't it Do (Yet)?



Despite already being in use in production, this tool is not what I would consider complete.  The issue tracker in the Google Code project has details on the functionality that's currently missing; the most notable gap being the lack of support for population of metadata (folders are created as cm:folder and files are created as cm:content). [EDIT] v0.5 adds a first cut at metadata import functionality.  The 'user experience' (I hesitate to call it that) is also very rough and could easily be substantially improved. [EDIT] v0.4 added several UI Web Scripts that significantly improve the usability of the tool (at least for the target audience: Alfresco developers and administrators).



That said, the core logic is sound, and has been in production use for some time.  You may find that it's worth investigating even in its currently rough state.



[POST EDIT] This tool seems to have attracted quite a bit of interest amongst the Alfresco implementer community.  I'm chuffed that that's the case and would request that any questions or comments you have be raised on the mailing list.  If you believe you've found a bug, or wish to request an enhancement to the tool, the issue tracker is the best place.  Thanks!
87 Comments
pmonks2
Member II
Zoran, that error usually indicates that the content model containing the given namespace (almost certainly 'sensis' in this case) isn't registered with the repository.  That said, if you're able to attach the 'sensisSmiley Tonguerod' aspect manually via the UI (Explorer or Share) then that pretty much rules that possibility out.

Would you mind raising this in the issue tracker in the Google Code project, so that I can track it properly?  The above detail is good, but what would be even better would be the files you're using to register the content model with the repository (both the model file itself and the Spring application context that loads it), or a cut-down equivalent that also demonstrates the issue.  Thanks!
blog_commenter
Active Member
Hi



I just wanted to say that, thanks to this tool, we were able to upload 4.5 million documents in an Alfresco repository in only 4 days.

This would have taken weeks with webdav or ftp.



Thank you very much for this awesome tool
pmonks2
Member II
polgarine, that's great to hear - thanks for commenting!  Just out of interest, roughly how large (in MB / GB) were the documents in total?
blog_commenter
Active Member
Thanks a lot for this tool. Supressing the blank in your readme, line

aspects=cm:versionable, custom:myAspect

or adapting your code near

((String)metadataProperties.get(key)).split(',')

might avoid some trouble.
pmonks2
Member II
Leo, would you mind raising this in the issue tracker in the Google Code project, so that I can track it properly?



I'd be particularly interested in knowing precisely what the behaviour is when the list of aspect names includes spaces (i.e. is an exception thrown, does the aspect fail to get applied, do incorrect aspects get applied, etc.).
blog_commenter
Active Member
Can you explain how the webscript you wrote gets access to the content that lives on the file system?  When I read the wiki regarding web scripts, a html input form is the only way shown to access the file content, that is, it uploads the content via the form and then the web script has access to the form fields and the file content.  In a a bulk file scenario where there isn't a UI, it's not obvious how to gain access to file content.  Can you enlighten us?  Thanks in advance.
pmonks2
Member II
Steve, the key is that the Web Script is reading the source content off the server's filesystem, not the client that initiated the import.  This is part of the reason that this is an administrator-only tool for now - it requires that the content be copied to a disk that's mounted to the server hosting Alfresco, prior to the tool being run (typically end-users wouldn't have direct filesystem access to the server(s) Alfresco is running on, so wouldn't be able to use this tool).



This isn't a problem for the tool's primary use case of course, which is around large scale content migration / ingestion.  It's unlikely that an end-user would be able to accomplish this unassisted anyway, even if the tool supported it.
blog_commenter
Active Member
Does this work on 64 bit Linux?  I ran the apply_amps.sh and it appears the WARs (alfresco & share) were both corrupted, when replacing them with the backup Tomcat now boots up cleanly again.  Before that it crashed at startup  :-(
pmonks2
Member II
Diane, the tool is developed on 64bit Mac OSX, which (from an Alfresco / Tomcat perspective) is basically the same as 64bit Linux.  Did you try applying the AMP again? My first suspicion would be that this issue was caused by a one-time glitch in the apply_amps process.
blog_commenter
Active Member
Is the tool compatible with the latest version of Alfresco (3.4d)?  Getting the following exception in the tomcat logs:

Module 'org.alfresco.extension.alfresco-bulk-filesystem-import' version 0.11 is incompatible with the current repository version 3.4.0.

   The repository version required must be in range [3.3.0 : 3.3.99].

at org.alfresco.error.AlfrescoRuntimeException.create(AlfrescoRuntimeException.java:46)

at org.alfresco.repo.module.ModuleComponentHelper.startModule(ModuleComponentHelper.java:509)

at org.alfresco.repo.module.ModuleComponentHelper.access$400(ModuleComponentHelper.java:57)

at org.alfresco.repo.module.ModuleComponentHelper$1$1.execute(ModuleComponentHelper.java:239)

at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:381)

at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:272)

at org.alfresco.repo.module.ModuleComponentHelper$1.doWork(ModuleComponentHelper.java:260)

... 54 more
pmonks2
Member II
Fred, v0.11 of the bulk filesystem import tool was developed and tested against Alfresco v3.3 (the then latest release of Alfresco).  The module (AMP file) was therefore 'pinned' to version 3.3, resulting in the above error when installed on Alfresco v3.4 (or indeed any version other than 3.3.x).



You can manually override the supported version of the AMP by editing the module.properties within the AMP file, which will allow the tool to be installed on 3.4, but there's no guarantee it'll work.  I'm not aware of anything that would prevent it from working, but haven't verified it myself.



The next version of the tool will be built and tested against v3.4 but I don't have an ETA on it, unfortunately.
blog_commenter
Active Member
Hi Peter,

Thanks for getting back to me about the version issue.  I updated configuration and it deployed just fine in alfresco.  I was able to test out the upload a bunch a files successfully as a test.  Right now I'm trying to figure out how to get the metadata import to work with our defined model.



I do have a question....is the purpose of this tool mainly for an individual to use to upload the files?  Or could this be used by a job scheduler or called from within another application?
pmonks2
Member II
Fred, the import tool itself is currently exposed as two REST APIs (Web Scripts):

1. An 'initiate' API

2. A 'status' API





Both of these are invoked via HTTP GET† requests, which can be scripted from an external job scheduler (e.g. cron or at) or called from any external application that is capable of executing an HTTP GET request.  In addition, the status API can emit either HTML or XML, allowing external applications to poll the tool and obtain detailed status information.



The UI Web Script that's used to manually initiate an import is little more than a convenience layer on top of these two REST APIs, and is not central to the operation of the tool itself.



† Technically I should have used HTTP POST or HTTP PUT for the 'initiate' API, in keeping with REST principles, but my pragmatic experience has been that HTTP GETs are far easier to call (particularly from within the browser, shell scripts etc.) provided minimal data is being passed in the request (as is the case here).  There's an enhancement request on this in the issue tracker.
blog_commenter
Active Member
Hey all,

Thanks for the tool, this adds another option for migration that, for larger jobs, will likely be much easier (opposed to planning *days* of active migration with other approaches)!



Question however as I'm reviewing options as discussed here http://forums.alfresco.com/en/viewtopic.php?f=9&t=38889, does this bulk import from filesystem work with Alfresco's content store such that the backlog, once put into Alfresco, is pre-seperated in the contentstore or currently is your entire backlog dumped into the current year contentstore on the filesystem (alf_data/contentstore/2011/*** for example).
blog_commenter
Active Member
adding - comment above for Alfresco 3.4.d CE edition, so the next version above 0.11 would be helpful for myself as well as Fred Grafe, added an issue in tracker.
pmonks2
Member II
dhartford, currently the tool simply imports the content into Alfresco using whatever contentstore implementation (and therefore storage policy) that Alfresco instance is configured with.  So for example if the XAM connector is configured, binaries will be stored on the CAS device using a hashed id rather than the default 'timestamp hashbucket directory structure' approach.



For the use case described in the forum post, I'd suggest that using Content Storage Policies (see also this webinar) is a better approach, as it will provide the archival mechanism you require, independent of the underlying content store implementation.  Relying on the internal implementation details of a particular contentstore implementation (such as the timestamp hashbucket behaviour of the filesystem contentstore) is somewhat risky, as Alfresco reserves the right to change those implementation details at a later time.
blog_commenter
Active Member
Wow, http://wiki.alfresco.com/wiki/Content_Store_Selector is exactly what I was looking for -- thanks so much Peter! 



That will help address my specific requirement regardless of the backlog import tool used.  Just so you can have some numbers, my current review of using CMIS (admittedly, in a single-thread/serial fashion) was only netting 2 transactions (images)/sec, so I do hope the bulk filesystem import will work for 3.4.d, I haven't modified the module.properties to try 0.11 on it yet but will let you know of any findings.
pmonks2
Member II
dhartford, good to hear!  Yeah 2 txns / sec is not great.  FWIW I see sustained throughput of 15 - 20 docs / sec using the tool on my 2009 MacBook Pro, using a vanilla Alfresco Enterprise 3.3 install, and in the past Alfresco has been demonstrated to handle sustained throughput of up to around 100 docs / sec (though that was on fairly beefy hardware).



I expect Alfresco 3.4 will be faster still (removing Hibernate gave the repository a noticeable performance bump across the board), and following the next (versioning) release, I intend to work on a couple of performance-focused enhancements that should also speed the tool up.  Issue #56 in particular has the potential to significantly improve the performance of imports.
blog_commenter
Active Member
doh, the Content Store Selector approach complains about no 'storeSelectorContentStoreBase' defined...it appears through the forums that this is an enterprise-only feature not available in the Community Edition.   Too bad, as this seems like a pretty common scenario to use if people knew about it more.
blog_commenter
Active Member
Hi Peter, we are analyzing your tool to import 260.000 records into a Record Management site.



Peter, have you ever used your tool for that purpose?



We started with a small sample of 1000 Record Folders with some metadata and it took 24min to ingest them. Not acceptable, we need to improve this speed. When we tried to ingest those Records Folders as normal folders with their metadata into a normal Site it took only 15sec. We would like to get similar performance with the RM site.



We think that something is slowing down the process in the RM Site (audit may be?). We have already tried some tuning techniques described in Alfresco documentations without any luck.



Best regards,

Jordi
pmonks2
Member II
Jordi, I have not tested the tool against an RM site as the tool doesn't know (or care) what type of space the target is - it simply writes the source content to wherever you tell it to.  In other words, the performance discrepancy you're seeing is likely due to the repository, its configuration or the environment, rather than the tool itself.



Have you tried profiling / DB tracing while the import is in progress to try to find out what specifically is taking the extra time?
blog_commenter
Active Member
Hi Peter,



thank you for your reply. We have tried to use VisualVM to detect a bottleneck without luck. However, apparently we solved the issue regarding the slowness with the record's ingestion in the RM Site. You know that the File Plan in the RM site has 4 levels (Series, Categories, Record Folders & Records Files). In our first test we had created the shadow files ONLY for the Record Folders (and skipping the Series and Categories). When the shadow files for the Series & Categories were created then the tool worked much better: 5000 records in 15min (on my local deployment)



Another important information that we have discovered along the process and that you could add to the documentation is that you need to create the shadow files specifically in 'UTF-8'. java.io.FileWriter class doesn’t use UTF-8 by default (it uses ISO-8859-1) and this was generating a NullPointerException during the execution.



Hope this information is useful,

Regards, Jordi
pmonks2
Member II
Jordi, I'd be very interested in seeing an example file that causes the NPE, along with the full stack trace for the UTF-8 vs ISO-8859 issue.  Would you mind raising an issue in the issue tracker?
blog_commenter
Active Member
Hi Peter,



I ran into an exception when trying an import with the 1.0 release. The error seemed to happen on multi-valued cm:taggable property in the metadata file. I created a new issue as http://code.google.com/p/alfresco-bulk-filesystem-import/issues/detail?id=88. Later on I found a related issue here

http://code.google.com/p/alfresco-bulk-filesystem-import/issues/detail?id=57. Do you mind taking a look?



Thanks.
pmonks2
Member II
Zhihai, I've updated issue #88 [1] with more information, and confirmed that it is indeed a duplicate of issue #57 [2].  Once you correct your metadata files, everything should work as expected.



[1] http://code.google.com/p/alfresco-bulk-filesystem-import/issues/detail?id=88

[2] http://code.google.com/p/alfresco-bulk-filesystem-import/issues/detail?id=57
pmonks2
Member II
I've just created a mailing list to help facilitate assistance with and discussion of the tool.  Please try to use that resource rather than commenting here, since blog comments aren't great for that kind of thing.  Thanks!
blog_commenter
Active Member
I'm trying to use the Alfesco Bulk Filesystem Import on Community Edition 3.4.d (RHEL platform). After some troubles to import content - the Source Directory must be under /tomcat/bin (I'm sure this is not a standard way) - I didn't arrive to import metadata. I mixed the content files with respective metadata files in the same hierachical organization as in the repository. As result, the content is loaded but not the metadata. In fact, the metadata file is loaded into repository as any other file. Is this the correct way to load metadata? The documentation is very succint: it shows how configurate but not how to use.



Thanks for your help,





Additional information:



Custom content model file (it's working well by using Alfresco Explorer)









   Customizacao para o Laboratorio de Conversao de Midia

   Candido

   2011-09-29

   1.0



  

     

     

     



  



  

     

  



  

        

   

         Relacao de Lancamento por Lote

         cm:content

        

           

               Microfilmagem

               d:text

  

     true

     false

                  true

               

                       

        



A metadata file:







  

      cm:digitalizacao01

      1234/1056-1078

      Lote de Lancamento

      Roseli

blog_commenter
Active Member
Peter, I'm sorry for the previous post: the xml sample I've sent didn't work well in your blog. If you want, I can send then attachen on e-mail.



Best regards,
pmonks2
Member II
Luiz, would you mind raising this in the mailing list?  That's a much better forum for discussing topics like this one.
blog_commenter
Active Member
Hi,

I was using your bulk import tool.It works great. I would like to know how can we declare the imported files as records. In one of the books i read we can give the cm:declareRecords in aspects tag and then give all the mandatory properties. The file will be declared as record. But it doesnt seem to work for me. Can someone help please



Pallavi
pmonks2
Member II
Pallavi, can I suggest you raise this on the project's mailing list?  That's a much better forum for discussing topics like this one.
blog_commenter
Active Member
hi peter

I was using your bulk import tool.It works great.

But i want to do this code to import automatcly metadata to folders contents

curl -u admin:admin -d 'sourceDirectory=/Users/user/Documents/Nouveaudossier/metadata&targetPath=/Company%20Home/sites/test'  'http://localhost:8080/alfresco/service/bulkfsimport/initiate'



but evry time The Web Script /alfresco/service/bulkfsimport/initiate has responded with a status of 302

please i need your help



Best regards,
pmonks2
Member II
oubaid, please raise this on the project's mailing list.
blog_commenter
Active Member
Hi,



is there any limitation on the path length?  I have some paths that are longer than 256 characters.



Thanks.
pmonks2
Member II
Jean-Michel, please raise this on the project's mailing list.
blog_commenter
Active Member
When I tried to execute a bulk import, It failed because of same files has not a valid name. Is there any tools to normalize names of a existing filesystem to update? Thanks
blog_commenter
Active Member
Please use the mailing list [1] for questions like this.  Thanks!

[1] http://groups.google.com/group/alfresco-bulk-filesystem-import