Bulk Import from a Filesystem

cancel
Showing results for 
Search instead for 
Did you mean: 

Bulk Import from a Filesystem

Member II
0 87 8,695

The Use Case



In any CMS implementation an almost ubiquitous requirement is to load existing content into the new system.  That content may reside in a legacy CMS, on a shared network drive, on individual user's hard drives or in email, but the requirement is almost always there - to inventory the content that's out there and bring some or all of it into the CMS with a minimum of effort.



Alfresco provides several mechanisms that can be used to import content, including:


Alfresco is also fortunate to have SI partners such as Technology Services Group who provide specialised content migration services and tools (their open source OpenMigrate tool has proven to be popular amongst Alfresco implementers).



That said, most of these approaches suffer from one or more of the following limitations:



  • They require the content to be massaged into some other format prior to ingestion


  • Orchestration of the ingestion process is performed external (ie. out-of-process) to Alfresco, resulting in excessive chattiness between the orchestrator and Alfresco.


  • They require development or configuration work


  • They're more general in nature, and so aren't as performant as a specialised solution



An Opinionated (but High Performance!) Alternative



For that reason I recently set about implementing a bulk filesystem import tool, that focuses on satisfying a single, highly specific use case in the most performant manner possible: to take a set of folders and files on local disk and load them into the repository as quickly and efficiently as possible.



The key assumption that allows this process to be efficient is that the source folders and files must be on disk that is locally accessible to the Alfresco server - typically this will mean a filesystem that is located on a hard drive physically housed in the server Alfresco is running on.  This allows the code to directly stream from disk into the repository, which basically devolves into disk-to-disk streaming - far more efficient than any kind of mechanism that requires network I/O.



How those folders and files got onto the local disk is left as an exercise for the reader, but most OSes provide efficient mechanisms for transferring files across a network (rsync and robocopy, for example).  Alternatively it's also possible to mount a remote filesystem using an OS-native mechanism (CIFS, NFS, GFS and the like), although doing so reintroduces network I/O overhead.



Another key differentiator of this solution is that all of the logic for ingestion executes in-process within Alfresco.  This completely eliminates expensive network RPCs while ingestion is occurring, and also provides fine grained control of various expensive operations (such as transaction commits / rollbacks).



Which leads into another advantage of this solution: like most transactional systems, there are some general strategies that should be followed when writing large amount of data into the Alfresco repository:



  1. Break up large volumes of writes into multiple batches - long running transactions are problematic for most transactional systems (including Alfresco).


  2. Avoid updating the same objects from different concurrent transactions.  In the case of Alfresco, this is particularly noticeable when writing content into the same folder, as those writes cause updates to the parent folder's modification timestamp.[EDIT] In recent versions of Alfresco, the automatic update of a folder's modification timestamp (cm:modified property) has been disabled by default.  It can be turned back on (by setting the property 'system.enableTimestampPropagation' to true), but the default is false so this is likely to be less of an impact to bulk ingestion than I'd originally thought.



The bulk filesystem import tool implements both of these strategies (something that is not easily accomplished when ingestion is coordinated by a separate process).  It batches the source content by folder, using a separate transaction per folder, and it also breaks up any folder containing more than a specific number of files (1,000 by default) into multiple transactions.  It also creates all of the children of a given folder (both files and sub-folders) as part of the same transaction, so that indirect updates to the parent folder occur from that single transaction.

But What Does this Mean in Real Life?



The benefit of this approach was demonstrated recently when an Alfresco implementation had a bulk ingestion process that regularly loaded large numbers (1,000s) of large image files (several MBs per file) into the repository via CIFS.  In one test, it took approximately an hour to load 1,500 files into the repository via CIFS.  In contrast the bulk filesystem import tool took less than 5 minutes to ingest the same content set.



Now clearly this ignores the time it took to copy the 1,500 files onto the Alfresco server's hard drive prior to running the bulk filesystem import tool, but in this case it was possible to modify the sourcing process so that it dropped the content directly onto the Alfresco server's hard drive, providing a substantial (order of magnitude) overall saving.

What Doesn't it Do (Yet)?



Despite already being in use in production, this tool is not what I would consider complete.  The issue tracker in the Google Code project has details on the functionality that's currently missing; the most notable gap being the lack of support for population of metadata (folders are created as cm:folder and files are created as cm:content). [EDIT] v0.5 adds a first cut at metadata import functionality.  The 'user experience' (I hesitate to call it that) is also very rough and could easily be substantially improved. [EDIT] v0.4 added several UI Web Scripts that significantly improve the usability of the tool (at least for the target audience: Alfresco developers and administrators).



That said, the core logic is sound, and has been in production use for some time.  You may find that it's worth investigating even in its currently rough state.



[POST EDIT] This tool seems to have attracted quite a bit of interest amongst the Alfresco implementer community.  I'm chuffed that that's the case and would request that any questions or comments you have be raised on the mailing list.  If you believe you've found a bug, or wish to request an enhancement to the tool, the issue tracker is the best place.  Thanks!
87 Comments
Active Member
hi,

just want to add that fme AG (an Alfresco Partner in germany and my prior employer) offers a migration tool called migration-center.

This was developed for high speed & high load migration from filesystem to documentum. It is used by some well known companies.

migration-center will also be able to talk with alfresco, you'll simply have to implement a specific importer-Interface. If you want to import from another import source (e.g. another ecm repo) you can do the same by implementing a specific scanner-Interface.

cheers, jan
Active Member
Peter -



Thank you for your efforts bringing this key piece of functionality to the community.  As a vendor tasked with CMS implementation and regularly loading hundreds of thousands of files onto client installations, one of our challenges with Alfresco has been a reliable supported interface for importing pre-indexed scanned documents.  We look forward to working with your new tool.



Best,



John
Member II
John,



Good to hear!  I'm very keen to hear of your experiences with the tool.  If you have ideas for improvement or (heaven forbid! ;-) ) run into bugs, please don't hesitate to use the issue tracker in the Google Code project to track those.



Cheers,

Peter
Active Member
I developed a tool to upload documents to alfresco with their meta data and tried to make the interface as simple as possible, it will generate ACP that can be imported into Alfresco automatically or manually



http://forge.alfresco.com/projects/acpgenerator/
Member II
Rami, the issue with ACPs is that they're imported in a single transaction, so if the content set is large that approach will run afoul of the various issues with long running transactions.



The ACP approach also requires that the content is copied three times:



1. from disk into the ACP file

2. the ACP file itself is transferred (copied) from disk into the repository (which may occur over the network, introducing network I/O latencies into the process as well)

3. from the ACP file into the content store



The bulk filesystem import tool only incurs one of these copy costs - copying the files from disk into the content store (which is the bare minimum that Alfresco requires).



Still, for smaller content sets ACP files work just fine, and as you point out they have support for importing metadata today (which, at the time of writing, the bulk filesystem importer still lacks).
Active Member
If you bulk load into a space that has content rules applied, what happens?  Will the rules still fire?  If they do, is there a way to NOT use the metadata importers at all, and let the rules handle everything?



All in all, this is the solution to my prayers.  I've been trying to load 300,000 files to a repository via CIFS and Alfresco really hates that!
Member II
Keith, yes rules will still fire.  In fact that's the reason the original customer who sponsored this development didn't require metadata loading - they already had rules configured that synthesised their metadata.  It sounds like your use case (replacing CIFS + rules with metadata importer + rules) is identical to their case, so you should be in good shape.



The metadata loading functionality is optional - you can control via Spring configuration which (if any) of the metadata loader implementations are used.  As of v0.5 the 'basic' and 'properties file' metadata loaders are configured by default, but unless you create metadata files on disk alongside your original content, the property file metadata loading logic won't take effect.



The 'basic' metadata loader is required however, as it's responsible for correctly setting the type (cm:content vs cm:folder) of each node as it's created in the repository, as well as setting the cm:name and cm:title properties to the name of the file on disk.  CIFS is doing both of these things too, it's just that you don't really see it explicitly (CIFS makes the repository look like a filesystem, but under the covers it's actually populating some standard metadata properties, such as type, cm:name, cm:title, etc.).



Anyway, I'm very keen to hear about your experiences with the tool - please keep us apprised of your progress!
Active Member
Hi Peter,



Thanks for the info.  I installed the tool and tested it out.  My initial test, leaving it configured as default, imported my file structure, along with filenames but all of my files were 0K length and were unable to be retrieved from the repository.



Any thoughts?  I turned on logging, but all I'm seeing are the 'Ingesting..' statements and the properties file metadata failures as I have no shadow files.



I changed the permissions on my source directory and files to 777, and no luck so far.  So close I can taste it!
Member II
Keith, I'd suggest raising this in the issue tracker in the Google Code project.  The more details you can provide (environment - OS, DB, Java; log file output; filesystem ownership information on the source content vs the user Alfresco is running as; the Alfresco user you're running the Web Scripts as, etc. etc.) the more likely it is that a possible explanation will suggest itself.
Active Member
Will do.  I've been unable to get this thing to work at all.
Member II
Yeah I suspect something is wrong with your installation or environment, given that it's in successful production use in at least one location and has been evaluated by a dozen or so other installations (that I'm aware of).



Once you create the issue, I'll take a look and see if anything obvious jumps out at me.
Member II
For anyone who's seeing similar issues, Keith reported his issue here.
Active Member
Peter,



Wanted to follow up and thank you for the quick turnaround on addressing the issue I discovered.  I'm using the tool to upload approximately 300K files to an instance of Alfresco Community.  It's been working great, and is helping out tremendously.  Up until I discovered your tool, I was dreading having to write one myself.  Thanks for making this available to the community, and I hope you have a prosperous and Happy New Year!
Active Member
Hi Peter,



We are in a project where the next urgent step is to bulk import documents in Alfresco, and we thought using your solution.

Unfortunately I am not familiar with Maven and this is why I am not sure what to do at the first step of the installation process:

'1. Build the AMP file using Maven2 ('mvn clean package')'.



In the mean time I've installed apache-maven-2.2.1 on my computer, hope this is the utility I need it.

As for your explanations, I understood that I have to 'manually edit the pom.xml file in order to point   Maven to either the Community Artifact repository (sponsored by   SourceSense, one of Alfresco's European SI partners), or to a Maven   repository I have to create that contains the Alfresco Enterprise artifacts' . Please emphasis this step, even if this is basic routine for you. Thank you in advance!
Member II
Mihaela, the Google Code page has a pre-built AMP file available for download that obviates the need to built the package yourself.
Active Member
I have installed the .amp file, but I am at a loss on how to use the actual functionality.  The readme file associated with the project states that the web script is available at /bulk/import/filesystem, yet I am unable to locate it.



Thanks!
Member II
Luke, Web Script URLs start with '/alfresco/service', so the fully qualified URL for the Web Script would be along the lines of:



http://myalfrescoserver:8080/alfresco/service/bulk/import/filesystem
Active Member
Hi Peter,



Thank you for the great tool. I am using version 0.6 on Alfresco Community 3.2r2.



I seem to have an issue with bulk importing to '/Company Home' - it gives a file not found error. However importing to '/Company Home/foldername' works as expected. Is this by design or is there a solution?



Also, are you planning (or is there already) the ability to maintain file dates?



Max
Member II
Max, I'd suggest raising an issue for the '/Company Home' problem in the issue tracker on the Google code project - it sounds like a bug.



The issue regarding file dates is already tracked in the issue tracker as issue #4.  This is a rather more complex problem, as Java (at least until JSR-203 is implemented - currently slated for JDK 1.7) is unable to read filesystem metadata (including most file dates).
Active Member
I would love to use your tool as I need to load thousands of PDFs nightly and Alfresco share hangs on both CIFS and FTP copies after about 600-1000 documents. I used MMT to include the supplied AMP file into Alfresco.war.





I get the following error in alfresco.log:

14:32:53,670 ERROR [org.alfresco.web.scripts.AbstractRuntime] Exception from executeScript - redirecting to status template error: 02030011 Not implemented

org.alfresco.error.AlfrescoRuntimeException: 02030011 Not implemented

        at org.alfresco.repo.security.authentication.DefaultMutableAuthenticationDao.loadUserByUsername(DefaultMutableAuthenticationDao.java:410)







If I go through the webscripts, I get the following error.



Web Script Status 500 - Internal Error



The Web Script /alfresco/service/bulk/import/filesystem/initiate has responded with a status of 500 - Internal Error.



500 Description: An error inside the HTTP server which prevented it from fulfilling the request.



Message: 02030011 Not implemented

 

Exception: org.alfresco.error.AlfrescoRuntimeException - 02030011 Not implemented

 

org.alfresco.repo.security.authentication.DefaultMutableAuthenticationDao.loadUserByUsername(DefaultMutableAuthenticationDao.java:410)

sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

java.lang.reflect.Method.invoke(Method.java:597)

org.alfresco.repo.management.subsystems.ChainingSubsystemProxyFactory$1.invoke(ChainingSubsystemProxyFactory.java:95)

org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)

org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)

$Proxy22.loadUserByUsername(Unknown Source)

net.sf.acegisecurity.providers.dao.DaoAuthenticationProvider.getUserFromBackend(DaoAuthenticationProvider.java:390)

net.sf.acegisecurity.providers.dao.DaoAuthenticationProvider.authenticate(DaoAuthenticationProvider.java:225)

net.sf.acegisecurity.providers.ProviderManager.doAuthentication(ProviderManager.java:159)

net.sf.acegisecurity.AbstractAuthenticationManager.authenticate(AbstractAuthenticationManager.java:49)

org.alfresco.repo.security.authentication.AuthenticationComponentImpl.authenticateImpl(AuthenticationComponentImpl.java:81)

org.alfresco.repo.security.authentication.AbstractAuthenticationComponent.authenticate(AbstractAuthenticationComponent.java:144)

org.alfresco.repo.security.authentication.AuthenticationServiceImpl.authenticate(AuthenticationServiceImpl.java:129)

org.alfresco.repo.security.authentication.AbstractChainingAuthenticationService.authenticate(AbstractChainingAuthenticationService.java:166)

sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

java.lang.reflect.Method.invoke(Method.java:597)

org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:304)

org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:182)

org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:149)

net.sf.acegisecurity.intercept.method.aopalliance.MethodSecurityInterceptor.invoke(MethodSecurityInterceptor.java:80)

org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)

org.alfresco.repo.security.permissions.impl.ExceptionTranslatorMethodInterceptor.invoke(ExceptionTranslatorMethodInterceptor.java:49)

org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)

org.alfresco.repo.audit.AuditComponentImpl.audit(AuditComponentImpl.java:275)

org.alfresco.repo.audit.AuditMethodInterceptor.invoke(AuditMethodInterceptor.java:69)

org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)

org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:106)

org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)

org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)

$Proxy26.authenticate(Unknown Source)

org.alfresco.repo.web.scripts.servlet.BasicHttpAuthenticatorFactory$BasicHttpAuthenticator.authenticate(BasicHttpAuthenticatorFactory.java:187)

org.alfresco.repo.web.scripts.RepositoryContainer.executeScript(RepositoryContainer.java:280)

org.alfresco.web.scripts.AbstractRuntime.executeScript(AbstractRuntime.java:262)

org.alfresco.web.scripts.AbstractRuntime.executeScript(AbstractRuntime.java:139)

org.alfresco.web.scripts.servlet.WebScriptServlet.service(WebScriptServlet.java:122)

javax.servlet.http.HttpServlet.service(HttpServlet.java:717)

org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)

org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)

org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)

org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)

org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)

org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)

org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)

org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)

org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:857)

org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:565)

org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1509)

java.lang.Thread.run(Thread.java:619)

 



I'm not sure if I have not turned something on or if it is a security error. Any help would be appreciated.



Gary
Member II
Gary, can I suggest you raise this in the issue tracker in the Google Code project?  Thanks!
Active Member
I was able to run the script and import 100s of documents without any problem. However, as a next steps I have to associate some custom meta data. I tried to follow the instruction in the readme file but so far had no success. Can you please provide example of a metadata/properties file which also includes a custom content type and properties.



Thanks, Frank
Member II
Frank, the readme file goes into some detail on how and where to put the metadata properties files, and there's a simple metadata properties file example about halfway down.
Active Member
Hello Peter,

Many thanks. I got it to work. However, could it be that dates ( e.g. 2012-05-23) are not supported.
Member II
Frank, currently the code relies on Alfresco to convert string values in the properties files into their correct data type in the repository.  I can't recall exactly which implicit data type conversions Alfresco supports natively, but there is a chance they are quite limited and don't extend to dates or date/times.



Regardless, this has been raised as a task in the issue tracker in the Google code project - please feel free to look into this further if you have the time and interest, as I'm not sure when I will next have an opportunity to investigate it.
Active Member
[...] to further increase migration performance.  In the coming months, we will be looking to integrate concepts from Peter Monk’s work with the Alfresco Bulk Filesystem [...]
Active Member
It is ready to bulk import into Alfresco 3.3? Can you mention somewhere for what versions of Alfresco your tool is usable? I did not find anything about this.
Member II
Bastiaan, the tested versions are mentioned in the readme file, although the tool is basically very simple and should work on all 3.x versions of Alfresco.  In fact it may even work on 2.x versions of Alfresco, but currently the AMP is configured to only allow installation on versions 3.0 and above as I've not tested on any 2.x release.
Active Member
Peter,

I've been testing the bulk importer, and it works well except that the 'update existing files' option doesn't seem to do anything. Has anyone else had trouble with this?
Member II
Susan, can I suggest you raise this in the issue tracker in the Google Code project?  It's far easier to manage / track in there.  Thanks!
Active Member
Thanks for this tool Peter!

I migrated 50go of data from a shared drive to alfresco and it worked perfectly (the process took something around 15 hours)

The only problem I got was with folders that contained a whitespace at the end so it might be a good idea to trim spaces names after creation.

Great work anyway.
Member II
Arthur, good to hear!  Just out of interest, approximately how many files and folders were in the source content set?



I've raised an issue regarding the whitespace in the issue tracker - it's issue #33.
Active Member
Hi i've tried this amp and it works great! How can i add custom aspects in the metadata.properties ? Is this possible?
Member II
justin, the readme describes how to attach aspects (regardless of whether they're built-in or custom) to the ingested content - see line 81.
Active Member
Hi, I just used and it works great, I personally like when spaces and folders overlap and spaces have rules on them, it works perfectly. Great job !
Member II
Thanks Savic!  Glad to hear you've had success with the tool!
Active Member
Hi Peter,

I have a fix for issue 4 'creation and modification dates'. I would like to share with you.

If you are interested, just contact me.

Regards,

Walter
Member II
Walter, if you could attach a patchfile to issue #4 in the issue tracker, I'll give it a review.



I should also point out that this isn't actually an issue with the bulk importer, but a bug in Alfresco that was recently fixed (see ALF-2565).
Active Member
Hi Peter,



Could you compile the latest head version of your source code, we really need the fix for issue #4 but we still having difficulties when compiling it in maven.



Regards,



Tiur
Member II
Tiur, please review the existing issues in the issue tracker (there are several that are similar), and raise a new issue if appropriate.
Active Member
Hi Peter,



We've raised a new issue : http://code.google.com/p/alfresco-bulk-filesystem-import/issues/detail?id=55



Could you kindly provide the AMP file for Alfresco 3.4? We are willing to give donation for the project if you do this. We need this functionality ASAP, it would very helpful if you help us.



Regards,



Tiur
Member II
Tiur, I've updated issue #55 with the current state of play.  Unfortunately the issue you're running into relates to the Community Maven repository (which I am in no way involved in supporting or maintaining) rather than the Bulk Filesystem Import Tool, so your best bet is to chase it up with them separately.



FWIW I've also reached out to them internally to try to find out what's going on, but you should chase it up directly as well as I rarely use that repository.
Active Member
Is it possible to use the bulk importer to load comments along with other metadata? Thanks.
Active Member
H Peter,



This is a fantastic tool i am able upload the around 130GB data in 2 - 3 hours...

i was wondered even from intranet the transfer rate is 7Mbps... I have no problem sofar when I am using with alfresco and share.

But when it comes to Open office 3.2.0 with oracle connector I was not able to access the  data which resides in folder.

Sory for raising this issue here but  i want to know what is the prob.
Member II
Susan, it depends how comments are defined in the content model.



If they're a property of the node, then yes they can be loaded (although note that multi-valued properties are not yet supported - see issue #20).



If comments are stored as sub-nodes of the file then currently there's no way of loading that structure, since filesystems don't typically support files that are also folders (unlike Alfresco, which does support that structure).
Member II
chiru, can you describe the problem in more detail?  It doesn't sound like the issue you're seeing is related to the import tool, although it's a bit hard to tell from your description.
Active Member
Hi Peter,



I have been checking your import tools and it looks good, but I hava a question: is it possible at this moment to set document's categories?.



Thank
Member II
Antonio, it's possible to set a single category (issue #19 in the issue tracker describes how this is done), but it's not yet possible to set multi-valued properties - that's issue #20.  Setting a single category isn't very useful, obviously, but I have not had time to look at issue #20.
Active Member
[...] to the source and target system.  It quickly became obvious (partly due to Peter Monks’ blog post on one approach) that there were a handful of options [...]
Active Member
Hi Peter,



We're in the process of upgrading Alfresco to 3.3.4 but they want to do a gradual upgrade (and decided to use Alfresco 3.2.* as an interim step). We had the bulk import working before with a custom aspect defined and when we tried to run it on 3.2 it failed with the following exception:



namespace prefix [prefix] is not mapped to a namespace uri



I should mention that applying the aspect manually to a file didn't produce any errors.



Any idea why the bulk import is complaining?



My metadata file is of the following structure:



type=cm:content

aspects=sensisSmiley Tonguerod

cm\:title=09000001800699fe.pdf

cm\:description=Contract

sensis\:advertiserId=478283400

sensis\:campaignCode=N00Y

sensis\:generationDate=2005-05-31T12:00:00.000+10:00

sensis\:issue=26

cm\:storeName=storeA



Thanks in advance for any suggestions you might have.