The official documentation is at: http://docs.alfresco.com
A Bulk Importer has been added to Alfresco Community (since v4.0b) and Enterprise 4.0+ that provides a mechanism for bulk importing content and metadata into a repository from the Alfresco server's filesystem. It is a fork of an old version of the Bulk Filesystem Import Tool hosted on github.
Please note that this fork was modified from the original edition of the tool, and has never been rebased against it. As a result, it lags the original edition both functionally and from a stability perspective. For information on the original edition of the tool, please review the github project.
The importer will (optionally) replace existing content items if they already exist in the repository, but does not perform deletes (ie. it is not designed to fully synchronise the repository with the local filesystem). The basic on-disk file/folder structure is preserved verbatim in the repository. It is possible to load metadata for the files and spaces being ingested, as well as a version history for files (each version may consist of content, metadata or both).
There is support for two kinds of bulk import:
There are a number of restrictions:
Three assumptions are made when importing content 'in place' :
In addition, the in-place bulk importer provides support for the content store selector (http://wiki.alfresco.com/wiki/Content_Store_Selector). This allows to select under which store the content to import is to be found.
The source content is copied into the repository content store during the import. In all other respects, in-place and streaming bulk import are the same.
The batch processor is used to perform the import using a configurable number of multiple threads in batches of a configurable size. The batch size and number of threads can be set either as default properties for the repository in alfresco-global.properties (the properties are 'bulkImport.batch.numThreads' and 'bulkImport.batch.batchSize') or entered in the GUI when performing an import (thereby overriding the defaults).
In order to make the bulk importer as fast as possible, it is recommended that the following properties are set:
The bulk importer has the ability to load metadata (types, aspects & their properties) into the repository. This is accomplished using 'shadow' Java property files in XML format (not the old key=value format that was used in earlier versions of the tool - the XML format has significantly better support for Unicode characters). These shadow properties files must have exactly the same name and extension as the file for which it describes the metadata, but with the suffix '.metadata.properties.xml'. So for example, if there is a file called 'IMG_1967.jpg', the 'shadow' metadata file for it would be called 'IMG_1967.jpg.metadata.properties.xml'.
These shadow files can also be used for directories - e.g. if you have a directory called 'MyDocuments', the shadow metadata file would be called 'MyDocuments.metadata.properties.xml'.
The metadata file itself follows the usual syntax for Java XML properties files:
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE properties SYSTEM 'http://java.sun.com/dtd/properties.dtd'>
<properties>
<entry key='key1'>value1</entry>
<entry key='key2'>value2</entry>
...
</properties>
There are two special keys:
The remaining entries in the file are treated as metadata properties, with the key being the qualified name of the property and the value being the value of that property. Multi-valued properties are comma-delimited, but please note that these values are not trimmed so it's advisable to not place a space character either before or after the comma, unless you actually want that in the value of the property.
Here's a fully worked example for IMG_1967.jpg.metadata.properties.xml:
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE properties SYSTEM 'http://java.sun.com/dtd/properties.dtd'>
<properties>
<entry key='type'>cm:content</entry>
<entry key='aspects'>cm:versionable,cm:dublincore</entry>
<entry key='cm:title'>A photo of a flower.</entry>
<entry key='cm:description'>A photo I took of a flower while walking around Bantry Bay.</entry>
<entry key='cm:created'>1901-01-01T12:34:56.789+10:00</entry>
<!-- cm:dublincore properties -->
<entry key='cm:author'>Peter Monks</entry>
<entry key='cm:publisher'>Peter Monks</entry>
<entry key='cm:contributor'>Peter Monks</entry>
<entry key='cm:type'>Photograph</entry>
<entry key='cm:identifier'>IMG_1967.jpg</entry>
<entry key='cm:dcsource'>Canon Powershot G2</entry>
<entry key='cm:coverage'>Worldwide</entry>
<entry key='cm:rights'>Copyright (c) Peter Monks 2002, All Rights Reserved</entry>
<entry key='cm:subject'>A photo of a flower.</entry>
</properties>
Additional notes on metadata loading:
The import tool also supports loading a version history for each file (Alfresco doesn't support version histories for folders). To do this, create a file with the same name as the main file, but append a 'v#' extension. For example:
IMG_1967.jpg.v1 <- version 1 content
IMG_1967.jpg.v2 <- version 2 content
IMG_1967.jpg <- 'head' (latest) revision of the content
This also applies to metadata files, if you wish to capture metadata history as well. For example:
IMG_1967.jpg.metadata.properties.xml.v1 <- version 1 metadata
IMG_1967.jpg.metadata.properties.xml.v2 <- version 2 metadata
IMG_1967.jpg.metadata.properties.xml <- 'head' (latest) revision of the metadata
Additional notes on version history loading:
Here's a fully fleshed out example, showing all possible combinations of content, metadata and version files:
IMG_1967.jpg.v1 <- version 1 content
IMG_1967.jpg.metadata.properties.xml.v1 <- version 1 metadata
IMG_1967.jpg.v2 <- version 2 content
IMG_1967.jpg.metadata.properties.xml.v2 <- version 2 metadata
IMG_1967.jpg.v3 <- version 3 content (content only version)
IMG_1967.jpg.metadata.properties.xml.v4 <- version 4 metadata (metadata only version)
IMG_1967.jpg.metadata.properties.xml <- 'head' (latest) revision of the metadata
IMG_1967.jpg <- 'head' (latest) revision of the content
The two types of bulk import (streaming and in-place) each have a GUI (see the individual sections below), implemented using Alfresco web scripts.
The streaming bulk import is exposed as a set of 2 Web Scripts:
http://localhost:8080/alfresco/service/bulkfsimport
http://localhost:8080/alfresco/service/bulkfsimport/initiate
The 'UI' Web Script presents the following simplified HTML form:
The in-place bulk import is exposed as a set of 2 Web Scripts:
http://localhost:8080/alfresco/service/bulkfsimport/inplace
http://localhost:8080/alfresco/service/bulkfsimport/initiate
The in-place 'UI' Web Script presents the following simplified HTML form:
The Bulk Import status Web Script displays status information on the current import (if one is in progress), or the status of the last import that was initiated. This Web Script has both HTML and XML views, allowing external programs to programmatically monitor the status of imports. This is an HTTP GET Web Script with a path of:
http://localhost:8080/alfresco/service/bulkfsimport/status
The status web page is the same for both streaming and in-place import. It looks like this when the bulk importer is idle:
The status is updated every five seconds when a bulk import has been initiated, as shown in these screenshots:
BulkUploadInProgress1.png
BulkUploadInProgress2.png
When the bulk import has finished, the following screen will be presented:
UserTransaction txn = transactionService.getUserTransaction();
txn.begin();
AuthenticationUtil.setRunAsUser('admin');
StreamingNodeImporterFactory streamingNodeImporterFactory = (StreamingNodeImporterFactory)ctx.getBean('streamingNodeImporterFactory');
NodeImporter nodeImporter = streamingNodeImporterFactory.getNodeImporter(new File('importdirectory'));
BulkImportParameters bulkImportParameters = new BulkImportParameters();
bulkImportParameters.setTarget(folderNode);
bulkImportParameters.setReplaceExisting(true);
bulkImportParameters.setBatchSize(40);
bulkImportParameters.setNumThreads(4);
bulkImporter.bulkImport(bulkImportParameters, nodeImporter);
txn.commit();
txn = transactionService.getUserTransaction();
txn.begin();
AuthenticationUtil.setRunAsUser('admin');
InPlaceNodeImporterFactory inPlaceNodeImporterFactory = (InPlaceNodeImporterFactory)ctx.getBean('inPlaceNodeImporterFactory');
NodeImporter nodeImporter = inPlaceNodeImporterFactory.getNodeImporter('default', '2011');
BulkImportParameters bulkImportParameters = new BulkImportParameters();
bulkImportParameters.setTarget(folderNode);
bulkImportParameters.setReplaceExisting(true);
bulkImportParameters.setBatchSize(150);
bulkImportParameters.setNumThreads(4);
bulkImporter.bulkImport(bulkImportParameters, nodeImporter);
txn.commit();
Logging - debug statements can be enabled with 'log4j.logger.org.alfresco.repo.batch.BatchProcessor=info'. Enabling logging for the retrying transaction handler can also be useful to pinpoint any transactional issues during the import; this can be done with 'log4j.logger.org.alfresco.repo.transaction.RetryingTransactionHelper=info'.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Ask for and offer help to other Alfresco Content Services Users and members of the Alfresco team.
Related links:
By using this site, you are agreeing to allow us to collect and use cookies as outlined in Alfresco’s Cookie Statement and Terms of Use (and you have a legitimate interest in Alfresco and our products, authorizing us to contact you in such methods). If you are not ok with these terms, please do not use this website.