Bulk Importer

cancel
Showing results for 
Search instead for 
Did you mean: 

Bulk Importer

pmonks1
Member II
0 0 17.2K

The official documentation is at: http://docs.alfresco.com



4.0


Introduction


A Bulk Importer has been added to Alfresco Community (since v4.0b) and Enterprise 4.0+ that provides a mechanism for bulk importing content and metadata into a repository from the Alfresco server's filesystem. It is a fork of an old version of the Bulk Filesystem Import Tool hosted on github.

Please note that this fork was modified from the original edition of the tool, and has never been rebased against it.  As a result, it lags the original edition both functionally and from a stability perspective.  For information on the original edition of the tool, please review the github project.


What it does


The importer will (optionally) replace existing content items if they already exist in the repository, but does not perform deletes (ie. it is not designed to fully synchronise the repository with the local filesystem). The basic on-disk file/folder structure is preserved verbatim in the repository. It is possible to load metadata for the files and spaces being ingested, as well as a version history for files (each version may consist of content, metadata or both).

There is support for two kinds of bulk import:


  • Streaming import: files are streamed into the repository content store by copying them in during the import.
  • In-place import: files are assumed to already exist within the repository content store, so no copying is required. This can result in a significant improvement in performance.

There are a number of restrictions:


  • No support for AVM.
  • Only one bulk import can be running at a time. This is enforced by the JobLockService.
  • Access to the bulk importer is by default restricted to Alfresco administrators.

Types of Bulk Import


In-Place Bulk Import (Enterprise Only)


Three assumptions are made when importing content 'in place' :


  • the content is already at its initial repository location prior to import, as it will be *not* be moved during the import.
  • the in-place content must be within the tree structure of a registered content store, as defined by either :
    • the default fileContentStore.
    • a filesystem-based store defined by the content store selector.
  • steps have already been taken prior to import to ensure the content structure is well distributed :
    • the default fileContentStore distributes content based on the import date (year/month/day/hour/minute). This avoids having thousands of file under the same root, which is inefficient both for the filesystem of for computing parent associations in Alfresco (among other things).
    • A rule of thumb is generally to keep immediate children to a few thousands maximum.
    • In order to choose an efficient distribution scheme, you should know that, when m files are randomly distributed into n leaf folders, when m >> n log n the statistical maximum load of a leaf is m/n + O( sqrt((m log n)/n)).

In addition, the in-place bulk importer provides support for the content store selector (http://wiki.alfresco.com/wiki/Content_Store_Selector). This allows to select under which store the content to import is to be found.


Streaming Bulk Import (available in Community and Enterprise)


The source content is copied into the repository content store during the import. In all other respects, in-place and streaming bulk import are the same.


Importing


The batch processor is used to perform the import using a configurable number of multiple threads in batches of a configurable size. The batch size and number of threads can be set either as default properties for the repository in alfresco-global.properties (the properties are 'bulkImport.batch.numThreads' and 'bulkImport.batch.batchSize') or entered in the GUI when performing an import (thereby overriding the defaults).

In order to make the bulk importer as fast as possible, it is recommended that the following properties are set:


  • system.usages.enabled=false : turn off content usages/quotas
  • system.enableTimestampPropagation=false : turn off modification timestamp propagation from child to parent nodes
  • index.subsystem.name=solr : ensure you are not using in-transaction indexing (Lucene) - instead favour the use of Solr.
  • set the number of threads (depending on the number of processors) and batch size appropriately for your system and number of files you are importing.

Preparing the Filesystem


Metadata Files


The bulk importer has the ability to load metadata (types, aspects & their properties) into the repository. This is accomplished using 'shadow' Java property files in XML format (not the old key=value format that was used in earlier versions of the tool - the XML format has significantly better support for Unicode characters). These shadow properties files must have exactly the same name and extension as the file for which it describes the metadata, but with the suffix '.metadata.properties.xml'. So for example, if there is a file called 'IMG_1967.jpg', the 'shadow' metadata file for it would be called 'IMG_1967.jpg.metadata.properties.xml'.

These shadow files can also be used for directories - e.g. if you have a directory called 'MyDocuments', the shadow metadata file would be called 'MyDocuments.metadata.properties.xml'.

The metadata file itself follows the usual syntax for Java XML properties files:


  <?xml version='1.0' encoding='UTF-8'?>
  <!DOCTYPE properties SYSTEM 'http://java.sun.com/dtd/properties.dtd'>
  <properties>
    <entry key='key1'>value1</entry>
    <entry key='key2'>value2</entry>
    ...
  </properties>


There are two special keys:


  • type - contains the qualified name of the content type to use for the file or folder
  • aspects - contains a comma-delimited list of the qualified names of the aspect(s) to attach to the file or folder

The remaining entries in the file are treated as metadata properties, with the key being the qualified name of the property and the value being the value of that property. Multi-valued properties are comma-delimited, but please note that these values are not trimmed so it's advisable to not place a space character either before or after the comma, unless you actually want that in the value of the property.

Here's a fully worked example for IMG_1967.jpg.metadata.properties.xml:



  <?xml version='1.0' encoding='UTF-8'?>
  <!DOCTYPE properties SYSTEM 'http://java.sun.com/dtd/properties.dtd'>
  <properties>
    <entry key='type'>cm:content</entry>
    <entry key='aspects'>cm:versionable,cm:dublincore</entry>
    <entry key='cm:title'>A photo of a flower.</entry>
    <entry key='cm:description'>A photo I took of a flower while walking around Bantry Bay.</entry>
    <entry key='cm:created'>1901-01-01T12:34:56.789+10:00</entry>
    <!-- cm:dublincore properties -->
    <entry key='cm:author'>Peter Monks</entry>
    <entry key='cm:publisher'>Peter Monks</entry>
    <entry key='cm:contributor'>Peter Monks</entry>
    <entry key='cm:type'>Photograph</entry>
    <entry key='cm:identifier'>IMG_1967.jpg</entry>
    <entry key='cm:dcsource'>Canon Powershot G2</entry>
    <entry key='cm:coverage'>Worldwide</entry>
    <entry key='cm:rights'>Copyright (c) Peter Monks 2002, All Rights Reserved</entry>
    <entry key='cm:subject'>A photo of a flower.</entry>
  </properties>



Additional notes on metadata loading:


  • you cannot create a new node based on metadata only - you must have a content file (even if zero bytes) for the metadata to be loaded. That said, you may 'replace' an existing node in the repository with nothing but metadata - despite the confusing name, this won't replace the content - it will simply be decorated with the new metadata.
  • the metadata must conform to the type and aspect definitions configured in Alfresco (including mandatory fields, constraints and data types). Any violations will terminate the bulk import process. 
  • associations between content items loaded by the tool are not yet nicely supported. Associations to objects that are already in the repository can be created using the NodeRef of the target object as the value of the property
  • non-string data types (including numeric and date types) have not been exhaustively tested. Date values have been tested and do work when specified using ISO8601 format.
  • updating the aspects or metadata on existing content will not remove any existing aspects not listed in the new metadata file - this tool is not intended to provide a full filesystem synchronisation mechanism
  • the metadata loading facility can be used to decorate content that's already in the Alfresco repository, without having to upload that content again. To use this mechanism, create a 'naked' metadata file in the same path as the target content file - the tool will match it up with the file in the repository and decorate that existing file with the new aspect(s) and/or metadata

Version History Files


The import tool also supports loading a version history for each file (Alfresco doesn't support version histories for folders). To do this, create a file with the same name as the main file, but append a 'v#' extension. For example:


  IMG_1967.jpg.v1   <- version 1 content
  IMG_1967.jpg.v2   <- version 2 content
  IMG_1967.jpg      <- 'head' (latest) revision of the content


This also applies to metadata files, if you wish to capture metadata history as well. For example:


  IMG_1967.jpg.metadata.properties.xml.v1   <- version 1 metadata
  IMG_1967.jpg.metadata.properties.xml.v2   <- version 2 metadata
  IMG_1967.jpg.metadata.properties.xml      <- 'head' (latest) revision of the metadata


Additional notes on version history loading:


  • you cannot create a new node based on a version history only - you must have a head revision of the file.
  • version numbers don't have to be contiguous - you can number your version files however you wish, provided you use whole numbers (integers)
  • the version numbers in your version files will not be used in Alfresco - the version numbers in Alfresco will be contiguous, starting at 1.0 and increasing by 1.0 for every version (so 1.0, 2.0, 3.0, etc. etc.). Alfresco doesn't allow version labels to be set to arbitrary values (see issue #85), and currently the bulk import doesn't provide any way to specify whether a given version should have a major or minor increment (see issue #84)
  • each version can contain a content update a metadata update or both - you are not limited to updating everything for every version. If not included in a version, the prior version's content or metadata will remain in place for the next version.

Here's a fully fleshed out example, showing all possible combinations of content, metadata and version files:


  IMG_1967.jpg.v1                           <- version 1 content
  IMG_1967.jpg.metadata.properties.xml.v1   <- version 1 metadata
  IMG_1967.jpg.v2                           <- version 2 content
  IMG_1967.jpg.metadata.properties.xml.v2   <- version 2 metadata
  IMG_1967.jpg.v3                           <- version 3 content (content only version)
  IMG_1967.jpg.metadata.properties.xml.v4   <- version 4 metadata (metadata only version)
  IMG_1967.jpg.metadata.properties.xml      <- 'head' (latest) revision of the metadata
  IMG_1967.jpg                              <- 'head' (latest) revision of the content


Importing Using the UI


The two types of bulk import (streaming and in-place) each have a GUI (see the individual sections below), implemented using Alfresco web scripts.


Streaming


The streaming bulk import is exposed as a set of 2 Web Scripts:


  • A simple 'UI' Web Script that can be used to manually initiate an import. This is an HTTP GET Web Script with a path of:
       http://localhost:8080/alfresco/service/bulkfsimport 

  • An 'initiate' Web Script that actually kicks off an import, using parameters that are passed to it (for the source directory, target space, etc.). If you wish to script or programmatically invoke the tool, this is the Web Script you would need to call. This is an HTTP GET Web Script with a path of:
       http://localhost:8080/alfresco/service/bulkfsimport/initiate  

The 'UI' Web Script presents the following simplified HTML form:

BulkUploadStreaming.png


  • The 'Import directory' field is required and indicates the absolute filesystem directory to load the content and spaces from, in an OS-specific format. Note that this directory must be locally accessible to the server the Alfresco instance is running on - it must either be a local filesystem or a locally mounted remote filesystem (i.e. mounted using NFS, GFS, CIFS or similar).
  • The 'Target space' field is also required and indicates the target space to load the content into, as a path starting with '/Company Home'. The separator character is Unix-style (i.e. '/'), regardless of the platform Alfresco is running on. This field includes an AJAX auto-suggest feature, so you may type any part of the target space name, and an AJAX search will be performed to find and display matching items.
  • The 'Replace existing files' checkbox indicates whether to replace nodes that already exist in the repository (checked) or skip them (unchecked). Note that if versioning is enabled for a node, the node's existing content & metadata will be preserved as the prior version and the new content and/or metadata will be written into the head revision.
  • The 'Number of Threads' text field allows you to override the default number of threads (defined by the property 'bulkImport.batch.numThreads') to use in the bulk import.
  • The 'Batch Size' text field allows you to override the default batch size (the number of directories and files to import at a time, per transaction; defined by the property 'bulkImport.batch.batchSize') to use in the bulk import.
  • The 'Disable rules' checkbox allows you to turn off rule processing during the bulk import.

In-Place


The in-place bulk import is exposed as a set of 2 Web Scripts:


  • A simple 'UI' Web Script that can be used to manually initiate an import. This is an HTTP GET Web Script with a path of:
       http://localhost:8080/alfresco/service/bulkfsimport/inplace

  • An 'initiate' Web Script that actually kicks off an import, using  parameters that are passed to it (for the source directory, target  space, etc.). If you wish to script or programmatically invoke the tool,  this is the Web Script you would need to call. This is an HTTP GET Web  Script with a path of:
       http://localhost:8080/alfresco/service/bulkfsimport/initiate 

The in-place 'UI' Web Script presents the following simplified HTML form:

BulkUploadInPlace.png


  • The 'Import directory' field is required and indicates the absolute filesystem directory to load the content and spaces from, in an OS-specific format. Note that this directory must be locally accessible to the server the Alfresco instance is running on - it must either be a local filesystem or a locally mounted remote filesystem (i.e. mounted using NFS, GFS, CIFS or similar). This directory must already be inside an existing contentstore.
  • The content store name that  holds the content, as defined within the storage configuration (content  store selector or direct fileContentStore). The default store is by  default named 'default'. An autocomplete popup will assist in selecting  the name as the first characters are entered. The 'Up' and 'Down'  keyboards keys can be used to navigate the list, in addition to the  mouse.
  • The 'Target space' field is also required and  indicates the target space to load the content into, as a path starting  with '/Company Home'. The separator character is Unix-style (i.e. '/'),  regardless of the platform Alfresco is running on. This field includes  an AJAX auto-suggest feature, so you may type any part of the target  space name, and an AJAX search will be performed to find and display  matching items.
  • The 'Replace existing files' checkbox  indicates whether to replace nodes that already exist in the repository  (checked) or skip them (unchecked). Note that if versioning is enabled  for a node, the node's existing content & metadata will be preserved  as the prior version and the new content and/or metadata will be  written into the head revision.
  • The 'Number of  Threads' text field allows you to override the default number of threads  (defined by the property 'bulkImport.batch.numThreads') to use in the  bulk import.
  • The 'Batch Size' text field allows you  to override the default batch size (the number of directories and files  to import at a time, per transaction; defined by the property  'bulkImport.batch.batchSize') to use in the bulk import.
  • The 'Disable rules' checkbox allows you to turn off rule processing during the bulk import.

Bulk Import Status


The Bulk Import status Web Script displays status information on the current import (if one is in progress), or the status of the last import that  was initiated. This Web Script has both HTML and XML views, allowing external programs to programmatically monitor the status of imports.  This is an HTTP GET Web Script with a path of:

       http://localhost:8080/alfresco/service/bulkfsimport/status

The status web page is the same for both streaming and in-place import. It looks like this when the bulk importer is idle:

BulkUploadIdleStatus.png

The status is updated every five seconds when a bulk import has been initiated, as shown in these screenshots:

BulkUploadInProgress1.png
BulkUploadInProgress2.png

When the bulk import has finished, the following screen will be presented:

BulkUploadFinished.png


Importing Programmatically


Streaming




          UserTransaction txn = transactionService.getUserTransaction();
          txn.begin();
           
          AuthenticationUtil.setRunAsUser('admin');
                 
          StreamingNodeImporterFactory streamingNodeImporterFactory = (StreamingNodeImporterFactory)ctx.getBean('streamingNodeImporterFactory');
          NodeImporter nodeImporter = streamingNodeImporterFactory.getNodeImporter(new File('importdirectory'));
          BulkImportParameters bulkImportParameters = new BulkImportParameters();
          bulkImportParameters.setTarget(folderNode);
          bulkImportParameters.setReplaceExisting(true);
          bulkImportParameters.setBatchSize(40);
          bulkImportParameters.setNumThreads(4);
          bulkImporter.bulkImport(bulkImportParameters, nodeImporter);

          txn.commit();

          

In-Place




      txn = transactionService.getUserTransaction();
      txn.begin();

      AuthenticationUtil.setRunAsUser('admin');

      InPlaceNodeImporterFactory inPlaceNodeImporterFactory = (InPlaceNodeImporterFactory)ctx.getBean('inPlaceNodeImporterFactory');
      NodeImporter nodeImporter = inPlaceNodeImporterFactory.getNodeImporter('default', '2011');
      BulkImportParameters bulkImportParameters = new BulkImportParameters();
      bulkImportParameters.setTarget(folderNode);
      bulkImportParameters.setReplaceExisting(true);
      bulkImportParameters.setBatchSize(150);
      bulkImportParameters.setNumThreads(4);
      bulkImporter.bulkImport(bulkImportParameters, nodeImporter); 
 
      txn.commit();


Diagnostics


Logging - debug statements can be enabled with 'log4j.logger.org.alfresco.repo.batch.BatchProcessor=info'. Enabling logging for the retrying transaction handler can also be useful to pinpoint any transactional issues during the import; this can be done with 'log4j.logger.org.alfresco.repo.transaction.RetryingTransactionHelper=info'.