Metadata Extraction

resplin · ‎6 Jun 2015

The official documentation is at: http://docs.alfresco.com

Content Modeling Core Repository Services
This document assumes knowledge of how to extend the repository configuration.

Back to Developer Guide

Back to Server Configuration

Introduction

Meta-data extractors offer server-side extraction of values from added or updated content.

System administrators can find definitions of the default set of extractors in

<WEB-INF>/classes/alfresco/content-services-context.xml.

Configuration options are detailed in the Javadocs

Javadocs for org.alfresco.repo.content.metadata Package.

Sample configurations are in

<extension-samples>/alfresco/extension/custom-metadata-extrators-context.xml.sample
<extension-samples>/alfresco/extension/wcm-xml-metadata-extracter-context.xml.sample.

Developers can study the default implementations contained in package:

org.alfresco.repo.content.metadata.

Note: The interface MetadataExtracter should be MetadataExtractor. Otherwise the word extractor is used in this document.

Functionality

One of the default actions that can be triggered in a space is Extract Common Metadata. This action will look at the mimetype of the document that triggered the rule and request an appropriate MetadataExtracter from the default MetadataExtracterRegistry. Each extractor is registered to handle a set of mimetypes.

Before V2.1, the extractor would pull out a set of values from the document and copy these directly into the document meta-data. If the property was declared as part of an aspect in the model, then the aspect is also added to the document. Developers can look at org.alfresco.repo.content.metadata.AbstractMetadataExtracter. When a property already exists, it is not overwritten by the extractor.

V2.1 separated the extraction of values from the document and the setting of system properties. In V2.1, the extractor will pull a set of values from the document (the full list is declared in the javadocs of each class). The extractor uses a set of properties to map the extracted values to the document's meta-data. By default, the extractor will not overwrite any properties already present in the document's meta-data, but this can be changed by overriding the extractor's bean definition. Developers should look at org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter.

Configuration

Default Extractors

These are the extractors defined within <WEB-INF>/classes/alfresco/content-services-context.xml for V2.1:

  <bean id='extracter.PDFBox'        class='org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter'        parent='baseMetadataExtracter' />
  <bean id='extracter.Office'        class='org.alfresco.repo.content.metadata.OfficeMetadataExtracter'        parent='baseMetadataExtracter' />
  <bean id='extracter.Mail'          class='org.alfresco.repo.content.metadata.MailMetadataExtracter'          parent='baseMetadataExtracter' />
  <bean id='extracter.Html'          class='org.alfresco.repo.content.metadata.HtmlMetadataExtracter'          parent='baseMetadataExtracter' />
  <bean id='extracter.MP3'           class='org.alfresco.repo.content.metadata.MP3MetadataExtracter'           parent='baseMetadataExtracter' />
  <bean id='extracter.OpenDocument'  class='org.alfresco.repo.content.metadata.OpenDocumentMetadataExtracter'  parent='baseMetadataExtracter' />
  <bean id='extracter.OpenOffice'    class='org.alfresco.repo.content.metadata.OpenOfficeMetadataExtracter'    parent='baseMetadataExtracter' >
     <property name='connection'>
        <ref bean='openOfficeConnection' />
     </property>
  </bean>

Declaring a New Extractor

Assuming you have a new extractor written in class com.company.MyExtracter, you can declare the extractor:

<extension-config>/alfresco/extension/custom-repository-context.xml:

  <bean id='com.company.MyExtracter' class='com.company.MyExtracter' parent='baseMetadataExtracter' />

The parent bean (parent='baseMetadataExtracter') will register the extractor with the metadataExtracterRegistry bean. It will automatically be available for use by the Alfresco server to handle the mimetypes that your extractor declared.

Configuring the Extractor

Note: This applies to V2.1 only. Earlier versions were hardcoded.

We'll use the extracter.OpenDocument as an example of how to modify the configuration. This extractor handles all the OpenDocument formats using a connection to a headless OpenOffice process. Before reading more, open up the following:

The Javadocs for the class org.alfresco.repo.content.metadata.OpenDocumentMetadataExtracter


Meta-data extractor for the MIMETYPE_OPENDOCUMENT_XXX mimetypes. 

   creationDate:           --      cm:created
   creator:                --      cm:author
   date:
   description:            --      cm:description
   generator:
   initialCreator:
   keyword:
   language:
   printDate:
   printedBy:
   subject:
   title:                  --      cm:title
   All user properties

The sample file <extension-samples>/alfresco/extension/custom-metadata-extrators-context.xml.sample


    ...
    <bean id='extracter.OpenDocument'
          class='org.alfresco.repo.content.metadata.OpenDocumentMetadataExtracter'
          parent='baseMetadataExtracter' >
        <property name='inheritDefaultMapping'>
            <value>true</value>
        </property>
        <property name='mappingProperties'>
            <props>
                <prop key='namespace.prefix.cm'>http://www.alfresco.org/model/content/1.0</prop>
                <prop key='user1'>cm:description</prop>
            </props>
        </property>
    </bean>
    ...

Copy the extension sample to <extension-config>/alfresco/extension/custom-metadata-extrators-context.xml to activate the sample for your server. The Javadocs for the extractor give the list (on the left) of values extracted from the document. All these extracted values are put into a map, ready for conversion to model-specific properties. By default, the following will be populated by the extractor:

  creationDate:           --      cm:created
  creator:                --      cm:author
  description:            --      cm:description
  title:                  --      cm:title

Let's assume that a user property, user1, will be used by the Alfresco users to fill in the description of the documents they edit. The description field extracted by the extractor should be ignored and the user1 field used instead. We inherit all the other mappings and just modify how the user1 field is used.

Perhaps, you wish to put your changes in a property file instead:

<extension-config>/alfresco/extension/custom-metadata-extrators-context.xml:


    <bean id='extracter.OpenDocument'
          class='org.alfresco.repo.content.metadata.OpenDocumentMetadataExtracter'
          parent='baseMetadataExtracter' >
        <property name='inheritDefaultMapping'>
            <value>true</value>
        </property>
        <property name='mappingProperties'>
            <bean class='org.springframework.beans.factory.config.PropertiesFactoryBean'>
               <property name='location'>
                  <value>classpath:alfresco/extension/custom-opendocument-extractor-mappings.properties</value>
               </property>
            </bean>
        </property>
    </bean>

<extension-config>/alfresco/extension/custom-opendocument-extractor-mappings.properties:


namespace.prefix.cm=http://www.alfresco.org/model/content/1.0
user1=cm:description

Inbound Data Type Conversions (V2.1.2E)

Document properties are generally extracted as Java String types, but this might not always be the case. When the properties are mapped to system properties, the extractor now explictly performs a data type conversion to catch any failures at the point of extraction. Properties that cannot be converted to the required type, where a property exists in the data dictionary, can either be discarded or cause extraction failure (default is failure).


   <property name='failOnTypeConversion'>
         <value>false</value>
      </property>
      ...
   </bean>

Date Conversions (V2.1.2E)

Alfresco's default String to Date conversion uses the ISO 8601 format, i.e. sYYYY-MM-DDThh:mm:ss.sssTZD. During meta-data extraction, the date strings are seldom in the correct format. A list of alternative formats can be specified and will be used if the ISO 8601 conversion fails and the target system property is d:date or d:datetime


   ...
   <bean id='extracter.xml.wsf.articleMetadataExtracter'
         class='org.alfresco.repo.content.metadata.xml.XPathMetadataExtracter'
         parent='baseMetadataExtracter'
         init-method='init' >
      <property name='supportedDateFormats'>
         <list>
            <value>yyyy.MM.dd G 'at' HH:mm:ss z</value>
         </list>
      </property>
      ...

The list will be processed in order until they have all failed or one has succeeded. For the full list of options to describe the date formats, see the SimpleDateFormat Javadocs.

Properties and Aspects

When an aspect-defined property is extracted and added to the document's metadata, the associated aspect is implicitly added. By default any values already present in the metadata will remain, but it is possible to change this behaviour on a system-wide level by specifying that any properties not extracted should be removed from the target node. Override the bean extract-metadata and set the carryAspectProperties to false.

<configRoot>/alfresco/action-service-context.xml:


    <bean id='extract-metadata' class='org.alfresco.repo.action.executer.ContentMetadataExtracter' parent='action-executer'>
        ...
        <property name='carryAspectProperties'>
            <value>false</value>
        </property>
    </bean>

For example, if an aspect defines properties p:x and p:y but the document only contains p:x, then p:y will be removed from the target node.

Usage Examples

JavaScript

The metadata extractor is not available as a root service in JavaScript, but it is available as an action.


var action = actions.create('extract-metadata');
action.execute(document);

Developing Your Own Metadata Extractor

Use the extracter.OpenDocument as a sample implementation.
- Source org.alfresco.repo.content.metadata.OpenDocumentMetadataExtracter.java
- Default mapping org.alfresco.repo.content.metadata.OpenDocumentMetadataExtracter.properties
Create your extractor com.my.company.MyExtractor
- Source com.my.company.MyExtractor.java
- Default mapping com.my.company.MyExtractor.properties
Be sure to make the class's Javadocs reflect all the extracted values along with the default mappings.

Metadata Extraction

Metadata Extraction

Table of Contents

Introduction

Functionality

Configuration

Default Extractors

Declaring a New Extractor

Configuring the Extractor

Inbound Data Type Conversions (V2.1.2E)

Date Conversions (V2.1.2E)

Properties and Aspects

Usage Examples

JavaScript

Developing Your Own Metadata Extractor

Content Transformation and Metadata Extraction with Apache Tika

AVM Metadata Extraction

Metadata Extraction

Metadata Extraction

Table of Contents

Introduction

Functionality

Configuration

Default Extractors

Declaring a New Extractor

Configuring the Extractor

Inbound Data Type Conversions (V2.1.2E)

Date Conversions (V2.1.2E)

Properties and Aspects

Usage Examples

JavaScript

Developing Your Own Metadata Extractor

Content Transformation and Metadata Extraction with Apache Tika

AVM Metadata Extraction

We use cookies on this site to enhance your user experience