AVM Metadata Extraction

resplin · ‎6 Jun 2015

The official documentation is at: http://docs.alfresco.com

XML Meta-data Extractor Configuration for WCM

Most document mimetypes are application specific. The meta-data that can be extracted is usually well-known. On the other hand, formats such as XML, CSV and TXT are not unique to any application. Since the XML file can conform to any number of XSDs, extracting metadata from it cannot be done by a single extractor. Some other functionality is needed to inspect the file and select the appropriate metadata extractor. ContentWorkerSelector, which extends the ContentSelector interface, finds the correct metadata extractor to use for a specific XML file.

A factory, such as XPathContentWorkerSelector is responsible for XML intronspection. It examines the physical content to determine which ContentWorker should be used to process the document. This mechanism can be used to implement transformers and injectors for generic formats such as XML.

The V2.1 ContentTransformer and MetadataExtracter implementations extend the ContentWorker interface.

This section shows how to

Activate meta-data extraction for WCM.

Configure an extractor to handle XML content.

Activate sample:

<extension-samples>/alfresco/extension/wcm-xml-metadata-extracter-context.xml.sample

Open Javadocs for:

org.alfresco.repo.content.metadata.xml.XmlMetadataExtracter

org.alfresco.repo.content.selector.XPathContentWorkerSelector

org.alfresco.repo.content.metadata.xml.XPathMetadataExtracter

Activating Meta-data Extraction for WCM

Out of the box, the Web Content Management framework does not have metadata extraction enabled. Additionally, V2.1 has no rules support in WCM. Because of this, it is necessary to manually activate meta-data extraction if it is needed in WCM. The first thing to decide is if the set of registered extractors for WCM must be the same as those available to the Alfresco Document Management framework. In our sample, this has not been done; the WCM framework (called AVM) defines its own extractors and specifically only a single extractor:

<extension-samples>/alfresco/extension/wcm-xml-metadata-extracter-context.xml.sample



...

<bean id='avmMetadataExtracterRegistry'

class='org.alfresco.repo.content.metadata.MetadataExtracterRegistry' />

...

<property name='invokePolicies'>

<value>true</value>

</property>

...

<property name='metadataExtracterRegistry'>

<ref bean='avmMetadataExtracterRegistry' />

</property>

...

</bean>

...

The avmNodeService fires content creation and update policies that are listened for by the avmMetadataExtracter, which initiates meta-data extraction for the new or updated content.

XML Meta-data Extraction

There are many ways to extract meta-data from XML documents. A common way to do this is to issue XPath statements against the document, as implemented by the XPathMetadataExtracter. In this example, we create an unregistered XPathMetadataExtracter to pull values from an Alfresco Model XML file. Because this is class can be used to extract values from any type of XML, the actual XPath statements that must be issued, and the values they must be stored against are set in the configuration:



...

<property name='xpathMappingProperties'>

<bean class='org.springframework.beans.factory.config.PropertiesFactoryBean'>

<property name='properties'>

<props>

<prop key='namespace.prefix.fm'>http://www.alfresco.org/model/forum/1.0</prop>

<prop key='author'>/model/author/text()</prop>

<prop key='title'>/model/@name</prop>

<prop key='description'>/model/description/text()</prop>

<prop key='version'>/model/version/text()</prop>

</props>

</property>

</bean>

</property>

...

Note the property namespace.prefix.fm introduces 'fm' shortcut for the http://www.alfresco.org/model/forum/1.0 namespace. If your extractor uses multiple content models, you may need to have multiple properties that map the namaspaces to their shorter equivalents.

These properties can be set using a classpath lookup of a properties file. The XpathMetadataExtracter will, for example, extract a value using XPath expression '/model/author/text()' and store it against key 'author'. The normal mapping mechanism then kicks in to transfer the value from 'author' to 'cm:author':



...

<prop key='author'>cm:author</prop>

...

Selecting XML Extractor

Of course, this is an unregistered extractor and won't be called by anything. On top of this, we have to ensure that the XML documents that do get passed to it are valid or the XPath expressions won't work at all. When the system needs to extract meta-data from an XML document, it uses the mimetype text/xml to request an extractor. The only registered extractor in this case is the extracter.xml.sample.XMLMetadataExtracter:



<bean id='extracter.xml.sample.XMLMetadataExtracter'

class='org.alfresco.repo.content.metadata.xml.XmlMetadataExtracter'

parent='baseMetadataExtracter'>

<property name='registry'>

<ref bean='avmMetadataExtracterRegistry' />

</property>

<property name='overwritePolicy'>

<value>EAGER</value>

</property>

<property name='selectors'>

<list>

<ref bean='extracter.xml.sample.selector.XPathSelector' />

</list>

</property>

</bean>

Note that the overwrite policy is applied by the extracter.xml.sample.XMLMetadataExtracter - in this case it is set to EAGER and any property coming from the document will overwrite the corresponding property in the system metadata.

It passes the document to the extracter.xml.sample.selector.XPathSelector bean, which in turn executes a sequence of XPath statements to determine which specific extractor will do the work:



<bean id='extracter.xml.sample.selector.XPathSelector'

class='org.alfresco.repo.content.selector.XPathContentWorkerSelector'

init-method='init'>

<property name='workers'>

<map>

<entry key='/my:test'>

<null />

</entry>

<entry key='/model'>

<ref bean='extracter.xml.sample.AlfrescoModelMetadataExtracter' />

</entry>

</map>

</property>

</bean>

If the document has a <model> element, then the extracter.xml.sample.AlfrescoModelMetadataExtracter bean will be passed back to be used by the XMLMetadataExtracter.

Overwrite Policies

In the example above, EAGER policy was used. There are a total of three overwrite policy options available:

CAUTIOUS - This policy only puts the extracted value if there is no value (null or otherwise) in the properties map.

EAGER - This policy puts the new value if: the extracted property is not null

PRAGMATIC - This policy puts the new value if: the extracted property is not null there is no target key for the property the target value is null the string representation of the target value is an empty string

Multivalued Properties

The XPathMetadataExtractor has some advantages over other, simpler extractors. The XPath statements can return multiple string values. For example, if an XML document looked as follows:



<article>

...

<keywords>KW1</keywords>

<keywords>KW2</keywords>

<keywords>KW3</keywords>

...

</article>

Then the following XPath statement:



<prop key='keywords'>/article/keywords/text()</prop>

will return multiple results. These are translated in an array of string values which can then be mapped into a multi-valued system property.

Testing

To test the sample, add one of the Alfresco model files to a web project.
AVM

AVM Metadata Extraction

AVM Metadata Extraction

Table of Contents

XML Meta-data Extractor Configuration for WCM

Activating Meta-data Extraction for WCM

XML Meta-data Extraction

Selecting XML Extractor

Overwrite Policies

Multivalued Properties

Testing

We use cookies on this site to enhance your user experience