Most document mimetypes are application specific. The meta-data that can be extracted is usually well-known. On the other hand, formats such as XML, CSV and TXT are not unique to any application. Since the XML file can conform to any number of XSDs, extracting metadata from it cannot be done by a single extractor. Some other functionality is needed to inspect the file and select the appropriate metadata extractor. ContentWorkerSelector, which extends the ContentSelector interface, finds the correct metadata extractor to use for a specific XML file.
A factory, such as XPathContentWorkerSelector is responsible for XML intronspection. It examines the physical content to determine which ContentWorker should be used to process the document. This mechanism can be used to implement transformers and injectors for generic formats such as XML.
The V2.1 ContentTransformer and MetadataExtracter implementations extend the ContentWorker interface.
Out of the box, the Web Content Management framework does not have metadata extraction enabled. Additionally, V2.1 has no rules support in WCM. Because of this, it is necessary to manually activate meta-data extraction if it is needed in WCM. The first thing to decide is if the set of registered extractors for WCM must be the same as those available to the Alfresco Document Management framework. In our sample, this has not been done; the WCM framework (called AVM) defines its own extractors and specifically only a single extractor:
The avmNodeService fires content creation and update policies that are listened for by the avmMetadataExtracter, which initiates meta-data extraction for the new or updated content.
XML Meta-data Extraction
There are many ways to extract meta-data from XML documents. A common way to do this is to issue XPath statements against the document, as implemented by the XPathMetadataExtracter. In this example, we create an unregistered XPathMetadataExtracter to pull values from an Alfresco Model XML file. Because this is class can be used to extract values from any type of XML, the actual XPath statements that must be issued, and the values they must be stored against are set in the configuration:
Note the property namespace.prefix.fm introduces 'fm' shortcut for the http://www.alfresco.org/model/forum/1.0 namespace. If your extractor uses multiple content models, you may need to have multiple properties that map the namaspaces to their shorter equivalents.
These properties can be set using a classpath lookup of a properties file. The XpathMetadataExtracter will, for example, extract a value using XPath expression '/model/author/text()' and store it against key 'author'. The normal mapping mechanism then kicks in to transfer the value from 'author' to 'cm:author':
Selecting XML Extractor
Of course, this is an unregistered extractor and won't be called by anything. On top of this, we have to ensure that the XML documents that do get passed to it are valid or the XPath expressions won't work at all. When the system needs to extract meta-data from an XML document, it uses the mimetype text/xml to request an extractor. The only registered extractor in this case is the extracter.xml.sample.XMLMetadataExtracter:
Note that the overwrite policy is applied by the extracter.xml.sample.XMLMetadataExtracter - in this case it is set to EAGER and any property coming from the document will overwrite the corresponding property in the system metadata.
It passes the document to the extracter.xml.sample.selector.XPathSelector bean, which in turn executes a sequence of XPath statements to determine which specific extractor will do the work:
If the document has a <model> element, then the extracter.xml.sample.AlfrescoModelMetadataExtracter bean will be passed back to be used by the XMLMetadataExtracter.
In the example above, EAGER policy was used. There are a total of three overwrite policy options available:
CAUTIOUS - This policy only puts the extracted value if there is no value (null or otherwise) in the properties map.
EAGER - This policy puts the new value if: the extracted property is not null
PRAGMATIC - This policy puts the new value if: the extracted property is not null there is no target key for the property the target value is null the string representation of the target value is an empty string
The XPathMetadataExtractor has some advantages over other, simpler extractors. The XPath statements can return multiple string values. For example, if an XML document looked as follows: