Developers can study the default implementations contained in package:
Note: The interface MetadataExtracter should be MetadataExtractor. Otherwise the word extractor is used in this document.
One of the default actions that can be triggered in a space is Extract Common Metadata. This action will look at the mimetype of the document that triggered the rule and request an appropriate MetadataExtracter from the default MetadataExtracterRegistry. Each extractor is registered to handle a set of mimetypes.
Before V2.1, the extractor would pull out a set of values from the document and copy these directly into the document meta-data. If the property was declared as part of an aspect in the model, then the aspect is also added to the document. Developers can look at org.alfresco.repo.content.metadata.AbstractMetadataExtracter. When a property already exists, it is not overwritten by the extractor.
V2.1 separated the extraction of values from the document and the setting of system properties. In V2.1, the extractor will pull a set of values from the document (the full list is declared in the javadocs of each class). The extractor uses a set of properties to map the extracted values to the document's meta-data. By default, the extractor will not overwrite any properties already present in the document's meta-data, but this can be changed by overriding the extractor's bean definition. Developers should look at org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter.
These are the extractors defined within <WEB-INF>/classes/alfresco/content-services-context.xml for V2.1:
The parent bean (parent='baseMetadataExtracter') will register the extractor with the metadataExtracterRegistry bean. It will automatically be available for use by the Alfresco server to handle the mimetypes that your extractor declared.
Configuring the Extractor
Note: This applies to V2.1 only. Earlier versions were hardcoded.
We'll use the extracter.OpenDocument as an example of how to modify the configuration. This extractor handles all the OpenDocument formats using a connection to a headless OpenOffice process. Before reading more, open up the following:
Copy the extension sample to <extension-config>/alfresco/extension/custom-metadata-extrators-context.xml to activate the sample for your server. The Javadocs for the extractor give the list (on the left) of values extracted from the document. All these extracted values are put into a map, ready for conversion to model-specific properties. By default, the following will be populated by the extractor:
Let's assume that a user property, user1, will be used by the Alfresco users to fill in the description of the documents they edit. The description field extracted by the extractor should be ignored and the user1 field used instead. We inherit all the other mappings and just modify how the user1 field is used.
Perhaps, you wish to put your changes in a property file instead:
Document properties are generally extracted as Java String types, but this might not always be the case. When the properties are mapped to system properties, the extractor now explictly performs a data type conversion to catch any failures at the point of extraction. Properties that cannot be converted to the required type, where a property exists in the data dictionary, can either be discarded or cause extraction failure (default is failure).
Alfresco's default String to Date conversion uses the ISO 8601 format, i.e. sYYYY-MM-DDThh:mm:ss.sssTZD. During meta-data extraction, the date strings are seldom in the correct format. A list of alternative formats can be specified and will be used if the ISO 8601 conversion fails and the target system property is d:date or d:datetime
The list will be processed in order until they have all failed or one has succeeded. For the full list of options to describe the date formats, see the SimpleDateFormat Javadocs.
Properties and Aspects
When an aspect-defined property is extracted and added to the document's metadata, the associated aspect is implicitly added. By default any values already present in the metadata will remain, but it is possible to change this behaviour on a system-wide level by specifying that any properties not extracted should be removed from the target node. Override the bean extract-metadata and set the carryAspectProperties to false.