Obsolete Pages{{Obsolete}}
The official documentation is at: http://docs.alfresco.com
Content ModelingCore Repository Services3.44.0
This document assumes knowledge of how to extend the repository configuration.
Back to Developer Guide
Back to Server Configuration
Back to Content Transformations
Back to Metadata Extraction
From Swift onwards, Alfresco makes use of Apache Tika.
This is used for both metadata extraction, and content transformation.
For metadata extraction, it allows easy extraction of the metadata of
documents and their translation into your content model. For content
transformation, it allows the production of plain text, HTML and XML (XHTML)
versions of content.
The exact list of formats which are supported will vary based on the
version of Tika being used. For Project Swift, Tika 0.8 is used, the list
of formats that are supported will shortly be available at
http://tika.apache.org/0.8/formats.html . For now, the list of features
in the previous release are available at
http://tika.apache.org/0.7/formats.html
All the Parsers which ship as standard with Tika are available under the Apache License (or similar), as are there dependencies. However, a few 3rd Party parsers are available which have different licenses (usually GPL), details available from the Tika Wiki. See the details below for how to enable these plugins if required.
A number of Metadata Extractors are powered by
Apache Tika. Many of the existing extractors in Alfresco have been
converted to use Tika,
The Auto-Detect parser allows the extraction of metadata from any files which
are supported by Tika, but where no dedicated metadata extractor exists. It
provides a common set of mappings from Tika metadata to the Alfresco content
model, which will be used across all files that are handled by the auto-detect
parser fall-back.
The auto-detect parser is provided by
org.alfresco.repo.content.metadata.TikaAutoMetadataExtracter, and as such
the properties mapping is handled by /org/alfresco/repo/content/metadata/TikaAutoMetadataExtracter.properties
If we wish to add extra mappings, then we can follow the
Configuring an Extractor
guide, for the extracter.TikaAuto bean to add in the extra mapping(s).
The auto-detect parser can be disabled just like any other Alfresco supplied
metadata extractor. Simply comment out the bean definition for extractor.TikaAuto
inside <WEB-INF>/classes/alfresco/content-services-context.xml and restart
the repository. For more details, see the main Metadata Extractors
page.
Whilst Tika ships with a large number of file format parsers, it won't
always cover all formats out of the box. All the parsers that ship with
Tika depend on Apache Licensed (or compatible) libraries, which means
that some parsers (typically depending on GPL or propriatary libraries)
cannot be shipped.
If you have an additional Tika parser that you wish to use within Alfresco,
a small amount of coding and configuration is required. However, this is
generally much less work than adding in a whole new metadata extractor
from scratch.
Firstly, you should create a new Metadata Extractor class that extends
from org.alfresco.repo.content.metadata.TikaPoweredMetadataExtracter .
Your class should register the mimetypes it handles via the contructor
of the superclass, and override the getParser method to return the
appropriate Tika Parser object for your file type. If needed, you can
also override the extractSpecific method to control the mapping.
For an example of how fairly quick and simple this can be, see
org.alfresco.repo.content.metadata.DWGMetadataExtracter
Once you have written your extractor class, you need to register
it with the repository, and configure the mappings. More details on
how to do this are provided in the
Configuring an Extractor
page. The quick answer is to install the class files for your
TikaPoweredMetadataExtracter instance, the new Tika Parser and
its dependent libraries into the repository, then register a new
bean in an extension content file with a definition something like:
...
<bean id='extracter.MyCustomTika'
class='com.example.mycompany.MyCustomTikaMetadataExtractor'
parent='baseMetadataExtracter' >
<property name='inheritDefaultMapping'>
<value>true</value>
</property>
<property name='mappingProperties'>
<props>
<prop key='namespace.prefix.cm'>http://www.alfresco.org/model/content/1.0</prop>
<prop key='user1'>cm:description</prop>
</props>
</property>
</bean>
...
Note that this isn't available in 3.4.a, but will be in 3.4 and 3.4.b
If your Tika parser doesn't need any special work to process the output, you can simply spring-in the parser to the metadata extractor service with a bean definition.
...
<bean id='extracter.MyCustomTika'
class='org.alfresco.repo.content.metadata.TikaSpringConfiguredMetadataExtracter'
parent='baseMetadataExtracter' >
<property name='tikaParserName'>
<value>example.HelloWorldParser</value>
</property>
<property name='inheritDefaultMapping'>
<value>true</value>
</property>
<property name='mappingProperties'>
<props>
<prop key='namespace.prefix.cm'>http://www.alfresco.org/model/content/1.0</prop>
<prop key='newTikaKey1'>cm:description</prop>
<prop key='newTikaKey2'>cm:title</prop>
</props>
</property>
</bean>
...
Some 3rd party Tika plugins include the required services files to be detected and used by the Tika Auto-Detect parser. If the Parser Jar includes a META-INF/services/org.apache.tika.parser.Parser file then it is probably correctly configured, and will be used by the Auto-Detect parser if you don't define your own spring bean for it.
A number of Content Transformers are powered
by Apache Tika. Several of the existing to-plain-text transformers in
Alfresco have been converted to use Tika.
The Auto-Detect parser allows the conversion to plain text, html or
xml/xhtml for any files which are supported by Tika, but where no
dedicated content transformer exists. It generally provides a
transformed version contain most of the text, but which is light
on formatting and layout. This can normally be well used for
indexing and simple preview, but is not normally of the sorts of
quality seen with transformers such as Open Office to PDF.
However, it does easily and quickly allow some transformation
for a wide range of formats that previously were not handled
by Alfresco.
The auto-detect transformer is provided by
org.alfresco.repo.content.transform.TikaAutoContentTransformer.
The auto-detect transformer can be disabled just like any other Alfresco supplied
content transformer. Simply comment out the bean definition for transformer.TikaAuto
inside <WEB-INF>/classes/alfresco/content-services-context.xml and restart
the repository. For more details, see the main
Content Transformer page.
Whilst Tika ships with a large number of file format parsers, it won't
always cover all formats out of the box. All the parsers that ship with
Tika depend on Apache Licensed (or compatible) libraries, which means
that some parsers (typically depending on GPL or propriatary libraries)
cannot be shipped.
If you have an additional Tika parser that you wish to use within Alfresco,
a small amount of coding and configuration is required. However, this is
generally much less work than adding in a whole new content transformer
from scratch.
Firstly, you should create a new Content Transformer class that extends
from org.alfresco.repo.content.transform.TikaPoweredContentTransformer .
Your class should register the mimetypes it handles via the contructor
of the superclass, and override the getParser method to return the
appropriate Tika Parser object for your file type. For an example of how
fairly quick and simple this can be, see
org.alfresco.repo.content.transform.PdfBoxContentTransformer
Once you have written your transformer class, you simply need to
install the classes and register the transformer with the
repository. Full details on how to add a new transformer are
provided on the Content Transformations page. The quick
answer is to install the class files for your
TikaPoweredContentTransformer instance, the new Tika Parser and
its dependent libraries into the repository, then register a new
bean in an extension content file with a definition something like:
...
<bean id='transformer.MyCustomTika'
class='com.example.mycompany.MyCustomTikaContentTransformer'
parent='baseContentTransformer' />
...
Note that this isn't available in 3.4.a, but will be in 3.4 and 3.4.b
If your Tika parser doesn't need any special work to process the output, you can simply spring-in the parser to the content transformer service with a bean definition.
...
<bean id='transformer.MyCustomTika'
class='org.alfresco.repo.content.transform.TikaSpringConfiguredContentTransformer'
parent='baseContentTransformer'>
<property name='tikaParserName'>
<value>example.HelloWorldParser</value>
</property>
</bean>
...
Some 3rd party Tika plugins include the required services files to be detected and used by the Tika Auto-Detect parser. If the Parser Jar includes a META-INF/services/org.apache.tika.parser.Parser file then it is probably correctly configured, and will be used by the Auto-Detect parser if you don't define your own spring bean for it.
Ask for and offer help to other Alfresco Content Services Users and members of the Alfresco team.
Related links:
By using this site, you are agreeing to allow us to collect and use cookies as outlined in Alfresco’s Cookie Statement and Terms of Use (and you have a legitimate interest in Alfresco and our products, authorizing us to contact you in such methods). If you are not ok with these terms, please do not use this website.