Content Transformation and Metadata Extraction with Apache Tika

cancel
Showing results for 
Search instead for 
Did you mean: 

Content Transformation and Metadata Extraction with Apache Tika

resplin
Intermediate
0 0 7,572

Obsolete Pages{{Obsolete}}

The official documentation is at: http://docs.alfresco.com



Content ModelingCore Repository Services3.44.0
This document assumes knowledge of how to extend the repository configuration.

Back to Developer Guide

Back to Server Configuration

Back to Content Transformations

Back to Metadata Extraction




Introduction


From Swift onwards, Alfresco makes use of Apache Tika.
This is used for both metadata extraction, and content transformation.
For metadata extraction, it allows easy extraction of the metadata of
documents and their translation into your content model. For content
transformation, it allows the production of plain text, HTML and XML (XHTML)
versions of content.

The exact list of formats which are supported will vary based on the
version of Tika being used. For Project Swift, Tika 0.8 is used, the list
of formats that are supported will shortly be available at
http://tika.apache.org/0.8/formats.html . For now, the list of features
in the previous release are available at
http://tika.apache.org/0.7/formats.html

All the Parsers which ship as standard with Tika are available under the Apache License (or similar), as are there dependencies. However, a few 3rd Party parsers are available which have different licenses (usually GPL), details available from the Tika Wiki. See the details below for how to enable these plugins if required.


Tika and Metadata Extraction


A number of Metadata Extractors are powered by
Apache Tika. Many of the existing extractors in Alfresco have been
converted to use Tika,


Auto Detect


The Auto-Detect parser allows the extraction of metadata from any files which
are supported by Tika, but where no dedicated metadata extractor exists. It
provides a common set of mappings from Tika metadata to the Alfresco content
model, which will be used across all files that are handled by the auto-detect
parser fall-back.

The auto-detect parser is provided by
org.alfresco.repo.content.metadata.TikaAutoMetadataExtracter, and as such
the properties mapping is handled by /org/alfresco/repo/content/metadata/TikaAutoMetadataExtracter.properties
If we wish to add extra mappings, then we can follow the
Configuring an Extractor
guide, for the extracter.TikaAuto bean to add in the extra mapping(s).

The auto-detect parser can be disabled just like any other Alfresco supplied
metadata extractor. Simply comment out the bean definition for extractor.TikaAuto
inside <WEB-INF>/classes/alfresco/content-services-context.xml and restart
the repository. For more details, see the main Metadata Extractors
page.


New Tika Parsers


Whilst Tika ships with a large number of file format parsers, it won't
always cover all formats out of the box. All the parsers that ship with
Tika depend on Apache Licensed (or compatible) libraries, which means
that some parsers (typically depending on GPL or propriatary libraries)
cannot be shipped.

If you have an additional Tika parser that you wish to use within Alfresco,
a small amount of coding and configuration is required. However, this is
generally much less work than adding in a whole new metadata extractor
from scratch.


For full control


Firstly, you should create a new Metadata Extractor class that extends
from org.alfresco.repo.content.metadata.TikaPoweredMetadataExtracter .
Your class should register the mimetypes it handles via the contructor
of the superclass, and override the getParser method to return the
appropriate Tika Parser object for your file type. If needed, you can
also override the extractSpecific method to control the mapping.
For an example of how fairly quick and simple this can be, see
org.alfresco.repo.content.metadata.DWGMetadataExtracter

Once you have written your extractor class, you need to register
it with the repository, and configure the mappings. More details on
how to do this are provided in the
Configuring an Extractor
page. The quick answer is to install the class files for your
TikaPoweredMetadataExtracter instance, the new Tika Parser and
its dependent libraries into the repository, then register a new
bean in an extension content file with a definition something like:



    ...
    <bean id='extracter.MyCustomTika'
          class='com.example.mycompany.MyCustomTikaMetadataExtractor'
          parent='baseMetadataExtracter' >
        <property name='inheritDefaultMapping'>
            <value>true</value>
        </property>
        <property name='mappingProperties'>
            <props>
                <prop key='namespace.prefix.cm'>http://www.alfresco.org/model/content/1.0</prop>
                <prop key='user1'>cm:description</prop>
            </props>
        </property>
    </bean>
    ...

Letting spring handle it


Note that this isn't available in 3.4.a, but will be in 3.4 and 3.4.b

If your Tika parser doesn't need any special work to process the output, you can simply spring-in the parser to the metadata extractor service with a bean definition.



    ...
    <bean id='extracter.MyCustomTika'
          class='org.alfresco.repo.content.metadata.TikaSpringConfiguredMetadataExtracter'
          parent='baseMetadataExtracter' >

        <property name='tikaParserName'>
           <value>example.HelloWorldParser</value>
        </property>

        <property name='inheritDefaultMapping'>
            <value>true</value>
        </property>
        <property name='mappingProperties'>
            <props>
                <prop key='namespace.prefix.cm'>http://www.alfresco.org/model/content/1.0</prop>
                <prop key='newTikaKey1'>cm:description</prop>
                <prop key='newTikaKey2'>cm:title</prop>
            </props>
        </property>
    </bean>
    ...

Let Tika Handle It


Some 3rd party Tika plugins include the required services files to be detected and used by the Tika Auto-Detect parser. If the Parser Jar includes a META-INF/services/org.apache.tika.parser.Parser file then it is probably correctly configured, and will be used by the Auto-Detect parser if you don't define your own spring bean for it.




Tika and Content Transformation


A number of Content Transformers are powered
by Apache Tika. Several of the existing to-plain-text transformers in
Alfresco have been converted to use Tika.


Auto Detect


The Auto-Detect parser allows the conversion to plain text, html or
xml/xhtml for any files which are supported by Tika, but where no
dedicated content transformer exists. It generally provides a
transformed version contain most of the text, but which is light
on formatting and layout. This can normally be well used for
indexing and simple preview, but is not normally of the sorts of
quality seen with transformers such as Open Office to PDF.
However, it does easily and quickly allow some transformation
for a wide range of formats that previously were not handled
by Alfresco.

The auto-detect transformer is provided by
org.alfresco.repo.content.transform.TikaAutoContentTransformer.

The auto-detect transformer can be disabled just like any other Alfresco supplied
content transformer. Simply comment out the bean definition for transformer.TikaAuto
inside <WEB-INF>/classes/alfresco/content-services-context.xml and restart
the repository. For more details, see the main
Content Transformer page.


New Tika Parsers


Whilst Tika ships with a large number of file format parsers, it won't
always cover all formats out of the box. All the parsers that ship with
Tika depend on Apache Licensed (or compatible) libraries, which means
that some parsers (typically depending on GPL or propriatary libraries)
cannot be shipped.

If you have an additional Tika parser that you wish to use within Alfresco,
a small amount of coding and configuration is required. However, this is
generally much less work than adding in a whole new content transformer
from scratch.


For full control


Firstly, you should create a new Content Transformer class that extends
from org.alfresco.repo.content.transform.TikaPoweredContentTransformer .
Your class should register the mimetypes it handles via the contructor
of the superclass, and override the getParser method to return the
appropriate Tika Parser object for your file type. For an example of how
fairly quick and simple this can be, see
org.alfresco.repo.content.transform.PdfBoxContentTransformer

Once you have written your transformer class, you simply need to
install the classes and register the transformer with the
repository. Full details on how to add a new transformer are
provided on the Content Transformations page. The quick
answer is to install the class files for your
TikaPoweredContentTransformer instance, the new Tika Parser and
its dependent libraries into the repository, then register a new
bean in an extension content file with a definition something like:



    ...
    <bean id='transformer.MyCustomTika'
          class='com.example.mycompany.MyCustomTikaContentTransformer'
          parent='baseContentTransformer' />
    ...

Letting spring handle it


Note that this isn't available in 3.4.a, but will be in 3.4 and 3.4.b

If  your Tika parser doesn't need any special work to process the output,  you can simply spring-in the parser to the content transformer service  with a bean definition.



    ...
    <bean id='transformer.MyCustomTika'
          class='org.alfresco.repo.content.transform.TikaSpringConfiguredContentTransformer'
          parent='baseContentTransformer'>

        <property name='tikaParserName'>
           <value>example.HelloWorldParser</value>
        </property>

    </bean>
    ...

Let Tika Handle It


Some 3rd party Tika plugins include the required services files to be detected and used by the Tika Auto-Detect parser. If the Parser Jar includes a META-INF/services/org.apache.tika.parser.Parser  file then it is probably correctly configured, and will be used by the  Auto-Detect parser if you don't define your own spring bean for it.