Metadata Embedding

cancel
Showing results for 
Search instead for 
Did you mean: 

Metadata Embedding

ragauss
Active Member II
0 7 1,593

What is Metadata Embedding?



Extraction of metadata from binary files is a critical task for enterprise content and digital asset management systems. The information contained in those files can aid in searching, workflows, and user interface visualizations.



Alfresco does a fantastic job of handling metadata extraction through it's concept of MetadataExtracters registering themselves in the MetadataExtracterRegistry, and the use of the Apache Tika project to power many of those extractors enables a huge number of file formats and metadata standards to be supported.



We ingest a binary file, metadata is extracted and mapped to Alfresco data model properties, and we can view and edit those properties in an interface like Alfresco Share.



In some cases it's important to get those property changes or other required fields back into the binary file as metadata. You might, for example, want to set the author metadata in a document or set copyright info in images before sending them outside of your organization.



In 4.2.c we introduced the concept of metadata embedders, which are essentially the inverse of MetadataExtracters, and are responsible for writing properties into content.

How Does it Work?



The MetadataEmbedder interface has just two methods, isEmbeddingSupported, and embed.



Rather than create an entirely separate registry for embedders, the MetadataExtracterRegistry was extended with a getEmbedder(String sourceMimetype) method. Note that currently only embedders which are also extractors can be registered, but in the future support may be added for explicitly registering embedders. You'd usually implement both in the same class anyway. Speaking of...



AbstractMappingMetadataExtracter now implements the MetadataEmbedder interface and contains:



  • A supportedEmbedMimetypes collection that's used in the isEmbeddingSupported call


  • embedMapping that defines the mapping from Alfresco properties to metadata fields


  • An embedInternal method to be overridden by extending classes


Again, just the reverse of the extraction pattern.



For classes extending AbstractMappingMetadataExtracter, the embed mapping can be defined in a properties file in the same location as the extract mapping properties but with an embed suffix, i.e. classpath:/x/y/z/MyExtracter.embed.properties (note that the preferred location for mapping files for extractors and embedders has changed after 4.2.c, see ALF-17891). If no embed properties are found a reverse mapping of the extract mapping is used by default, cool right?

What About Tika?



'But that's still sooooo... abstract. How are we going to leverage Tika? It doesn't support embedding, does it?'



Well as a matter of fact it does, as of version 1.3 (TIKA-775).



The same notion of writing metadata into a binary has been outlined with an interface and basic implementation in Tika, so of course our TikaPoweredMetadataExtracter builds on that and overrides the embedInternal method defined in its parent AbstractMappingMetadataExtracter to convert Alfresco properties to Tika metadata fields and passes that on to a Tika Embedder's embed method, which then passes back the new binary with the metadata embedded.



Tika embedding

How Can we Use Embedding?



Our shiny new Alfresco metadata embedder's embed method isn't very useful if we don't have an easy way to call it, so we've added a ContentMetadataEmbedder action executor which shows up as a standard 'Embed properties as metadata in content' action that can be used in a rule on a folder or executed in a workflow.  (After 4.2.c you can find this in alfresco/extension/metadata-embedding-context.xml.sample)



So what kinds of files and metadata does Tika have embed support for? Truth be told, not many at the moment, but the tika-exiftool project does!



tika-exiftool is wrapper for calls to the ExifTool command-line which contains a Tika Parser and Embedder for image files.



The Media Management module contains an example which brings all of this together with an extension of TikaPoweredMetadataExtracter that uses the Tika Embedder defined in the tika-exiftool project to enable IPTC embedding in image files.



We can add an embed rule to a folder that fires on content update such that when we edit our caption field through Share, the new value is embedded in the file and can be seen using standard image metadata tools, like Photoshop's file info.



Embed flow



Sit down and stop clapping, everyone is staring at you. Aw, who cares, go ahead.

What's Next?



We'll be adding embed support for more file and metadata types to Tika and Alfresco in the future including, of course, documents, but in the meantime, what other formats are you anxious to start embedding?
7 Comments
blog_commenter
Active Member
Embedding / transferring metadata into OOXML documents is a request that is relevant quite often. In the past, we've used custom integrations with DOCX4J (Apache License) to transfer metadata into customer content. Combined with metadata extractors, this could enhance the SharePoint protocol experience  (effectively providing the same Content Type Document Properties feature). As far as 'other formats' are concerned, I can't think of any that would be nearly as relevant / significant.

Concerning EXIF, I can't wait for a GeoTagger UI extension in Share to manage / correct those files where my camera just wouldn't get a clear GPS signal.
blog_commenter
Active Member
Question from a non-programmer: Is it possible to call exiftool from within say a rule and have it extract custom metadata from e.g. a pdf (i.e. XMP tags) and get those to populate the Alfresco metadata fields?

Cheers
ragauss
Active Member II
Hi nafets,



The Media Management module does something similar using exiftool, though you wouldn't normally have a rule interact directly with the command line exiftool.



Normally the way extraction works is that once the content is ingested the metadata extraction is automatically performed, without the need for an explicit rule, by executing an action which:



1. Finds the proper metadata extractor based on the content's mimetype and your repository's configuration and tells that extractor to extract the metadata.



2. That extractor extracts then maps 'raw' metadata fields to Alfresco data model properties based on its mapping configuration.



Now, in the case of PDFs, Apache Tika (the primary underlying metadata extraction library) can already read custom XMP, so all you have to do is add your custom XMP fields to the mapping config for the PDF extractor!



See http://wiki.alfresco.com/wiki/Metadata_Extraction#Configuring_the_Extractor



Hope that helps,



Ray
blog_commenter
Active Member
This is great feature indeed!!!! which opens up whole new field of opportunities to implement various complex requirements within alfresco were earlier not able to meetup, various use cases where earlier we were not able to implement (or could not even imagine ) within Alfresco is possible now. It will encourage customer to go for Alfresco. Good work Alfresco Team. keep it up Smiley Happy
blog_commenter
Active Member
Hi Ray,



I am investigating how to embed exif tags from alfresco metadata changes.

If I understand the exiftool code correctly,  it only supports embedding of iptc and XMP tags ( in ExiftoolTikaIptcMapper.java ) but ignores changes to exif tags, right?

Am I correct that, in order to embed exif tag changes, I would need to create a new ExiftoolTikaMapper class and use this as mapper for the ExiftoolExternalEmbedder constructor?

Or is there an easier way?



kr



Stefan
ragauss
Active Member II
Hi Stefan,

You are correct.  You would extend ExiftoolTikaIptcMapper or create a new ExiftoolTikaMapper for EXIF exclusively and pass that into the ExiftoolExternalEmbedder constructor.

As you've probably seen, the main job of that mapper is to translate between Exiftool fields to/from Tika fields.

Regards,

Ray
blog_commenter
Active Member
Thank you Ray,  Just in case you may want to include this exifdata embedding feature in future releases, here is the code I ve added to ExiftoolTikaIptcMapper to make it work:



_tikaToExiftoolMetadataMap.put(

          Property.internalTextBag('tiff:ImageWidth'),

          Arrays.asList(

            Property.internalTextBag('ImageWidth')));

        _tikaToExiftoolMetadataMap.put(

          Property.internalTextBag('tiff:ImageLength'),

          Arrays.asList(

            Property.internalTextBag('ImageLength')));

        _tikaToExiftoolMetadataMap.put(

          Property.internalTextBag('tiff:Make'),

          Arrays.asList(

            Property.internalTextBag('Make')));

        _tikaToExiftoolMetadataMap.put(

          Property.internalTextBag('tiff:Model'),

          Arrays.asList(

            Property.internalTextBag('Model')));

        _tikaToExiftoolMetadataMap.put(

          Property.internalTextBag('tiffSmiley Frustratedoftware'),

          Arrays.asList(

            Property.internalTextBag('Software')));

        _tikaToExiftoolMetadataMap.put(

          Property.internalTextBag('tiffSmiley Surprisedrientation'),

          Arrays.asList(

            Property.internalTextBag('Orientation#')));//# allows writing numbers

        _tikaToExiftoolMetadataMap.put(

          Property.internalTextBag('tiff:XResolution'),

          Arrays.asList(

            Property.internalTextBag('XResolution')));

        _tikaToExiftoolMetadataMap.put(

          Property.internalTextBag('tiff:YResolution'),

          Arrays.asList(

            Property.internalTextBag('YResolution')));

        _tikaToExiftoolMetadataMap.put(

          Property.internalTextBag('tiff:ResolutionUnit'),

          Arrays.asList(

            Property.internalTextBag('ResolutionUnit#')));//# allows writing numbers

        _tikaToExiftoolMetadataMap.put(

          Property.internalTextBag('exif:Flash'),

          Arrays.asList(

            Property.internalTextBag('Flash')));

        _tikaToExiftoolMetadataMap.put(

          Property.internalTextBag('exif:ExposureTime'),

          Arrays.asList(

            Property.internalTextBag('ExposureTime')));

        _tikaToExiftoolMetadataMap.put(

          Property.internalTextBag('exif:FNumber'),

          Arrays.asList(

            Property.internalTextBag('FNumber')));

        _tikaToExiftoolMetadataMap.put(

          Property.internalTextBag('exif:FocalLength'),

          Arrays.asList(

            Property.internalTextBag('FocalLength')));

        _tikaToExiftoolMetadataMap.put(

          Property.internalTextBag('exif:IsoSpeedRatings'),

          Arrays.asList(

            Property.internalTextBag('ISOSpeedRatings'),

            Property.internalTextBag('ISO')));

        _tikaToExiftoolMetadataMap.put(

          Property.internalTextBag('exifSmiley Very HappyateTimeOriginal'),

          Arrays.asList(

            Property.internalTextBag('DateTimeOriginal')));