Metadata Extraction to Tags

cancel
Showing results for 
Search instead for 
Did you mean: 

Metadata Extraction to Tags

ragauss
Active Member II
0 17 10.6K

The Alfresco Tags You Know and Love



The tagging capabilities in Alfresco and Share provide an easy way for users to associate tags with a piece of content and filter content sets by those tags. It's an excellent way to add more context for other members of a team, and is particularly useful for visual content where there may not be any associated text to describe or enable searching for what's depicted.



Some file formats may contain metadata that can aid in that description and searching, and that metadata is likely something Alfresco can already extract, but we wouldn't get the slick user experience that tags afford us.



'Wait! What if we could map extracted metadata to standard tags?' Hey, that's a great idea!

Tag Mapping



We introduced the ability to map metadata extraction to tags in 4.2.c, but it's not enabled out of the box. Let's take a quick look at how things worked, what was changed, then how you might use it.

Metadata Extraction Mapping Refresher



ContentMetadataExtracter is the action executer which does the work of getting the proper MetadataExtracter from the MetadataExtracterRegistry, then calls its extract method to fill in a properties map.



AbstractMappingMetadataExtracter, which is what most metadata extractors extend from, allows you to map incoming metadata fields to Alfresco properties.

Tags Refresher



Alfresco's tags are stored and displayed using the cm:taggable property inside the cm:taggable aspect. A type of category node is created for each tag (or linked to if it already exists) and is associated with the tagged content as a property by the TaggingService.



In the past you couldn't just map your free-form string metadata fields to cm:taggable as it's expecting a nodeRef to perform that property linking.

What's Changed



Now we've caught MalformedNodeRefExceptions related to tags in AbstractMappingMetadataExtracter, and if enableStringTagging=true the raw string values will be passed on as is to the next step. There may be some cases where you actually have a tag's nodeRef as a metadata field in your binary file, in which case no MalformedNodeRefException would be thrown and your content would be linked to that existing tag.



Once we've returned to ContentMetadataExtracter the properties modified by the metadata extractor are iterated and set by the NodeService. It's during that process that we look for cm:taggable and use the TaggingService to create or link the raw string tags, provided enableStringTagging=true and the TaggingService is set.



Multi-valued metadata fields are supported of course, and a tag will be created or linked for each value.

How to Use it



Again, to make all this magic happen you must currently set the taggingService property on ContentMetadataExtracter and set enableStringTagging=true. Your overriding bean definition might look like this:

<bean id='extract-metadata' class='org.alfresco.repo.action.executer.ContentMetadataExtracter' parent='action-executer'>

    <property name='nodeService'>

        <ref bean='NodeService' />

    </property>

    <property name='contentService'>

        <ref bean='ContentService' />

    </property>

    <property name='dictionaryService'>

        <ref bean='dictionaryService' />

    </property>

    <property name='taggingService'>

        <ref bean='TaggingService' />

    </property>


    <property name='metadataExtracterRegistry'>

        <ref bean='metadataExtracterRegistry' />

    </property>

    <property name='applicableTypes'>

        <list>

            <value>{http://www.alfresco.org/model/content/1.0}content</value>

        </list>

    </property>

    <property name='carryAspectProperties'>

        <value>true</value>

    </property>

    <property name='enableStringTagging'>

        <value>true</value>

    </property>


</bean>


then define your metadata extractor mapping, something like:

dc\:subject=cm:taggable


IPTC Keywords Example



The Media Management module supports full IPTC extraction for images, which is where keywords used by so many photo editing and organization programs is stored, and a perfect candidate for mapping to Alfresco tags:



Tag Mapping



What are other metadata fields are you thinking of mapping to tags?
17 Comments
blog_commenter
Active Member
For a customer we've mapped the MS Office document property Keywords to tags via an intermediary property, where a policy would pickup the complete string and tokenise it according to a defined delimiter. This could be a field relevant to a lot of people - and combined with a reverse update via metadata embedders this could really help re-connect offline edited content or enhance the SharePoint protocol experience.
blog_commenter
Active Member
Hi,

until now I used something like Axel describes to extract the Keyword values (=comma seperated multi valued metadata field) from PDF and MS documents. Trying to do it the new way described in the article it gives me the infamous annoying cast exception:



java.lang.ClassCastException: java.lang.String cannot be cast to [Ljava.lang.Object;



I use the bean definition given above and this simple mapping bean:





  

       

            true

       

       

           

                http://www.alfresco.org/model/content/1.0

                http://purl.org/dc/terms/

                dcterms:creator

                dcterms:title

                dcterms:description



                cm:taggable



                dcterms:created

                dcterms:modified

           

       

   





Maybe somebody already knows a solution for this.



Kind regards

Jochem
blog_commenter
Active Member
Ooh, sorry, wrong html tags used.
blog_commenter
Active Member
It is just this simple Mapping that I tried to use:



cm:taggable
ragauss
Active Member II
It's tough to tell what the problem might be without seeing your mapping properly or the a little more context around the error.
blog_commenter
Active Member
Hello Ray,

embarassing, hope I don't ruin it again, will put the mapping at the bottom.

I made an interesting or rather annoying experience with automatic tagging using the javascript api, and I guess it could be the reason for my mapping problem too.

There was an 'invisible' linebreak or html linebreak-tag  in one metadata item that I tried to import into a tag, and that corrupted the tagging data somehow. It appeared as a simple linebreak in the javascript console but showed up as html tag in the Share repo browser. The tag manager also didn't show any tags anymore after importing it. Even removing all tags (also with javascript code) left the tagging mechanisms in Share in a unsusable.





  

       

            true

       

       

           

                http://www.alfresco.org/model/content/1.0

                http://purl.org/dc/terms/

                dcterms:creator

                dcterms:title

                dcterms:description

                dcterms:keywords

<!--                dcterms:created

                dcterms:modified -->

                dcterms:created

                dcterms:modified

           

       

   

blog_commenter
Active Member
Sorry again, the 'code' html-tags  don't work somehow. :-(
ragauss
Active Member II
Hi Jochem,



Would you mind filing a JIRA issue with a component of 'Tika, POI, and Metadata Extraction' describing the problem and perhaps attaching the mapping there so I can take a closer look?



Regards,



Ray
blog_commenter
Active Member
Hi - this is working fine, but it does only seem to work if your create an inbound rule.

how can i keep extracting metadata and put it into the tags on update of a document (new version, edited content, etc)
ragauss
Active Member II
Hi kassem,



You shouldn't need an inbound rule for the initial mapping to tags to work.



As far as updates, by default most metadata extractors have an OverwritePolicy of PRAGMATIC (http://dev.alfresco.com/resource/docs/java/org/alfresco/repo/content/metadata/MetadataExtracter.Over...) so the previously extracted metadata isn't overwritten.



You may be able to set the overwritePolicy property to something like EAGER on the extractor bean in question to achieve the desired result.
blog_commenter
Active Member
I've noticed that at least with the latest Alfresco enterprise version (5.0.2.5), it doesn't parse out the keywords into separate tags, but concatenates them together with commas. Have you seen this behavior before? I see this for PDFs and jpg (the two mime types I tried). I didn't create my own extractor, but just used PdfBoxMetadataExtracter and TikaAutoMetadataExtracter and pointed them to property files where I set the mapping.
blog_commenter
Active Member
Hey Ray,

I don't know what might have happened recently, but I noticed in Alfresco 5.0.d and 5.0.2.5 that keywords are not getting extracted correctly. They are getting concatenated into a single comma separated tag. I turned TRACE on for ContentMetadataExtracter and noticed that the rawValue getting passed into addTags is a Collection with a single value which is that comma separated list of keywords. I don't know if there was a code change or Tika update or whatever, but noticed that problem. I posted to the forums https://forums.alfresco.com/forum/developer-discussions/repository-services/metadata-extraction-tags... about this as well, but haven't heard anything back.
ragauss
Active Member II
Hi Jeff,

The change in behavior is being investigated.

Regards,

Ray
blog_commenter
Active Member
Hi Ray,

do you have any idea, why I have got the same result (tags = 'tag1,tag2') with 5.1.f, when the Jira issue is closed as fixed already months ago?

Thanks
ragauss
Active Member II
Hi Ivos,

It looks like the fix has made it onto the community line fairly recently so should be in the next community release.

Regards,

Ray
blog_commenter
Active Member
Hi Ray,

Thanks a lot for so quick reply.  It looks promising on one side and it's pity, that there are no patches even for bugs marked as critical or ...

Ivo
ragauss
Active Member II
Hi Ivos,

Defect escalation, long term patching, service packs and prioritization of issues is big part of Alfresco One, the supported product.

See the comparison here.

Regards,

Ray