Missing embedded metadata when uploading PDF

albertomartin · ‎18 Nov 2018

Hello, I'm trying to automate metadata extraction in Alfresco Community 5.2 so that my custom models get populated automatically when documents are uploaded. My PDFs have custom embedded metadata fields (see image 1). However, when I import these PDFs to Alfresco, according to the information in the alfresco debugger, not all the metadata tags (for example prims: and jav are detected (see image 2). Do you know if this is the expected behaviour? In that case, what do I have to change so that Alfresco detects these custom metadata fields? Thank you very much for your help in advance.

jpotts · ‎19 Nov 2018

Check the docs:

Metadata Extractors | Alfresco Documentation

By default, the metadata extraction grabs the author, title, subject, and created. If you want anything else, you'll have to tweak the metadata extractor. Because there is already an extractor that knows how to pull fields from PDFs you should not have to write your own from scratch, but you could if you needed to.

I think you'll just need to map the fields to actual properties in your model. The docs are pretty thorough on this topic and there are a number of other pages around the net that discuss customizing metadata extraction.

Jeff Potts
https://www.metaversant.com | https://ecmarchitect.com

albertomartin · ‎19 Nov 2018

Hello Jeff, first of all thank your very much for your response.

I'm sorry, I see now that I didn't make myself clear. I read that page of the documentation carefully. I'm writing because I think that while following the instructions in the documentation, I am experiencing a behaviour that I haven't seen discussed in said documentation, or any other document on the web that I could find. I understand that by default, only some fields are mapped, so I wanted to map the fields I need. First, of course, I created a new model that contains a custom type with the fields I needed (for example: DOI, volume, issn), and created a rule in the folder so that any document added to that folder would be specialized to that type.

Then, I needed to create a new mapping, but for that, first I needed to know the names of the properties according to Alfresco. To do this, I modified the log4j.properties so that log4j.logger.org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter=debug. With this, after uploading a document that contains the metadata I need, I could check the names of the properties I should use in the mapping.

This is where I found my problem. In the Alfresco log file, when I upload one of these documents, not all the metadata that is available in the PDF (see image 1 in first post of the thread) appears as a raw property. For example:

Raw Properties: {date=2018-08-13T08:56:21Z, pdfDFVersion=1.6, xmp:CreatorTool=Springer, Keywords=Highly-cited documents,Google Scholar,Web of Science,Scopus,Coverage,Academic journals,Classic Papers, subject=Scientometrics, https://doi.org/10.1007/s11192-018-2820-9, pdfaDFVersion=A-2b, dc:creator=Enrique Orduna-Malea , Emilio Delgado López-Cózar , Alberto Martín-Martín, description=Scientometrics, https://doi.org/10.1007/s11192-018-2820-9, dcterms:created=2018-06-26T11:18:02Z, Last-Modified=2018-08-13T08:56:21Z, dcterms:modified=2018-08-13T08:56:21Z, dc:format=application/pdf; version=1.6, application/pdf; version="A-2b", title=Coverage of highly-cited documents in Google Scholar, Web of Science, and Scopus: a multidisciplinary comparison, Last-Save-Date=2018-08-13T08:56:21Z, CrossMarkDomains[1]=springer.com, meta:save-date=2018-08-13T08:56:21Z, dc:title=Coverage of highly-cited documents in Google Scholar, Web of Science, and Scopus: a multidisciplinary comparison, pdf:encrypted=false, modified=2018-08-13T08:56:21Z, cp:subject=Scientometrics, https://doi.org/10.1007/s11192-018-2820-9, robots=noindex, Content-Type=application/pdf, TIKA_PARSER_PARSE_SHAPES=false, creator=Enrique Orduna-Malea , Emilio Delgado López-Cózar , Alberto Martín-Martín, pdfaid:conformance=B, comments=null, meta:author=Enrique Orduna-Malea , Emilio Delgado López-Cózar , Alberto Martín-Martín, dc:subject=[Ljava.lang.String;@91aba4, meta:creation-date=2018-06-26T11:18:02Z, created=2018-06-26T11:18:02Z, author=Enrique Orduna-Malea , Emilio Delgado López-Cózar , Alberto Martín-Martín, xmpTPg:NPages=14, Creation-Date=2018-06-26T11:18:02Z, pdfaidart=2, CrossMarkDomains[2]=springerlink.com, meta:keyword=Highly-cited documents,Google Scholar,Web of Science,Scopus,Coverage,Academic journals,Classic Papers, Author=Enrique Orduna-Malea , Emilio Delgado López-Cózar , Alberto Martín-Martín, producer=Acrobat Distiller 10.1.8 (Windows), CrossmarkDomainExclusive=true, CrossmarkMajorVersionDate=2010-04-23, doi=10.1007/s11192-018-2820-9}

My main question is, why is Alfresco not detecting all available metadata in the PDF as raw properties?

I tried changing the mapping in the custom-repository-context.xml file anyway, trying to guess the name of the properties that don't appear in the list of raw properties. I tried mapping the DOI (which is available in the raw properties), the volume, and the ISSN (which are not available as raw properties):

<bean id="extracter.PDFBox" class="org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter"
parent="baseMetadataExtracter">
<property name="documentSelector" ref="pdfBoxEmbededDocumentSelector" />
<property name="inheritDefaultMapping">
<value>true</value>
</property>
<property name="mappingProperties">
<props>
<prop key="namespace.prefix.prism">http://prismstandard.org/namespaces/basic/2.0</prop>
<prop key="doi">prism:doi</prop>
<prop key="prism:volume">prism:volume</prop>
<prop key="issn">prism:issn</prop>
</props>
</property>
</bean>

After uploading another document with this configuration in place, as I expected and feared, only the DOI was correctly extracted.

Any ideas as to why some metadata from the PDF is not being detected by Alfresco?

Thank you very much for your help in advance.