Sending PDF to Alfresco via CIFS doesn't fire metadata extracter

cancel
Showing results for 
Search instead for 
Did you mean: 
jpbuttet
Member II

Sending PDF to Alfresco via CIFS doesn't fire metadata extracter

I run a dockerized Alfresco with following components:

 

  • Alfresco 5.2.f & api-explorer 5.2.0
  • Share 5.2.e
  • Nginx (reverse proxy on port 143)
  • Postgres 9.4
  • Libreoffice 5.1.2
  • Solr6 (alfresco-search-services-1.0.0)
A Fujitsu N7100 network scanner is attached to Alfresco via CIFS.
The CIFS interface is configured in the alfresco-global.properties file:
 
### CIFS configuration###
cifs.enabled=true
cifs.serverName=alfresco
cifs.domain=WORKGROUP
cifs.broadcast=255.255.255.255
cifs.ipv6.enabled=false
cifs.hostannounce=true
cifs.tcpipSMB.port=1445
cifs.netBIOSSMB.sessionPort=1139
cifs.netBIOSSMB.namePort=1137
cifs.netBIOSSMB.datagramPort=1138
 
The scanner has an integrated OCR engine and sends PDF files to Alfresco.
 
Unfortunately, the PDF files received from the scanner by Alfresco are not searchable and don't fire the metadata extractor (according the debug log).
When the same file is downloaded to a Windows 10 wokrstation (via the share interface) and then uploaded again into Alfresco (via the share interface), this event fires the metadata extractor and the file becomes searchable within Alfresco.
 
Here is my question : Why the PDF files received by Alfresco via CIFS interface don't fire Alfresco's metadata extractor ?
 
Alfresco log files showing metadata extractor fired upon PDF file upload via share interface , no such log when the same file is received by the CIFS Interface:
alfresco_1 | extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@b83b1f3
alfresco_1 | 2019-10-21 09:11:34,854 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-368] Concurrent extractions : 0
alfresco_1 | 2019-10-21 09:11:34,854 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-368] New extraction accepted. Concurrent extractions : 1
alfresco_1 | 2019-10-21 09:11:34,891 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-368] Extraction finalized. Remaining concurrent extraction : 0
alfresco_1 | 2019-10-21 09:11:34,891 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-368] Converted extracted raw values to system values:
alfresco_1 | 2019-10-21 09:11:34,900 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-368] Completed metadata extraction:
alfresco_1 | extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@b83b1f3
 
Could someone help me to solve this issue ?
 
Best regards.
 
Jean-Pierre Buttet
 

 

5 Replies
Community Manager
Community Manager

Re: Sending PDF to Alfresco via CIFS doesn't fire metadata extracter

Have you confirmed that the PDFs are not image PDFs? So, are they PDFs wrapped around an image file or was OCR done by the scanner putting the text into the PDF file? Have you also confirmed that if you take the scanner output and manually upload it that it properly goes through as expected?

jpbuttet
Member II

Re: Sending PDF to Alfresco via CIFS doesn't fire metadata extracter

Hello,

Thank you for your answer.

Yes, the scanner performs OCR and put text into the PDF. This is confirmed because, as i wrote, the very same file becomes searchable after a simple download (into workstation) and then immediately upload into Alfresco by using share interface.

Looks like the PDF with text content doesn't fire the metadata extracter when sent by the CIFS interface from the scanner, but everything is okay when the same file is uploaded from Alfresco via share by the user...

jpbuttet
Member II

Re: Sending PDF to Alfresco via CIFS doesn't fire metadata extracter

Not sure it's related to the strange extracter behavior I submitted, but I have following debug logs during Aafresco startup:

alfresco_1 | 2019-10-21 15:24:41,219 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [localhost-startStop-1] Loaded mapping properties from resource: alfresco/metadata/TikaAutoMetadataExtracter.properties
alfresco_1 | 2019-10-21 15:24:41,222 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [localhost-startStop-1] No explicit embed mapping properties found at: alfresco/metadata/TikaAutoMetadataExtracter.embed.properties, assuming reverse of extract mapping
alfresco_1

Any idea ?

 

 

Community Manager
Community Manager

Re: Sending PDF to Alfresco via CIFS doesn't fire metadata extracter

I'm still looking into this. Just to make completely sure - you do the exact same thing manually (same folder, same repository, same user, etc.) and it works?

jpbuttet
Member II

Re: Sending PDF to Alfresco via CIFS doesn't fire metadata extracter

Hello,

 

Thanks for your reply. Yes.

I just checked again:

first step:

 

PDF file sent to Alfresco (to folder "Nemerisation") via CIFS, -->no DEBUG metadata Extractor log and the PDF is not searchable.

The fujitsu N7100 scanner logs into Alfresco as user "xxx"

 

second step:

User "xxx", not the scanner but a real user is connected to Alfresco and performs following actions:

from the folder"Numerisation" user downloads the file to Windows 10 Worstation via share (firefox browser)

then immeditaly after, user uploads the same file to Alfresco via share (firefox browser)

 

---> this fires metadata extraxter according DEBUG log, and the PDF besomes searchable:

 

alfresco_1 | 2019-10-22 14:08:30,582 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-38] Starting metadata extraction:
alfresco_1 | reader: ContentAccessor[ contentUrl=store://2019/10/22/14/8/99524f4b-2ace-4202-8bd8-b83933f7edf9.bin, mimetype=application/pdf, size=405868, encoding=UTF-8, locale=fr]
alfresco_1 | extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@3839fcb4
alfresco_1 | 2019-10-22 14:08:30,584 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-38] Concurrent extractions : 0
alfresco_1 | 2019-10-22 14:08:30,585 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-38] New extraction accepted. Concurrent extractions : 1
alfresco_1 | 2019-10-22 14:08:30,601 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-38] Extraction finalized. Remaining concurrent extraction : 0
alfresco_1 | 2019-10-22 14:08:30,602 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-38] Converted extracted raw values to system values:
alfresco_1 | Raw Properties: {date=2019-10-22T15:06:30Z, pdfSmiley TongueDFVersion=1.3, TIKA_PARSER_PARSE_SHAPES=false, xmp:CreatorTool=N7100 1.0, comments=null, dc:subject=null, meta:creation-date=2019-10-22T15:06:30Z, created=2019-10-22T15:06:30Z, author=null, MetadataDate=D:20191022160630+01'00', xmpTPg:NPages=1, Creation-Date=2019-10-22T15:06:30Z, dcterms:created=2019-10-22T15:06:30Z, Last-Modified=2019-10-22T15:06:30Z, dcterms:modified=2019-10-22T15:06:30Z, dc:format=application/pdf; version=1.3, title=null, Last-Save-Date=2019-10-22T15:06:30Z, meta:save-date=2019-10-22T15:06:30Z, pdf:encrypted=false, producer=PFU PDF Library 1.0, modified=2019-10-22T15:06:30Z, Content-Type=application/pdf}
alfresco_1 | System Properties: {{http://www.alfresco.org/model/content/1.0}created=2019-10-22T15:06:30Z, {http://www.alfresco.org/model/content/1.0}title=null, {http://www.alfresco.org/model/content/1.0}author=null}
alfresco_1 | 2019-10-22 14:08:30,603 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-38] Extracted Metadata from ContentAccessor[ contentUrl=store://2019/10/22/14/8/99524f4b-2ace-4202-8bd8-b83933f7edf9.bin, mimetype=application/pdf, size=405868, encoding=UTF-8, locale=fr]
alfresco_1 | Found: {date=2019-10-22T15:06:30Z, pdfSmiley TongueDFVersion=1.3, TIKA_PARSER_PARSE_SHAPES=false, xmp:CreatorTool=N7100 1.0, comments=null, dc:subject=null, meta:creation-date=2019-10-22T15:06:30Z, created=2019-10-22T15:06:30Z, author=null, MetadataDate=D:20191022160630+01'00', xmpTPg:NPages=1, Creation-Date=2019-10-22T15:06:30Z, dcterms:created=2019-10-22T15:06:30Z, Last-Modified=2019-10-22T15:06:30Z, dcterms:modified=2019-10-22T15:06:30Z, dc:format=application/pdf; version=1.3, title=null, Last-Save-Date=2019-10-22T15:06:30Z, meta:save-date=2019-10-22T15:06:30Z, pdf:encrypted=false, producer=PFU PDF Library 1.0, modified=2019-10-22T15:06:30Z, Content-Type=application/pdf}
alfresco_1 | Mapped and Accepted: {{http://www.alfresco.org/model/content/1.0}title=null, {http://www.alfresco.org/model/content/1.0}author=null}
alfresco_1 | 2019-10-22 14:08:30,605 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-38] Completed metadata extraction:
alfresco_1 | reader: ContentAccessor[ contentUrl=store://2019/10/22/14/8/99524f4b-2ace-4202-8bd8-b83933f7edf9.bin, mimetype=application/pdf, size=405868, encoding=UTF-8, locale=fr]
alfresco_1 | extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@3839fcb4
alfresco_1 | changed: {{http://www.alfresco.org/model/content/1.0}title=null, {http://www.alfresco.org/model/content/1.0}author=null}

 

 

 

thank you for your support. :-)