Not able to index content of large pdfs in database mysql

cancel
Showing results for 
Search instead for 
Did you mean: 
jbrasil
Active Member II

Not able to index content of large pdfs in database mysql

Hey guys!
I am unable to index large pdf files.

Version Alfresco community 6.1.1
Ubuntu Linux 18.04


See the error message of file catalina.out:

2020-10-01 17:03:28,779 WARN [content.metadata.AbstractMappingMetadataExtracter] [http-nio-8080-exec-41] Metadata extraction rejected:
Extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@758471b1
Reason: Max doc size exceeded 10.0 MB
2020-10-01 17:03:29,193 WARN [content.metadata.AbstractMappingMetadataExtracter] [http-nio-8080-exec-28] Metadata extraction rejected:
Extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@758471b1
Reason: Max doc size exceeded 10.0 MB

Read the documentation on the website
https://docs.alfresco.com/6.1/references/dev-extension-points-content-transformer.html

I added in alfresco-global.properties

content.transformer.PdfBox.priority = 110
content.transformer.PdfBox.extensions.pdf.txt.priority = 50
content.transformer.PdfBox.extensions.pdf.txt.maxSourceSizeKBytes = 25600

However, it still didn't work.
Can you help please?
With best regards,

3 Replies
angelborroy
Alfresco Employee

Re: Not able to index content of large pdfs in database mysql

jbrasil
Active Member II

Re: Not able to index content of large pdfs in database mysql

Hi angelborroy,
I had seen that documentation.
I applied the parameters below, in the alfresco-global.properties
I am restart Alfresco service.
It still didn't work.
Can you help?
Thanks a lot.

content.transformer.default.timeoutMs=180000
content.transformer.default.txt.*.maxSourceSizeKBytes=1048576
content.transformer.JodConverter.maxSourceSizeKBytes=102400

log4j.logger.org.alfresco.repo.content.transform.TransformerDebug=DEBUG

content.metadataExtracter.pdf.maxDocumentSizeMB=1000
content.metadataExtracter.default.timeoutMs=3625000

content.transformer.PdfBox.priority=110
content.transformer.PdfBox.extensions.pdf.txt.priority=50
content.transformer.PdfBox.extensions.pdf.txt.maxSourceSizeKBytes=25600

content.transformer.json2html.priority=30
content.transformer.json2html.extensions.json.html.supported=true
content.transformer.json2html.extensions.json.html.priority=30

afaust
Master

Re: Not able to index content of large pdfs in database mysql

Well, not really cross-posting as the OP is different. But the answer in the other thread is definitely spot on for a similar issue with transformers. What is not mentioned in the other thread is that the transformer config is also documented.

But in this case we are talking about metadata extractors, and these have separately configured limits. In fact, the PdfBox extractor is about the only one that has a configured limit via the global property content.metadataExtracter.pdf.maxDocumentSizeMB