Not able to index content of large pdfs in databas...

jbrasil · ‎2 Oct 2020

Hey guys!
I am unable to index large pdf files.

Version Alfresco community 6.1.1
Ubuntu Linux 18.04

See the error message of file catalina.out:

2020-10-01 17:03:28,779 WARN [content.metadata.AbstractMappingMetadataExtracter] [http-nio-8080-exec-41] Metadata extraction rejected:
Extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@758471b1
Reason: Max doc size exceeded 10.0 MB
2020-10-01 17:03:29,193 WARN [content.metadata.AbstractMappingMetadataExtracter] [http-nio-8080-exec-28] Metadata extraction rejected:
Extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@758471b1
Reason: Max doc size exceeded 10.0 MB

Read the documentation on the website
https://docs.alfresco.com/6.1/references/dev-extension-points-content-transformer.html

I added in alfresco-global.properties

content.transformer.PdfBox.priority = 110
content.transformer.PdfBox.extensions.pdf.txt.priority = 50
content.transformer.PdfBox.extensions.pdf.txt.maxSourceSizeKBytes = 25600

However, it still didn't work.
Can you help please?
With best regards,

angelborroy · ‎2 Oct 2020

Cross-posting: https://hub.alfresco.com/t5/alfresco-content-services-forum/increase-max-file-size-that-solr-indexes...

Hyland Developer Evangelist

jbrasil · ‎2 Oct 2020

Hi angelborroy,
I had seen that documentation.
I applied the parameters below, in the alfresco-global.properties
I am restart Alfresco service.
It still didn't work.
Can you help?
Thanks a lot.

content.transformer.default.timeoutMs=180000
content.transformer.default.txt.*.maxSourceSizeKBytes=1048576
content.transformer.JodConverter.maxSourceSizeKBytes=102400

log4j.logger.org.alfresco.repo.content.transform.TransformerDebug=DEBUG

content.metadataExtracter.pdf.maxDocumentSizeMB=1000
content.metadataExtracter.default.timeoutMs=3625000

content.transformer.PdfBox.priority=110
content.transformer.PdfBox.extensions.pdf.txt.priority=50
content.transformer.PdfBox.extensions.pdf.txt.maxSourceSizeKBytes=25600

content.transformer.json2html.priority=30
content.transformer.json2html.extensions.json.html.supported=true
content.transformer.json2html.extensions.json.html.priority=30

afaust · ‎2 Oct 2020

Well, not really cross-posting as the OP is different. But the answer in the other thread is definitely spot on for a similar issue with transformers. What is not mentioned in the other thread is that the transformer config is also documented.

But in this case we are talking about metadata extractors, and these have separately configured limits. In fact, the PdfBox extractor is about the only one that has a configured limit via the global property content.metadataExtracter.pdf.maxDocumentSizeMB

Not able to index content of large pdfs in database mysql

Not able to index content of large pdfs in database mysql

Re: Not able to index content of large pdfs in database mysql

Re: Not able to index content of large pdfs in database mysql

Re: Not able to index content of large pdfs in database mysql

Not able to index content of large pdfs in database mysql

Not able to index content of large pdfs in database mysql

Re: Not able to index content of large pdfs in database mysql

Re: Not able to index content of large pdfs in database mysql

Re: Not able to index content of large pdfs in database mysql

We use cookies on this site to enhance your user experience