Not able to index content of large pdfs

cancel
Showing results for 
Search instead for 
Did you mean: 
hiten_rastogi1
Established Member

Not able to index content of large pdfs

Jump to solution

Hi All,

We are uploading pdf files upto 200MB in our DMS but the content are not getting indexed. 

After searching we came to know that the maximum limit of pdf files that can be indexed are by default 10MB so we decided to override this prop to 1 GB content.metadataExtracter.pdf.maxDocumentSizeMB=1000 we then deleted our old indexes and restarted the DMS but no effect.

Then we also find out that the default timeout for metaDataExtractor was 20 milliseconds so we changed that to ~1 hour content.metadataExtracter.default.timeoutMs=3625000 but still no change.

Please guide what else needs to be done to get the index correctly.

Thanks

Hiten Rastogi

1 Solution

Accepted Solutions
afaust
Master

Re: Not able to index content of large pdfs

Jump to solution

You can see the problem in the log output. Indexing of the content has nothing to do with the metadata extracter, so increasing its limit did not have any impact on your problem. You need to increase the limits of the PDF => TXT transformers so they are not rejecting the PDF source document.

Check content transformation limits and content transformers (and renditions) for details on how to configure the Transformers subsystem.

The following lines in your log output show that transformers have a 25 MB source file limit and thus are not acting on a 200 MB PDF:

2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 --a) [50] PdfBox > 25 MB
2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 --b) [120] TikaAuto > 25 MB
2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 Finished in 10 ms Transformer NOT called

View solution in original post

8 Replies
mehe
Senior Member II

Re: Not able to index content of large pdfs

Jump to solution

Just a first question: your documents are pdfs containing extractable text, not just scanned pages without ocr or protected by restricted pdf permissions?

hiten_rastogi1
Established Member

Re: Not able to index content of large pdfs

Jump to solution

Hi Martin,

Yes, the pdf are readable not the scanned ones.

Thanks

Hiten Rastogi

mehe
Senior Member II

Re: Not able to index content of large pdfs

Jump to solution

any errors in the alfreso or tomcat logs - i.e. java heap space errors?

Maybe you can increase the transformation logging via log4j:

log4j.logger.org.alfresco.repo.content.transform.TransformerDebug=DEBUG

log4j.logger.org.alfresco.util.exec.RuntimeExec=DEBUG

hiten_rastogi1
Established Member

Re: Not able to index content of large pdfs

Jump to solution

Hi Martin,

I enabled the logs and found out the below. Please help  me in discerning the same.

log4j.logger.org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter=DEBUG


2018-07-06 15:07:03,442 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-apr-8080-exec-1] Starting metadata extraction:
reader: ContentAccessor[ contentUrl=store://2018/7/6/15/7/08761879-e49c-4fa8-95e3-c22f160074a5.bin, mimetype=application/pdf, size=41637989, encoding=UTF-8, locale=en_GB]
extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@7671e45b
2018-07-06 15:07:03,443 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-apr-8080-exec-1] Concurrent extractions : 0
2018-07-06 15:07:03,443 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-apr-8080-exec-1] New extraction accepted. Concurrent extractions : 1
2018-07-06 15:07:05,089 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-apr-8080-exec-1] Extraction finalized. Remaining concurrent extraction : 0
2018-07-06 15:07:05,089 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-apr-8080-exec-1] Converted extracted raw values to system values:
Raw Properties: {pdfSmiley TongueDFVersion=1.5, TIKA_PARSER_PARSE_SHAPES=false, comments=null, dc:subject=null, author=null, xmpTPg:NPages=84, dc:format=application/pdf; version=1.5, title=null, pdf:encrypted=false, Content-Type=application/pdf}
System Properties: {{http://www.alfresco.org/model/content/1.0}title=null, {http://www.alfresco.org/model/content/1.0}author=null}
2018-07-06 15:07:05,089 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-apr-8080-exec-1] Extracted Metadata from ContentAccessor[ contentUrl=store://2018/7/6/15/7/08761879-e49c-4fa8-95e3-c22f160074a5.bin, mimetype=application/pdf, size=41637989, encoding=UTF-8, locale=en_GB]
Found: {pdfSmiley TongueDFVersion=1.5, TIKA_PARSER_PARSE_SHAPES=false, comments=null, dc:subject=null, author=null, xmpTPg:NPages=84, dc:format=application/pdf; version=1.5, title=null, pdf:encrypted=false, Content-Type=application/pdf}
Mapped and Accepted: {{http://www.alfresco.org/model/content/1.0}title=null, {http://www.alfresco.org/model/content/1.0}author=null}
2018-07-06 15:07:05,090 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-apr-8080-exec-1] Completed metadata extraction:
reader: ContentAccessor[ contentUrl=store://2018/7/6/15/7/08761879-e49c-4fa8-95e3-c22f160074a5.bin, mimetype=application/pdf, size=41637989, encoding=UTF-8, locale=en_GB]
extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@7671e45b
changed: {{http://www.alfresco.org/model/content/1.0}title=null, {http://www.alfresco.org/model/content/1.0}author=null}

log4j.logger.org.alfresco.repo.content.transform.TransformerDebug=DEBUG
log4j.logger.org.alfresco.util.exec.RuntimeExec=DEBUG

2018-07-06 15:07:19,467 INFO [web.scripts.QuickShareStatus] [http-apr-8080-exec-1] Successfully retrieved quick share information from Alfresco.
2018-07-06 15:07:21,396 INFO [web.scripts.MimetypesQuery] [http-apr-8080-exec-8] Successfully retrieved mimetypes information from Alfresco.
2018-07-06 15:07:30,029 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 pdf txt Xerox Scan_19052018115315(1)-2.pdf 39.7 MB -- index -- SolrIndexer NO transformers
2018-07-06 15:07:30,037 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 workspace://SpacesStore/66aa186a-9dc9-44aa-8680-fad46a88105f
2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 --a) [50] PdfBox > 25 MB
2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 --b) [120] TikaAuto > 25 MB
2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 Finished in 10 ms Transformer NOT called

Thanks

Hiten Rastogi

afaust
Master

Re: Not able to index content of large pdfs

Jump to solution

You can see the problem in the log output. Indexing of the content has nothing to do with the metadata extracter, so increasing its limit did not have any impact on your problem. You need to increase the limits of the PDF => TXT transformers so they are not rejecting the PDF source document.

Check content transformation limits and content transformers (and renditions) for details on how to configure the Transformers subsystem.

The following lines in your log output show that transformers have a 25 MB source file limit and thus are not acting on a 200 MB PDF:

2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 --a) [50] PdfBox > 25 MB
2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 --b) [120] TikaAuto > 25 MB
2018-07-06 15:07:30,038 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-3] 33 Finished in 10 ms Transformer NOT called

hiten_rastogi1
Established Member

Re: Not able to index content of large pdfs

Jump to solution

Thanks Axel,

It is working now.

mehe
Senior Member II

Re: Not able to index content of large pdfs

Jump to solution

...don't forget to comment out the log4j debugging options again - this could be a bit noisy in production...

jbrasil
Active Member II

Re: Not able to index content of large pdfs

Jump to solution

Hi hiten_rastogi1,
All right?
What did you do to solve this problem?
I have the same situation.
See the catalina.out log

2020-10-01 17: 03: 28,779 WARN [content.metadata.AbstractMappingMetadataExtracter] [http-nio-8080-exec-41] Metadata extraction rejected:
Extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@758471b1
Reason: Max doc size exceeded 10.0 MB
2020-10-01 17: 03: 29,193 WARN [content.metadata.AbstractMappingMetadataExtracter] [http-nio-8080-exec-28] Metadata extraction rejected:
Extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@758471b1
Reason: Max doc size exceeded 10.0 MB


Thaks a lot!