Alfresco 7.1 Does not extract text from PDF/A of large files

cancel
Showing results for 
Search instead for 
Did you mean: 
RodrigoGomes
Member II

Alfresco 7.1 Does not extract text from PDF/A of large files

Jump to solution

I've been trying to get Alfresco to extract texts from PDF/A files larger than 25 MB and I haven't been successful. I've read countless pages of documentation, installed different versions on different operating systems. I tested several recommended settings, removed all the limits I could find. All of this without success.

  • Alfresco version 7.1
  • Search Services 2.0.2
  • Ubuntu 20.04 (compatible version according to documentation).
  • Installation was done through Ansible. Following all the documentation.

 

Alfresco can extract text from PDF files smaller than 25MB, but none larger than that. The logs do not return any problems regarding this.
I know this from the logs, here 2 PDFs of different sizes were sent. But only one extracted the text:

pdf-test.png

 

Goal: Be able to search for terms that exist in PDF/A files larger than 25MB.

 

Some settings I've tried:

 

### Time out configured for all extractor and all mimetypes
content.metadataExtracter.default.timeoutMs=3600000

### Maximum size of a document to process - configured for PdfBoxMetadataExtracter , pdf files
content.metadataExtracter.pdf.maxDocumentSizeMB=900

### Maximum number of concurrent extractions - configured for PdfBoxMetadataExtracter , pdf files
content.metadataExtracter.pdf.maxConcurrentExtractionsCount=15


content.transformer.default.timeoutMs=3600000
content.transformer.default.txt.*.maxSourceSizeKBytes=1073741824
content.transformer.JodConverter.maxSourceSizeKBytes=1073741824
content.transformer.JodConverter.extensions.doc.pdf.maxSourceSizeKBytes=1073741824
content.transformer.JodConverter.extensions.doc.pdf.maxSourceSizeKBytes.use.asyncRule=1073741824
content.transformer.default.extensions.pdf.swf.maxSourceSizeKBytes.use.index=1073741824
content.transformer.TikaAuto.timeoutMs.use.index=3600000
content.transformer.default.extensions.doc.txt.maxSourceSizeKBytes=1073741824
content.transformer.TikaAuto.timeoutMs=3600000
content.transformer.default.extensions.pdf.swf.maxSourceSizeKBytes=1073741824
content.transformer.default.extensions.pdf.swf.maxSourceSizeKBytes.use.webpreview=1073741824
content.transformer.PdfBox.extensions.pdf.txt.maxSourceSizeKBytes=1073741824
content.transformer.TikaAuto.extensions.pdf.txt.maxSourceSizeKBytes=1073741824

 

I followed the instructions on this page:

 

export TRANSFORMER_ROUTES_ADDITIONAL_custom="/etc/opt/alfresco/content-services/classpath/alfresco/extension/transform/pipelines/custom-pipeline-file.json"

And I created the file custom-pipeline-file.json with the most varied configurations, here are some that I tried:

{
  "overrideSupported": [
    {
      "maxSourceSizeBytes": 1073741824
    }
  ]
}
{
  "transformers": [
    {
      "transformerName": "tika",
      "supportedSourceAndTargetList": [
        {"sourceMediaType": "application/pdf", "maxSourceSizeBytes": 1073741824, "targetMediaType": "text/plain" },
        {"sourceMediaType": "application/pdf", "priority": 40, "targetMediaType": "text/plain" }
      ]
    }
  ]
}

And after the changes, I have restarted the service using the commands below:

sudo service alfresco-content restart
sudo service alfresco-search restart
sudo service alfresco-tengine-aio restart

After digging deeper into this I got a configuration in the file I created custom-pipeline-file.json which gave a different result. Here's the configuration:

{
     "transformers": [
        {
            "transformerName": "PdfBox",
            "supportedSourceAndTargetList": [
                 {"sourceMediaType": "application/pdf", "maxSourceSizeBytes": 1073741824, "targetMediaType": "text/plain"}
            ],
            "transformOptions": [
                "pdfboxOptions"
            ]
        }]
}

And I get the error below in the logs:

2022-10-25 02:44:40.172 ERROR 41834 --- [nio-8090-exec-7] o.a.transformer.TransformController      : No transforms were able to handle the request

org.alfresco.transform.exceptions.TransformException: No transforms were able to handle the request
    at org.alfresco.transformer.AbstractTransformerController.getTransformerName(AbstractTransformerController.java:444) ~[alfresco-transformer-base-2.5.3.jar!/:2.5.3]
    at org.alfresco.transformer.AbstractTransformerController.getTransformerName(AbstractTransformerController.java:421) ~[alfresco-transformer-base-2.5.3.jar!/:2.5.3]
    at org.alfresco.transformer.AbstractTransformerController.transform(AbstractTransformerController.java:172) ~[alfresco-transformer-base-2.5.3.jar!/:2.5.3]
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:na]
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:na]
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:na]
    at java.base/java.lang.reflect.Method.invoke(Method.java:566) ~[na:na]
    at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:197) ~[spring-web-5.3.9.jar!/:5.3.9]
    at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:141) ~[spring-web-5.3.9.jar!/:5.3.9]
    at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:106) ~[spring-webmvc-5.3.9.jar!/:5.3.9]
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:895) ~[spring-webmvc-5.3.9.jar!/:5.3.9]
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:808) ~[spring-webmvc-5.3.9.jar!/:5.3.9]
    at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87) ~[spring-webmvc-5.3.9.jar!/:5.3.9]
    at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1064) ~[spring-webmvc-5.3.9.jar!/:5.3.9]
    at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:963) ~[spring-webmvc-5.3.9.jar!/:5.3.9]
    at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1006) ~[spring-webmvc-5.3.9.jar!/:5.3.9]
    at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:909) ~[spring-webmvc-5.3.9.jar!/:5.3.9]
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:681) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:883) ~[spring-webmvc-5.3.9.jar!/:5.3.9]
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:764) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:227) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53) ~[tomcat-embed-websocket-9.0.52.jar!/:na]
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100) ~[spring-web-5.3.9.jar!/:5.3.9]
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) ~[spring-web-5.3.9.jar!/:5.3.9]
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93) ~[spring-web-5.3.9.jar!/:5.3.9]
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) ~[spring-web-5.3.9.jar!/:5.3.9]
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.springframework.boot.actuate.metrics.web.servlet.WebMvcMetricsFilter.doFilterInternal(WebMvcMetricsFilter.java:96) ~[spring-boot-actuator-2.5.4.jar!/:2.5.4]
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) ~[spring-web-5.3.9.jar!/:5.3.9]
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201) ~[spring-web-5.3.9.jar!/:5.3.9]
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) ~[spring-web-5.3.9.jar!/:5.3.9]
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:197) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:97) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:542) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:135) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:78) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:357) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:382) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:893) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1726) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) ~[tomcat-embed-core-9.0.52.jar!/:na]
    at java.base/java.lang.Thread.run(Thread.java:829) ~[na:na]

I will really appreciate it if someone can give me a way to solve this, as I've been trying for weeks.

1 Solution

Accepted Solutions
angelborroy
Alfresco Employee

Re: Alfresco 7.1 Does not extract text from PDF/A of large files

Jump to solution

I wasn't able to apply that configuration to Transform Core AIO 2.5 / 2.6

If you upgrade Transform Core AIO to 3.0.0, you can follow this instructions:

https://github.com/Alfresco/alfresco-transform-core/blob/master/docs/transform-config.md

I've tested that locally with Docker Compose and it works, this is the section for Transform Core AIO configuration:

    transform-core-aio:
        image: alfresco/alfresco-transform-core-aio:3.0.0
        mem_limit: 2048m
        environment:
            TRANSFORM_CONFIG_FILE_PDFUPDATE: "/0200-increase-pdf-max-source.json"
            JAVA_OPTS: "
              -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80
              -Dserver.tomcat.threads.max=12
              -Dserver.tomcat.threads.min=4
              -Dlogging.level.org.alfresco.transform.router.TransformerDebug=ERROR
            "
        ports:
          - 8090:8090
        volumes:
            - ./0200-increase-pdf-max-source.json:/0200-increase-pdf-max-source.json

And the configuration for 0200-increase-pdf-max-source.json file is:

{
  "overrideSupported": [
    {
      "transformerName": "PdfBox",
      "sourceMediaType": "application/pdf",
      "targetMediaType": "text/plain",
      "maxSourceSizeBytes": -1
    },
    {
      "transformerName": "TikaAuto",
      "sourceMediaType": "application/pdf",
      "targetMediaType": "text/plain",
      "maxSourceSizeBytes": -1
    }
  ]
}

Hope this helps.

Hyland Developer Evangelist

View solution in original post

1 Reply
angelborroy
Alfresco Employee

Re: Alfresco 7.1 Does not extract text from PDF/A of large files

Jump to solution

I wasn't able to apply that configuration to Transform Core AIO 2.5 / 2.6

If you upgrade Transform Core AIO to 3.0.0, you can follow this instructions:

https://github.com/Alfresco/alfresco-transform-core/blob/master/docs/transform-config.md

I've tested that locally with Docker Compose and it works, this is the section for Transform Core AIO configuration:

    transform-core-aio:
        image: alfresco/alfresco-transform-core-aio:3.0.0
        mem_limit: 2048m
        environment:
            TRANSFORM_CONFIG_FILE_PDFUPDATE: "/0200-increase-pdf-max-source.json"
            JAVA_OPTS: "
              -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80
              -Dserver.tomcat.threads.max=12
              -Dserver.tomcat.threads.min=4
              -Dlogging.level.org.alfresco.transform.router.TransformerDebug=ERROR
            "
        ports:
          - 8090:8090
        volumes:
            - ./0200-increase-pdf-max-source.json:/0200-increase-pdf-max-source.json

And the configuration for 0200-increase-pdf-max-source.json file is:

{
  "overrideSupported": [
    {
      "transformerName": "PdfBox",
      "sourceMediaType": "application/pdf",
      "targetMediaType": "text/plain",
      "maxSourceSizeBytes": -1
    },
    {
      "transformerName": "TikaAuto",
      "sourceMediaType": "application/pdf",
      "targetMediaType": "text/plain",
      "maxSourceSizeBytes": -1
    }
  ]
}

Hope this helps.

Hyland Developer Evangelist