Error ocrmypdf in Alfresco Linux version 6.1

jbrasil · ‎14 Sep 2020

Hey guys,
It is not generating the ocr within the Alfresco platform.

See the logs below:

tail -f /opt/alfresco/tomcat/logs/catalina.out

command: /opt/alfresco/scripts/ocrmypdf.sh --verbose 1 --force-ocr -l por+eng /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601.pdf /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601_ocr.pdf
succeeded: false
exit code: 1
out:
err: Traceback (most recent call last):
File "/usr/bin/ocrmypdf", line 11, in <module>
load_entry_point('ocrmypdf==6.1.2', 'console_scripts', 'ocrmypdf')()
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 480, in load_entry_po
at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:183)
at es.keensoft.alfresco.ocr.OCRExtractAction.access$200(OCRExtractAction.java:38)
at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:164)
at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:161)
at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:450)
at es.keensoft.alfresco.ocr.OCRExtractAction.executeInNewTransaction(OCRExtractAction.java:169)
at es.keensoft.alfresco.ocr.OCRExtractAction.access$100(OCRExtractAction.java:38)
at es.keensoft.alfresco.ocr.OCRExtractAction$ExtractOCRTask.run(OCRExtractAction.java:151)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 08140019 Failed to perform OCR transformation:
Execution result:
os: Linux
command: /opt/alfresco/scripts/ocrmypdf.sh --verbose 1 --force-ocr -l por+eng /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601.pdf /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601_ocr.pdf
succeeded: false
exit code: 1
out:
err: Traceback (most recent call last):
File "/usr/bin/ocrmypdf", line 11, in <module>
load_entry_point('ocrmypdf==6.1.2', 'console_scripts', 'ocrmypdf')()
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 480, in load_entry_po
at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:86)
at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:181)
... 10 more
Caused by: org.alfresco.service.cmr.repository.ContentIOException: 08140019 Failed to perform OCR transformation:
Execution result:
os: Linux
command: /opt/alfresco/scripts/ocrmypdf.sh --verbose 1 --force-ocr -l por+eng /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601.pdf /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601_ocr.pdf
succeeded: false
exit code: 1
out:
err: Traceback (most recent call last):
File "/usr/bin/ocrmypdf", line 11, in <module>
load_entry_point('ocrmypdf==6.1.2', 'console_scripts', 'ocrmypdf')()
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 480, in load_entry_po
at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:79)

root@pmituiutaba:/opt/alfresco/logs# gs --version
9.26

root@pmituiutaba:/opt/alfresco/logs# pip3 --version
pip 20.2.3 from /usr/local/lib/python3.6/dist-packages/pip (python 3.6)

root@pmituiutaba:/opt/alfresco/logs# tesseract --version
tesseract 4.0.0-beta.1
leptonica-1.75.3
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

Found AVX
Found SSE

root@pmituiutaba:/opt/alfresco/logs# ocrmypdf --version
6.1.2

root@pmituiutaba:/opt/alfresco/logs# cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

cat alfresco.log | grep -i "Current version"
2020-09-15 00:04:09,348 INFO [org.alfresco.service.descriptor.DescriptorService] [localhost-startStop-1] Alfresco Content Services started (Community). Current version: 6.1.1 (r9d03d2fd-b168) schema 12,001. Originally installed version: 6.1.1 (r9d03d2fd-b168) schema 12,001.

cat /etc/sudoers
#
# This file MUST be edited with the 'visudo' command as root.
#
# Please consider adding local content in /etc/sudoers.d/ instead of
# directly modifying this file.
#
# See the man page for details on how to write a sudoers file.
#
Defaults env_reset
Defaults mail_badpass
Defaults secure_path="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin"

# Host alias specification

# User alias specification

# Cmnd alias specification

# User privilege specification
root ALL=(ALL:ALL) ALL
alfresco ALL=(ALL) NOPASSWD: ALL

# Members of the admin group may gain root privileges
%admin ALL=(ALL) ALL

# Allow members of group sudo to execute any command
%sudo ALL=(ALL:ALL) ALL

# See sudoers(5) for more information on "#include" directives:

#includedir /etc/sudoers.d

cat /opt/alfresco/tomcat/shared/classes/alfresco-global.properties | grep -i "ocr"
#### OCR mit OCRmyPDF
ocr.command=/opt/alfresco/scripts/ocrmypdf.sh
ocr.output.verbose=false
ocr.output.file.prefix.command=
ocr.extra.commands=--verbose 1 --force-ocr -l por+eng
ocr.server.os=linux

/opt/alfresco/modules/share# l
total 12K
-rw-r--r-- 1 root root 12K Sep 14 18:48 simple-ocr-share-2.3.1.jar

/opt/alfresco/modules/platform# l
total 28K
-rw-r--r-- 1 root root 28K Sep 14 18:48 simple-ocr-repo-2.3.1.jar

Can you help please?
Thanks a lot!

kaynezhang · ‎15 Sep 2020

You can tested the command directly in the shell using an exmaple file and see what happens

/opt/alfresco/scripts/ocrmypdf.sh --verbose 1 --force-ocr -l por+eng /***/***src.pdf  /***/***target.pdf

jbrasil · ‎15 Sep 2020

Hi kaynezhang,
Running through the linux shell, it worked perfectly.
See the log:

./ocrmypdf.sh --verbose 1 --force-ocr -l por+eng /home/jbrasil/teste33.pdf /home/jbrasil/teste33-v2.pdf
DEBUG - ocrmypdf 6.1.2
DEBUG - tesseract 4.0.0-beta.1
DEBUG - qpdf 8.0.2
DEBUG - PyMuPDF not installed
DEBUG - os.symlink(/home/jbrasil/teste33.pdf, /tmp/com.github.ocrmypdf.l22048pv/origin)

________________________________________
Tasks which will be run:

Task enters queue = 'ocrmypdf.pipeline.triage'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/origin, /tmp/com.github.ocrmypdf.l22048pv/origin.pdf)
Completed Task = 'ocrmypdf.pipeline.triage'
Task enters queue = 'ocrmypdf.pipeline.repair_and_parse_pdf'
DEBUG - Beginning qpdf repair...
DEBUG - Repair OK; beginning parse...
DEBUG - <PdfInfo('...'), page count=1>
Completed Task = 'ocrmypdf.pipeline.repair_and_parse_pdf'
Task enters queue = 'ocrmypdf.pipeline.pre_split_pages'
Task enters queue = 'ocrmypdf.pipeline.generate_postscript_stub'
Completed Task = 'ocrmypdf.pipeline.pre_split_pages'
Task enters queue = 'ocrmypdf.pipeline.split_page'
Completed Task = 'ocrmypdf.pipeline.generate_postscript_stub'
Completed Task = 'ocrmypdf.pipeline.split_page'
Task enters queue = 'ocrmypdf.pipeline.ocr_or_skip'
INFO - 1: page already has text! – rasterizing text and running OCR anyway
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.page.pdf, /tmp/com.github.ocrmypdf.l22048pv/000001.ocr.page.pdf)
Completed Task = 'ocrmypdf.pipeline.ocr_or_skip'
Task enters queue = 'ocrmypdf.pipeline.orient_page'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.ocr.page.pdf, /tmp/com.github.ocrmypdf.l22048pv/000001.ocr.oriented.pdf)
Completed Task = 'ocrmypdf.pipeline.orient_page'
Task enters queue = 'ocrmypdf.pipeline.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.skip_page'
Uptodate Task = 'ocrmypdf.pipeline.skip_page'

WARNING:
In Task 'ocrmypdf.pipeline.skip_page':
No jobs were run because no file names matched.
Please make sure that the regular expression is correctly specified.

DEBUG - Rasterize 000001.ocr.oriented.pdf with png16m
DEBUG -
Completed Task = 'ocrmypdf.pipeline.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.preprocess_remove_background'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.page.png, /tmp/com.github.ocrmypdf.l22048pv/000001.pp-background.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_remove_background'
Task enters queue = 'ocrmypdf.pipeline.preprocess_deskew'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.pp-background.png, /tmp/com.github.ocrmypdf.l22048pv/000001.pp-deskew.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_deskew'
Task enters queue = 'ocrmypdf.pipeline.preprocess_clean'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.pp-deskew.png, /tmp/com.github.ocrmypdf.l22048pv/000001.pp-clean.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_clean'
Task enters queue = 'ocrmypdf.pipeline.select_visible_page_image'
Task enters queue = 'ocrmypdf.pipeline.select_ocr_image'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.page.png, /tmp/com.github.ocrmypdf.l22048pv/000001.image)
Completed Task = 'ocrmypdf.pipeline.select_visible_page_image'
Task enters queue = 'ocrmypdf.pipeline.select_image_layer'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.pp-clean.png, /tmp/com.github.ocrmypdf.l22048pv/000001.ocr.png)
Completed Task = 'ocrmypdf.pipeline.select_ocr_image'
Task enters queue = 'ocrmypdf.pipeline.ocr_tesseract_textonly_pdf'
DEBUG - 1: convert
DEBUG - ['tesseract', '-l', 'por+eng', '-c', 'textonly_pdf=1', '/tmp/com.github.ocrmypdf.l22048pv/000001.ocr.png', '/tmp/com.github.ocrmypdf.l22048pv/000001.text', 'pdf', 'txt']
DEBUG - 1: convert done
Completed Task = 'ocrmypdf.pipeline.select_image_layer'
Completed Task = 'ocrmypdf.pipeline.ocr_tesseract_textonly_pdf'
Task enters queue = 'ocrmypdf.pipeline.combine_layers'
Completed Task = 'ocrmypdf.pipeline.combine_layers'
Task enters queue = 'ocrmypdf.pipeline.merge_pages_ghostscript'
DEBUG - Final pages: /tmp/com.github.ocrmypdf.l22048pv/000001.rendered.pdf
/tmp/com.github.ocrmypdf.l22048pv/pdfa.ps
DEBUG - Ghostscript had to remove PDF 'overprinting' from the input file to complete PDF/A conversion.
Completed Task = 'ocrmypdf.pipeline.merge_pages_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.copy_final'
Completed Task = 'ocrmypdf.pipeline.copy_final'
INFO - Output file is a PDF/A-2B (as expected)
WARNING - The output file size is 3.38× larger than the input file.
Possible reasons for this include:
The optional dependency PyMuPDF is not installed.
The argument --force-ocr was issued.

DEBUG - <PdfInfo('...'), page count=1>

l /home/jbrasil/
total 116K
-rw-r--r-- 1 root root 26K Sep 14 18:56 teste33.pdf
-rw-r--r-- 1 root root 86K Sep 15 09:01 teste33-v2.pdf

It just doesn't generate through the Alfresco platform.

Can you help?
Thank you.

kaynezhang · ‎15 Sep 2020

How did you install alfresco ? did you install it manually or install using docker?

jbrasil · ‎15 Sep 2020

Hi kaynezhang,
I installed using the loftuxab script.
alfinstall.sh

https://github.com/loftuxab/alfresco-ubuntu-install

I have always installed this script.
I never had a problem. First time this type of error occurs.
Anything else that needs to be investigated?
Thanks a lot.

kaynezhang · ‎16 Sep 2020

Your installation is ok ,the error seems python script can't load tesseract lib correctly. But you can run the command successfully directly int shell,very strange.

jbrasil · ‎16 Sep 2020

Hi kaynezhang,
Very strange. We have other servers with Alfrescom running the same version.
See the script:

/ opt / alfresco / scripts

cat ocrmypdf.sh
#! / usr / bin / env bash
# set -o xtrace # Uncomment for debugging / troubleshooting
sudo ocrmypdf "$ @"

Theoretically, it is right.
I do not know what happened...
Thanks.