Error ocrmypdf in Alfresco Linux version 6.1

cancel
Showing results for 
Search instead for 
Did you mean: 
Active Member

Error ocrmypdf in Alfresco Linux version 6.1

Hey guys,
It is not generating the ocr within the Alfresco platform.

See the logs below:

tail -f /opt/alfresco/tomcat/logs/catalina.out

command: /opt/alfresco/scripts/ocrmypdf.sh --verbose 1 --force-ocr -l por+eng /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601.pdf /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601_ocr.pdf
succeeded: false
exit code: 1
out:
err: Traceback (most recent call last):
File "/usr/bin/ocrmypdf", line 11, in <module>
load_entry_point('ocrmypdf==6.1.2', 'console_scripts', 'ocrmypdf')()
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 480, in load_entry_po
at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:183)
at es.keensoft.alfresco.ocr.OCRExtractAction.access$200(OCRExtractAction.java:38)
at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:164)
at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:161)
at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:450)
at es.keensoft.alfresco.ocr.OCRExtractAction.executeInNewTransaction(OCRExtractAction.java:169)
at es.keensoft.alfresco.ocr.OCRExtractAction.access$100(OCRExtractAction.java:38)
at es.keensoft.alfresco.ocr.OCRExtractAction$ExtractOCRTask.run(OCRExtractAction.java:151)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 08140019 Failed to perform OCR transformation:
Execution result:
os: Linux
command: /opt/alfresco/scripts/ocrmypdf.sh --verbose 1 --force-ocr -l por+eng /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601.pdf /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601_ocr.pdf
succeeded: false
exit code: 1
out:
err: Traceback (most recent call last):
File "/usr/bin/ocrmypdf", line 11, in <module>
load_entry_point('ocrmypdf==6.1.2', 'console_scripts', 'ocrmypdf')()
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 480, in load_entry_po
at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:86)
at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:181)
... 10 more
Caused by: org.alfresco.service.cmr.repository.ContentIOException: 08140019 Failed to perform OCR transformation:
Execution result:
os: Linux
command: /opt/alfresco/scripts/ocrmypdf.sh --verbose 1 --force-ocr -l por+eng /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601.pdf /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601_ocr.pdf
succeeded: false
exit code: 1
out:
err: Traceback (most recent call last):
File "/usr/bin/ocrmypdf", line 11, in <module>
load_entry_point('ocrmypdf==6.1.2', 'console_scripts', 'ocrmypdf')()
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 480, in load_entry_po
at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:79)


root@pmituiutaba:/opt/alfresco/logs# gs --version
9.26

root@pmituiutaba:/opt/alfresco/logs# pip3 --version
pip 20.2.3 from /usr/local/lib/python3.6/dist-packages/pip (python 3.6)

root@pmituiutaba:/opt/alfresco/logs# tesseract --version
tesseract 4.0.0-beta.1
leptonica-1.75.3
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

Found AVX
Found SSE

root@pmituiutaba:/opt/alfresco/logs# ocrmypdf --version
6.1.2

root@pmituiutaba:/opt/alfresco/logs# cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

cat alfresco.log | grep -i "Current version"
2020-09-15 00:04:09,348 INFO [org.alfresco.service.descriptor.DescriptorService] [localhost-startStop-1] Alfresco Content Services started (Community). Current version: 6.1.1 (r9d03d2fd-b168) schema 12,001. Originally installed version: 6.1.1 (r9d03d2fd-b168) schema 12,001.

cat /etc/sudoers
#
# This file MUST be edited with the 'visudo' command as root.
#
# Please consider adding local content in /etc/sudoers.d/ instead of
# directly modifying this file.
#
# See the man page for details on how to write a sudoers file.
#
Defaults env_reset
Defaults mail_badpass
Defaults secure_path="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin"

# Host alias specification

# User alias specification

# Cmnd alias specification

# User privilege specification
root ALL=(ALL:ALL) ALL
alfresco ALL=(ALL) NOPASSWD: ALL

# Members of the admin group may gain root privileges
%admin ALL=(ALL) ALL

# Allow members of group sudo to execute any command
%sudo ALL=(ALL:ALL) ALL

# See sudoers(5) for more information on "#include" directives:

#includedir /etc/sudoers.d

cat /opt/alfresco/tomcat/shared/classes/alfresco-global.properties | grep -i "ocr"
#### OCR mit OCRmyPDF
ocr.command=/opt/alfresco/scripts/ocrmypdf.sh
ocr.output.verbose=false
ocr.output.file.prefix.command=
ocr.extra.commands=--verbose 1 --force-ocr -l por+eng
ocr.server.os=linux

/opt/alfresco/modules/share# l
total 12K
-rw-r--r-- 1 root root 12K Sep 14 18:48 simple-ocr-share-2.3.1.jar

/opt/alfresco/modules/platform# l
total 28K
-rw-r--r-- 1 root root 28K Sep 14 18:48 simple-ocr-repo-2.3.1.jarprint-tela-modules.png

Can you help please?
Thanks a lot!

6 Replies
Advanced

Re: Error ocrmypdf in Alfresco Linux version 6.1

You can  tested the command directly in the shell using an exmaple file  and see what happens
/opt/alfresco/scripts/ocrmypdf.sh --verbose 1 --force-ocr -l por+eng /***/***src.pdf  /***/***target.pdf 
Active Member

Re: Error ocrmypdf in Alfresco Linux version 6.1

Hi kaynezhang,
Running through the linux shell, it worked perfectly.
See the log:

./ocrmypdf.sh --verbose 1 --force-ocr -l por+eng /home/jbrasil/teste33.pdf /home/jbrasil/teste33-v2.pdf
DEBUG - ocrmypdf 6.1.2
DEBUG - tesseract 4.0.0-beta.1
DEBUG - qpdf 8.0.2
DEBUG - PyMuPDF not installed
DEBUG - os.symlink(/home/jbrasil/teste33.pdf, /tmp/com.github.ocrmypdf.l22048pv/origin)

________________________________________
Tasks which will be run:


Task enters queue = 'ocrmypdf.pipeline.triage'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/origin, /tmp/com.github.ocrmypdf.l22048pv/origin.pdf)
Completed Task = 'ocrmypdf.pipeline.triage'
Task enters queue = 'ocrmypdf.pipeline.repair_and_parse_pdf'
DEBUG - Beginning qpdf repair...
DEBUG - Repair OK; beginning parse...
DEBUG - <PdfInfo('...'), page count=1>
Completed Task = 'ocrmypdf.pipeline.repair_and_parse_pdf'
Task enters queue = 'ocrmypdf.pipeline.pre_split_pages'
Task enters queue = 'ocrmypdf.pipeline.generate_postscript_stub'
Completed Task = 'ocrmypdf.pipeline.pre_split_pages'
Task enters queue = 'ocrmypdf.pipeline.split_page'
Completed Task = 'ocrmypdf.pipeline.generate_postscript_stub'
Completed Task = 'ocrmypdf.pipeline.split_page'
Task enters queue = 'ocrmypdf.pipeline.ocr_or_skip'
INFO - 1: page already has text! – rasterizing text and running OCR anyway
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.page.pdf, /tmp/com.github.ocrmypdf.l22048pv/000001.ocr.page.pdf)
Completed Task = 'ocrmypdf.pipeline.ocr_or_skip'
Task enters queue = 'ocrmypdf.pipeline.orient_page'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.ocr.page.pdf, /tmp/com.github.ocrmypdf.l22048pv/000001.ocr.oriented.pdf)
Completed Task = 'ocrmypdf.pipeline.orient_page'
Task enters queue = 'ocrmypdf.pipeline.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.skip_page'
Uptodate Task = 'ocrmypdf.pipeline.skip_page'


WARNING:
In Task 'ocrmypdf.pipeline.skip_page':
No jobs were run because no file names matched.
Please make sure that the regular expression is correctly specified.

DEBUG - Rasterize 000001.ocr.oriented.pdf with png16m
DEBUG -
Completed Task = 'ocrmypdf.pipeline.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.preprocess_remove_background'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.page.png, /tmp/com.github.ocrmypdf.l22048pv/000001.pp-background.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_remove_background'
Task enters queue = 'ocrmypdf.pipeline.preprocess_deskew'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.pp-background.png, /tmp/com.github.ocrmypdf.l22048pv/000001.pp-deskew.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_deskew'
Task enters queue = 'ocrmypdf.pipeline.preprocess_clean'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.pp-deskew.png, /tmp/com.github.ocrmypdf.l22048pv/000001.pp-clean.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_clean'
Task enters queue = 'ocrmypdf.pipeline.select_visible_page_image'
Task enters queue = 'ocrmypdf.pipeline.select_ocr_image'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.page.png, /tmp/com.github.ocrmypdf.l22048pv/000001.image)
Completed Task = 'ocrmypdf.pipeline.select_visible_page_image'
Task enters queue = 'ocrmypdf.pipeline.select_image_layer'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.pp-clean.png, /tmp/com.github.ocrmypdf.l22048pv/000001.ocr.png)
Completed Task = 'ocrmypdf.pipeline.select_ocr_image'
Task enters queue = 'ocrmypdf.pipeline.ocr_tesseract_textonly_pdf'
DEBUG - 1: convert
DEBUG - ['tesseract', '-l', 'por+eng', '-c', 'textonly_pdf=1', '/tmp/com.github.ocrmypdf.l22048pv/000001.ocr.png', '/tmp/com.github.ocrmypdf.l22048pv/000001.text', 'pdf', 'txt']
DEBUG - 1: convert done
Completed Task = 'ocrmypdf.pipeline.select_image_layer'
Completed Task = 'ocrmypdf.pipeline.ocr_tesseract_textonly_pdf'
Task enters queue = 'ocrmypdf.pipeline.combine_layers'
Completed Task = 'ocrmypdf.pipeline.combine_layers'
Task enters queue = 'ocrmypdf.pipeline.merge_pages_ghostscript'
DEBUG - Final pages: /tmp/com.github.ocrmypdf.l22048pv/000001.rendered.pdf
/tmp/com.github.ocrmypdf.l22048pv/pdfa.ps
DEBUG - Ghostscript had to remove PDF 'overprinting' from the input file to complete PDF/A conversion.
Completed Task = 'ocrmypdf.pipeline.merge_pages_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.copy_final'
Completed Task = 'ocrmypdf.pipeline.copy_final'
INFO - Output file is a PDF/A-2B (as expected)
WARNING - The output file size is 3.38× larger than the input file.
Possible reasons for this include:
The optional dependency PyMuPDF is not installed.
The argument --force-ocr was issued.


DEBUG - <PdfInfo('...'), page count=1>

l /home/jbrasil/
total 116K
-rw-r--r-- 1 root root 26K Sep 14 18:56 teste33.pdf
-rw-r--r-- 1 root root 86K Sep 15 09:01 teste33-v2.pdf

It just doesn't generate through the Alfresco platform.

Can you help?
Thank you.

Advanced

Re: E rror ocrmypdf in Alfresco Linux version 6.1

How did you install alfresco ? did you install it manually or install using docker?

Active Member

Re: E rror ocrmypdf in Alfresco Linux version 6.1

Hi kaynezhang,
I installed using the loftuxab script.
alfinstall.sh

https://github.com/loftuxab/alfresco-ubuntu-install

I have always installed this script.
I never had a problem. First time this type of error occurs.
Anything else that needs to be investigated?

Thanks a lot.

Advanced

Re: E rror ocrmypdf in Alfresco Linux version 6.1

Your  installation is ok ,the error seems python script can't load tesseract lib correctly. But you can run the command successfully directly int shell,very strange.

Active Member

Re: E rror ocrmypdf in Alfresco Linux version 6.1

Hi kaynezhang,
Very strange. We have other servers with Alfrescom running the same version.
See the script:

/ opt / alfresco / scripts

cat ocrmypdf.sh
#! / usr / bin / env bash
# set -o xtrace # Uncomment for debugging / troubleshooting
sudo ocrmypdf "$ @"

Theoretically, it is right.
I do not know what happened...
Thanks.