"OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
hisayo-s
Active Member

"OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Hello,

I'm using Alfresco 5.2 community edition on CentOS7.5 and it works well itself.

Now I trying to add OCR function to Alfresco, so I installed alfresco-simple-ocr (simple-ocr-repo-2.3.1.jar) and pdfsandwich to add function.

When I install pdfsandwich version 1.4, ruled "Extract OCR" action do works and version 1.1 PDF-file made automatically. But all pages of OCR PDF are white paper; no images, no characters.

Secondly I uninstall pdfsandwich version 1.6 insted of version 1.4, and tried again. Then ruled "Extract OCR" action DO NOT seem to be occured, and version 1.1 PDF file never made.

I tried pdfsandwich version 1.4, 1.5, 1.6 and 1.7 on comannd-line, and they works well expect version 1.7. (Version 1.7 says buggy message on command line)  When use version 1.4, 1.5, 1.6, exit-code is zero. 

---------------------------------------------------------------------------

RULE DEFINITION

Attached file is screen shot of rule definition. (Japanese)

When item created or input on this folder OR when item updated,

AND MIME-type is "Adobe PDF Document",

execute "Extract OCR".

- Continue on error: Checked

- Execute the rule background: Checked

---------------------------------------------------------------------------

/opt/alfresco-community/tomcat/shared/classes/alfresco-global.properties

 :

 :
### Alfresco Simple OCR ###
ocr.command=/usr/local/bin/pdfsandwich
ocr.output.verbose=true
ocr.output.file.prefix.command=-o

ocr.extra.commands=-verbose -rgb -lang jpn
ocr.server.os=linux

---------------------------------------------------------------------------

Anyone please help me !

14 Replies
douglascrp
Advanced II

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Hello.

Check the following link FAQ · keensoft/alfresco-simple-ocr Wiki · GitHub 

Maybe that can help you.

hisayo-s
Active Member

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Hi,

Thank you for your information.

I haven't read the FAQ page, so I will read the FAQ carefully and try to improve my environment.

I hope good result.

Best regards,

hisayo-s
Active Member

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Hello,

My problems has been partly solved.

Reading FAQ, I installed 2 jar files (simple-ocr-repo-2.3.1.jar and simple-ocr-share-2.3.1.jar) insted of  simple-ocr-repo.amp. (I had used amp file)

After restarted alfresco, the "Extract PDF action" sometimes works well, and sometimes not.

When action(conversion) succeed, "tesseract" ".convers.b+" "unpaper" processes are running on "top" view.

Otherwise when action(conversion) fails, their processes appears shortly and soon disappears.

It seems that file size and number of page are unrelated. 

I have no idea how to solve this problem.

Anyone know the solution. Please let me know!

------

- CentOS 7.5

- Alfresco 5.2 - community edition

- alfresco-simple-ocr 2.3.1

- pdfsandwich is 1.6 (*1)

- tesseract 3.04

(*1)

When I tested version 1.7 again, pdfsandwich says following message as before.

> "Fatal error: exception Unix.Unix_error(Unix.ENOTEMPTY, "rmdir", "/tmp/pdfsandwich_tmp2d3ca3")"

Such being the case, I use version 1.7 with "-debug" option to avoid error. (temp files should be erased manually...)

angelborroy
Expert

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Try using OCRmyPDF (https://github.com/jbarlow83/OCRmyPDF) instead of pdfsandwich.

Both pdfsandwich and OCRmyPDF have some issues on CentOS (they are developed for Ubuntu), but you can use the Docker Image for OCRmyPDF available at https://ocrmypdf.readthedocs.io/en/latest/installation.html#installing-the-docker-image

Software Engineer in Alfresco Search Team.
hisayo-s
Active Member

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Hello,

Thank you for your suggestion.

Unfortunately I'm not familiar with Docker.

I tried to install OCRmyPDF, but I could't.

So, I'm going to continue struggling to use pdfsandwich.

angelborroy
Expert

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

I don’t know if this still works, as I haven’t tested it recently, but you can find a reference for installing pdfsandwich at CentOS 7 at https://github.com/keensoft/alfresco-simple-ocr/blob/master/docker/pdfsandwich-1.6-centos-7/Dockerfi...

Software Engineer in Alfresco Search Team.
hisayo-s
Active Member

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

I tried to install "pdfsandwich" and "OCRmyPDF"(with docker), however I couldn't set up propery.

It is a pitty that I give up to try.

Thank you very much for giving suggestions and informations.

hisayo-s
Active Member

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Thank you for your information.

I tried to install "OCRmyPDF" using Docker and partly successed to install it.

On the command line, and at the directory where the inputfile exist, conversion successfully done.

However at the othe directory, it does not work.

> ERROR - File not found - /home/hisayo-s/AAAAA.pdf

 

I give up my challenge.

Thanks a lot.

angelborroy
Expert

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

I'm currently using Docker Compose as base for my installations, so I only can give you some tips on how to configure the whole thing with Docker.

OCRmyPDF Dockerfile

FROM jbarlow83/ocrmypdf:v7.0.0
USER root

RUN apt-get update && apt-get install -y openssh-server
RUN mkdir /var/run/sshd
RUN echo 'root:screencast' | chpasswd
RUN sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config

# SSH login fix. Otherwise user is kicked off after login
RUN sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd

ENV NOTVISIBLE "in users profile"
RUN echo "export VISIBLE=now" >> /etc/profile

COPY assets/ssh/id_rsa.pub /root/.ssh/id_rsa.pub
COPY assets/ocr.sh /usr/bin/ocr.sh
RUN cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys \
&& chmod 0600 /root/.ssh/authorized_keys \
&& chmod +x /usr/bin/ocr.sh

EXPOSE 22
ENTRYPOINT ["/usr/sbin/sshd", "-D"]

assets/ocr.sh

#!/bin/bash

export LC_ALL=C.UTF-8
export LANG=C.UTF-8

/usr/bin/ocrmypdf $@

Alfresco Dockerfile

FROM alfresco/alfresco-content-repository-community:6.0.7-ga

ENV LC_ALL C.UTF-8
ENV LANG C.UTF-8

# Extra software
RUN set -x \
     && yum install -y \
     wget \
     unzip \
     && yum clean all

# Install api-explorer webapp for REST API
RUN set -x \
     && wget https://artifacts.alfresco.com/nexus/service/local/repositories/releases/content/org/alfresco/api-explorer/6.0.7-ga/api-explorer-6.0.7-ga.war -O /usr/local/tomcat/webapps/api-explorer.war

ARG TOMCAT_DIR=/usr/local/tomcat

RUN mkdir -p $TOMCAT_DIR/amps

# Install AOS
RUN set -x \
        && mkdir /tmp/aos \
        && wget --no-check-certificate https://download.alfresco.com/cloudfront/release/community/201806-GA-build-00113/alfresco-aos-module-distributionzip-1.2.0.zip \
        && unzip alfresco-aos-module-distributionzip-1.2.0.zip -d /tmp/aos \
        && mv /tmp/aos/extension/* /usr/local/tomcat/shared/classes/alfresco/extension \
        && mv /tmp/aos/alfresco-aos-module-1.2.0.amp amps \
        && mv /tmp/aos/aos-module-license.txt licenses \
        && mv /tmp/aos/_vti_bin.war /usr/local/tomcat/webapps \
        && rm -rf /tmp/aos alfresco-aos-module-distributionzip-1.2.0.zip

# SSH keys for ocrmypdf
COPY ssh/ /root/.ssh/

# Install OCR
COPY bin/ /opt/alfresco/bin/

# Configure SSH Client
RUN set -x && \
    chmod +x /opt/alfresco/bin/ocrmypdf.sh && \
    # Configure ssh   
    yum install -y openssh-clients && \
    echo "StrictHostKeyChecking no" >> /etc/ssh/ssh_config && \
    # Alfresco Image is using POSIX as Locale (!)
    sed -i '/^\s*SendEnv/ d' /etc/ssh/ssh_config && \
    chmod 600 /root/.ssh/id_rsa

# Install modules and addons
COPY modules/amps $TOMCAT_DIR/amps
COPY modules/jars $TOMCAT_DIR/webapps/alfresco/WEB-INF/lib

RUN java -jar $TOMCAT_DIR/alfresco-mmt/alfresco-mmt*.jar install \
            $TOMCAT_DIR/amps $TOMCAT_DIR/webapps/alfresco -directory -nobackup -force

# Add services configuration to alfresco-global.properties
COPY conf/alfresco-global.properties /usr/local/tomcat/shared/classes/alfresco-global.properties

EXPOSE 21 143 25 445 137/udp 138/udp 139

bin/ocrmypdf.sh

#!/bin/bash

INPUT_DIR=/ocr_input
OUTPUT_DIR=/ocr_output

# ocrmypdf hostname
OCRMYPDF_SERVER="ocrmypdf"

# identify parameters, input and output file
array=( "$@" )
len=${#array[@]}
ARGS=${array[@]:0:$len-2}

LAST_ARGS="${@: -2}"
INPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 1`
OUTPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 2`

# extract filenames
INPUT_FILE=$(basename "$INPUT_FILE_PARAM")
OUTPUT_FILE=$(basename "$OUTPUT_FILE_PARAM")

# SSH parameters
SCP=cp
SSH=ssh
USER=root

# copy original pdf to ocrmypdf server
$SCP $INPUT_FILE_PARAM $INPUT_DIR

# execute ocrmypdf program
$SSH $USER@$OCRMYPDF_SERVER "/usr/bin/ocr.sh $ARGS $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE"

# copy transformed pdf back to alfresco path
$SCP $OUTPUT_DIR/$OUTPUT_FILE ${OUTPUT_FILE_PARAM}

# remove temporal files
rm -f $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE

conf/alfresco-global.properties

(Only OCRmyPDF section)

## simple-ocr
# https://github.com/keensoft/alfresco-simple-ocr
ocr.command=/opt/alfresco/bin/ocrmypdf.sh
ocr.output.verbose=true
ocr.output.file.prefix.command=
# https://github.com/jbarlow83/OCRmyPDF/issues/124
ocr.extra.commands=-j1 --author keensoft --rotate-pages -l spa+eng+fra --deskew --clean --skip-text
ocr.server.os=linux
Software Engineer in Alfresco Search Team.