"OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Active Member

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Jump to solution

Thanks you for your kindness.

However, my environment is consist of CentOS7 and Alfresco5.2 and OCRmyPDF(docker).

The scripts you have posted aren't match my environment.

As I am very new to docker, I don't know how to change the scripts.

Highlighted
Active Member

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Jump to solution

Comparing pdfsandwich to OCRmyPDF, pdfsandwich's quality for letter recognition is better than OCRmyPDF in Japanese.

So I will focused on using pdfsandwich.

Thank you very much for your help.

Highlighted
Alfresco Employee

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Jump to solution

Did you test with these instructions?

https://github.com/keensoft/alfresco-simple-ocr/blob/master/docker/pdfsandwich-1.6-centos-7/Dockerfi...

I don't know if they are still working with latest CentOS releases, but it can be an starting point.

Software Engineer in Alfresco Search Team.
Highlighted
Senior Member

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Jump to solution

I try go over this solution. My deployment:

Alfresco 6.1.2-ga / Share 6.1.0

jbarlow83/ocrmypdf:v8.2.3 or v7.0.0

api-explorer-6.1.0-ea.war or 6.0.7-ga

And I have got "failed to copy".

I had file /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468.pdf  but /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468_ocr.pdf don't.

My thought, I should change 

INPUT_DIR=/ocr_input
OUTPUT_DIR=/ocr_output

but i don't understand how. "ocrmypdf" container don't contain this directories.

Log:

alfresco_1 | Exception in thread "defaultAsyncAction1" java.lang.RuntimeException: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 05270018 Failed to copy content from file:
alfresco_1 | writer: ContentAccessor[ contentUrl=store://2019/6/27/18/13/0081dc19-8750-4ddb-ac3c-396b4ba1a859.bin, mimetype=application/pdf, size=0, encoding=UTF-8, locale=en_US]
alfresco_1 | file: /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468_ocr.pdf
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:183)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction.access$200(OCRExtractAction.java:38)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:164)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:161)
alfresco_1 | at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:450)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction.executeInNewTransaction(OCRExtractAction.java:169)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction.access$100(OCRExtractAction.java:38)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction$ExtractOCRTask.run(OCRExtractAction.java:151)
alfresco_1 | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
alfresco_1 | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
alfresco_1 | at java.base/java.lang.Thread.run(Thread.java:834)
alfresco_1 | Caused by: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 05270018 Failed to copy content from file:
alfresco_1 | writer: ContentAccessor[ contentUrl=store://2019/6/27/18/13/0081dc19-8750-4ddb-ac3c-396b4ba1a859.bin, mimetype=application/pdf, size=0, encoding=UTF-8, locale=en_US]
alfresco_1 | file: /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468_ocr.pdf
alfresco_1 | at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:86)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:181)
alfresco_1 | ... 10 more
alfresco_1 | Caused by: org.alfresco.service.cmr.repository.ContentIOException: 05270018 Failed to copy content from file:
alfresco_1 | writer: ContentAccessor[ contentUrl=store://2019/6/27/18/13/0081dc19-8750-4ddb-ac3c-396b4ba1a859.bin, mimetype=application/pdf, size=0, encoding=UTF-8, locale=en_US]
alfresco_1 | file: /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468_ocr.pdf
alfresco_1 | at org.alfresco.repo.content.AbstractContentWriter.putContent(AbstractContentWriter.java:491)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:83)
alfresco_1 | ... 11 more
alfresco_1 | Caused by: java.io.FileNotFoundException: /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468_ocr.pdf (No such file or directory)
alfresco_1 | at java.base/java.io.FileInputStream.open0(Native Method)
alfresco_1 | at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
alfresco_1 | at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
alfresco_1 | at org.alfresco.repo.content.AbstractContentWriter.putContent(AbstractContentWriter.java:485)
alfresco_1 | ... 12 more

Highlighted
Senior Member

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Jump to solution

So, to make it works on Alfresco/Share CE 6.1.2-ga/6.1.0 I made shared volume between alfresco and ocrmypdf containers. I replace /ocr_input and /ocr_output to one directory /ocr and map it as volume for both containers. 

Only one problem, asynchronous mode for rule gives me error. So I turn it off.

Angel thanks!

docker-compose.yml

...
services:
   alfresco:
      ...
     volumes:
        - ocr:/ocr
      ...

   ocrmypdf:
      ...
      volumes:
         - ocr:/ocr
   ...
volumes:
   ...
   ocr:
      driver: local
...

bin/ocrmypdf.sh

(and remove {} from $OUTPUT_FILE_PARAM in copy output file command) 

#!/bin/bash

INPUT_DIR=/ocr
OUTPUT_DIR=/ocr

# ocrmypdf hostname
OCRMYPDF_SERVER="ocrmypdf"

# identify parameters, input and output file
array=( "$@" )
len=${#array[@]}
ARGS=${array[@]:0:$len-2}

LAST_ARGS="${@: -2}"
INPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 1`
OUTPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 2`

# extract filenames
INPUT_FILE=$(basename "$INPUT_FILE_PARAM")
OUTPUT_FILE=$(basename "$OUTPUT_FILE_PARAM")

# SSH parameters
SCP=cp
SSH=ssh
USER=root

# copy original pdf to ocrmypdf server
$SCP $INPUT_FILE_PARAM $INPUT_DIR

# execute ocrmypdf program
$SSH $USER@$OCRMYPDF_SERVER "/usr/bin/ocr.sh $ARGS $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE"

# copy transformed pdf back to alfresco path
$SCP $OUTPUT_DIR/$OUTPUT_FILE $OUTPUT_FILE_PARAM

# remove temporal files
rm -f $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE

View solution in original post

Highlighted
Customer

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Jump to solution

With the approach suggested by Fedorow, I was able to make OCR work with Alfresco 6.1.0. I update ocr_input and /ocr_output to /usr/local/tomcat/ocr_input and /usr/local/tomcat/ocr_out so that alfresco container can access these folders without any access issues. 

Thanks Fedorow

docker-compose.yml

 

...
services:
   alfresco:
      ...
     volumes:
        - ocr-input:/usr/local/tomcat/ocr_input
- ocr-output:/usr/local/tomcat/ocr_output
      ...

   ocrmypdf:
      ...
      volumes:
- ocr-input:/usr/local/tomcat/ocr_input
- ocr-output:/usr/local/tomcat/ocr_output
   ...
volumes:
   ...
  ocr-input:
external: true
ocr-output:
external: true
...

 bin/ocrmypdf.sh

#!/bin/bash

INPUT_DIR=/usr/local/tomcat/ocr_input
OUTPUT_DIR=/usr/local/tomcat/ocr_output

# ocrmypdf hostname
OCRMYPDF_SERVER="ocrmypdf"

# identify parameters, input and output file
array=( "$@" )
len=${#array[@]}
ARGS=${array[@]:0:$len-2}

LAST_ARGS="${@: -2}"
INPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 1`
OUTPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 2`

# extract filenames
INPUT_FILE=$(basename "$INPUT_FILE_PARAM")
OUTPUT_FILE=$(basename "$OUTPUT_FILE_PARAM")

# SSH parameters
SCP=cp
SSH=ssh
USER=root

# copy original pdf to ocrmypdf server
$SCP $INPUT_FILE_PARAM $INPUT_DIR

# execute ocrmypdf program
$SSH $USER@$OCRMYPDF_SERVER "/usr/bin/ocr.sh $ARGS $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE"

# copy transformed pdf back to alfresco path
$SCP $OUTPUT_DIR/$OUTPUT_FILE $OUTPUT_FILE_PARAM

# remove temporal files
rm -f $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE

After the above changes I was able to successfully run OCR with Alfresco 6.1. 

As we are running our Alfresco instance on Kubernetes and using HELM deployment, I need to configure the  volumes in values.yaml file but I am not sure how to configure the volumes in values.yaml file. Any one has idea on how we need to make similar configuration in kubernetes.

 

Any help apprecaited. 

View solution in original post

Highlighted
Community Manager
Community Manager

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Jump to solution

Hi @SriramG,

Thanks for updating us on how you resolved your issue - really helpful. 

Maybe start a new thread for your question about configuring volumes?

Cheers, 

Digital Community Manager, Alfresco Software.
Problem solved? Click Accept as Solution!