OCR a scanned file and retrieve the metadata

cancel
Showing results for 
Search instead for 
Did you mean: 
imanez1
Active Member II

OCR a scanned file and retrieve the metadata

Hello,

I have thousands of invoices to scan, OCR them (near 100% recognition) and retrieve the needed metadata (Partner, Invoice Number, Amount, Units,Currency,...).(All of this in Alfresco)

Based on these metadata retrieved i need to do some operations on the invoices ( Move them to appropriate folders, apply some workflows...).

As a first approche:

- For the OCR I used Alfresco Simple OCR Action, but the result is not very accurate (far from 100%).

- For retrieving the results I convert the PDF OCRed to a plain text file and then i search it's content using javascript with document.content ... But since the OCR is not accurate i can't tell if it's the best solution to search inside the document.

So my questions are :

- How can I make the OCR results more accurate?

- How to retrieve important data from the invoice? is the method i'm using good enough or very poor for such processing?

 

Im using pdfsandwich, and my alfresco-global.properties is:

ocr.command=/usr/bin/pdfsandwich
ocr.output.verbose=true
ocr.output.file.prefix.command=-o

ocr.extra.commands=-verbose -lang eng
ocr.server.os=linux
3 Replies
angelborroy
Expert

Re: OCR a scanned file and retrieve the metadata

Switch from pdfsandwich to ocrmypdf.

ocr.command=/usr/local/bin/ocrmypdf
ocr.output.verbose=true
ocr.output.file.prefix.command=

ocr.extra.commands=--verbose 1 --force-ocr -l spa+eng+fra
ocr.server.os=linux

This will produce more accurate results.

Software Engineer in Alfresco Search Team.
imanez1
Active Member II

Re: OCR a scanned file and retrieve the metadata

Indeed, OCRmyPDF gives more accurate results.

Concerning my second question, do you have any idea how can I extract the data from the OCRed PDF file depending on the position of the data in the document. For example retrieve: Number of the invoice, the price, .... I'm really stuck and I don't know where to start, i've been googling a lot and couldn't come up with a free solution to do so from alfresco.

jpotts
Advanced II

Re: OCR a scanned file and retrieve the metadata