OCR a scanned file and retrieve the metadata

Showing results for 
Search instead for 
Did you mean: 
Active Member II

OCR a scanned file and retrieve the metadata


I have thousands of invoices to scan, OCR them (near 100% recognition) and retrieve the needed metadata (Partner, Invoice Number, Amount, Units,Currency,...).(All of this in Alfresco)

Based on these metadata retrieved i need to do some operations on the invoices ( Move them to appropriate folders, apply some workflows...).

As a first approche:

- For the OCR I used Alfresco Simple OCR Action, but the result is not very accurate (far from 100%).

- For retrieving the results I convert the PDF OCRed to a plain text file and then i search it's content using javascript with document.content ... But since the OCR is not accurate i can't tell if it's the best solution to search inside the document.

So my questions are :

- How can I make the OCR results more accurate?

- How to retrieve important data from the invoice? is the method i'm using good enough or very poor for such processing?


Im using pdfsandwich, and my alfresco-global.properties is:


ocr.extra.commands=-verbose -lang eng
3 Replies
Alfresco Employee

Re: OCR a scanned file and retrieve the metadata

Switch from pdfsandwich to ocrmypdf.


ocr.extra.commands=--verbose 1 --force-ocr -l spa+eng+fra

This will produce more accurate results.

Hyland Developer Evangelist
Active Member II

Re: OCR a scanned file and retrieve the metadata

Indeed, OCRmyPDF gives more accurate results.

Concerning my second question, do you have any idea how can I extract the data from the OCRed PDF file depending on the position of the data in the document. For example retrieve: Number of the invoice, the price, .... I'm really stuck and I don't know where to start, i've been googling a lot and couldn't come up with a free solution to do so from alfresco.