OCR Evaluation

tpage · ‎24 Sep 2021

As part of a recent hackathon I teamed up with Angel Borroy to connect Alfresco Community Edition with a couple of open source OCR applications. The project is available on GitHub. The two OCR applications chosen are OCRmyPDF (version 9.6.0+dfsg) and Apache Tika (version 1.24-full) (both based on Tesseract). OCRmyPDF also contains a number of other tools for cleaning up the PDF (in particular unpaper) and producing a searchable PDF that looks like the original document. Both tools allow for configuring Tesseract itself, but for these tests I've pretty much stuck to their out-of-the-box settings.

I chose to use the start of Alice's Adventures in Wonderland as the test text and ImageMagick to convert this into an image embedded in pdf files. ImageMagick features a large number of options allowing for image manipulation, to test how well the OCR tools coped.

To measure the accuracy of the OCR then I measured the Levenshtein distance between the input and output text. In theory this would be zero distance if the conversion was 100% accurate, but it can be particularly hard to accurately identify certain characters (e.g. whitespace). To get the text output from OCRmyPDF I used the --sidecar option.

Font Size

I standardised on an A4 PDF at 300ppi with DejaVu Sans. This produced text that was unreadable at 1pt, but could easily be seen at 2pt. However the two OCR applications did not manage as well as a human, but got to their peak performance at about 4pt text. OCRmyPDF performed slightly better than Tika.

Graph showing OCR accuracy against font size

The Levenshtein distance for a 1pt font was the total number of characters in the source text, and pretty much acts as a limit in all the tests.

Font

There are an almost unlimited number of fonts which we could test, but I decided to just compare different faces from the DejaVu family. There was very little difference in performance between the two applications in this test, and only a few characters difference between the various fonts.

Graph showing OCR accuracy against font face

Rotation

It's fairly common for scanned documents or text in photos to be slightly skewed. Here I tested three configurations - the default OCRmyPDF settings, OCRmyPDF with --deskew and Apache Tika. OCRmyPDF also has a --rotate-pages option for when some pages in a pdf are landscape or upside down, but this only works for multiples of ninety degrees. We can see that the deskew option made a huge difference and allowed reading text up to six degrees from the verticle.

Graph showing OCR accuracy against page rotation

One nice feature of OCRmyPDF was that it produced an output pdf that had the page rotated so that the text was the correct orientation.

Noise

I tested this by adding noise using the +noise Gaussian option from ImageMagick. I found that a single application of noise made little difference to the results, so tested against repeated applications. Although the noise itself is random, I used the same set of documents for each test The document with ten applications of noise looked like this:

Example text after ten applications of noise

OCRmyPDF provides a few useful options to improve the extraction of text from an image. The first --clean calls through to a tool called "unpaper" which tries to remove artifacts from scanning. The second --remove-background is intended for removing a background image from a document. The third option that I tested was --threshold which is an experimental option to convert every pixel to black or white before sending the image to Tesseract.

Although Gausian noise is easy to generate, it's not the same noise that is typically seen on scanned documents, so these results should be taken with a grain of salt. Overall Tika came out with better results than any of the OCRmyPDF options I tested. The clean option didn't seem to make any difference at all - the results were almost exactly the same as default settings for OCRmyPDF. The remove background option actually made things worse - presumably because the noise was applied to the foreground (text) and to the background. The threshold option improved the results significantly and even provided better accuracy than Tika for certain noise levels.

Graph showing OCR accuracy against noise level

ImageMagick provides various methods of generating noise, but all of them seemed to give similar results.

Blur

Using iterations of the option -blur 100 from ImageMagick I produced various blurry documents. With ten applications of noise the text looked like this:

Example text after ten blur applications

I tested the OCR fidelity using the same set of options as for the noise test. Overall OCRmyPDF performed better than Tika, and generally the default options gave a more accurate scan than anything else. For a certain amount of blur then the remove background option gave the best results. The clean option again gave the same results as the default settings.

Graph showing OCR accuracy against blurring

Conclusions

OCRmyPDF and Tika are both based on Tesseract, but their default settings performed best under different circumstances. While both tools provide the option to tinker with Tesseract's configration, OCRmyPDF provides easy access to some simpler settings that can improve fidelity of OCR. Additionally OCRmyPDF provides the ability to easily generate an output PDF containing text in the appropriate places. Both projects are under active development, so it's worth testing the latest versions for any improvements.

Finally, if you're interested in using OCR with Alfresco Community Edition then have a look at the Alfresco OCR Transformer on GitHub.

angelborroy · ‎24 Sep 2021

Nice work, @tpage

We got some other tool contributed by Community:

https://pypi.org/project/pypdfocr/

It would be nice to add this third one to the comparison!

MichKats · ‎28 Oct 2021

Hi @tpage you mention that you measure accuracy by computing the Levenshtein distance between the text from the original pdf and the text from the OCRed pdf. To obtain the text from the latter is obvious using the --sidecar flag but for the former, how do you obtain that text since the original pdf is not searchable and its the reason we do OCR?

tpage · ‎8 Feb 2023

@MichKats Incredibly late reply here - sorry! It was simple to measure the Levenshtein distance because I generated the pdf test file from a text document ("Alice in Wonderland"). I guess I made the assumption that the text to pdf conversion was accurate, but I don't have any reason to assume otherwise. Thanks for your interest in the project!

afjaber · ‎8 Feb 2023

@tpage Thanks for the effort put into this. We have using OCRmyPDF for many years and agree it has given us better results and control vs. alternatives we've tried. You've hit the nail on the head that the background noise and blur is still a challenge to overcome with certain kinds of hard copy scanned documents. In these cases, use of the available options does not produce the level of OCR text accuracy that a human can decipher with their own eyes. --remove-background does not help, as mentioned, and use of --threshold only helps to a certain degree but we do use it.

It would be interesting to see if you can perform this analysis somehow against hard copy scanned documents that have lots of noise (e.g., shaded tables) and/or blur. On a side note, I've found that if you print a document with background color/shading and take it through the copy machine again and again and again (i.e. taking a copy of the copy of the copy), then you can often produce the kind of hard copy original that's in line with the problem files.

tpage · ‎9 Feb 2023

Thanks @afjaber ! This was a fun investigation but I stopped short of producing scans of physical documents. That would definitely be an interesting extension and I really like the idea of using background colour and successive photocopies to produce noisier versions of the original.

OCR Evaluation

OCR Evaluation

We use cookies on this site to enhance your user experience