Simple Optical Character Recognition Using Pytesseract
We may be in a hurry to recognize the characters in an image and need the fastest possible result even if accuracy is not very high. If these characters are "in the wild," it is a relatively complex problem; if the characters are on a document image, it is easier, and venerable models such as Tesseract OCR can be used out of the box for a quick character recognition task. Tesseract was developed during the 80s, well before deep neural networks were practical. Now the model is in the open-source realm and supported by Google. Pytesseract offers a Python wrapper for the Tesseract model that greatly simplifies its use both in command line mode and inside scripts.
This demonstration uses a Google Colab notebook; configuring Tesseract-OCR may be complex depending on your operating system. Colab uses a Linux operating system, and if we did not know this information in our machine while interacting through a notebook environment using Python, we could always interrogate the system:
import platform print(platform.system()) print(platform.release()) print(platform.version())
Linux 5.4.188+ #1 SMP Sun Apr 24 10:03:06 PDT 2022
Installing the Tesseract engine and Pytesseract wrapped is relatively simple in Linux:
!sudo apt install tesseract-ocr !pip install pytesseract
The Pillow image processing module may be outdated concerning the pytesseract requirements; we may need to restart the session at this point. In addition, we will need to import the following modules:
import pytesseract from pytesseract import Output import cv2 from google.colab.patches import cv2_imshow import numpy as np
Note that cv2_imshow is the patch required to view images in Colab and is not generally needed in other environments. We import NumPy as we will represent the images as numerical vectors. Any photo with text is valid for this demonstration; we are using a purchase invoice that has been scanned using a phone camera; use this image or your own:
When the image is uploaded to the Colab environment, we can load it and detect the text in just three lines using:
source = r'/content/facturamdiamarkt.jpg' original = cv2.imread(source) text = pytesseract.image_to_string(original) print(text)
The result is not totally wrong and not totally satisfactory:
Fragments from the document are readable; those parts coincident with the focus point of the camera and where the original document is not crumbled are correctly detected. The rest contains detected text, although it is not readable. The first step to improving the result is understanding what Tesseract is detecting. Next, we can use the data output from the text recognition to display the points of interest that Tesseract uses to find and recognize text:
d = pytesseract.image_to_data(original, output_type=Output.DICT) boxes = len(d['level']) print ("Number of boxes: " + str(boxes)) box_img = original.copy() for i in range(boxes): (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i]) cv2.rectangle(box_img, (x, y), (x + w, y + h), (0, 255, 0), 2) cv2_imshow(box_img)
The new image shows where the model is looking for text. In this manner, we can determine if Tesseract is missing the text or failing to recognize detected text blocks:
The text in the warped area is not recognized as such. The model is looking for straight lines. Even if inclined, lines on warped paper seem to confuse it. Before we start to modify the image, we can generate a better visualization method using matplotlib:
from matplotlib import pyplot as plt img_size = box_img.shape plt.figure(figsize = (16, 9)) img_section = box_img[:img_size//3, :, :] plt.imshow(img_section);
We can plot the upper third of the image and obtain a faster, more controllable view:
In our following publication, we will try to modify this image to improve text detection.
Do not hesitate to contact us if you require quantitative model development, deployment, verification, or validation. We will also be glad to help you with your machine learning or artificial intelligence challenges when applied to asset management, automation, or text recognition in the wild.
The demonstration notebook is here.