Optical Character Recognition (OCR) uses optics to extract readable text into machine-encoded text. A large number of companies that process paper-based forms use OCR to extract texts from documents.
Applying cutting-edge technologies to modern problems has enabled various problems to be solved in healthcare, thereby improving people’s lives.
Machine learning has been used in hospitals to identify patients faster in disease identification and diagnosis, robotic surgery, and medical imaging diagnosis.
One notable area of concern in healthcare that hasn’t been explored a lot is drug label extraction. Optical Character Recognition (OCR) uses optics to extract readable text into machine-encoded text.
OCR has been used on Google’s visual translation service. Recently, most companies offering PDF services, such as Adobe, also provide this feature on their app.
We will be using OCR to enable us to automatically extract drug labels/text from prescription medicine.
We’ll specifically use the PaddleOCR, previously referenced in a paper known as PP-OCR: A Practical Ultra Lightweight OCR System. The PaddleOCR is a model that the engineers at Baidu originally built.
To understand this tutorial, you will need the following:
- Have Google Colab or Jupyter Notebook installed.
- You will need to be familiar with machine learning.
- You will need to have Python and PaddleOCR installed.
Table of contents
- Installing PaddleOCR for Python
- Applying PaddleOCR model to images
- Extracting prescription medication labels using PaddleOCR
- Wrapping up
To go through with our drug label extraction, we need to perform three key things:
- Install and import our dependencies.
- Instantiate the model and detect.
- Visualize the results.
Installing PaddleOCR for Python
We begin with opening a Jupyter notebook either on Colab or locally and install the required libraries. The dependency we want to install is the PaddleOCR.
Head on to Paddle OCR’s Github page.
Tutorial section, navigate to the
installation sub-section. A thorough guide to install PaddleOCR is available here.
There are two parts to the installation process. First, you have to install
paddlepaddle, which is the underlying framework behind the PaddleOCR model. Second, we’ll go ahead and install the
The PaddleOCR model can run on both a GPU and CPU. GPUs are often preferred because of their greater memory bandwidth.
If you have a GPU-enabled machine, use the following command to install the PaddleOCR:
!python3 -m pip install paddlepaddle-gpu==2.0.0 -i https://mirror.baidu.com/pypi/simple
If you have a CPU-enabled machine, use the following command to install the PaddleOCR:
!python3 -m pip install paddlepaddle==2.0.0 -i https://mirror.baidu.com/pypi/simple
If you happen to get an error when installing, it’s because you haven’t installed the
pipcommand. To install it, use the command,
pip3 install --upgrade pip. Also, since we’re installing it on our notebook, we have to include the exclamation mark
!before the installation command.
Let’s install PaddleOCR through a quick pip install, using the following command:
pip install paddleocr
It would be best if you cloned the PaddleOCR repo on Github.
!git clone https://github.com/PaddlePaddle/PaddleOCR
This clone is crucial as it’ll enable the look-up of specific fonts when we visualize our results in the last step.
Now that we’ve installed our dependencies, let’s import them into our notebook/colab.
from paddleocr import PaddleOCR, draw_ocr from matplotlib import pyplot as plt import cv2 import os
paddleocris our main OCR dependency
matplotliballows us to plot images
cv2allows us to use the OpenCV library
osallows for folder directory navigation
Applying PaddleOCR model to images
Now, let us apply the PaddleOCR to our images.
These are the two images that we’ll be applying PaddleOCR on:
Let’s set it up using the following command:
ocr_model = PaddleOCR(lang='en')
This command specifies the language of choice. For our case, we set it to English. Since the PaddleOCR is made by engineers at Baidu, it can also recognize Chinese characters. You may select your language as Chinese or any other supported language.
The following command sets up the path to our images on the machine.
img_path = os.path.join('.', 'drug1.jpg')
Let’s run the OCR method on the
ocr_model to make some detection. We can store the results in a variable known as
result = ocr_model.ocr(img_path)
Running the command above gives the output:
[[[[114.0, 39.0], [265.0, 5.0], [273.0, 39.0], [121.0, 72.0]], ('Generio', 0.9429093)], [[[422.0, 99.0], [464.0, 89.0], [469.0, 113.0], [428.0, 122.0]], ('1', 0.5031448)], [[[135.0, 118.0], [457.0, 40.0], [467.0, 80.0], [144.0, 158.0]], ('TO APPOINTME', 0.9501941)], [[[302.0, 124.0], [433.0, 94.0], [442.0, 132.0], [311.0, 163.0]], ('THEN', 0.9947262)], [[[146.0, 166.0], [323.0, 122.0], [332.0, 160.0], [155.0, 203.0]], ('ONLY.', 0.9750903)], [[[154.0, 204.0], [455.0, 127.0], [468.0, 175.0], [167.0, 253.0]], ('TWICEDAILY', 0.99549305)]]
type(result) command, we find these results stored in a list. We can use standard list indexing to get the results out.
To grab all the texts, let’s loop through the result variable, picking the first index that is the text. The following command is going to print out the text only.
for res in result: print(res)
Generio 1 TO APPOINTME THEN ONLY. TWICEDAILY
The few lines of code above allow us to extract the drug labels in drug bottles/capsules quickly.
Please note that we are using
drug1.jpgin this example. You can play around with the code and use
drug2.jpgand see the outcome.
Extracting prescription medication labels using PaddleOCR
This step involves visualizing the results. The
matplotlib library help us achieve this.
Let’s extract the boxes, scores, and text coordinates into separate variables. We then pass these coordinates into the
This process feels similar to object detection if you’ve ever performed such a task. The
boxes variable loops through and grabs the first index inside our
result variable, the
texts variable grabs the second, and the
scores grab the third.
boxes = [res for res in result] texts = [res for res in result] scores = [res for res in result]
The next step involves specifying the font path for our
draw_ocr method. This is why I mentioned earlier in the tutorial that you need to clone the PaddleOCR Github repository.
To use the
draw_ocr method, we need a path defined to the specific fonts. Inside the downloaded repo, follow that path shown below, and you’ll find different fonts inside a folder. For our case,
ttf is the font format we use for
True Type Font.
We do this by issuing the following command:
font_path = os.path.join('PaddleOCR', 'doc', 'fonts', 'latin.ttf')
Let’s now load our image using OpenCV. We use OpenCV’s
imread() method to read in our image.
img = cv2.imread(img_path) img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
cv2.COLOR_BGR2RGB reorders the color channels. By default, OpenCV imports images as BGR.
The final step involves visualizing our images and detections.
Using matplotlib, let’s first resize our display area.
We then use the
draw_ocr method to draw annotations on the image.
annotated = draw_ocr(img, boxes, texts, scores, font_path=font_path)
Finally, we use matplotlib’s
imshow() method to visualize the image.
Let’s take a look at the result. This’s the same image we had earlier. We can see that it has extracted the text and drawn bounding boxes around it.
Please find the complete code to my code implementation here.
One key feature of the PaddleOCR model is that it’s swift. We also noted that with a few lines of code, we can extract labels from prescribed medicine. Amazing, right?
In a nutshell, this tutorial showed you how you could use OCR to extract text from drug labels. The sizes and quality of your image will significantly impact your results.
Remotely running this colab project may give you an error since you can’t use my image path remotely. You can import an image of your choice into your colab, or you can carry out the task locally to circumvent this error.
Try playing around with the code and see what you find.
Peer Review Contributions by: Collins Ayuya