Extracting Text with Machine Learning (OCR Technology)

Step-by-Step process of extracting specific text from multiple PDF files

Otema Yirenkyi
4 min readJun 6, 2022

Born out of necessity, I wrote some Python code to extract text out of multiple PDF Files and write the output to Excel files.

I reckon this is going to be a very short article about Machine Learning application.

The Problem: My friend was tasked to read multiple PDF files and write out some info into an Excel sheet. Typically, this would mean that he’d have to look through all the 20+ files one by one, and type out the keywords he’s been asked to look out for, into an Excel sheet / Google sheets.

Proposed Solution:

The solution to this was simply to arrive at the optimal final results in the laziest way possible haha.

So, I wrote a program that can be classified under Optical Character Recognition.

Photo by Claudia Ramírez on Unsplash

Ideas Employed:

The idea of extracting text from PDF files can be implemented by using Optical Character Recognition (OCR) technology.

“OCR is a technology that recognizes text within a digital image. It is commonly used to recognize text in scanned documents, but it serves many other purposes as well.” [source]

Did you know that the human eye and brain employ Optical Character Recognition to enable us read text and decipher what words and sentences are conveying? In the same vein we’ll have to train the computer to identify characters and words from printed documents or images into editable form using Machine Learning. Read more about the technology of OCR and its advantages here.

Lool isn’t it weird that the same computing system that typed and printed a document needs help reading it? Welp, the computer is not naturally smart until you train it to be.

Step by Step Process:

I tried a sample code after my YouTube exploration - using the Py2PDF python library. This was to read a single PDF file and export the extracted words into a Text (.txt) file. Find the full video here:

Code extract from YouTube video

Secondly, I used the pdfplumber library to read a single PDF file. Legend says it runs faster than Py2PDF.

pdfplumber library to read a single PDF file

In the above code, you need to indicate the file path of the PDF file you are extracting text from. Next, you loop through the pages in the file. For each page in the file, extract the text and store in the ‘text1’ variable. Open a new text file and write the results of the extract into the .txt file. Close the file.

The file will be stored on your computer.

I then tried with multiple files with pdfplumber and I got my results.

pdfplumber library to read multiple PDF files

To do this, you first need to put all PDF files whose text you will be extracting, into one folder.

Import the glob library. This library will allow you to read all files in a folder. The difference between extracting from one file and multiple files, is the use of the glob library. After each read from a file, create a new line and read the next file. Open a new text file and write the results of the extract into the .txt file. Close the file.

Take note of the ‘*.file extension’ at the end of the file path. Read more on the different options you can select when using the glob library

Next, we’ll read multiple pdf files and search for one specific term. When we find this term, we’ll extract the row in which the search term exists.

pdfplumber library to read multiple PDF files — adding a search term

Be sure to import the necessary libraries (csv, glob, pandas). In this code, we’ll create a list where the results of our code will be placed in. Instantiate your search term. Mine is “TOTAL”.

Note that the search term is case sensitive. This means that the way in which the word is written is exactly how the program will search for the word.

Create a new csv file that our results will be written into. Use the glob function to access the files in a specified file directory. Call the csv.writer method on the newly created csv file. This method returns a writer object which will convert the user’s data into delimited strings. Basically, it inserts data into the CSV file.

Use pdfplumber like in the above examples, to open each file and go through every page in each PDF file. Use the extract_text method to read each pages’ text and store in the ‘page’ variable. Each page text is split by a new line for clarity of individual pages.

In each page, loop through the lines. If the search term is identified in a line, split the line by default space. Use the writerow method to write the split words into the csv file.

Close the csv file.

Csv files can be opened with the Microsoft Excel application.

Voila! You’ve saved yourself time half the time you would have wasted in going through multiple PDFs in search of keywords.

Note that pip installations are one time commands. You only need to install once unless the library has been deleted prior to use.

An alternative to this approach is to use already built platforms that incoroporate OCR Technology. One is UIPath Studio — an automation platform for digitizing text. While I worked on my code, my colleague worked on exploring the UIPath option. Read on how to extract text with UIPath Studio.

Gratitude goes to a friend of mine who helped me along the way when I got to roadblocks with my code.

Limitations:

You may have to proofread the output of your extracted text by spell checking it, adding headings, reordering it and more. But at least, the above process saves you a lot more time as compared to going through multiple files one by one.

Additional resources

--

--