Parsing PDF Files Using Python: A Guide with Tesseract OCR
In this post, I’ll guide you through a practical use case of parsing text from PDF files using Python Functions. The code uses several libraries, including cv2
, pytesseract
, and pdf2image
, to extract and process text from PDF attachments. Below, I’ll break down the code, explain its functionality, and outline the modules required for each step.
Overview
The goal of this code is to convert PDF pages into images, preprocess those images to correct distortions (like skew), and extract text using OCR with Tesseract.
Required Libraries and Modules
To accomplish PDF parsing with OCR in Python, you’ll need the following modules:
- pytesseract: A Python wrapper for Google’s Tesseract-OCR Engine.
- pdf2image: To convert PDF pages into images.
- OpenCV (cv2): For image processing, like converting to grayscale and deskewing.
- NumPy (np): To work with image arrays.
- pandas: For managing OCR data efficiently.
- logging: For logging errors and debugging.
Install these libraries code repositories
Code Breakdown
1. Setting up the Environment: tessdata
Directory
Tesseract uses trained data for OCR. The _get_tessdata_directory_path()
function retrieves the path where Tesseract stores its trained data (tessdata
), which is essential for configuring the OCR engine.
def _get_tessdata_directory_path():
import sys
from pathlib import Path
env_root = Path(sys.executable).parent.parent
share_dir = os.path.join(env_root, "share", "tessdata")
assert share_dir, "tessdata directory does not exist in <envroot>/share/tessdata"
return str(share_dir)
2. Processing PDF Pages into Images: convert_from_bytes
The function get_attachment_text()
is your main function to processes PDF attachments into a list of pages, which are then handled as images. This should be called from Workshop
attachment_bytes = <your_object_type>.<data_attachment_property>.read().read()
pages = convert_from_bytes(attachment_bytes)
Here, convert_from_bytes
turns the PDF into image pages stored as a list.
3. Deskewing Images: deskew()
PDF scans often suffer from skewed text. To improve OCR accuracy, the deskew()
function calculates and applies a rotation to align the text properly.
def deskew(image):
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.bitwise_not(gray)
coords = np.column_stack(np.where(gray > 0))
angle = cv2.minAreaRect(coords)[-1]
angle = -(90 + angle) if angle < -45 else -angle
center = (image.shape[1] // 2, image.shape[0] // 2)
M = cv2.getRotationMatrix2D(center, angle, 1.0)
return cv2.warpAffine(image, M, (image.shape[1], image.shape[0]), flags=cv2.INTER_CUBIC)
4. Extracting Text from the Image: extract_text_from_image()
After the image is preprocessed (deskewed), we can use Tesseract to extract the text.
def extract_text_from_image(image):
text = pytesseract.image_to_string(image, config=tessdata_dir_config)
return text
The tessdata_dir_config
ensures that Tesseract looks for the trained data in the correct directory.
5. Main Text Extraction Logic: process_page()
The process_page()
function is a helper for processing each page of the PDF. It handles image preprocessing, uses Tesseract to extract the text, and excludes unnecessary header and footer data.
def process_page(page):
try:
page_arr_gray = cv2.cvtColor(np.array(page), cv2.COLOR_BGR2GRAY)
page_deskew = deskew(page_arr_gray)
d = pytesseract.image_to_data(page_deskew, output_type=pytesseract.Output.DICT)
d_df = pd.DataFrame.from_dict(d)
header_index = d_df[d_df["block_num"] == 1].index.values
footer_index = d_df[d_df["block_num"] == d_df["block_num"].max()].index.values
text = " ".join(d_df.loc[(d_df["level"] == 5) &
(~d_df.index.isin(header_index) & ~d_df.index.isin(footer_index)), "text"].values)
return text
except Exception as e:
return -1, str(e)
Full Functionality
The code revolves around one main function:
Extracting Data from PDF
@function
def get_attachment_text(<your_object_type_instance>: <your_object_type>) -> str:
attachment_bytes = <your_object_type_instance>.<data_attachment_property>.read().read()
pages = convert_from_bytes(attachment_bytes)
extracted_text = [extract_text_from_image(deskew(np.array(page))) for p, page in enumerate(pages) if p < 3]
return ' '.join(extracted_text)
This function extracts and processes the first three pages from a data attachment, returning the concatenated text.
Conclusion
This code demonstrates how to efficiently extract text from PDF files using Python functions if TypeScript code is giving you issues: Parsing PDF Blob with pdfjs-dist