[Guide] Parsing PDF Files Using Python with Tesseract OCR

Parsing PDF Files Using Python: A Guide with Tesseract OCR

In this post, I’ll guide you through a practical use case of parsing text from PDF files using Python Functions. The code uses several libraries, including cv2, pytesseract, and pdf2image, to extract and process text from PDF attachments. Below, I’ll break down the code, explain its functionality, and outline the modules required for each step.

Overview

The goal of this code is to convert PDF pages into images, preprocess those images to correct distortions (like skew), and extract text using OCR with Tesseract.

Required Libraries and Modules

To accomplish PDF parsing with OCR in Python, you’ll need the following modules:

  • pytesseract: A Python wrapper for Google’s Tesseract-OCR Engine.
  • pdf2image: To convert PDF pages into images.
  • OpenCV (cv2): For image processing, like converting to grayscale and deskewing.
  • NumPy (np): To work with image arrays.
  • pandas: For managing OCR data efficiently.
  • logging: For logging errors and debugging.

Install these libraries code repositories

Code Breakdown

1. Setting up the Environment: tessdata Directory

Tesseract uses trained data for OCR. The _get_tessdata_directory_path() function retrieves the path where Tesseract stores its trained data (tessdata), which is essential for configuring the OCR engine.

def _get_tessdata_directory_path():
    import sys
    from pathlib import Path
    env_root = Path(sys.executable).parent.parent
    share_dir = os.path.join(env_root, "share", "tessdata")
    assert share_dir, "tessdata directory does not exist in <envroot>/share/tessdata"
    return str(share_dir)

2. Processing PDF Pages into Images: convert_from_bytes

The function get_attachment_text() is your main function to processes PDF attachments into a list of pages, which are then handled as images. This should be called from Workshop

attachment_bytes = <your_object_type>.<data_attachment_property>.read().read()
pages = convert_from_bytes(attachment_bytes)

Here, convert_from_bytes turns the PDF into image pages stored as a list.

3. Deskewing Images: deskew()

PDF scans often suffer from skewed text. To improve OCR accuracy, the deskew() function calculates and applies a rotation to align the text properly.

def deskew(image):
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    gray = cv2.bitwise_not(gray)
    coords = np.column_stack(np.where(gray > 0))
    angle = cv2.minAreaRect(coords)[-1]
    angle = -(90 + angle) if angle < -45 else -angle
    center = (image.shape[1] // 2, image.shape[0] // 2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    return cv2.warpAffine(image, M, (image.shape[1], image.shape[0]), flags=cv2.INTER_CUBIC)

4. Extracting Text from the Image: extract_text_from_image()

After the image is preprocessed (deskewed), we can use Tesseract to extract the text.

def extract_text_from_image(image):
    text = pytesseract.image_to_string(image, config=tessdata_dir_config)
    return text

The tessdata_dir_config ensures that Tesseract looks for the trained data in the correct directory.

5. Main Text Extraction Logic: process_page()

The process_page() function is a helper for processing each page of the PDF. It handles image preprocessing, uses Tesseract to extract the text, and excludes unnecessary header and footer data.

def process_page(page):
    try:
        page_arr_gray = cv2.cvtColor(np.array(page), cv2.COLOR_BGR2GRAY)
        page_deskew = deskew(page_arr_gray)
        d = pytesseract.image_to_data(page_deskew, output_type=pytesseract.Output.DICT)
        d_df = pd.DataFrame.from_dict(d)
        header_index = d_df[d_df["block_num"] == 1].index.values
        footer_index = d_df[d_df["block_num"] == d_df["block_num"].max()].index.values
        text = " ".join(d_df.loc[(d_df["level"] == 5) & 
                                 (~d_df.index.isin(header_index) & ~d_df.index.isin(footer_index)), "text"].values)
        return text
    except Exception as e:
        return -1, str(e)

Full Functionality

The code revolves around one main function:

Extracting Data from PDF

@function
def get_attachment_text(<your_object_type_instance>: <your_object_type>) -> str:
    attachment_bytes = <your_object_type_instance>.<data_attachment_property>.read().read()
    pages = convert_from_bytes(attachment_bytes)
    extracted_text = [extract_text_from_image(deskew(np.array(page))) for p, page in enumerate(pages) if p < 3]
    return ' '.join(extracted_text)

This function extracts and processes the first three pages from a data attachment, returning the concatenated text.

Conclusion

This code demonstrates how to efficiently extract text from PDF files using Python functions if TypeScript code is giving you issues: Parsing PDF Blob with pdfjs-dist

21 Likes