Extracting Input Form Fields from PDF

bwolz · February 19, 2025, 5:47pm

This is not an OCR/VLM question: I have a PDF with input fields. When I run the pipeline builder extract text from PDF block, I get all the text EXCEPT for what’s in these input fields. With PyMuPDF, I can get these forms out as widgets (see PyMuPDF docs, can’t add links). Any support for getting these input fields out of a PDF in PB at the moment? Or would I need to write a UDF to get these out. Image from my PDF below. For reference, the PB board will pull out “6 City, state, and ZIP code” but it will not find the text in the widget (“Input Field”)

sperchanok · February 19, 2025, 6:05pm

Hi @bwolz. Unfortunately, the only way to use a custom model for PDF extraction is to create a UDF.

Out of curiosity, when you say the field is “missing”, what do you mean?

Similar post that might interest you

bwolz · February 19, 2025, 8:34pm

here’s a small (notional) snippet of what the extract PDF board gets me. Note this is just pulling the text out of the PDF, not using OCR or a VLM:

\r\n6 City, state, and ZIP code\r\nRequester’s name and address (optional)\r\n7 List account number(s) here (optional)

you can see that in the extracted text, “Input Field” is not showing up. Using the script below (PyMuPDF) I can extract these widgets and I can find the following field:

Field name: topmostSubform[0].Page1[0].f1_09[0]
Field type: 7
Field value: Input Field

Whatever the PDF text extraction board is doing under the hood, it’s not pulling data from these form fields. Based on my experimentation here, it appears that the forms (or widgets as they’re known in PyMuPDF terms) are not part of the raw text of a PDF, so extracting these fields through PDF text extraction board and actually inserting them at the correct position seems tricky

==PyMuPDF function code for reference==
import fitz # PyMuPDF
def extract_fields_from_pdf(pdf_path):
# Open the PDF file
document = fitz.open(pdf_path)

for page_num in range(len(document)):
    page = document.load_page(page_num)
    widgets = page.widgets()

    if widgets:
        for widget in widgets:
            print(f"Field name: {widget.field_name}")
            print(f"Field type: {widget.field_type}")
            print(f"Field value: {widget.field_value}")
            print("---")

document.close()

system · April 20, 2025, 8:35pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Extracting Input Form Fields from PDF

Field name: topmostSubform[0].Page1[0].f1_09[0] Field type: 7 Field value: Input Field

Field name: topmostSubform[0].Page1[0].f1_09[0]
Field type: 7
Field value: Input Field