What library is Foundry's Pipeline Builder "Extract text from PDF" feature built on?

Hi everyone,

I’m currently working on a project where document parsing accuracy is critical. I’m dealing with a mixed set of PDFs that include:

  • PDFs with extractable raw text
  • Scanned documents requiring OCR
  • Hybrid PDFs (e.g., first page is scanned for approval signatures, remaining pages are raw text)

To improve accuracy, I’d like to experiment with different approaches. As you know, Pipeline Builder’s “Extract text from PDF” function offers three options: raw text, OCR, and layout-aware.

My questions are:

  1. Which Python libraries underlie each of these three extraction methods? Understanding this would help me avoid redundant development and potentially create hybrid approaches.

  2. Is it acceptable to develop custom UDFs using alternative libraries (like pdfplumber, Tesseract, PaddleOCR, etc.) and import them as transforms in Pipeline Builder? If so, could someone provide an example of:

    • Input: PDF from a Media Set
    • Output: Extracted text using a library like pdfplumber
    • Implementation as a UDF/transform

Any guidance or examples would be greatly appreciated!

Thanks in advance.

1 Like

Hello,

The libraries/models used for the text extraction methods aren’t publicized as these methods are designed to be easy-to-use and the backing models are subject to change at any time. We intend to keep in line with industry standards, and as such, we are currently working to update the backing models over the coming months.

If you care a lot about which model or library is being used, we recommend writing custom pipelines instead so that you have ultimate flexibility.

Thanks,

Isy

As to your second question, you should be able to develop a custom UDF which makes API calls to / uses the model / library of your choice. I don’t have any specific examples, but for more information you can reference our documentation on UDFs https://www.palantir.com/docs/foundry/transforms-java/user-defined-functions