Models used in the Extract text from PDF board in Pipeline Builder

crosati · May 27, 2025, 5:46pm

For the Extract text from PDF board in Pipeline Builder, what models are used under the hood for the Raw Text, OCR, and Layout aware extraction methods?

Maverick · May 27, 2025, 6:31pm

Hey There, crosati, welcome to the community

In Palantir Foundry, PDF text extraction can be achieved using various models and tools, including:

Optical Character Recognition (OCR) Tools:

Tesseract: An open-source OCR engine that can be integrated into Foundry pipelines to extract text from scanned PDF documents.
AWS Textract: A service that extracts text and data from scanned documents, which can be used within Foundry.

Natural Language Processing (NLP) Models:

SpaCy: An NLP library for processing and analyzing extracted text, offering features like tokenization and named entity recognition.
BERT: A transformer-based model for tasks like entity recognition or text classification.

Custom Machine Learning Models:

Train custom models using Foundry’s machine learning capabilities for specific text extraction tasks.

Foundry’s Data Integration and Processing Tools:

Use Foundry’s tools to preprocess PDFs, apply OCR, and analyze text as part of a data pipeline.

Maverick · May 27, 2025, 9:43pm

You can choose the model you want to use with the LLM node.
Pipeline Builder • Transforms • Use LLM node • Palantir