Models used in the Extract text from PDF board in Pipeline Builder

For the Extract text from PDF board in Pipeline Builder, what models are used under the hood for the Raw Text, OCR, and Layout aware extraction methods?

3 Likes

Hey There, crosati, welcome to the community

In Palantir Foundry, PDF text extraction can be achieved using various models and tools, including:

  1. Optical Character Recognition (OCR) Tools:
  • Tesseract: An open-source OCR engine that can be integrated into Foundry pipelines to extract text from scanned PDF documents.
  • AWS Textract: A service that extracts text and data from scanned documents, which can be used within Foundry.
  1. Natural Language Processing (NLP) Models:
  • SpaCy: An NLP library for processing and analyzing extracted text, offering features like tokenization and named entity recognition.
  • BERT: A transformer-based model for tasks like entity recognition or text classification.
  1. Custom Machine Learning Models:
  • Train custom models using Foundry’s machine learning capabilities for specific text extraction tasks.
  1. Foundry’s Data Integration and Processing Tools:
  • Use Foundry’s tools to preprocess PDFs, apply OCR, and analyze text as part of a data pipeline.
2 Likes

You can choose the model you want to use with the LLM node.
Pipeline Builder • Transforms • Use LLM node • Palantir