For the Extract text from PDF board in Pipeline Builder, what models are used under the hood for the Raw Text, OCR, and Layout aware extraction methods?
3 Likes
Hey There, crosati, welcome to the community
In Palantir Foundry, PDF text extraction can be achieved using various models and tools, including:
- Optical Character Recognition (OCR) Tools:
- Tesseract: An open-source OCR engine that can be integrated into Foundry pipelines to extract text from scanned PDF documents.
- AWS Textract: A service that extracts text and data from scanned documents, which can be used within Foundry.
- Natural Language Processing (NLP) Models:
- SpaCy: An NLP library for processing and analyzing extracted text, offering features like tokenization and named entity recognition.
- BERT: A transformer-based model for tasks like entity recognition or text classification.
- Custom Machine Learning Models:
- Train custom models using Foundry’s machine learning capabilities for specific text extraction tasks.
- Foundry’s Data Integration and Processing Tools:
- Use Foundry’s tools to preprocess PDFs, apply OCR, and analyze text as part of a data pipeline.
2 Likes
You can choose the model you want to use with the LLM node.
Pipeline Builder • Transforms • Use LLM node • Palantir