Extract text and tables from pdf using code repo

nanuprasad · November 3, 2025, 2:09pm

I have thousands of PDFs from which I need to extract both text and tables for downstream processing. The extracted content will be used for:

Text chunking
Embedding generation
Integration with a chatbot via Qdrant vector database

Current State: I’ve successfully implemented this workflow using Pipeline Builder, but I’m facing scalability and computational limitations, particularly with media set processing.

Goal: I want to migrate this pipeline to Code Repository for better performance and scalability. I’m looking for:

Sample code or templates to get started quickly
Best practices for handling large-scale PDF processing in Foundry
Efficient approaches for extracting both text and tabular data

Target Output: A structured DataFrame containing:

context (extracted text/table content)
page_number
file_path
metadata (chunk number, entities, summary)

Questions:

Are there existing code templates or reference implementations for large-scale PDF processing in Code Repo?
Any recommendations for optimizing performance when processing thousands of documents?

I’d appreciate any guidance, examples, or best practices you can share to help accelerate this migration.

david · November 3, 2025, 3:14pm

Hello! Hoping that someone else can give you some best practices for achieving your use case in transforms. From the pipeline builder side, can you be a bit more specific about what scaling limitations you’re running into? Are builds failing? Are rows error-ing out? If rows are error-ing, what transforms seem to be causing the errors?

VenkatPolimera · November 4, 2025, 2:58am

Foundry inbuilt transforms are the best approach for pdf document chunks and embeddings for now. Please help us to understand more about the pdfs. are they any OCR in the pdfs?

fac7a5bb66119b7b9c34 · November 4, 2025, 7:31am

Hello. For your use case, please use AIP Document Intelligence. When working with media sets, AIP Document supports all three methods: OCR, Raw text, and Layout Aware. To further improve accuracy, use a hybrid format that incorporates an LLM. At the end of the app, it will automatically generate a code repo for you. When using an LLM, I recommend parsing with Haiku 4.5 to save on tokens.

The code repo is set to incremental mode by default, which prevents duplicate PDF text extraction. The output data includes page numbers, file paths, and other metadata.