I have thousands of PDFs from which I need to extract both text and tables for downstream processing. The extracted content will be used for:
-
Text chunking
-
Embedding generation
-
Integration with a chatbot via Qdrant vector database
Current State: I’ve successfully implemented this workflow using Pipeline Builder, but I’m facing scalability and computational limitations, particularly with media set processing.
Goal: I want to migrate this pipeline to Code Repository for better performance and scalability. I’m looking for:
-
Sample code or templates to get started quickly
-
Best practices for handling large-scale PDF processing in Foundry
-
Efficient approaches for extracting both text and tabular data
Target Output: A structured DataFrame containing:
-
context (extracted text/table content)
-
page_number
-
file_path
-
metadata (chunk number, entities, summary)
Questions:
-
Are there existing code templates or reference implementations for large-scale PDF processing in Code Repo?
-
Any recommendations for optimizing performance when processing thousands of documents?
I’d appreciate any guidance, examples, or best practices you can share to help accelerate this migration.