Hi everyone,
I’m currently working on a project where document parsing accuracy is critical. I’m dealing with a mixed set of PDFs that include:
- PDFs with extractable raw text
- Scanned documents requiring OCR
- Hybrid PDFs (e.g., first page is scanned for approval signatures, remaining pages are raw text)
To improve accuracy, I’d like to experiment with different approaches. As you know, Pipeline Builder’s “Extract text from PDF” function offers three options: raw text, OCR, and layout-aware.
My questions are:
-
Which Python libraries underlie each of these three extraction methods? Understanding this would help me avoid redundant development and potentially create hybrid approaches.
-
Is it acceptable to develop custom UDFs using alternative libraries (like pdfplumber, Tesseract, PaddleOCR, etc.) and import them as transforms in Pipeline Builder? If so, could someone provide an example of:
- Input: PDF from a Media Set
- Output: Extracted text using a library like pdfplumber
- Implementation as a UDF/transform
Any guidance or examples would be greatly appreciated!
Thanks in advance.