PDF Preview vs. Pipeline Builder Text Extraction

rebeccanhan · June 17, 2025, 7:35pm

I am using the pipeline builder “Extract Text from PDF” to extract text from my PDF.

I want to cut certain snippets of this text and then, in the PDF Preview widget on Workshop, use the “Customize initial search” to feed the snippet back into the PDF to highlight that text.

This works sometimes. However, it seems that the exact same text that is being extracted by the pipeline builder is not the same as the text that is searchable in the PDF Preview widget. These differences are often minimal (headers/footers being present/absent, special symbols appearing in different locations or being read differently), but result in the initial search not being able to find a match and thus no highlighted text in the PDF Preview.

I have tried every PDF text extraction algorithm that pipeline builder offers and while the results differ in accuracy, none of the seem to match exactly with what’s searchable on the PDF Preview.

I’m not sure how to proceed with this, does anyone have any ideas on how to get this to work?

Isy · June 18, 2025, 4:00pm

In the ‘Extract Text from PDF’ board, are you using the ‘Raw text’ or ‘OCR’ option?

rebeccanhan · June 25, 2025, 5:43pm

Both have the same issue.