How to use Pipeline Builder to "clean" your PDF text

dyun · July 1, 2024, 7:39am

Hi everyone,

I am looking to this community to ask for your best practices and any hacky techniques you might have developed along the way to use Pipeline Builder to “clean” your text data extracted from PDFs.

For instance, removing header and footer text from PDFs, removing ad text or page numbers, etc.

One can also write a custom code for this but I am looking for a way to do this in Pipeline Builder such that users can rapidly build their analysis workflow using PDF texts!

jeanphelippe · July 16, 2024, 2:40pm

Hi dyun,

Builder now has a feature where you can use LLMs directly in the pipeline which might help alongside the “extract from PDF” expression (https://www.palantir.com/docs/foundry/pb-functions-expression/pdfTextExtractionV1/)

Given that proper parsing of the PDF properly is almost like an art I would suggest splitting the work into two chunks: Data Science and Analytics. In the Data Science approach, you can leverage great work and research done on OCR/PDF parsing and a few great open-source Python libraries. Below you can find a link to a great post on how to use PyMuPDF library amongst others.

Once that processing happened this can be provided as Page chunks or any sort of chunks to a Analytics user that can leverage Builder to extract true value of that data.

Of course, here I am assuming that the PDF data is based on a handful of PDF templates.

The combination of Code Repositories + Pipeline Builder can be quite powerful in this type of workflow. I’ve used successfully in the past and I can attest.

Good luck!
Cheers,
JP

https://medium.com/@hussainshahbazkhawaja/paper-implementation-header-and-footer-extraction-by-page-association-3a499b2552ae