How to extract data from tabular data, charts, and text from PDFs?

VincentF · November 6, 2024, 11:32am

What are the strategies to use to extract specific data (e.g. a specific number) from a very large PDF, containing text, but as well tabular data, charts, etc.

I believe some strategies like using multimodal models can work, but are there other strategies that are working well for those use-cases ?
https://community.palantir.com/t/how-to-extract-information-from-unstructured-excel-in-code-repos/1388

Extracting text from PDF can work, but is not very efficient with PDFs tables.
https://community.palantir.com/t/extracting-tables-from-pdfs-in-a-structured-format/309/2

nickk · November 13, 2024, 4:41pm

Extracting text (OCR text extraction, extracting unstructured text) expressions from Documents and Images in Logic should be coming very soon (i.e. extract text → pass text to a useLLM board). As soon as this week.

As for the PDF tables case, just to give the LLM as much context as possible, I would follow the strategy outlined in https://community.palantir.com/t/how-to-extract-information-from-unstructured-excel-in-code-repos/1388:

Extract text from PDF
Convert PDF to image
Pass both text and image to multi-modal modal (GPT, Gemini, etc)

We are currently working on an easy way for users to just pass in a PDF media reference into the useLLM block so that they don’t have to worry about converting to image first. But for now, users would have to manually convert to image before bringing the media into Logic in order for the above to work.

There are newer models like Claude 3.5 Sonnet which has been shown to be very efficient in your specific use case of text, data, and chart PDF analysis. And we are currently tracking adding this functionality as well. Unsure on the timeline of adding complete support as of right now. Link: https://docs.anthropic.com/en/docs/build-with-claude/pdf-support