LLM (vision model) transform on PDF to extract visual data

crackyflipside · October 11, 2024, 10:31am

I’ve done the AIP tutorial, but want to incorporate structured visual data that’s also in the PDF’s; things like graphs and tables of values. My issue is “Use LLM” transforms cannot read the PDF’s, the vision models can only read image mediasets.

Is there a no-code way to convert the PDF’s into images within pipeline builder? or is this some function I gotta manually code in?

My plan is to use an LLM to scan the PDF and identify specific page(s) where structured visual data is located; outputting a nested array, or similar structure, to handle cases where the visual data spans multiple pages. Next, apply hOCR to extract both the positional data and text for each identified section. The LLM will then generate a text summary of the hOCR-extracted data & a summary of the structured visual content, in order to create relevant entities. Finally, the summarized information gets embedded into a vector, similar to the approach in the AIP tutorial for text data in PDF.

helenq · October 11, 2024, 1:36pm

Our teams are currently working on a transform that allows you to convert PDFs to images and then you’ll be able to use those images in the use LLM node. We’re also currently scoping allowing PDF’s to be directly used in the use LLM node. We can post updates in this thread so stay tuned!

A workaround for now could be to either use the text extraction board for PDFs or use code repositories to convert your PDFs to images and then feed those images into the use LLM node.

crackyflipside · October 11, 2024, 2:08pm

Love it, thank you!

Will it let me pick individual pages of the PDF after they are transformed to images? If so, thank you again. That will save me a ton of computing time.