Bulk extraction of information in chunks of documents in Pipeline Builder

VincentF · February 25, 2025, 12:45pm

In Pipeline builder, I have a mediaset and a dataset that represents chunks of the documents stored in the mediaset.

I want to find the relevant chunks to answer a specific question (e.g. “Is there a chunk about ABC”) and then summarize/extract the particular information I’m interested in.
The extract is trivial once the chunks have been identified.

Question: How can I find the “relevant chunks” in each document to answer this particular question ?

I see there is a KNN join, but I would need to join on the document_id (as I want to find this information in each document), so I believe it doesn’t really fulfill this need. Maybe I’m missing something ?

sid · February 25, 2025, 1:29pm

Hey Vincent,
I have used this flow in the past for retrieval:

Use the Document Layout Extraction model (its Experimental so you may need to enable it) on the document which returns you chunks (bounding boxes of the content piece in the PDF).
Then run a GPT-4o query on each of the chunks of content to summarise or transcribe words/ tables in the chunks.
Now embed the GPT-4o result using an embedding model
Now upon inference - run a KNN search on the query’s embedding and the chunk embedding and you can get the top k chunks. Note: you might have to change this logic slightly based on your usecase.
Pass the top k chunks’ content to a GPT model to summarise and generate an answer to the initial query.

system · March 11, 2025, 1:30pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.