Create a pipeline to feed whole pdf to LLM model to extract important details in code repository

Amarkshi143 · April 15, 2025, 7:50am

Hello,

Has Anyone tried to feed pdf directly to LLM and extracted the details instead of doing in normal way like converting the pdf each page into image and passing this image to LLM vision model for extraction of details.

AIP Agent with document context RAG is currently doing the same thing but there is a limitation on number/size of pdfs.

Want the same functionality in code repository to loop through all the pdf files to get the important fields extracted.

0079205442bc0cc60bca · May 19, 2025, 8:57am

import {Objects} from “@foundry/ontology-api”
import { Function, MediaItem } from “@foundry/functions-api”
// import { GPT_4o } from “@foundry/models-api/language-models”
import { AnthropicClaude_3_5_Sonnet_V2 } from “@foundry/models-api/language-models”

export class Aip_module {

@Function()
public async createChatCompletion(userInput: string): Promise<string | undefined> {
    try {
        // Get all data with proper async operation
        const mediaObjects = await Objects.search().myObject().allAsync();
       
        if (!mediaObjects || mediaObjects.length === 0) {
            throw new Error("No data found");
        }

        const media = mediaObjects[0];
        if (!media.mediaReference) {
            throw new Error("No media reference found");
        }

        // Check if it's a document type and handle accordingly
        if (MediaItem.isDocument(media.mediaReference)) {
            // For machine-generated PDFs, use extractTextAsync
            const extractedText = await media.mediaReference.extractTextAsync({});
           
            if (!extractedText || extractedText.length === 0) {
                // If no text extracted, try OCR as fallback
                const ocrText = await media.mediaReference.ocrAsync({
                    languages: [],
                    scripts: [],
                    outputType: 'text'
                });
                return ocrText.join(' ');
            }
           
            return extractedText.join(' ');
        } else {
            throw new Error("Media is not a document type");
        }
    } catch (error) {
        console.error("Error processing document:", error);
        throw error;
    }
}

} // end class

Amarkshi143 · May 20, 2025, 8:21am

Hello,

Object cannot be created using mediaset directly. Typescript code repository will require object types as input instead of mediaset.

I want to have a functionality in pyspark code repository to extract the pdf data directly instead of looping through each page and extract the text.

Have you tried to extract the pdf directly using LLM in pyspark code repo?