Hello,
Has Anyone tried to feed pdf directly to LLM and extracted the details instead of doing in normal way like converting the pdf each page into image and passing this image to LLM vision model for extraction of details.
AIP Agent with document context RAG is currently doing the same thing but there is a limitation on number/size of pdfs.
Want the same functionality in code repository to loop through all the pdf files to get the important fields extracted.
1 Like
import {Objects} from “@foundry/ontology-api”
import { Function, MediaItem } from “@foundry/functions-api”
// import { GPT_4o } from “@foundry/models-api/language-models”
import { AnthropicClaude_3_5_Sonnet_V2 } from “@foundry/models-api/language-models”
export class Aip_module {
@Function()
public async createChatCompletion(userInput: string): Promise<string | undefined> {
try {
// Get all data with proper async operation
const mediaObjects = await Objects.search().myObject().allAsync();
if (!mediaObjects || mediaObjects.length === 0) {
throw new Error("No data found");
}
const media = mediaObjects[0];
if (!media.mediaReference) {
throw new Error("No media reference found");
}
// Check if it's a document type and handle accordingly
if (MediaItem.isDocument(media.mediaReference)) {
// For machine-generated PDFs, use extractTextAsync
const extractedText = await media.mediaReference.extractTextAsync({});
if (!extractedText || extractedText.length === 0) {
// If no text extracted, try OCR as fallback
const ocrText = await media.mediaReference.ocrAsync({
languages: [],
scripts: [],
outputType: 'text'
});
return ocrText.join(' ');
}
return extractedText.join(' ');
} else {
throw new Error("Media is not a document type");
}
} catch (error) {
console.error("Error processing document:", error);
throw error;
}
}
} // end class
Hello,
Object cannot be created using mediaset directly. Typescript code repository will require object types as input instead of mediaset.
I want to have a functionality in pyspark code repository to extract the pdf data directly instead of looping through each page and extract the text.
Have you tried to extract the pdf directly using LLM in pyspark code repo?