PDF Parsing with Typescript

Jacob_SE · July 15, 2024, 1:06pm

I’ve attempted to parse PDFs using libraries like pdf-parse , pdf-lib , pdfjs-dist , and pdf2json . Although there are no errors in my code, I encounter a compilation error with the following result.

Does anybody knows the way of extracting text from PDF file with typescript?

{
“errorCode”: “INVALID_ARGUMENT”,
“errorInstanceId”: “64fd8be1-c243-4308-9a12-7bfcdd073cb8”,
“errorName”: “Functions:CompileFailed”,
“parameters”: {
“stdout”: “[object Object]\n[object Object]\n[object Object]\n[object Object]”,
“stderr”: “”
}
}

After logging, I found that the error occurs in the following code line:

const data = await pdfParse(buffer);

mport { Attachment, Function, Integer } from "@foundry/functions-api";
import * as pdfParse from "pdf-parse";
import { isUint8Array } from "util/types";

export class PdfToTextConverter {
    @Function()
    public async convertPdfToText(attachment: Attachment): Promise<Integer> {
        try {
            // Read the PDF file from the attachment
            const blob = await attachment.readAsync();
            
            // Convert the Blob to an ArrayBuffer
            const arrayBuffer = await blob.arrayBuffer();
            console.log(arrayBuffer);
            const buffer = Buffer.from(arrayBuffer);
            console.log(buffer);
                        
            // Use pdf-parse to extract text from the PDF
            const data = await pdfParse(buffer);
            console.log(data);
            // The extracted text is in the 'text' property of the data object
            const extractedText = data.text;
            console.log(extractedText);

            return arrayBuffer.byteLength;
        } catch (error) {
            console.error("Error converting PDF to text:", error);
            throw new Error("Failed to convert PDF to text");
        }
    }
}

ivy · December 16, 2024, 6:56pm

Hi Jacob, did you ever figure out what the issue was? I’m running into something similar using pfd-parse

Jacob_SE · December 23, 2024, 8:51am

I’ve had extensive discussions with Palantir engineers earlier this year, and we suspect that there might be a library crash in the TypeScript function in Palantir.

However, I believe that the Python function will work if the issue is indeed related to the TypeScript library.

arukavina · December 24, 2024, 12:42pm

Hi @Jacob_SE , please check this post for Python Functions out. That should show how to efficiently extract text from PDF files using Python functions if TypeScript code is giving you issues: Parsing PDF Blob with pdfjs-dist

Let me know if it worked