Parsing PDF Blob with pdfjs-dist

The folowing TypeScript code defines a class SemanticSearch used as part of a Semantic Search workflow that includes several methods. One of them being for extracting text from a PDF blob and retrieving it’s text

  1. Imports:
import { Function, Attachments, Attachment, MediaItem } from "@foundry/functions-api";
import * as pdfjsLib from 'pdfjs-dist';
  • The code imports necessary modules and types from @foundry/functions-api and @foundry/ontology-api.
  • It also imports the pdfjs-dist library for handling PDF documents.
  1. Setting the Worker Source for PDF.js:
pdfjsLib.GlobalWorkerOptions.workerSrc = `//cdnjs.cloudflare.com/ajax/libs/pdf.js/${pdfjsLib.version}/pdf.worker.min.js`;
  1. Method: extractTextFromPDFBlob:
public async extractTextFromPDFBlob(pdfBlob: Blob): Promise<string> {
    // Convert Blob to ArrayBuffer
    const arrayBuffer = await pdfBlob.arrayBuffer();

    // Load the PDF document
    const pdfDoc = await pdfjsLib.getDocument({ data: arrayBuffer }).promise;

    let extractedText = '';

    // Loop through all pages and extract text
    for (let i = 1; i <= pdfDoc.numPages; i++) {
        const page = await pdfDoc.getPage(i);
        const textContent = await page.getTextContent();

        // Join all the text items into a single string
        const pageText = textContent.items.map((item: any) => item.str).join(' ');

        extractedText += pageText + '\n';
    }

    return extractedText;
}
  • This asynchronous method takes a Blob object representing a PDF file and extracts text from it.
  • It converts the Blob to an ArrayBuffer, loads the PDF document, and iterates through all pages to extract text content.
  • The extracted text from each page is concatenated into a single string and returned.
  • Method: getOrgbusinessDataAttachment:

The issue I’m getting is the following:

Promise.withResolvers is not a function.
Error Parameters: {}
TypeError: Promise.withResolvers is not a function
    at new PDFDocumentLoadingTask (UserCode:28401:32)
    at Module.getDocument (UserCode:28214:16)
    at SemanticSearch.extractTextFromPDFBlob (UserCode:15587:39)
    at async SemanticSearch.getOrgbusinessDataAttachment (UserCode:15603:30)
    at async le.executeFunctionInternal (FunctionsIsolateRuntimePackage:2:1008381)
    at async Ne (FunctionsIsolateRuntimePackage:2:1007469)
    at async le.executeFunction (FunctionsIsolateRuntimePackage:2:1007756)
    at async userFunction (FunctionsInitialization:8:43)

It’s suggested that:

The build of PDF.js you are using does not support running in Node.js (i.e. only in the browser). The error comes from Promise.withResolvers being called, which is not supported by Node.js

https://github.com/mozilla/pdf.js/issues/18006, the recommended way to run it under Node.js is to use the legacy build https://github.com/mozilla/pdf.js/wiki/Frequently-Asked-Questions#faq-support (using pdfjs-dist/legacy/build/pdf.js).

Source: https://stackoverflow.com/questions/78415681/pdf-js-pdfjs-dist-promise-withresolvers-is-not-a-function

Is this happening to any of you?

1 Like