Convert DOCX, PPTX, XLSX to PDF in Foundry?

I have a bunch of Word documents, Powerpoints, and Excels that I want to convert to PDFs, so that I can ingest it in a mediaset to display to end users and perform processing on top (e.g. text extraction, RAG pipeline, etc.)

How can I convert those different format to PDF ?
Can I load them directly to a mediaset for display, as is ?

2 Likes

Hey unfortunately we don’t yet support uploading microsoft word docs into mediasets.

One solution that people have used is https://gotenberg.dev/ but this would be in authoring transforms

Thanks for your guidance. Here is a full end to end tutorial !

Steps

Pre-requisite

Make sure Containers workflows are turned-on on your Foundry enrollment, in Control Panel: https://www.palantir.com/docs/foundry/administration/container-governance/#enable-container-workflows

Prepare where to upload the container

The docs are here: https://www.palantir.com/docs/foundry/transforms-python/transforms-sidecar/index.html#push-an-image

  1. Create an artifacts repository

  2. Change the type to Docker

  3. You will need to take note of the registry created, available at the bottom of the page, to use it in some command later.

We now have a store to upload containers on Foundry.

Container Creation

If will need docker on your machine to create the container and to run a few commands in a terminal.

  1. Create a Dockerfile
    1. In a folder of your choice
      e.g. cd ./my_folder
    2. Create a Dockerfile
      e.g. via nano Dockerfile
    3. Populate the Dockerfile
FROM gotenberg/gotenberg:7
            
USER root
            
RUN usermod -u 5001 -g 1001 gotenberg
            
USER 5001
  1. Create the image locally
    1. From the same directory …
      e.g. my_folder
    2. [Optional] Make sure docker runs on your machine
      e.g. launch manually the “docker” application
    3. Build the image locally
      Note: you need to replace <registry-name> in the below command by the
      1. If you are on an Intel-based Mac or a x86_64 architecture
        docker build . --tag <registry-name>/foundry_gotenberg:0.0.1 --platform linux/amd64
      2. If you are on a M1 Mac or an ARM-based architecture
        docker buildx build --platform linux/amd64 --push -t <registry-name>/foundry_gotenberg:0.0.1 .

We now have a container on our local laptop.

Container Upload

We have a store to upload containers on Foundry and we have a container on our local laptop. We want to upload this container to Foundry.

  1. Go back on Foundry, in the artifact repository you created

  2. Follow the top on-screen instructions to generate a token.

  3. Look at the instructions given to upload the container.

  4. Execute the first 3 instructions in your terminal, to be able to push to the Foundry repository

  5. Customize and execute the last command with the name of the container you just created
    docker push <registry-name>/foundry_gotenberg:0.0.1

At this point, you should see the docker image uploaded in Foundry

Container Usage

We now have our container in Foundry. We can use it in our transforms.

  1. Create a code repository next to your Artifact repository in Foundry
    Right click in the folder > New > Code Repository > Python Transforms
    image

  2. Import the transforms-sidecar library on the left side.
    Also import the requests, polars libraries.
    image

  3. Import the artifact repository

  4. In example.py file in the repository, replace the default code with the below code. This will use a “sidecar” library to load the docker image you uploaded earlier, next to the where the code of the transform will be executed. Hence you can query the docker image via API, as shown below.

       from transforms.api import transform, Input, Output
        from transforms.sidecar import sidecar
        import requests
        import tempfile
        import shutil
        
        @sidecar(image='foundry_gotenberg', tag='0.0.1', volumes=[])
        @transform(
            output=Output("/path/dataset_with_pdfs_of_docx_files"),
            source=Input("/path/dataset_with_docx_files"),
        )
        def compute(output, source):
            def user_defined_function(file_status):
                with source.filesystem().open(file_status.path, 'rb') as in_f:
                    with tempfile.NamedTemporaryFile() as tmp:
                        shutil.copyfileobj(in_f, tmp)
                        tmp.flush()
        
                        with open(tmp.name, 'rb') as tmp_f:
                            files={ 
                               "file": (file_status.path, tmp_f)
                            }
        
                            url = 'http://localhost:3000/forms/libreoffice/convert'
                            response = requests.post(url, files=files)
        
                            if response.status_code == 200:
                                with output.filesystem().open(f"{file_status.path}.pdf", 'wb') as out_f:
                                    out_f.write(response.content)
        
            source.filesystem().files().rdd.foreach(user_defined_function)
6 Likes