Convert DOCX, PPTX, XLSX to PDF in Foundry?

VincentF · August 21, 2024, 10:32am

I have a bunch of Word documents, Powerpoints, and Excels that I want to convert to PDFs, so that I can ingest it in a mediaset to display to end users and perform processing on top (e.g. text extraction, RAG pipeline, etc.)

How can I convert those different format to PDF ?
Can I load them directly to a mediaset for display, as is ?

helenq · August 21, 2024, 7:06pm

Hey unfortunately we don’t yet support uploading microsoft word docs into mediasets.

One solution that people have used is https://gotenberg.dev/ but this would be in authoring transforms

VincentF · August 29, 2024, 5:13pm

Thanks for your guidance. Here is a full end to end tutorial !

Steps

Pre-requisite

Make sure Containers workflows are turned-on on your Foundry enrollment, in Control Panel: https://www.palantir.com/docs/foundry/administration/container-governance/#enable-container-workflows

Prepare where to upload the container

The docs are here: https://www.palantir.com/docs/foundry/transforms-python/transforms-sidecar/index.html#push-an-image

Create an artifacts repository

image1052×942 78.7 KB
Change the type to Docker

image2642×644 56.7 KB
You will need to take note of the registry created, available at the bottom of the page, to use it in some command later.

image939×275 13.5 KB

We now have a store to upload containers on Foundry.

Container Creation

If will need docker on your machine to create the container and to run a few commands in a terminal.

Create a Dockerfile
1. In a folder of your choice
  e.g. cd ./my_folder
2. Create a Dockerfile
  e.g. via nano Dockerfile
3. Populate the Dockerfile

FROM gotenberg/gotenberg:7
            
USER root
            
RUN usermod -u 5001 -g 1001 gotenberg
            
USER 5001

Create the image locally
1. From the same directory …
  e.g. my_folder
2. [Optional] Make sure docker runs on your machine
  e.g. launch manually the “docker” application
3. Build the image locally
  Note: you need to replace <registry-name> in the below command by the
  1. If you are on an Intel-based Mac or a x86_64 architecture
    docker build . --tag <registry-name>/foundry_gotenberg:0.0.1 --platform linux/amd64
  2. If you are on a M1 Mac or an ARM-based architecture
    docker buildx build --platform linux/amd64 --push -t <registry-name>/foundry_gotenberg:0.0.1 .

We now have a container on our local laptop.

Container Upload

We have a store to upload containers on Foundry and we have a container on our local laptop. We want to upload this container to Foundry.

Go back on Foundry, in the artifact repository you created
Follow the top on-screen instructions to generate a token.
Look at the instructions given to upload the container.

image978×345 25.1 KB
Execute the first 3 instructions in your terminal, to be able to push to the Foundry repository
Customize and execute the last command with the name of the container you just created
docker push <registry-name>/foundry_gotenberg:0.0.1

At this point, you should see the docker image uploaded in Foundry

Container Usage

We now have our container in Foundry. We can use it in our transforms.

Create a code repository next to your Artifact repository in Foundry
Right click in the folder > New > Code Repository > Python Transforms
Import the transforms-sidecar library on the left side.
Also import the requests, polars libraries.
Import the artifact repository

image1037×459 61.7 KB
In example.py file in the repository, replace the default code with the below code. This will use a “sidecar” library to load the docker image you uploaded earlier, next to the where the code of the transform will be executed. Hence you can query the docker image via API, as shown below.

       from transforms.api import transform, Input, Output
        from transforms.sidecar import sidecar
        import requests
        import tempfile
        import shutil
        
        @sidecar(image='foundry_gotenberg', tag='0.0.1', volumes=[])
        @transform(
            output=Output("/path/dataset_with_pdfs_of_docx_files"),
            source=Input("/path/dataset_with_docx_files"),
        )
        def compute(output, source):
            def user_defined_function(file_status):
                with source.filesystem().open(file_status.path, 'rb') as in_f:
                    with tempfile.NamedTemporaryFile() as tmp:
                        shutil.copyfileobj(in_f, tmp)
                        tmp.flush()
        
                        with open(tmp.name, 'rb') as tmp_f:
                            files={ 
                               "file": (file_status.path, tmp_f)
                            }
        
                            url = 'http://localhost:3000/forms/libreoffice/convert'
                            response = requests.post(url, files=files)
        
                            if response.status_code == 200:
                                with output.filesystem().open(f"{file_status.path}.pdf", 'wb') as out_f:
                                    out_f.write(response.content)
        
            source.filesystem().files().rdd.foreach(user_defined_function)