I have a bunch of Word documents, Powerpoints, and Excels that I want to convert to PDFs, so that I can ingest it in a mediaset to display to end users and perform processing on top (e.g. text extraction, RAG pipeline, etc.)
How can I convert those different format to PDF ?
Can I load them directly to a mediaset for display, as is ?
In example.py file in the repository, replace the default code with the below code. This will use a “sidecar” library to load the docker image you uploaded earlier, next to the where the code of the transform will be executed. Hence you can query the docker image via API, as shown below.
from transforms.api import transform, Input, Output
from transforms.sidecar import sidecar
import requests
import tempfile
import shutil
@sidecar(image='foundry_gotenberg', tag='0.0.1', volumes=[])
@transform(
output=Output("/path/dataset_with_pdfs_of_docx_files"),
source=Input("/path/dataset_with_docx_files"),
)
def compute(output, source):
def user_defined_function(file_status):
with source.filesystem().open(file_status.path, 'rb') as in_f:
with tempfile.NamedTemporaryFile() as tmp:
shutil.copyfileobj(in_f, tmp)
tmp.flush()
with open(tmp.name, 'rb') as tmp_f:
files={
"file": (file_status.path, tmp_f)
}
url = 'http://localhost:3000/forms/libreoffice/convert'
response = requests.post(url, files=files)
if response.status_code == 200:
with output.filesystem().open(f"{file_status.path}.pdf", 'wb') as out_f:
out_f.write(response.content)
source.filesystem().files().rdd.foreach(user_defined_function)