UDF for converting PDFs dataset to media set

Is it possible to create a UDF for use in pipeline builder that takes in PDFs from a data set and outputs either PDFs in a media set or media set references? I saw the related problem linked below but was still wondering if this is even possible: https://community.palantir.com/t/how-to-convert-dataset-with-binary-column-into-a-mediaset/564

For context, I do almost all my transforming in pipeline builder and I am revisiting a goal to extract text from PDFs and display those PDFs from a media set in workshop. I have both many old PDFs and new PDFs weekly, and my ultimate goal is to incrementally process new PDFs. I do not have edit access to the connection that is pulling PDFs into a dataset.

Hey @Joel,

Media set outputs from pipeline builder are close to being released, so hang tight for this!

In the meantime, you can probably achieve this in a Python transform, you can read more about it here:
https://www.palantir.com/docs/foundry/transforms-python/media-sets

Let me know if you have any more questions about this!

All the best,
JG

1 Like