How to convert dataset with binary column into a MediaSet

Hello,

I have a table in a MS SQL database that contains various files in a varbinary column. This results in a dataset in Foundry with each row containing the contents of the file in a Binary column.

I would like to convert this dataset into MediaSet or at the very least, convert each of these into raw files in an output dataset. I originally looked at create a UDF in a python transform to process each row and write it to the output dataset but it appears that I am unable to pass a reference to the output dataset to the UDF.

Does anyone have experience creating a similar pipeline?

Thanks!

Here’s how I would do it. I think you can infer the schema from my code below, but the schema here comes from parsing out email attachments, but you can modify it however you need to. An important thing is that MediaSets support specific types. In this example I manually created a Media Set of PDFs. If there are multiple types of files you need to direct them to the right Media Sets.

from transforms.api import transform, Input
from transforms.mediasets import MediaSetOutput

import io


@transform(
    attachments=Input("ri.foundry.main.dataset.cab42d31-4633-425b-9186-35203368ac1d"),
    output_pdfs=MediaSetOutput('ri.mio.main.media-set.e8643e76-b1af-457d-a965-51091d3e52de')
)
def compute(attachments, output_pdfs):
    def process_row(row):
        data = io.BytesIO(row['bytes'])
        filename = row['name']
        email_id = row['email_id']
        output_pdfs.put_media_item(data, email_id + "_" + filename)
    attachments.dataframe().foreach(process_row)