Convert media references in dataset to files within raw dataset

Hi. We are trying to package certain files (pdfs, png, etc) into a dataset in order to export the dataset to a SharePoint location.

In this process, we fetch files from a SFTP location, create a media set, a status for each row and create a new file name for each file. We would then like to export the renamed files into a SharePoint location.

The current dataset structure (#1) looks as follows:

From what we could see, the dataset type of “raw dataset”, usually holds raw files in the correct format to export to a SharePoint location. Here’s an example of a “raw dataset” (#2) that succeeded in uploading to SharePoint:

The questions we have would be:

a) Is there a way to convert dataset #1 into a “file-only” dataset like dataset #2?
b) If it is not possible, is there a different approach that would allow converting the media references into raw files (pdf, png, etc) in order to export to the SharePoint location?

Thanking you in advance for the assistance.

Hi!

When you import the MediaSet it is passed into a Transform-object at runtime. This allows you some advantages in terms of transforming the media items, but also allows you to work with them as in any other transform.

media_item_list = media_input.list_media_items_by_path_with_media_reference(ctx)

The above code will list the media in your se as a dataframe, with the columns media_item_rid, path and media_reference. Make sure to add ctx to your transform compute function.

You can then use these media_item_rids to get the items:

media_item = media_input.get_media_item(“ri.mio.main.media-item.123”)

You should then be able to grab each item as a file object and copy it using the code example above to the dataset. Note that when you create a new dataset in Foundry, it’s a automatically a ‘raw dataset’ until you apply a schema.

Last points: Make sure you figure out the shape of your use case – are you looking to set up a file export via Data Connector? Or are you looking to use an API-approach?

Hi @jakehop.

Thanks a lot for the detailed explanation.

Got the first part going where we “list by path with media refs”, its the “get_media_item” portion and the subsequent steps that’s been missing.

To answer your last point, the idea would be that each time the script runs we generate an export dataset containing only the new files, and using the file export via Data Connector to push the files to SharePoint. Next time the script runs, it overwrites the export dataset which again gets pushed to SharePoint.

Happy to take any additional guidance you may have on my use-case. Thanks again!

For some reason part of my answer was removed (@michaelt ???)

If you open the documentation and search for “Copy raw files between datasets”, there’s a code snippet you can play around with, which allows copying raw files to a dataset.

OK, you should be able to create a pretty solid workflow then. You’d basically want your transform to extract the files from the mediasets and store them directly in the output dataset. Only change from the above raw file example is that your input will be objects from a mediaset.

Let me know how it works out for you.