Loading Images in Jupyter Code Workspace

adampsu · December 30, 2024, 4:25am

Hi everyone,

I’m facing an issue with downloading images in my Jupyter Code Workspace. According to the Foundry documentation, the following approach should work for downloading files:

from foundry.transforms import Dataset

# Download all files in the dataset
downloaded_files = Dataset.get("my_alias").files().download()
local_file = downloaded_files["file.pdf"]

However, I’m struggling to adapt this for downloading images. When I follow these steps, I end up with parquet files instead of the expected images:

{
    'spark/part-00000-dcb7197b-d051-4dea-90ed-c56fbfa64726-c000.snappy.parquet': '/foundry/0a697b6f969d726c/ri.foundry.main.dataset.eb2d6994-f832-429b-a23b-47f82d37d3b5/ri.foundry.main.transaction.00000003-13ab-71e1-aa6f-cb06d9026800/spark/part-00000-dcb7197b-d051-4dea-90ed-c56fbfa64726-c000.snappy.parquet',
    'spark/part-00001-dcb7197b-d051-4dea-90ed-c56fbfa64726-c000.snappy.parquet': '/foundry/0a697b6f969d726c/ri.foundry.main.dataset.eb2d6994-f832-429b-a23b-47f82d37d3b5/ri.foundry.main.transaction.00000003-13ab-71e1-aa6f-cb06d9026800/spark/part-00001-dcb7197b-d051-4dea-90ed-c56fbfa64726-c000.snappy.parquet'
}

My dataset contains about ~30,000 images that look like this:

Does anyone have experience with downloading images from a dataset in Foundry? Is there a specific approach for handling non-tabular data that contains images?

Thanks in advance!

lmartini · December 30, 2024, 2:55pm

Hi!

Might this might be because your dataset only contains references to images and not actually images? Usually datasets with images are unstructured, so they wouldn’t have a schema.

I have tried something similar to your approach; first I uploaded images to an unstructured dataset, and when importing such dataset in the Jupiter code workspace, I was provided with the unstructured option, and I can see my files are displayed as png.

This is how the dataset preview looks like given it only contains an image

So in your preview we can actually see tabular data, which is why it reads as tabular dataset in code workspace and not as images. In order to operate on the actual images you would need to make sure that your dataset is unstructured and contains the files of the images.

If you don’t want to operate on datasets with raw files, you can use first-class Mediasets in Foundry (I see you have some mio rids in your dataset, so possibly you are already aware of those?). However, Mediasets are not yet supported with Code Workspaces. You would have to use Pipeline Builder or Code Repositories for those.

adampsu · January 2, 2025, 3:04pm

Sorry for the late response – thanks. This actually helped me figure out the problem.

system · March 3, 2025, 3:04pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.