Hi all, hoping to get some input on recommended practices for working with Zarr data in xarray using Foundry (as well as a gut check on how I’m thinking about this).
For context, I’m working on atmospheric data processing pipelines using Zarr/xarray and need guidance on efficiently handling large n-dimensional arrays in Foundry while maintaining lazy loading capabilities. I’ve been able to set this up outside Foundry (using an EC2 instance with colocated S3 bucket) with relative ease. A general pipeline step does something like this:
- Create Dask cluster
- Load in Zarr from S3 bucket (using xarray.open_zarr())
- Perform transform on dataset
- Write back to a new Zarr in S3 bucket
This setup works fine for one-off jobs/rapid development of the pipeline logic, but I’d like to eventually move this inside Foundry. When thinking about how to “translate” code like this into Foundry, I have a couple questions:
Question 1: I/O of Zarr data within a code repo transform
A major benefit of using the Zarr format is that data is chunked and can be loaded lazily. It’s not clear to me how this will work when the Zarr data no longer lives in an S3 bucket. We can treat the dataset like a local filesystem, but as noted in the doc on working with unstructured data in Foundry, in order to actually use the data in a code repo this would involve copying the Zarr from the dataset filesystem to the code repo (https://www.palantir.com/docs/foundry/transforms-python/unstructured-files) – which would get messy and be a huge performance bottleneck.
Another potential option is to use the Foundry S3 API (https://www.palantir.com/docs/foundry/data-integration/foundry-s3-api). The way I understand this, this allows us to treat Foundry datasets as S3 buckets and we can (in theory) use the same tools as we would for a normal S3 bucket (for example, create an s3fs object in Python and then use that to load the Zarr into xarray). Looking at this doc though, it looks like the typical use case here is for a third-party application outside of Foundry that wants to access data within Foundry. What we would be doing is a bit different as we would be accessing the data entirely within Foundry. Because of this, setting up the configuration for this (registering a third-party application, generating access keys etc) seems unnecessary and I’m unsure whether I am thinking about this the right way. I’m curious if anyone has done this before and has any insights here.
Question 2: Reprovisioning resources for Dask cluster
Spark is not optimized for operations on nd-arrays (which are non-tabular). When doing computation on EC2, it’s relatively straightforward to spin up a Dask cluster and do the xarray computation on this. But within a Foundry transform, I’m not sure how much control we have over the underlying resources. If I want to “repurpose” the Spark executors to be Dask workers, what would this even look like? Is this possible? The computations will consist almost entirely of operations on xarray objects rather than tabular operations on Spark dataframes. Would appreciate any recommendations/best practices here for being able to get out of the underlying resources here.
In addition to these two questions, if anyone has worked with nd-arrays in Foundry, any insights here would be much appreciated. I’d really like to make Foundry work for my use case, but so much of the functionality seems to be built around tabular data and I’m not sure how much room I have to optimize for big non-tabular data here.
Thanks in advance.