I’ve got a dataset with MANY tabular files. They are in parquet but they don’t all have the same schema. My workflow includes analysing specific files before the downstream processing pipeline is executed.
We are talking MANY files so breaking this dataset into one dataset per file is impossible, impractical, and actually DDoSes Foundry.
Do you have any recommendations?
Would it be an option to automate isolating the data that needs to be analyzed in a separate transform by reading files one by one instead of applying a schema directly on the input dataset?
Yeah, that’s the backup plan. But triggering a transform is a really frustrating road-bump for users… so ideally I’d have a workflow as responsive as opening a past transaction in Contour.
My more hacky solution is to trigger a compute module function which pushes that single file into a new branch and starts an analysis on that.