I’m frequently running an incremental transform/append sync and the numbers of files in my dataset is growing too large. How can I manage this?
The two most common ways to manage this are with periodic snapshotting + reapportioning of your data or retention policies.
First, on the repartition strategy, the key thing here will be to have an input which re-snapshots and triggers a repartition. This could be a dummy file (for example a fusion dataset which “rebuilds” every Sunday). That “rebuild” will trigger a snapshot, and you’ll need to add code like the pseudocode below to repartition the files:
if not ctx.is_incremental get previous version of output check file / row count, if it's too large, repartition it and add in new rows processed in this build runif you added in old rows + new rows and repartitioned, set output mode to “replace”
otherwise set to “append”
write outputs
If you don’t care about old files and would be happy with deleting them, you could instead use a retention policy. For retention, you’ll simply need to configure retention policies for the datasets you want. This can, for example, wipe out transactions older than 180 days. Be careful while doing this - as defining retention policies for the current view can wipe out your production data and it will be unrecoverable. So only set these up if you’re confident you don’t need the data.
We also have a new (currently Beta) feature called pipeline “parameterization” that lets you schedule different logic on your dataset at varying frequency – e.g. a daily append and weekly compaction.
See this example: https://www.palantir.com/docs/foundry/building-pipelines/parameterization/#example-use-case-periodic-snapshot-jobs-for-an-incremental-transform
Hi Sarah!
Very cool to see this parameterization functionality. It has always felt like having transforms be a bit more dynamic (allow scheduled authoring builds) would be helpful and this is a step in that direction.
Hopefully this functionality gets expanded in the future to almost give us the ability for multiple job specs.
- Ability to modify resources based on either previous errors (like oom) or other triggers.
- Ability fall back to some logic or behavior similar to a pre/post hook.
- Job builds default to 3 retries today, would be cool to set one of those to a different param.
- I would love to be able to gracefully switch between light weight and spark. Light weight for incremental, spark for snapshots etc.