I’m frequently running an incremental transform/append sync and the numbers of files in my dataset is growing too large. How can I manage this?
The two most common ways to manage this are with periodic snapshotting + reapportioning of your data or retention policies.
First, on the repartition strategy, the key thing here will be to have an input which re-snapshots and triggers a repartition. This could be a dummy file (for example a fusion dataset which “rebuilds” every Sunday). That “rebuild” will trigger a snapshot, and you’ll need to add code like the pseudocode below to repartition the files:
if not ctx.is_incremental get previous version of output check file / row count, if it's too large, repartition it and add in new rows processed in this build runif you added in old rows + new rows and repartitioned, set output mode to “replace”
otherwise set to “append”
write outputs
If you don’t care about old files and would be happy with deleting them, you could instead use a retention policy. For retention, you’ll simply need to configure retention policies for the datasets you want. This can, for example, wipe out transactions older than 180 days. Be careful while doing this - as defining retention policies for the current view can wipe out your production data and it will be unrecoverable. So only set these up if you’re confident you don’t need the data.