Dynamic configurations for Incremental pipelines

Hello,

We have several incremental pipelines with large amounts of historical data that we need to snapshot regularly (weekly).

Problem is that the config needed to run incremental is at least 20x lower in executor memory, number of executors and partition size.

Dynamically changing the config after the job started looks impossible.

Has anyone found a way to deal with this situation ?

1 Like

Dynamic allocation spark profiles are the closest fit for this use-case. You can’t modify executor memory or partition size dynamically in this way, but my personal experience is that it’s usually possible to achieve good results even with those constraints.

It’s also worth noting that you can dynamically change the number of partitions that you pass to a call to the Spark repartition function based on whether you are running incrementally or snapshot (or just based on the total amount of data to be processed, which you can efficiently compute at runtime by summing up the total bytes in the files in the input filesystem). This technique can help with reducing the amount of data per-task as necessary to get your job to run well with a static amount of per-executor memory.

@MoezBH it’s possible but not easy, you can find some details in an older post where I suggested a workaround.

Thank you, we might end up implementing something like this.