Curious if there is a particular reason why checkpoints are not set by default at forks and why are currently parallel execution paths created by default? Intuitively one assumes that the computation before the fork is not computed twice, but maybe I’m missing some context why this would still be more optimal?
It is especially concerning with large pipelines. Concretely, I have a large dataset (433b rows • 112,551 files • 13.8TB) that I first aggressively filter and inner join to ~65M rows before my fork. Intuitively, I wouldn’t want to duplicate that work.
In this case if you were to checkpoint useLlm spark would be forced to run that on the full datasetA. However, if you don’t checkpoint, you can get the filters pushed down into the datasetA read which will likely be a lot faster.
That’s not to say we haven’t thought about it. And I do think we should be warning users & suggesting checkpoints for forks. That’s something we’re tracking.
Thanks @mtelling that’s a valid point! Having said that, best practice kinda would be to push the filters as left-most as possible. In that case, wouldn’t it be valid to assume best practice development, set checkpoints as defaults, and then if such best practices aren’t followed warn users of the potential performance / cost overruns?
Given the PB UI/UX, it feels a lot more intuitive to assume that when a fork is happening everything before the fork will be computed once. It would be on the user to understand and develop the pipeline with that in mind.
It would be also interesting to see how much more compute cost was incurred by people assuming the checkpoint behavior is default compared to the current default.
I don’t think it’s always natural for users to push filters down. In this case they would have to apply the filters twice. Instead they do get this for free from Spark.
We will start with a warning & make it clear that a checkpoint should be applied and then see what the status is after that.