How to compute the max latency of a batch pipeline in Foundry?

VincentF · November 12, 2024, 7:44am

I have a pipeline that is made of Spark transforms and other equivalent transforms (but not streaming).

I’ve setup a schedule on this pipeline to make it run very often, as I want to meet some SLAs.

How should I compute the maximum latency of my pipeline ? In other words, how much time will my data take to propagate from start of the pipeline to end of the pipeline ?

VincentF · November 12, 2024, 8:17am

Remember when scheduling a pipeline:

If you schedule all transforms together, the data freshness (or “data latency”) worst-case will be 2X, where X is the runtime of the whole pipeline.

For example:
Let’s assume a pipeline that takes 10 minutes to run, when the pipeline kicks off at time T_0, fresh data will become available at the output at T_10.
But new data from the source may arrive at T_0+epsilon (say, one millisecond after the transform kicked off). This fresh data now not only has to wait until the previous pipeline run completes (at T_10) but also has to wait for its own pipeline run (completing at T_20) until the data is available to the consumer

If you schedule your transforms like a “conveyor-belt”, where each job kicks off when one of its upstream datasets has fresh data (like Instruction pipelining for those who know), the worst-case pipeline latency is only X+W, where X is the sum of the runtime of each job and W is the runtime duration of the slowest job of your pipeline.

For example: You have a pipeline of 3 datasets : A (5m runtime) > B (15m runtime) > C (10m runtime). At best, the timing will be the sum: 5 + 15 + 10 = 30m.
In a bad case, let’s assume that some data just landed at T_0 +epsilon like in the first case, then it will wait 5+5+15+10+5 because it will wait the “current build of A” + “the next build of A that will contains this data”.
In a worst case, then the data can arrive at T+epsilon right before the worst build (noted W), which is dataset B in our case. So the worst case will be 5+15+15+10.

Important note: If you need a much lower latency, consider Streaming pipelines ! You can achieve O(seconds) latency !