What is the difference between Job Grouping and Scheduling?

VincentF · October 31, 2024, 11:11am

In Pipeline Builder, jobs can be grouped together. How is that different from Scheduling ?

What is the impact of each ? When to use one vs the other ?

helenq · October 31, 2024, 12:03pm

Job groupings are more about managing how outputs are processed together and you can configure different compute profiles for different job groups, while scheduling just focuses on the timing and conditions under which data transformations are executed.

In particular, Job grouping allows you to bundle multiple outputs into a single job in batch pipelines or split each output into its own job in streaming pipelines. This provides granular control over how outputs are built and can help manage computation resources effectively.

Scheduling is used to run dataset builds on a recurring basis to keep data up-to-date. It can be configured to run at specific times, when data or logic has been updated, or any combination of these conditions.

Basically choose job grouping for resource management and parallel processing needs, and scheduling for maintaining up-to-date data pipelines.

helenq · October 31, 2024, 12:07pm

Also some caveats with job groupings: moving outputs between job groups in streaming or incremental pipelines is considered a breaking change and will trigger mandatory replays

If you’ve ever used the Skip computing already processed rows on the use LLM node and you have multiple downstream outputs, you also need to put those downstream output datasets into the same job group to be able to use the cache we have behind the hood for that feature.

VincentF · October 31, 2024, 7:05pm

Job grouping is indeed available in Pipeline Builder, but is different from scheduling.

A bit of vocabulary (maybe a bit inexact, but necessary to explain the difference). See the docs here: https://www.palantir.com/docs/foundry/data-integration/builds/

Transform: It is the definition of some processing with some logic. Kind of a “function” that takes some inputs (such as datasets) and produces a set of outputs (for example another dataset).
*For example: *
- Transform 1: some code that takes Dataset A as input, add a new column, generate a Dataset B as output.
- Transform 2: some code that takes Dataset B as input, create a new column with some logic, generate a Dataset C as output.
Job: This is the execution of some logic on some resources. This is the execution of one transform on compute resources (specific number of vCPUs, Memory, etc.).
For example:
- Job 1 will execute Transform 1 on 1 vCPU and 2 GB of Memory
- Job 2 will execute Transform 2 on 3 vCPUs and 4 GB of Memory

Jobs can be grouped in Pipeline Builder, which is equivalent to multi-output transforms in Code Repository.
Note: Grouped jobs (in case of Pipeline builder) or multi-outputs job (in Code repo) will always update together and it is not possible to execute it with only some of the datasets output without running the other outputs.

For example: If Job 1 and 2 are grouped together, then when building dataset B, dataset C will be building with it, as if the logic written was "take dataset and produce dataset B AND C as output

Build: A build will execute one or multiple jobs together and organize their execution.
For example: if you build Job 1 and Job 2 together, the Build will make sure that Job 1 triggers first, and once completed, Job 2 will be executed, as Job 2 uses the output of Job 1.
If there wouldn’t be any dependency, then both Jobs would be executed in parallel.
If the Job group would be executed, they would be executed as one job.

And so, to answer the original question

Job Grouping:
Job grouping bundles multiple outputs into a single job. This is relevant if the different datasets are sharing logic, should share resources or should update always “together” (e.g. a data and metadata datasets).

Scheduling:
Scheduling is used to automate the execution of Builds (so the execution of 1 or N Jobs) at specified times or intervals.