Snapshot input but still be able to use 'added'? Or always be able to read output's 'previous' view?

npateel · October 9, 2024, 3:25pm

Hey all – is there any way to do the following in an incremental transform:

Always have incremental outputs (we want to stop snapshots even if input datasets are snapshotted)
Have the input dataset /NOT/ be a snapshot input (We need to get the ‘added’ files or just the files in the last transaction)
Always have a view of the previous output dataset (So calling output.dataframe(‘previous’))

So what we want is
input.dataframe('added') to always be non-empty and return either added files or the entire dataframe during snapshots
and output.dataframe('previous') to always be non-empty, even when the transform can’t run incrementally. Saw in the docs it mentioned that To read data from the previous output the transform must run in incremental mode (ctx.is_incremental is True), otherwise the dataframe will be empty.

evictor · October 10, 2024, 5:39am

Hi @npateel !

Always have incremental outputs (we want to stop snapshots even if input datasets are snapshotted)

In the @incremental decorator, there is a parameter called “require_incremental”. If you set it to true, your build will fail if it goes back to snapshot, and will continue to fail until you fix what is causing the break in incrementality. This is usually coupled with a health check on the dataset to notify you if/when it goes wrong.

Have the input dataset /NOT/ be a snapshot input (We need to get the ‘added’ files or just the files in the last transaction)

If you don’t specify a given input in the “snapshot_inputs” parameter on the @incremental decorator, and it is an incremental, your incremental will only process added rows/files for that input. If the input is a snapshot, you could try to do a type of “merge and append” where you set one variable to be the input’s .dataframe(‘previous’) and then another to .dataframe(‘current’). Then do a subtract, or left_anti join on a relevant column to get the differences between builds. That said, given it’s a snapshot input this may need to be experimented with and validated.

Always have a view of the previous output dataset (So calling output.dataframe(‘previous’))

This command also works on inputs as well - here is a link to the relevant documentation:

https://www.palantir.com/docs/foundry/transforms-python/incremental-reference/#incrementaltransforminput