Incremental Transform When Adding Columns

CC-kakita · September 30, 2024, 6:51am

Because a record in one DataSet A is being washed out, we are adding a record from DataSet A at that point in time to historical DataSet B each month.
We would like to modify the implementation to take into account the addition of columns to DataSet A with reference to the following
merge-and-replace-with-schema-change

I had implemented and verified the operation based on the example described above as follows.
(1) Create Input DataSet A, import CSV and submit records.
(2) Execute build and generate history DataSet B.
(3) Import another CSV to DataSet A and submit the record again (in this case, the CSV was replaced)
(4) Execute build and add records to the historical DataSet B.
Here, we found that “ctx.is_incremental = False” is set during the process (4), and an error occurs in “Output.dataframe(‘current’)”.

Why do I get “ctx.is_incremental = False” when (2) should be outputting the previous historical DataSet B?
I would appreciate it if you could tell me.

VincentF · September 30, 2024, 11:16am

In the build details > Spark Details > Incremental/Snapshot tab section of your build, you should find details about what was the reason the build was running as incremental vs not.

This will give you the details of why the incrementality was respected, or not.

Given your explanation, it seems that you replaced the CSV of your input dataset.
Given previously processed data is no longer there, Foundry detects (as it is an “update” transaction or a “snapshot” transaction) that it needs to propagate the new data “in place” of the old one, and hence propagates a snapshots in the pipeline.

In order to override this behavior, you need to specify in your incremental transform that the input is “expected to snapshot” like so:

@incremental(snapshot_inputs=["my_csv_input_dataset"])
@transform(
  my_csv_input_dataset = Input("/path/to/csv_dataset")
  ...
)...

Foundry then will ignore snapshots or updates on this input dataset and will only rely on how you read your data:

    students_df = students.dataframe('added') # Will read only the new rows if any
    students_df = students.dataframe('current') # Will read the whole dataset

See https://www.palantir.com/docs/foundry/transforms-python/incremental-reference/