Because a record in one DataSet A is being washed out, we are adding a record from DataSet A at that point in time to historical DataSet B each month.
We would like to modify the implementation to take into account the addition of columns to DataSet A with reference to the following
merge-and-replace-with-schema-change
I had implemented and verified the operation based on the example described above as follows.
(1) Create Input DataSet A, import CSV and submit records.
(2) Execute build and generate history DataSet B.
(3) Import another CSV to DataSet A and submit the record again (in this case, the CSV was replaced)
(4) Execute build and add records to the historical DataSet B.
Here, we found that âctx.is_incremental = Falseâ is set during the process (4), and an error occurs in âOutput.dataframe(âcurrentâ)â.
Why do I get âctx.is_incremental = Falseâ when (2) should be outputting the previous historical DataSet B?
I would appreciate it if you could tell me.
1 Like
In the build details
> Spark Details
> Incremental/Snapshot
tab section of your build, you should find details about what was the reason the build was running as incremental vs not.
This will give you the details of why the incrementality was respected, or not.
Given your explanation, it seems that you replaced the CSV of your input dataset.
Given previously processed data is no longer there, Foundry detects (as it is an âupdateâ transaction or a âsnapshotâ transaction) that it needs to propagate the new data âin placeâ of the old one, and hence propagates a snapshots in the pipeline.
In order to override this behavior, you need to specify in your incremental transform that the input is âexpected to snapshotâ like so:
@incremental(snapshot_inputs=["my_csv_input_dataset"])
@transform(
my_csv_input_dataset = Input("/path/to/csv_dataset")
...
)...
Foundry then will ignore snapshots or updates on this input dataset and will only rely on how you read your data:
students_df = students.dataframe('added') # Will read only the new rows if any
students_df = students.dataframe('current') # Will read the whole dataset
See https://www.palantir.com/docs/foundry/transforms-python/incremental-reference/
1 Like