I’m trying to add a column to an incremental dataset. Right now, the dataset is configured to append only new rows. When I add the new column (called state_workflow) and deploy the pipeline, I get a warning that the logic is out of date and the state_workflow column is all null. I’ve tried redeploying it but to no success. Any chances someone has ran into this?
Hi @kirusha are you trying to snapshot this so that the state_workflow column is populated? If it’s incremental I would expect that when you add a new column only the new rows will have that column populated unless you snapshot.
Hey @helenq, what does it mean to snapshot? But yes, I’n trying to get all the current rows to be populated with the new state_workflow column (as well as any new rows!).
@helenq I realized I’ve been using the wrong terminology, the input dataset in the pipeline is a snapshot, but the output dataset—the one I’ve configured to append only new rows—is the one that keeps giving me the error that the logic is out of date (despite redeploying).
oh I see so if you change the write mode to be snapshot you’ll lose the historic data? because the input dataset only has the latest X rows
With the way we have it currently set up, the (snapshot) input dataset gets update once a day. Some of the rows may change, some of them stay the same. In a transform, we concat the unique ID of each row with current date and make that the new ID for each row. The output dataset uses this new concat ID as the PK for the append only new rows write mode.
I want to add the state_workflow column into the output dataset for all existing rows, as well as all new rows. I’m realizing now that I don’t think I can modify previously created rows. Though I just checked today and it looks like new rows that were added today to the output dataset are now non-null.
Sorry for the delay here but yes that’s expected behavior that only the new rows will get this new field populated. Is this field calculated by doing some computation on already existing rows or is it a new column that is now being brought in from the backing dataset?
If you still want to backfill your older rows with this value and the answer to the above is the former, a workaround we’ve seen is people snapshot/save the historic dataset, calculate the values, and then union that dataset with the new rows you get from the incremental pipeline
Ah great, thanks very much! That makes a lot of sense, I’ll try that:)