I want to be able to view the history of a snapshot dataset that I get from a source system. The column I am interested in is the status column. Their is no status changed timestamp available in the source system.
The 2 approaches that I see are
Creating a changelog dataset that shows the pk, the status, and an artificial timestamp that shows when the status changed. This would essentially append the rows only when they change.
Appending the entire dataset on a daily basis to an output dataset. The pro of this would be I could easily tell all of the status’s on any given day. the con would be the scale, but the original dataset is only like 4000 rows.
Hey, creating an output dataset with a changelog write mode seems like it would be suitable for your use case. Let if us know if you have additional questions/issues!
The problem is, how can I add a timestamp to that output? If I add a timestamp column then the changelog will essentially just create a new row every time it builds, regardless if the status changed or not.
I see, you want to preserve the timestamp but only track changes to the status.
If you can guarantee that the status only changes one-way (i.e., for a given PK, you can’t go from status1 → status2 → status1), then you could use the “Append only new rows” write mode. The idea would be to set the primary keys as your regular PK + status columns. This way, if both the PK and status match, no rows are appended – however, if the status changes for the same PK, a row is appended to the output dataset.
The only problem is if a row reverts to a previous status as mentioned above, as that status will already exist in the output dataset and therefore no rows will be appended (even if it has indeed “changed” since the previous build).
If you can’t make any guarantees regarding reverting to a previous status, you would have to follow option 2 that you mentioned above (although it’s not the best, 4000 rows doesn’t seem too bad).