Adding new columns to incremental dataset schema

paulm · November 12, 2024, 9:54pm

Hello,

Is it possible to add new columns to a dataset that is being generated incrementally, without changing the semantic version and re-writing the existing dataset history? I would like to add new columns that are included in an incremental dataset moving forward, but it is OK if they are not present in previous, existing transactions.

Could this be accomplished by manually editing the schema of the dataset, or would that cause other issues?

Thanks!

joe · November 12, 2024, 10:04pm

From a python transform you can append a transaction with extra columns without breaking incrementality. If you try to then remove the columns you will need to resnapshot.

I’m not sure how pipeline builder handles it.

paulm · November 12, 2024, 10:29pm

Is there anything I should be checking here? I am getting the following error when trying to build: The provided schema doesn't match the actual schema of a previous transaction.

In that error message, it gives me both the provided and previous schema. I made sure that the existing columns match, not just their names but also data types and nullability. That seems to indicate that the error is because of the new columns I am adding.

paulm · November 12, 2024, 10:49pm

Looks like I got the transform to run. The solution in my case was to manually edit the schema of the incremental dataset to add the new columns I wanted. Worth noting - they are nullable columns, which is how this all works. I will keep an eye on the downstream pipeline that consumes this dataset to make sure it’s still happy.