Add columns incrementally avoiding to recompute previous outputs

ALAIZA · July 3, 2024, 8:03am

Im currently working an incremental transaction, that reads incrementally from an input and adds appending to the output, the size of the data is huge, so recomputing or rewriting as an “snapshot”/“replace” is not desirable. My question is:

Is there any possibility, to add a column to previous output just modifying the schema (accepting that all the values there will be null) and start adding a new column incrementally? (supposedly this wont fail as schema will be matching)

Example:

Iteration 1:

Input: 10 columns
Transform selects 5 columns
Output mode append write 5 columns

Iteration 2:

Input: 10 columns
Transforms selects 6 columns
<somehow I make previous output to have those 6 cols, being the new one null values>
Output mode append write 6 columns

the idea is to avoid bring previous output, union with the new incoming data and write into output replacing (unless it is the most efficient way of doing it)

redboyben · July 3, 2024, 8:47am

Hi ALAIZA, adding columns in an incremental transform will simply automatically fill the existing transactions with nulls (no need to edit a schema manually). That being said, it’s not possible to remove columns once they’ve been added - you’d get a schema error.