How can I see the impact of changing a column name?

jkane · June 17, 2024, 8:50pm

I want to rename a column in my pipeline. I am afraid of the effects of doing this downstream of where i make the change.

How can I learn about how my rename will impact my pipeline?

jkane · June 17, 2024, 8:52pm

The data lineage application is useful for this.

If you select multiple nodes in the data lineage app then the second item in this menu turns to a histogram item.

You can then look at the “frequent columns” section and click on each row, it will show you which nodes have that column in the main graph. This makes seeing the impact of changing a column name very easy if the column doesn’t get renamed (otherwise you have to repeat that process with the new names).

Here’s an example of how the graph changes when you select a column:

It’s not fool proof though, some columns might reference that old column name. The only foolproof way to check is to build a pipeline on a branch and see where there are errors.

cdesouza · June 18, 2024, 1:37pm

Another feature I find useful when making major schema changes like column renames or even just logic changes to existing columns is to use the Compare tab under the Dataset Preview UI. This makes it much easier to do a side-by-side comparison of different branches and catch significant changes not just to the schema but also to the column statistics once those are calculated.

Like you said, building your changes on a branch is the recommended way to test them before merging back into the main branch and effecting changes to your production pipelines. You’ll also want to add more in-depth checks like those using the Compare tab to catch unforeseen changes downstream that do not produce errors when you run the pipeline on your branch. As an example, an error will not be thrown if you try to rename a column that doesn’t exist in your input. So if you change the name of a column that is itself renamed somewhere in the downstream pipeline, the datasets in the downstream pipeline will now have that column removed.