SCD in Code repository

Srinivas · April 28, 2025, 3:40pm

Hello all,

I am new to this community. I have a requirement to create SCD for a few datasets. I am trying to create an empty final dataset so that I can read it from a pipeline and calculate the delta between the new records from another dataset. My question is, can anyone point me to how I can create an empty dataset and read it from code repository. I tried creating a file with just column names and I got an error. Then I tried reading from a dataset with no files I get another error. Is there a way to bypass this?

An example on Slowly Changing dimensions. Slowly changing dimensions allow data to be picked according to the day from the fact

An example would be
ID,Name,PostCode
1,Srinivas,London
2,Shankar,Chennai

This will be loaded as
ID,Name,PostCode,valid_from,valid_to,created_datetime
1,Srinivas,London,29/04/2025,31/12/9999,29/04/2025 00:00:00
2,Shankar,Chennai,29/04/2025,31/12/9999,29/04/2025 00:00:00

if there is an update for one of the records
An example would be
ID,Name,PostCode
1,Srinivas,Brimingham
2,Shankar,Chennai

This will be loaded as
ID,Name,PostCode,valid_from,valid_to,created_datetime
1,Srinivas,London,29/04/2025,03/05/2025,29/04/2025 00:00:00
1,Srinivas,London,03/05/2025,31/12/9999,03/05/2025 00:00:00
2,Shankar,Chennai,29/04/2025,31/12/9999,29/04/2025 00:00:00

same as delete. Another question is, how can i overcome cyclical dependency? because for me to calculate the delta, I need to get a copy of the final dim to calculate the delta

green · April 29, 2025, 8:00am

Hello,

Sorry, in this context I don’t understand what is meant by “SCD”, could you perhaps explain the acronym?

To create an empty dataset you could use a lightweight Python transform like this:

import polars as pl
from transforms.api import LightweightOutput, transform, lightweight, Output


@lightweight
@transform(
    out=Output("ri.foundry.main.dataset.xxx"),
)
def compute(out: LightweightOutput):

    schema = {"id": pl.Int64,
             "text": pl.String,
             "date": pl.Datetime
             }

    df1 = pl.DataFrame({}, schema=schema)
    
    out.write_table(df1)

That would give you a dataset with a schema, but no rows.