Lightweight Polars Output Dataset Statistics

Hey there,
I enjoy lightweight transforms, especially with Polars.

Our Output parquet files, are normally larger than those created by Spark so we use other compression strategies to produce small files, such as using sink_parquet on path_for_write_table with different compression strategies.

from transforms.api import Input, Output, transform, lightweight
import polars as pl

@lightweight()
@transform(
    source=Input("ri.foundry.main.dataset.AAA"),
    out=Output("ri.foundry.main.dataset.BBB"),
)
def polars_lightweight_transform(
    source,
    out_data
):
    source = source.polars(lazy=True)
    '''
     Do Polars stuff
    '''
    source.sink_parquet(
        out.path_for_write_table,
        compression='snappy',
    )

    out.write_table(out.path_for_write_table)

Unfortunately, it seems that Polars on lightweight does not generate dataset statistics, which we can use to compare Dataset Previews.

Does someone have suggestions how to obtain these statistics?
Thank you!

1 Like