Lightweight (polars) partitioning large datasets for reading

We are currently trying to migrate many of our processes to Polars, most of the pipelines are optimised to work on 1 day batches, and most of our reports use “date” as the main filter.

It is almost mandatory to keep the report (Contour) reading capacity, prefiltering by date, but with polars there is no native approach for it (no .partitionBy(…)).

Current status:

  • dataset with many Gb partitioned by date
  • Contour reads and prefilter by date in the 1st stage
  • Parquet files

We want to achieve the same with Polars, to compute just one day incrementally as it is, but storing it in a partitioned way without breaking current incrementality

1 Like

Did you try writing to a local temp directory using the native polars dataframe write_dataset function (which supports partition_by) and then iterating the files and uploading to the output filesystem?