[Polars] Output limitation: write_table always writing a single file

Salim · November 17, 2025, 1:17pm

Hello,

I am leveraging Lightweight Transforms with Polars for its performance.

However, I’m running into a challenge when it comes to controlling the output file structure for large datasets with the function write_table

Polars always output a single file. When it comes to heavy dataset (eg 10GB), it can cause problem for spark or contour users.

I wasn’t able to find a way to split the output in several parts.

Any recommendations on that ?

I guess it shouldn’t be too hard to setup from a platform point of view, but it seems no current max size of file parameter is integrated in the write_table function.

nicornk · November 17, 2025, 6:23pm

I think this is a known limitation of the current write_table implementation.

To my knowledge this should be resolved soon when the duckdb bindings are released which should unlock a lower level API to write to the output where you would have full control over write options.

I‘ll repl again once this is released.

rfisk · November 18, 2025, 8:27am

@Salim this has been working for me for the time being

        df.sink_parquet(
            pl.PartitionMaxSize(
                base_path=output.path_for_write_table,
                max_size=5_000_000,
            ),
            compression="zstd",
            mkdir=True,
        )
        output.write_table(output.path_for_write_table)

nicornk · November 18, 2025, 5:41pm

I think this is a good solution. The files are staged in the container which means there is a disk size limit and the performance will be a little bit slower since the upload of the files takes time. Overall, this shouldn’t make a big difference in job runtime.

Salim · November 26, 2025, 9:26am

Hello @rfisk ,
Thank you very much for the proposed solution. This is exactly what I was looking for

@nicornk You’re right, it is a little bit slower, but this is acceptable for us !
I will also monitor if Foundry proposed a more “official” solution to that. I saw that Foundry released an API for DuckDB. Were you referring to that?

sandpiper · November 28, 2025, 7:09am

It’s worth bearing in mind that because Parquet is splittable, a single large Parquet file is very rarely a problem for Spark consumers. Have you actually seen performance degradation in downstream Spark jobs when trying to read a large Parquet file output from these Polars transforms?

Salim · November 28, 2025, 11:27am

No, I actually haven’t seen any performance loss. My earlier statement may be incorrect ; I took for granted what a developer told me. I haven’t benchmarked it myself yet. I’ll probably do that next week and share an update.

Anyways, thanks for the clarification about the Parquet file!

50f2ab4125f380efd597 · February 12, 2026, 9:05am

Hello,

Be aware that PartitionMaxSize seems not to be supported in newer versions of Polars - an upgrade to your repo configs may break your pipelines.

Additionally, we discovered that ontology object types backed by polars transformations partitioned in this way take a loooong time to index.

Is the .write_table method going to support first-class partitioning?

Solv · March 2, 2026, 6:18pm

Polars is under active developement, pin your Polars version in meta.yaml.
For Polars 1.37+ use PartitionBy.

Additionally, we discovered that ontology object types backed by polars transformations partitioned in this way take a loooong time to index.

We had similar Issues, enforce sort on your list columns. Other than PySpark, no hidden sort is applied with Polars.