[Polars] Output limitation: write_table always writing a single file

Hello,

I am leveraging Lightweight Transforms with Polars for its performance.

However, I’m running into a challenge when it comes to controlling the output file structure for large datasets with the function write_table

Polars always output a single file. When it comes to heavy dataset (eg 10GB), it can cause problem for spark or contour users.

I wasn’t able to find a way to split the output in several parts.

Any recommendations on that ?

I guess it shouldn’t be too hard to setup from a platform point of view, but it seems no current max size of file parameter is integrated in the write_table function.

I think this is a known limitation of the current write_table implementation.

To my knowledge this should be resolved soon when the duckdb bindings are released which should unlock a lower level API to write to the output where you would have full control over write options.

I‘ll repl again once this is released.

2 Likes

@Salim this has been working for me for the time being

        df.sink_parquet(
            pl.PartitionMaxSize(
                base_path=output.path_for_write_table,
                max_size=5_000_000,
            ),
            compression="zstd",
            mkdir=True,
        )
        output.write_table(output.path_for_write_table)
3 Likes

I think this is a good solution. The files are staged in the container which means there is a disk size limit and the performance will be a little bit slower since the upload of the files takes time. Overall, this shouldn’t make a big difference in job runtime.

Hello @rfisk ,
Thank you very much for the proposed solution. This is exactly what I was looking for :light_blue_heart:

@nicornk You’re right, it is a little bit slower, but this is acceptable for us !
I will also monitor if Foundry proposed a more “official” solution to that. I saw that Foundry released an API for DuckDB. Were you referring to that?

1 Like

It’s worth bearing in mind that because Parquet is splittable, a single large Parquet file is very rarely a problem for Spark consumers. Have you actually seen performance degradation in downstream Spark jobs when trying to read a large Parquet file output from these Polars transforms?

2 Likes

No, I actually haven’t seen any performance loss. My earlier statement may be incorrect ; I took for granted what a developer told me. I haven’t benchmarked it myself yet. I’ll probably do that next week and share an update.

Anyways, thanks for the clarification about the Parquet file!

Hello,

Be aware that PartitionMaxSize seems not to be supported in newer versions of Polars - an upgrade to your repo configs may break your pipelines.

Additionally, we discovered that ontology object types backed by polars transformations partitioned in this way take a loooong time to index.

Is the .write_table method going to support first-class partitioning?

Polars is under active developement, pin your Polars version in meta.yaml.
For Polars 1.37+ use PartitionBy.

Additionally, we discovered that ontology object types backed by polars transformations partitioned in this way take a loooong time to index.

We had similar Issues, enforce sort on your list columns. Other than PySpark, no hidden sort is applied with Polars.

1 Like