[Polars] Output limitation: write_table always writing a single file

Hello,

I am leveraging Lightweight Transforms with Polars for its performance.

However, I’m running into a challenge when it comes to controlling the output file structure for large datasets with the function write_table

Polars always output a single file. When it comes to heavy dataset (eg 10GB), it can cause problem for spark or contour users.

I wasn’t able to find a way to split the output in several parts.

Any recommendations on that ?

I guess it shouldn’t be too hard to setup from a platform point of view, but it seems no current max size of file parameter is integrated in the write_table function.

I think this is a known limitation of the current write_table implementation.

To my knowledge this should be resolved soon when the duckdb bindings are released which should unlock a lower level API to write to the output where you would have full control over write options.

I‘ll repl again once this is released.

2 Likes

@Salim this has been working for me for the time being

        df.sink_parquet(
            pl.PartitionMaxSize(
                base_path=output.path_for_write_table,
                max_size=5_000_000,
            ),
            compression="zstd",
            mkdir=True,
        )
        output.write_table(output.path_for_write_table)
3 Likes

I think this is a good solution. The files are staged in the container which means there is a disk size limit and the performance will be a little bit slower since the upload of the files takes time. Overall, this shouldn’t make a big difference in job runtime.

Hello @rfisk ,
Thank you very much for the proposed solution. This is exactly what I was looking for :light_blue_heart:

@nicornk You’re right, it is a little bit slower, but this is acceptable for us !
I will also monitor if Foundry proposed a more “official” solution to that. I saw that Foundry released an API for DuckDB. Were you referring to that?

1 Like

It’s worth bearing in mind that because Parquet is splittable, a single large Parquet file is very rarely a problem for Spark consumers. Have you actually seen performance degradation in downstream Spark jobs when trying to read a large Parquet file output from these Polars transforms?

2 Likes

No, I actually haven’t seen any performance loss. My earlier statement may be incorrect ; I took for granted what a developer told me. I haven’t benchmarked it myself yet. I’ll probably do that next week and share an update.

Anyways, thanks for the clarification about the Parquet file!