I am leveraging Lightweight Transforms with Polars for its performance.
However, I’m running into a challenge when it comes to controlling the output file structure for large datasets with the function write_table
Polars always output a single file. When it comes to heavy dataset (eg 10GB), it can cause problem for spark or contour users.
I wasn’t able to find a way to split the output in several parts.
Any recommendations on that ?
I guess it shouldn’t be too hard to setup from a platform point of view, but it seems no current max size of file parameter is integrated in the write_table function.
I think this is a known limitation of the current write_table implementation.
To my knowledge this should be resolved soon when the duckdb bindings are released which should unlock a lower level API to write to the output where you would have full control over write options.
I think this is a good solution. The files are staged in the container which means there is a disk size limit and the performance will be a little bit slower since the upload of the files takes time. Overall, this shouldn’t make a big difference in job runtime.
Hello @rfisk ,
Thank you very much for the proposed solution. This is exactly what I was looking for
@nicornk You’re right, it is a little bit slower, but this is acceptable for us !
I will also monitor if Foundry proposed a more “official” solution to that. I saw that Foundry released an API for DuckDB. Were you referring to that?
It’s worth bearing in mind that because Parquet is splittable, a single large Parquet file is very rarely a problem for Spark consumers. Have you actually seen performance degradation in downstream Spark jobs when trying to read a large Parquet file output from these Polars transforms?
No, I actually haven’t seen any performance loss. My earlier statement may be incorrect ; I took for granted what a developer told me. I haven’t benchmarked it myself yet. I’ll probably do that next week and share an update.
Anyways, thanks for the clarification about the Parquet file!