I think, there is a problem with toArrow(), as it collects all the data into the driver. It outputs a single .parquet file even when my input dataset consists of many .parquet files, even when I output it without any transformation, right after reading it.
This is a bottleneck for bigger datasets which require to use bigger drivers just to be able to collect all the data once at the end. Maybe there is another method to output into a Foundry dataset directly from executors?
Thank you @gwalker .
I had to switch the order - first upload schema and only then the .parquet files. If we write the schema afterwards, we lose the data, because in this case it creates a snapshot transaction. The following code works well in Spark4. Lower versions would need some small workaround for df.toArrow() which is used here to upload the schema.