Will Foundry repartition my dataset automatically?

aszekely · November 19, 2024, 6:10pm

I have a dataset full of parquet files without a schema applied. I read them using ctx.spark_session.read.parquet(list_of_files) and write them to the output. Will Foundry keep the input partitions or will it run an expensive repartitioning step?

redboyben · November 20, 2024, 9:02am

Hi!

If you just read the Parquet files and write them back without any transformations and without specifying any new partitioning scheme, Spark should preserve the original input partitions and avoid an expensive repartitioning step.

However, if you do any operations in the middle (filter, join, …), Spark might decide to repartition the data based on the transformation logic and the optimization strategies in place. It obviously applies if you do a .repartition or .coalesce as well.

To be very explicit, Spark will be calling the shots there!