Hello Community team,
I have a doubt on the way the partition_cols works in palantir, so i did the following :
I had a dataset repartitionned by two columns : date and id, so i had 10 partitions ( 1 value of date and 10 ids ), i used a transform that takes this dataset in input and does the following : my_output.write_dataframe(my_input.dataframe(), partition_cols=[‘date’, ‘id’])
from what i understood, when you do have partitionning, it will take your files and divide them by (‘date’,‘id’), but because my dataset is already repartitionned so i was expecting this to return an output with 10 partitions that are the same exact ones of the input, but instead of that i got the partitions divided into smaller partitions, so i don’t know what i’am missing on the way partition_cols works ?
Best Regards,
Soufiane
Hi Soufiane,
Wild guess: it could be that you have AQE splitting your partitions post repartition (ShuffleRead skewed)
Could please share your Spark physical plan ? (please remove columns names or any sensitive info)
Hello,
AQE is normally disabled by default and i didn’t enable it, here is the physical plan of my job : == Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand (6)
± WriteFiles (5)
± Sort (4)
± CollectMetrics (3)
± Project (2)
± BatchScan parquet foundry
and even in the details i can’t see anything that suggests that spark will divide my partitions so it’s weird
Right. Apologies I thought you had a repartition before writing.
Then it is straightforward.
partition_cols doesn’t trigger any repartitioning.
The number of Spark partitions is determined when loading the data, by chunks of 128MB, then when writing spark splits them according to the partition_cols.
If you want 10 partitions, then you have to repartition it yourself before writing.
This is not Foundry specific, the same would happen if you use vanilla spark
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.partitionBy.html
1 Like