Hive Partitionning with partition_cols

Soufiane · October 31, 2024, 6:34pm

Hello Community team,

I have a doubt on the way the partition_cols works in palantir, so i did the following :
I had a dataset repartitionned by two columns : date and id, so i had 10 partitions ( 1 value of date and 10 ids ), i used a transform that takes this dataset in input and does the following : my_output.write_dataframe(my_input.dataframe(), partition_cols=[‘date’, ‘id’])

from what i understood, when you do have partitionning, it will take your files and divide them by (‘date’,‘id’), but because my dataset is already repartitionned so i was expecting this to return an output with 10 partitions that are the same exact ones of the input, but instead of that i got the partitions divided into smaller partitions, so i don’t know what i’am missing on the way partition_cols works ?

Best Regards,
Soufiane

cpottiez · November 4, 2024, 6:18am

Hi Soufiane,

Wild guess: it could be that you have AQE splitting your partitions post repartition (ShuffleRead skewed)
Could please share your Spark physical plan ? (please remove columns names or any sensitive info)

Soufiane · November 4, 2024, 8:44am

Hello,

AQE is normally disabled by default and i didn’t enable it, here is the physical plan of my job : == Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand (6)
± WriteFiles (5)
± Sort (4)
± CollectMetrics (3)
± Project (2)
± BatchScan parquet foundry

and even in the details i can’t see anything that suggests that spark will divide my partitions so it’s weird

cpottiez · November 4, 2024, 9:18am

Right. Apologies I thought you had a repartition before writing.

Then it is straightforward.
partition_cols doesn’t trigger any repartitioning.
The number of Spark partitions is determined when loading the data, by chunks of 128MB, then when writing spark splits them according to the partition_cols.

If you want 10 partitions, then you have to repartition it yourself before writing.

This is not Foundry specific, the same would happen if you use vanilla spark
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.partitionBy.html