I have saved a dataset using the option partition_cols=["my_col"]
in code repos.
In a different pipeline I am using that same dataset as input and I want to ensure that spark is recognizing that its reading a partitioned input, how can I do this?
I tried doing in_rdd.dataframe().rdd.partitioner
but this is None
so it seems wrong.
Thanks!
Hello,
When repartitionning using partition_cols=[“my_col”], you can simply check at the file level of the dataset if your files are partitionned using your col, you will see in the name of the files the distinct values of your col.
What you can also check is simply doing a filter transformation using the column that you repartitionned with, and then check the logical plan of your spark build, you will see in the PartitionFilters that spark is taking into account the partitionning you did, ( if your dataset is not partitionned you will have an empty list ).
1 Like