Dynamic Partition Pruning of hive partitioned datasets with a (broadcast) join

gloom · April 15, 2025, 4:38pm

Hey!
I’m trying to dynamically filter hive partitioned dataset (to be specific - view of multiple hive partitioned datasets) by broadcast joining much smaller non-hive pratitioned dataset on hive partitioned columns. As a result during loading files stage i should load only files that are present im broadcasted dataset, e.g.
In Big dataset i have files - hive partitioned by year: year=2024/file_a and year=2025/file_b and in small dataset i have column year with value of 2025, by joining it on year it should result in loading only 1 file (year=2025/file_b) instead of both of them.
By my understanding this option should allow me do it: ctx.spark_session.conf.set(“spark.sql.optimizer.dynamicPartitionPruning.enabled”, True)
however it doesn’t seem to be a case.
Currently i’ve tested 4 diffrent scenarios (year, month, day are my hive partitioned columns):

join on: year, month, day, colA - result BIG dataset is loading all files
join on: year, month, day - result BIG dataset is loading all files
join on: year, month, day + broadcasted dataset being repartitioned by those 3 columns- result BIG dataset is loading all files
join on: year, month, day + for a join i’m using static dataframe instead - result BIG dataset is loading all files
Here is a code e.g. for case 4:
def compute(ctx, source_df):

ctx.spark_session.conf.set(“spark.sql.optimizer.dynamicPartitionPruning.enabled”, True)

schema = StructType([
StructField(“year”, IntegerType(), True),
StructField(“month”, IntegerType(), True),
StructField(“day”, IntegerType(), True),
StructField(“colA”, StringType(), True)
])

data = [
(2022, 1, 5, 3518248),
(2022, 1, 5, 3519692),
(2022, 1, 5, 3519502)
]

df = ctx.spark_session.createDataFrame(data, schema)

return source_df.join(F.broadcast(df), [“year”, “month”, “day”])

My question is: Is my understanding of functionality incorrect or am i not fulfilling all conditions to be able to do that?
Thanks for help!

gloom · April 22, 2025, 2:25pm

Hey!
I’m trying to dynamically filter hive partitioned dataset (to be specific - view of multiple hive partitioned datasets) by broadcast joining much smaller non-hive pratitioned dataset on hive partitioned columns. As a result during loading files stage i should load only files that are present im broadcasted dataset, e.g.
In Big dataset i have files - hive partitioned by year: year=2024/file_a and year=2025/file_b and in small dataset i have column year with value of 2025, by joining it on year it should result in loading only 1 file (year=2025/file_b) instead of both of them.
By my understanding this option should allow me do it: ctx.spark_session.conf.set(“spark.sql.optimizer.dynamicPartitionPruning.enabled”, True)
however it doesn’t seem to be a case.
Currently i’ve tested 4 diffrent scenarios (year, month, day are my hive partitioned columns):

join on: year, month, day, colA - result BIG dataset is loading all files
join on: year, month, day - result BIG dataset is loading all files
join on: year, month, day + broadcasted dataset being repartitioned by those 3 columns- result BIG dataset is loading all files
4)join on: year, month, day + for a join i’m using static dataframe instead - result BIG dataset is loading all files
Here is a code e.g. for case 4:
def compute(ctx, source_df):

ctx.spark_session.conf.set(“spark.sql.optimizer.dynamicPartitionPruning.enabled”, True)

schema = StructType([
StructField(“year”, IntegerType(), True),
StructField(“month”, IntegerType(), True),
StructField(“day”, IntegerType(), True),
StructField(“colA”, StringType(), True)
])

data = [
(2022, 1, 5, 3518248),
(2022, 1, 5, 3519692),
(2022, 1, 5, 3519502)
]

df = ctx.spark_session.createDataFrame(data, schema)

return source_df.join(F.broadcast(df), [“year”, “month”, “day”])

My question is: Is my understanding of functionality incorrect or am i not fulfilling all conditions to be able to do that?
Thanks for help!

system · June 21, 2025, 2:26pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.