Hey!
I’m trying to dynamically filter hive partitioned dataset (to be specific - view of multiple hive partitioned datasets) by broadcast joining much smaller non-hive pratitioned dataset on hive partitioned columns. As a result during loading files stage i should load only files that are present im broadcasted dataset, e.g.
In Big dataset i have files - hive partitioned by year: year=2024/file_a and year=2025/file_b and in small dataset i have column year with value of 2025, by joining it on year it should result in loading only 1 file (year=2025/file_b) instead of both of them.
By my understanding this option should allow me do it: ctx.spark_session.conf.set(“spark.sql.optimizer.dynamicPartitionPruning.enabled”, True)
however it doesn’t seem to be a case.
Currently i’ve tested 4 diffrent scenarios (year, month, day are my hive partitioned columns):
join on: year, month, day, colA - result BIG dataset is loading all files
join on: year, month, day - result BIG dataset is loading all files
join on: year, month, day + broadcasted dataset being repartitioned by those 3 columns- result BIG dataset is loading all files
join on: year, month, day + for a join i’m using static dataframe instead - result BIG dataset is loading all files
Here is a code e.g. for case 4:
def compute(ctx, source_df):
Hey!
I’m trying to dynamically filter hive partitioned dataset (to be specific - view of multiple hive partitioned datasets) by broadcast joining much smaller non-hive pratitioned dataset on hive partitioned columns. As a result during loading files stage i should load only files that are present im broadcasted dataset, e.g.
In Big dataset i have files - hive partitioned by year: year=2024/file_a and year=2025/file_b and in small dataset i have column year with value of 2025, by joining it on year it should result in loading only 1 file (year=2025/file_b) instead of both of them.
By my understanding this option should allow me do it: ctx.spark_session.conf.set(“spark.sql.optimizer.dynamicPartitionPruning.enabled”, True)
however it doesn’t seem to be a case.
Currently i’ve tested 4 diffrent scenarios (year, month, day are my hive partitioned columns):
join on: year, month, day, colA - result BIG dataset is loading all files
join on: year, month, day - result BIG dataset is loading all files
join on: year, month, day + broadcasted dataset being repartitioned by those 3 columns- result BIG dataset is loading all files
4)join on: year, month, day + for a join i’m using static dataframe instead - result BIG dataset is loading all files
Here is a code e.g. for case 4:
def compute(ctx, source_df):