Bucketing on incremental datasets

55b969276dcf43cd2235 · September 26, 2024, 2:21pm

Hello community team,

We are currently exploring hive partitioning VS bucketing to well repartion our dataset but i found the foundry stackoverflow here :https://stackoverflow.com/questions/72608763/spark-writedataframe-with-partitioningbyrange-in-foundry stating that bucketing is not possible on incremental dataset?

would anyone know if this is true and still the case?

Best,

Ben · September 26, 2024, 3:54pm

Hi @55b969276dcf43cd2235 ,

If your pipeline is append-only, a projection would be the best way to improve query/filtering performance when reading this dataset. A projection is able to compact files to maintain performance even after a large number of incremental transactions.

The docs for projections are here, and specifically for incremental pipelines here.

Ben

55b969276dcf43cd2235 · September 26, 2024, 4:51pm

Hi Ben,

Thanks for your reply. we’re indeed aware of the projections, but we can’t use them for now, as the schema of our incremental datasets is evolving and we need to keep a certain flexibility to be able to delete files/transactions when necessary.

That’s why we’re experimenting with hive partitioning and bucketing to avoid any blocking.

related to my question Would you know if bucketing is currently possible on incremental datsets?

Best,
Wilfried