Bucketing on incremental datasets

Hello community team,

We are currently exploring hive partitioning VS bucketing to well repartion our dataset but i found the foundry stackoverflow here :https://stackoverflow.com/questions/72608763/spark-writedataframe-with-partitioningbyrange-in-foundry stating that bucketing is not possible on incremental dataset?

would anyone know if this is true and still the case?

Best,

Hi @55b969276dcf43cd2235 ,

If your pipeline is append-only, a projection would be the best way to improve query/filtering performance when reading this dataset. A projection is able to compact files to maintain performance even after a large number of incremental transactions.

The docs for projections are here, and specifically for incremental pipelines here.

Ben

Hi Ben,

Thanks for your reply. we’re indeed aware of the projections, but we can’t use them for now, as the schema of our incremental datasets is evolving and we need to keep a certain flexibility to be able to delete files/transactions when necessary.

That’s why we’re experimenting with hive partitioning and bucketing to avoid any blocking.

related to my question Would you know if bucketing is currently possible on incremental datsets?

Best,
Wilfried