Repartitioning on an incremental dataset in PB doesn't seem to work

dkoman · January 23, 2025, 2:29pm

Hi! Am I correct in assuming that if we add a repartition board to the transformations performed on an incremental dataset and then replay this incremental dataset on deploy, we should expect the entire dataset (from the start of the data) to be repartitioned?

If so, it is not the behavior I’m experiencing. Essentially I have a dataset of 8.2GB that is backed by 50,000 files, and I followed the process described above and set the number of partitions to be 50. I replayed the dataset, and the repartitioning didn’t happen. What might be the cause of it?

Thanks!

achung · January 24, 2025, 4:40pm

Hey, the write mode might be an issue here. If you have a snapshot replace write mode, for example, that performs a left join between the old and new data after your coalesce operation, and as a result your data might be repartitioned to a different number of files than you would expect. To ensure that you’re outputting the correct number of files, you’ll want to make sure that the coalesce operation comes at the end of the pipeline and that you’re using the default write mode.