In Pipeline Builder, I want to drop all duplicate rows but I get the following warning message when using the provided Drop Duplicates transform:
Drop duplicates picks rows at random using the given columns, making it nondeterministic. Use a checkpoint to ensure results are identical between downstream outputs.
Why is this the case and how can I deterministically drop duplicate rows? I was thinking of writing a Python function but I’d rather stick to built ins if possible.
Hey! If the values in the given columns all share the same value then we can’t guarantee each time which row we’re keeping. However, you can use the strategies below to force deterministic behavior:
(Also note that we’re just using spark’s implementation of drop duplicates under the hood so switching over to code would give the same results.)
How to Make dropDuplicates Deterministic?
To ensure the results are always the same, you can follow these strategies:
Sort the Data Consistently: Before removing duplicates, sort the data in a specific order. This ensures that the same rows are kept each time.
Rank Rows Within Groups: Use a method to rank rows within each group of duplicates. Then, keep only the top-ranked row from each group, ensuring consistency.
Use Unique Identifiers: If your data has unique identifiers (like timestamps or IDs), use these to sort and manage duplicates. This makes sure the same rows are kept every time.
Thank you! I was just confused with the wording of the warning message. I think it may be better to say something like this: Drop duplicates cannot guarantee which row is kept making it nondeterministic... <insert suggestion here to make it deterministic>
For my implementation, it doesn’t matter which row of the duplicates is kept. I thought the warning message meant that you took a sample of the data and just applied drop duplicates to only that sample set which confused me.