Essentially, I have two versions of my dataset: the transaction before my code change, and the transaction after my code change. I want to be sure that my code change hasn’t affected any of the rows, because in theory the logic is exactly the same.
Is there a way to get a row by row diff or comparison of these two dataset?
Solved! Contour to the rescue! It let’s us do Set maths on the two transactions. We can use the ‘remove rows’ functionality to figure out a difference. Here’s the step by step instructions. Note: All the below pictures are on notional and fake data only.
1. Create two paths in contour, one with the dataset transaction before, and another with the transaction after.
2. Use the multi-column editor on each path to make sure you have the same columns. If there are columns you expect to be different, then deselect these. The remaining columns will be used for the diff comparison.
Make sure you add a final board at the end of the final path to sort by the primary key of your dataset, so that you can see related rows right next to each other.
As you can see, clicking ‘show data’ on the final board reveals an empty dataset. That’s because my ‘text messages - Before’ board and ‘text messages - After’ board are identical and nothing has changes between the two!
You could also: create a unique PK for your dataset such that the PK for a row is the same for A and B, and then union anti joins to get your result: (A anti B) U (B anti A). That should give you : all rows that are in A that aren’t in B, and all rows in B that aren’t in A.
The underlying Spark mechanism (that you can utilize in, for example, Transforms) is DataFrame.subtract. For those familiar with SQL, this is effectively EXCEPT DISTINCT on dataframes.