How can I see a row by row comparison of two datasets?

ishan · April 17, 2024, 9:02am

Essentially, I have two versions of my dataset: the transaction before my code change, and the transaction after my code change. I want to be sure that my code change hasn’t affected any of the rows, because in theory the logic is exactly the same.

Is there a way to get a row by row diff or comparison of these two dataset?

ishan · April 18, 2024, 2:52pm

Solved! Contour to the rescue! It let’s us do Set maths on the two transactions. We can use the ‘remove rows’ functionality to figure out a difference. Here’s the step by step instructions. Note: All the below pictures are on notional and fake data only.

1. Create two paths in contour, one with the dataset transaction before, and another with the transaction after.

2. Use the multi-column editor on each path to make sure you have the same columns. If there are columns you expect to be different, then deselect these. The remaining columns will be used for the diff comparison.

3. Create a new path from your ‘Before’ path.

4. Use the union board to subtract the rows in the ‘After’ dataset from the ‘Before’ dataset.

Also you might want to rename this third path to ‘Before minus after’.

5. Do the same thing to get After minus Before.

6. Create a new path with the union of these two diffs.
First, create a new path from ‘Before minus After’.

Then, use the union board to add the rows from the ‘After minus Before’ path.

This final path should look something like this

Make sure you add a final board at the end of the final path to sort by the primary key of your dataset, so that you can see related rows right next to each other.

As you can see, clicking ‘show data’ on the final board reveals an empty dataset. That’s because my ‘text messages - Before’ board and ‘text messages - After’ board are identical and nothing has changes between the two!

Best of luck!

david · May 6, 2024, 8:06pm

You could also: create a unique PK for your dataset such that the PK for a row is the same for A and B, and then union anti joins to get your result: (A anti B) U (B anti A). That should give you : all rows that are in A that aren’t in B, and all rows in B that aren’t in A.

Kuba · May 8, 2024, 8:24pm

The underlying Spark mechanism (that you can utilize in, for example, Transforms) is DataFrame.subtract. For those familiar with SQL, this is effectively EXCEPT DISTINCT on dataframes.