i have two snapshots of the same dataset, the dataset has a primary key and all rows are unique. we want to compute any differences between the datasets, in particular to know which rows were added, updated or deleted. is there a better way to do this than just doing except?
I was recommended to use a full outer join on the primary key column. In that case created rows are old.pk == null and new.pk != null, deleted rows are old.pk != null and new.pk == null.
The question is for updated, where old.pk != null and new.pk != null, how do i detect which rows have different data?
As you mentioned, it seems like using ExceptAll would be your best bet. I’m not familiar with Java Spark, but here’s the function in PySpark pyspark.sql.DataFrame.exceptAll