How to compute a diff on two datasets

jzhangyang · March 26, 2025, 7:04pm

i have two snapshots of the same dataset, the dataset has a primary key and all rows are unique. we want to compute any differences between the datasets, in particular to know which rows were added, updated or deleted. is there a better way to do this than just doing except?

i am also working with Java spark directly.

jzhangyang · March 26, 2025, 7:09pm

I was recommended to use a full outer join on the primary key column. In that case created rows are old.pk == null and new.pk != null, deleted rows are old.pk != null and new.pk == null.

The question is for updated, where old.pk != null and new.pk != null, how do i detect which rows have different data?

yix · March 26, 2025, 7:20pm

As you mentioned, it seems like using ExceptAll would be your best bet. I’m not familiar with Java Spark, but here’s the function in PySpark
pyspark.sql.DataFrame.exceptAll

system · May 25, 2025, 7:21pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.