I just recently found a pipeline where a data expectation had been added to all relevant inputs that checks if the input dataset has any rows. In the past we have seen issues with snapshot ingestion inputs being empty, so this was added as a security measure. This is the current config:
dataset_emptiness_check = E.count().not_equals(0)
Now, when looking at the Spark details, this code does do a row-count on the entire dataset (>2 billion rows), which is quite expensive, especially since the data is usually hive partitioned.
Instead, I now changed the code to do this instead (and removed the data expectation):
if not df.first():
raise Exception('One of the input datasets is empty')
This works much faster, but is obviously not using the data expectations “framework”. Now my questions:
- Is there a way using data expectations to check for an empty dataset more efficiently than using row count?
- If not, could this maybe be added along the lines of my code snippet above?
Thanks!