Hello there. I’m having issues with datasets and was wondering whether it was somehow possible to generate a stable ID for a dataset that doesn’t have one and is the result of a buid? Essentially, the dataset doesn’t have any unique identifier and this is making work with it a lot more difficult, for various reasons. I’ve experimented with incremental transforms on one dataset, but can’t confirm whether it works the way I intent. I am concerned with whether changing any if the columns will result in a different ID being generated. Currently, the incremental algorithm is “Get the original dataset, compare with the new version that has ID assigned and assign new IDs to any new rows”. What concerns me whether udpates or deletions of rows will result in a different ID. In other words, I’m looking for a “assign and forget” algorithm that ensures that if the value “10” is assigned to a row, if I update its column, the value remains “10”. And if I delete the column, the number of the next new ID becomes “11”. Can anyone please help me out with this?
So, just to clarify, you are speaking about an “identifier” here, but what you are looking for is a “primary key” - correct ?
So your question is “can I generate a stable primary key ?”.
Short answer: yes, but it depends on how your data is shaped.
A trivial approach is to concatenate as strings multiple columns. You need to identify which columns are “stable” over time, hence if they change, that means that a new row is created.
For example when you say “If the value 10 changes to something else, I want the primary key to remains the same” then this “value” column shouldn’t be included in the primary key.
You might use multiple columns (e.g. the date, and the user, and the item_number, etc.) to generate your primary key.
Example in pyspark
df_with_pk = df.withColumn("primary_key", F.concat_ws("__", F.col("name"), F.col("date"), F.col("id")))
If you want your key to look nicer (but that’s as well making debugging a bit harder and doesn’t provide much additional benefit), you can hash this concatenation:
df_with_pk = df.withColumn("primary_key", F. hash(F.concat_ws("__", F.col("name"), F.col("date"), F.col("id"))))