tl:dr: What are some good strategies for pruning objects from the Ontology after a certain time?
Our application processes incoming data that captures changes and updates (let’s call the incoming data “updates”) to various metrics that are being tracked. Whenever updates are received, we process them using an incremental transform to modify the current state of the metrics.
It is useful for our application to store not just the metrics themselves but also the update records as ontology objects. However, we receive a high volume of updates (millions every month), and after some period of time we don’t need those objects anymore.
Additionally, updates capture changes in the metrics, but in the future we don’t want to have to iterate over years of data if, for example, we change the logic parsing the updates into the metrics. And, we don’t want to have to retain all update records for all time.
So, how do we effectively prune old update records and set up logic to let us avoid needing those records that we want to prune?
One idea we had was to periodically store “snapshots” of the metrics at a given time and generate partitions of the update records organized by date range. That would allow us to answer a question such as “what are the current values of the metrics?” by starting from a snapshot and then iterating through only the updates since that snapshot.
Some points we are thinking about are:
Could we change the backing dataset for the “updates” objects to be a dataset that only pulls in the last several months of update records? If we do that, how do we make sure we can re-compute the current metrics if we want to re-iterate over many months of update records (again if, for example, we change the parsing logic and insights we are looking for from the updates)?
We want to set up something that will avoid a big refactor down the line.
What do y’all think? What are some strategies to deal with this sort of situation?
Hey @paulm so my understanding from your question is that you currently have an incremental dataset and your current logic only takes the appended rows to update the metrics that come out of this dataset. You’re looking for a way to only calculate metrics on a subset of rows because the dataset is getting too large. (Let me know if I misunderstood anything)
My initial thought is if you want to calculate the metrics for a subset (eg. the last X months) it’s easier to just have a snapshot build that calculates the metrics from scratch rather than taking into account all the rows you would filter out that are too historic and taking in the new rows (but maybe based on whatever logic you have the latter is easier-- I’ll leave that up to you). One question I do have is once you calculate the metrics for let’s say the last 6 months, do you always just want the metrics to reflect the last 6 months? Or you want to be able to switch between the last 6 months and everything
So the metrics in question are technically cumulative for all time. Each update says something like “Measurement XYZ increased by 12.” So we want to be able to ask questions like “if Measurement XYZ was 372 in June of 2019, what is its value now given all these updates we’ve recorded?” However, in June of 2024, we don’t want to have to process 5 years worth of updates to do the computation.
And most importantly, the volume of updates we are receiving (tens of millions per year) makes for a decent amount of computation that we don’t want to do to look at recent data. That’s why have an incremental transform doing most of the heavy lifting here - we don’t want to run snapshot builds of years of data to get to the current state.
Another situation that comes to mind would be one where the nature of the update records themselves changes, such that they mean something different than they used to. For example, suppose a sensor starts reporting in a different unit of measurement without us knowing about the change, and then we find out that “two months ago this sensor started reporting in feet instead of meters.” Then, we might need to go re-calculate our metrics from that point forward with some new logic, but we again would not want to have to look across several years of data when we only really care about re-iterating over the last several months.
To summarize the core of the situation:
We want to create objects that represent each update.
The metrics reported are based on the aggregation of those updates.
There are too many updates for it to be practical for us to retain them forever in the ontology.
From time to time we may want to change our aggregation logic such that we need to re-iterate over some update records, but it won’t always make sense to iterate over very old records, especially if only the new update records have something different about them.
Thanks for the info! You could start off by having a metrics dataset that has the latest and most accurate inventory and then build off of that with the incremental logic you have. It sounds like you could also benefit from having an archive of metrics (maybe per month or per year) so if you do need to recalculate you only recalculate from the last time you saved accurate metrics.
Keeping all rows inside a historical dataset not linked to an object is also fine since Foundry can handle 10s and even 100s of millions of rows. You would need to filter based on the date and the last metric snapshot you kept to backfill incorrect entries though which makes me wonder if your users could correct the inventory numbers at the ontology level instead (like in a workshop with an action?)
I’ll let others chime in with their thoughts as well!