Hello @57caf7cbd2918a494860 ,
I can give you a few directions to get started, but I’ll need to make a few assumptions.
As a generic pointer, you can try to design workflows in Solution Designer via AIP Architect at /workspace/solution-design/onboarding of your Foundry instance
See https://www.palantir.com/docs/foundry/solution-designer/overview
High level you likely have something that looks like this (maybe without the modeling step):
The main question here is “How is your anomaly detection working ?”. I assume you take the last N points for each generator/unit/etc. and try to detect a point breaking some kind of N-points-mean, for example ?
If that’s the case, you need to read the “old” points in order to generate alerts.
If you implement this in Pipeline Builder, by default, the whole history will be processed on each run (that’s what is called a snapshot transformation). Given the size of your dataset, and the volume of new rows on each run (only one/a few rows !) this can become inefficient, potentially even leading to failures in Spark.
What you might be interested to check are incremental transforms: they only process “the new data”. However, in your case, a simple incremental transform won’t be good enough. As processing the “new rows” only is not enough: You need as well the N rows of “history” to compare to a mean, for example.
See https://www.palantir.com/docs/foundry/transforms-python/incremental-reference
As a side note, appending that many “small updates” (one or a few rows) every 10m, will each time create a file in the dataset, which will make you fall into the “small file” problem.
Bringing a few ideas:
-
Idea 1 - Partitioning - If your data written to the input dataset is partitioned properly (for example by date), Spark will simply skip the files that are not relevant, hence doing exactly what we want: only process the few rows we need. In your case, what you can do is to create an intermediary dataset where the data will be “repartitioned” periodically (e.g. every week) and the rest of the time the rows will be appended.
-
Idea 2 - Filtering - You probably don’ need the WHOLE history to compute those alerts. Hence you might do is to have an intermediary dataset between your ingest and your alert generation, that will keep the last N days/weeks/periods you need to run your alerts and that gets the latest new rows appended every day. This is a bit harder to implement than a normal incremental pipeline and will require
-
Idea 3 - Streaming - You can convert your workflow to a streaming workflow. Being directly to ingest the data as a stream, or to convert the dataset you already have, as a stream. You can then use Pipeline Builder to implement your alerts, and you can use “window” functions that will keep track of the sum/count/etc., array of values, etc. you need (so it won’t re-read the history, it will just keep the latest value and update it). The downside is that a streaming workflow is essentially a transform that runs “all the time” from a compute perspective. Hence it can be less or more expensive than your current workflow (e.g. a transform that runs constantly is often more expensive in compute than a streaming workflow as the resource allocated for the transform are by default superior than those for the stream)
As a summary, ideas 1 and 2 will required code repository and incremental transforms that are specifically written for those cases.
Idea 3 has a compute-cost impact (because of the nature of streaming) but might as well be more straightforward to implement and overall more fit to the nature of your usecase, especially if you are looking to increase the frequency of the updates in the future.
For the rest of your questions (RAG, etc.) this is a question of Ontology.
You will want to create a few objects types representing what you are describing: Site, Generator, Unit, Alerts, etc. and attach properties to those objects, potentially timeseries.
Once the Ontology is in place, applications - like Workshop, AIP Logic, etc. will be able to query the ontology to answer what you need there and build Applications that are fitting the workflows you want to cover (Display the prioritized list of Alerts, drill into the reasons backing those alerts, drill into the history and how it was resolved in the past for already resolved alerts, pivot to similar sites that could experience a similar issue, etc.)