Implementing Anomaly Detection & AI Query Models in Foundry for Large-Scale Power Generation Data

Hello Palantir Community,

I’m new to the Foundry system and need some guidance on implementing a few tasks within it. Although I have a strong background in AI development and data analysis (with multiple Python projects including ML/DL, RAG), I haven’t yet worked with Foundry. Here’s an overview of the tasks I need to accomplish and some context about my dataset:

  • Anomaly detection for power generated – I want to compare yesterday power generate with today power generated then if its – or +5% as red , – or +3% as yellow flag else normal. You can make the different in a column
  • AI model as RAG for power generator – like if I ask it how much power it got generated it will response to me or if I ask about max power generated on x day it will give me max power generated source
  • Create a dashboard that showcase the anomaly detection for each power source generated, with drill down to view the power unit along with its current status

Dataset Overview: I’m working with a large dataset, with some tables containing around 18 million, 40 million, and 70 million rows. When using the pipeline builder for anomaly detection, the build fails or takes an excessive amount of time. This suggests that I might need a different approach within Foundry and also not sure what I need to use in order to active this task.

  • Power generator (every 10 min new row for each power generator have multiple power generator) - contains ID Power generator, timestamp, power generted
  • Power unit (every 20 min new row) – contain ID, Last maintenance done, timestamp, status, speed
  • Power generator & power unit (every 30 min new row) – contains ID, power generator ID, power unit (which is connect to power generator), run date, expected generating power, heat,

I’d appreciate any tips, best practices, or alternative approaches within Foundry that could help me:

  • Efficiently build and execute pipelines for large datasets.
  • Implement the anomaly detection logic without running into build failures.
  • Develop the RAG AI model for querying the data.
  • Create an interactive dashboard that meets the requirements above.

Thank you in advance for any help

Hello @57caf7cbd2918a494860 ,

I can give you a few directions to get started, but I’ll need to make a few assumptions.

As a generic pointer, you can try to design workflows in Solution Designer via AIP Architect at /workspace/solution-design/onboarding of your Foundry instance
See https://www.palantir.com/docs/foundry/solution-designer/overview

High level you likely have something that looks like this (maybe without the modeling step):

The main question here is “How is your anomaly detection working ?”. I assume you take the last N points for each generator/unit/etc. and try to detect a point breaking some kind of N-points-mean, for example ?
If that’s the case, you need to read the “old” points in order to generate alerts.

If you implement this in Pipeline Builder, by default, the whole history will be processed on each run (that’s what is called a snapshot transformation). Given the size of your dataset, and the volume of new rows on each run (only one/a few rows !) this can become inefficient, potentially even leading to failures in Spark.

What you might be interested to check are incremental transforms: they only process “the new data”. However, in your case, a simple incremental transform won’t be good enough. As processing the “new rows” only is not enough: You need as well the N rows of “history” to compare to a mean, for example.
See https://www.palantir.com/docs/foundry/transforms-python/incremental-reference

As a side note, appending that many “small updates” (one or a few rows) every 10m, will each time create a file in the dataset, which will make you fall into the “small file” problem.

Bringing a few ideas:

  • Idea 1 - Partitioning - If your data written to the input dataset is partitioned properly (for example by date), Spark will simply skip the files that are not relevant, hence doing exactly what we want: only process the few rows we need. In your case, what you can do is to create an intermediary dataset where the data will be “repartitioned” periodically (e.g. every week) and the rest of the time the rows will be appended.

  • Idea 2 - Filtering - You probably don’ need the WHOLE history to compute those alerts. Hence you might do is to have an intermediary dataset between your ingest and your alert generation, that will keep the last N days/weeks/periods you need to run your alerts and that gets the latest new rows appended every day. This is a bit harder to implement than a normal incremental pipeline and will require

  • Idea 3 - Streaming - You can convert your workflow to a streaming workflow. Being directly to ingest the data as a stream, or to convert the dataset you already have, as a stream. You can then use Pipeline Builder to implement your alerts, and you can use “window” functions that will keep track of the sum/count/etc., array of values, etc. you need (so it won’t re-read the history, it will just keep the latest value and update it). The downside is that a streaming workflow is essentially a transform that runs “all the time” from a compute perspective. Hence it can be less or more expensive than your current workflow (e.g. a transform that runs constantly is often more expensive in compute than a streaming workflow as the resource allocated for the transform are by default superior than those for the stream)

As a summary, ideas 1 and 2 will required code repository and incremental transforms that are specifically written for those cases.
Idea 3 has a compute-cost impact (because of the nature of streaming) but might as well be more straightforward to implement and overall more fit to the nature of your usecase, especially if you are looking to increase the frequency of the updates in the future.

For the rest of your questions (RAG, etc.) this is a question of Ontology.
You will want to create a few objects types representing what you are describing: Site, Generator, Unit, Alerts, etc. and attach properties to those objects, potentially timeseries.
Once the Ontology is in place, applications - like Workshop, AIP Logic, etc. will be able to query the ontology to answer what you need there and build Applications that are fitting the workflows you want to cover (Display the prioritized list of Alerts, drill into the reasons backing those alerts, drill into the history and how it was resolved in the past for already resolved alerts, pivot to similar sites that could experience a similar issue, etc.)

1 Like

Another part of the platform to explore, especially with regards to working this kind of signal/sensor data in Foundry, is the concept of a Time Series. Implementing your pipeline - taking into account @VincentF excellent guidance above - so that at least one of the outputs follows the time series schema will unlock quite a bit of functionality, including Time Series alerting.

Under the hood, time series data will be stored in a database optimized for specifically this schema and data scale, while also being available within the “regular” ontology for building operational workflows, including setting up different alerting patterns.

Again, I would think of this approach as complimentary to the pipeline-based alerting Vincent described, and simply another capability to consider as you become familiar with the platform.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.