Fastest Pipeline Possible

brentshulman-silklin · September 5, 2024, 7:46pm

Hi - I have really small data. I ingest it from an API (incrementally) and then want to tranform it a bit and push it to the Ontology.

How do I make the ingestion fast? I am using lightweight and incremental but my last job took 20 seconds to start. The spark stuff downstream took 2 seconds…
From there, how do I make the transforms really fast? I tried using Pipeline Builder in a stream but was confused about how to use the dataset input and output a stream. I would love for any incremental updates to the input dataset to be pushed through the stream/

sperchanok · September 5, 2024, 8:46pm

Hi there! I don’t know the best way to make your data ingestion fast, but it seems like you have your wires crossed with streaming vs incremental pipelines.

A stream will process data in real-time while incremental pipelines are batch pipelines and require distinct builds (where each build will take some amount of time).

That’s all to say a streaming pipeline will be “faster” since it runs in real time, but make sure it’s actually what you want.

Here’s the documentation for creating a streaming pipeline in Pipeline Builder: https://www.palantir.com/docs/foundry/building-pipelines/create-stream-pipeline-pb/

brentshulman-silklin · September 6, 2024, 9:55am

Hi Sperchanok - my wire are not crossed. As I personally know too well, the product can be confusing.

Ingestion: I have to write the ingestion via code. My API has an endpoint to get the latest changes. I get those latest changes and then save them via an incremental update to a dataset. The question I have re: speed is not about my ingestion code, but rather the overall environment spin up. My code runs very quickly. How can I minimize the spin up time for my ingestion code? This was a recent build:

Secondly, after I ingest that data, I then want to process it really quickly. It would seem Pipeline builder can load in a dataset into a streaming pipeline (note: there are two options there, Stream and Snapshot btw which are not documented and I therefore don’t understand the difference). My idea was that these incremental updates to the previous dataset could be then pushed into the streaming pipeline. Perhaps a better explanation of what adding a dataset to a streaming pipeline builder instance is what is warranted here.

malbin · September 6, 2024, 12:21pm

Hey @brentshulman-silklin,

On ingestion:

I know the team is working on improvements to lightweight startup time. Watch this space- it should get faster in the next few months.
In the next few weeks we’ll be publishing examples on how to use Compute Modules in this fashion. Basically we’ll show how to have a constantly running Compute Module check and API on an ongoing basis and push the data into a stream for processing. Maybe @jeg can share some early versions of the guide?

On using Builder streaming:

If you end up going with the compute module approach you don’t need an intermediate dataset. You’ll create a new raw stream and push into it with an api call from your Compute Module.
If you will be fine with wherever we get to with lightweight spinup time you can just take the dataset and use it as a streaming input using the “Stream” mode. This picks up the data as soon as the new append transaction closes. “snapshot” can be used to join in an existing static dataset with a stream. It is in fact not well documented. We’ll fix it.

brentshulman-silklin · September 6, 2024, 2:23pm

@malbin - thanks for the very detailed response!

Re: container spin-up - amazing! Excited to see the progress here. Historically, small data was actually better run outside Foundry (because why make a script take 2 mins when it runs in 5 seconds locally). Looking forward to more here

Re: compute modules - I’m assuming the tradeoff of having a set compute module like this is it negates the “scale up and down” nature of builds?

Re: Pipeline Builder - Thank you for acknowledging the lack of documentation here. It sounds like my intuition was correct: New incremental updates get pushed into the stream when in Stream mode. I was having a bit of trouble getting this to work in practice, but I am going to give it another go now that I know this is the intended behavior.

nicornk · September 6, 2024, 6:20pm

Once Python functions are GA, serverless, can be scheduled and support egress you could also use them.
… or use AWS Lambda today und PUT your updates through the s3 compat. API.

brentshulman-silklin · September 6, 2024, 6:46pm

Are you referring to these Python Functions?

If so, is serverless + scheduling + egress on the roadmap?

nicornk · September 6, 2024, 7:49pm

To my knowledge serverless and egress yes. Scheduling, no idea

jeg · September 25, 2024, 1:06pm

Hey sorry for the delayed response here, wanted to get you a visible code sample.

To answer the original question:

How do I make the ingestion fast? This Compute Module is a good start.
How do I make the transforms fast? Use Pipeline Builder in streaming mode on the output.

This is an example of a Compute Module as a data ingestor, in this example I hit an endpoint every 10 seconds (you can ramp that up as high as you want) and write an output to a streaming dataset. This Compute Module will run indefinitely, but may have periods of downtime (the host is recycled every 24 hours).

It leverages the TypeScript Compute Module SDK.

We will go open beta with Compute Modules by the end of the month enabling this capability with SDKs for Python and TypeScript, but you use our platform APIs for any client you want.

In response to losing scaling, this is true - in its current form you need to pin your replicas to min and max 1. If your reason for wanting better scaling is for throughput, I’d recommend looking at using container transforms post ingestion. If your concern is cost, you are able to provision the CM with 0.1 CPUs to minimise spend.