Co-Opting Streaming PBs for Faster Index Times?

jake · November 27, 2024, 4:58pm

Heyo PB team, reaching out with a (potentially hacky) question/ask.

We read in data in ~real-time from client’s SQL DB. Right now, this is an incremental pipeline that results in SNAPSHOT backing datasets which back our objects.

These objects are expected to be as live as possible, and we’re currently working to reduce E2E pipeline latency.

One roadblock that I’ve been thinking about overcoming is long Ontology index times, and I’ve been wondering if there’s a way to leverage streaming-backed objects for this.

On this train of thought, I’ve tried re-implementing our snapshot backing dataset transforms (which live in transforms code right now) in a streaming PB, just to see what the output would look like. The inputs are all in snapshot mode instead of stream.

The resultant dataset is…empty.

Any thoughts here as to why? Is this even something that would work? How else could I “hijack” faster funnel index times with streaming backed objects to get our E2E latency down?

Thankies

bkaplan · November 27, 2024, 7:12pm

To understand if you want to use streaming or batch (streaming vs batch comparison), I’d think you would need to think about what is your e2e pipeline latency goal. If it is a sub 5/10 minutes, you likely do need to use streaming. However, if you are comfortable with a bit longer (say 15 minutes) you should be able to do this in batch.

One option off the bat that would improve your e2e pipeline time would be to make your pipeline output incremental, this would expedite the ontology sync / indexing job.

jake · November 27, 2024, 7:27pm

Ack

Goal is definitely sub 10 minutes

I’m worried about E2E incrementality for maintainability purposes. It’s nice to have near E2E incrementality and then snapshot the backing dataset; it allows for more schema flexibility/filtering etc without crazy complicated code.

bkaplan · November 27, 2024, 8:25pm

I think it can be doable with snapshot but will be hard. What is the scale of your datasets? How long do the pipelines currently take?

jake · November 27, 2024, 10:14pm

Datasets/objects are relatively small by spark/highbury standards - ~a few hundred - ~1.5m rows

including ontology sync times, it takes ~15-20 minutes right now