Heyo PB team, reaching out with a (potentially hacky) question/ask.
We read in data in ~real-time from client’s SQL DB. Right now, this is an incremental pipeline that results in SNAPSHOT backing datasets which back our objects.
These objects are expected to be as live as possible, and we’re currently working to reduce E2E pipeline latency.
One roadblock that I’ve been thinking about overcoming is long Ontology index times, and I’ve been wondering if there’s a way to leverage streaming-backed objects for this.
On this train of thought, I’ve tried re-implementing our snapshot backing dataset transforms (which live in transforms code right now) in a streaming PB, just to see what the output would look like. The inputs are all in snapshot mode instead of stream.
The resultant dataset is…empty.
Any thoughts here as to why? Is this even something that would work? How else could I “hijack” faster funnel index times with streaming backed objects to get our E2E latency down?
To understand if you want to use streaming or batch (streaming vs batch comparison), I’d think you would need to think about what is your e2e pipeline latency goal. If it is a sub 5/10 minutes, you likely do need to use streaming. However, if you are comfortable with a bit longer (say 15 minutes) you should be able to do this in batch.
One option off the bat that would improve your e2e pipeline time would be to make your pipeline output incremental, this would expedite the ontology sync / indexing job.
I’m worried about E2E incrementality for maintainability purposes. It’s nice to have near E2E incrementality and then snapshot the backing dataset; it allows for more schema flexibility/filtering etc without crazy complicated code.