Live Stream data using repository NOT pipeline builder

I have a bit of a convoluted question regarding streaming data being input through a data transformation in a repository and the latency of this data. I will lay out what my current prototype project structure is and then follow up with my questions. I am just putting a practice project together to ingest MQTT data from a broker into the stream, which in turn is the input for a data transformation in code repository which ultimately outputs to a clean parsed dataset. (In the transformation, I have the input just set as the main RID for the stream) This cleaned dataset is a backing set for a telemetry device object in the ontology. I am not wanting to use the form based pipeline building, but in the case of the stream, it gives you the option of running a streaming pipeline so that you are getting live or near live data transformation for the input stream. I am looking for near minute latency on my data so I will be scheduling my data transformation to build once a minute. The problem that I seem to be encountering is that the latency from the stream itself to the data transformation seems to be much higher as if it is accessing the archival dataset for the stream as opposed to the live stream data.

  • Is this the correct way of going about trying to create a “streaming-ish” pipeline using a code repository as opposed to pipeline builder.

  • what is the latency between a stream and the data transform… should I be using the “RID” or the “View RID” as the input for the data transformation.

  • if I want better latency than just the dataset to be build once per minute, how do I go about creating a live streaming pipeline with a repository, not the pipeline builder.

  • also is there some documentation that you can point me towards on how to process just the latest streaming data and append it to a dataset, so that I am not processing all past entries into the dataset every time the build runs?

I hope that this is relatively understandable, and appreciate all of the communities help in advance.

Hello @ahoffbauer127 ,

At the moment, streaming pipelines can only be defined using pipeline builder. I recommend reading the documentation around streaming transforms. You can do some stateful transforms in streaming.

In code repo, you can only develop and publish custom flink transformations that you can then import in pipeline builder.

The archiving process from the live to the archive dataset is triggered every 10 minutes, so as soon as you switch to the « batch world », you get 10min+ latency. Sometimes you have no other choice, for example if you want to join two streaming datasets.

Regards,

Ben

If you truly want to use code, you can probably do this in a compute module. When setting up a module, you can set it up in pipeline mode. The documentation there lists out how to read / write directly from the stream instead of from the archive.

Out of curiosity, what is pushing you to use a code repo instead of pipeline builder?

The example given compute module seems to read same data over, is the stream really supported through api