How to use objects as a data source for a pipeline

johnshin13 · February 7, 2025, 1:24am

My current understanding is that I have use some dataset, let’s say a spreadsheet, as input for a pipeline builder.
I want to make a streaming pipeline that constantly analyzes an object set and then transforms that data into another object. How can I go about doing this? Because I know objects need a backing data-source, but if I Create Object, then the new record won’t go into the data-source. So how can I have a pipeline look at objects instead of datasets?

lrhyne · February 7, 2025, 9:38am

As a first-order answer to your question, object types need to be materialized to an output dataset to work with them in a transform.

If you can explain a bit more about the workflow that you’re looking to enable, maybe with a user story about what the end-user needs to accomplish or experience working with the final tool, I expect there might be other solution design patterns that better match your needs.

johnshin13 · February 9, 2025, 7:19pm

Thanks for the response. Ideally, what I would like to do is have 2 objects A and B, and then the pipeline will concatenate the two objects into object A-B, with some additional columns performing computations.

What I have noticed is that any changes made to the backing datasource will be reflected in the object set, but not the other way around.

lrhyne · February 10, 2025, 1:15pm

Yes, the materialization model in the ontology specifies a separate output object set for each object type.

So in your case you’d have

Input dataset A → Object Type A → Output dataset A

Input dataset B → Object Type B → Output dataset B

And then in Pipeline Builder or wherever else you can join the output datasets, do additional derivation or analysis, and even produce a new dataset C that backs another object type.

The other factor to take into account, in this setup, is scheduling and latency. You can set up the output datasets to materialize with every object set edit, and then set up the pipeline to run whenever there is a new transaction on the output dataset - this gets you a “push” style pipeline that will run on every edit.

The piece I still don’t follow is why you’re creating Object Types for A and B? Do you need some user input or edits collected from an application view? If you’re just looking to take the input datasets A and B and then do some logic to produce output C, there isn’t any reason to detour through the Ontology for the computation. You can still have object set A and B and a pipeline to create C.

If you go one step less abstract and actually talk in terms of what you need to accomplish (in terms of the shape of your data and who/where someone will interact with the various inputs and outputs - rather than just the very abstract) I can help a bit more on the solution design

johnshin13 · February 11, 2025, 8:51am

I see, now that you mentioned the fact that I may not need to convert the datasets into objects at all, you are correct.

I wanted to combine 2 datasets, one with a list of job listings, and the other dataset is a list of candidate records that had recently finished interviews, into 1 object that concatenated the two based on which job the candidate applied for.

Since the candidate dataset will only be updated in dataset form, I actually don’t need to convert this to an object. However, the list of jobs can be updated via Workshop, so I still think I would need to convert this to an object.

Thank you for the feedback! I really appreciate your help!

system · February 25, 2025, 8:52am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.