How to change output format to Parquet after materializing object in Pipeline Builder

I created an object in Pipeline Builder and materialized it, but the output is not stored in Parquet format.

I can’t find where to set the output format to Parquet. Where exactly can I configure this

I’m not using Code Repositories - just the standard visual pipeline

Hey @27207436a3fb2f15a430, the default backing dataset for batch ontology objects from Pipeline Builder should have parquet files. Can you share what you see when you go to your object → open → view backing dataset and then in the dataset view, go to the details tab → files?

See screenshots below for the steps outlined above (you’ll see in the second screenshot there is a list of .parquet files)

Are you maybe talking about the materialization dataset of your object type? That is a View and the backing file format is not exposed to users.

If you need your materialization as parquet you would need to build a pipeline on top to convert it.

Yes, I was referring to the materialization dataset of the object.
How can I convert the materialization to Parquet format?

You can‘t (as I said above).

You can build a downstream transform with the materialization as input using code repo or pipeline builder and have that update on a schedule.

Why do you specifically want parquet format? For usage outside of Foundry, I assume?

We originally built the pipelines using the Batch pipeline setup, which processes all the data on each run.

Now we want to switch to Streaming, since the data is coming in more frequently and we want to process it in near real time.

From what I understand, we’ll need to convert the materialized dataset into Parquet format to make it compatible with the streaming environment.

The issue is that the materialization dataset is not giving you incremental updates which means you would first need a downstream incremental transform to detect deltas.

What’s your latency requirements for the end2end data flow? I have seen lightweight pipeline builder been able to keep op with <2 minutes latency requirements. That would mean you don’t require streaming datasets.

What’s consuming your data at the end of the pipeline?