How to change output format to Parquet after materializing object in Pipeline Builder

27207436a3fb2f15a430 · July 31, 2025, 10:14am

I created an object in Pipeline Builder and materialized it, but the output is not stored in Parquet format.

I can’t find where to set the output format to Parquet. Where exactly can I configure this

I’m not using Code Repositories - just the standard visual pipeline

helenq · July 31, 2025, 1:52pm

Hey @27207436a3fb2f15a430, the default backing dataset for batch ontology objects from Pipeline Builder should have parquet files. Can you share what you see when you go to your object → open → view backing dataset and then in the dataset view, go to the details tab → files?

See screenshots below for the steps outlined above (you’ll see in the second screenshot there is a list of .parquet files)

nicornk · July 31, 2025, 3:46pm

Are you maybe talking about the materialization dataset of your object type? That is a View and the backing file format is not exposed to users.

If you need your materialization as parquet you would need to build a pipeline on top to convert it.

27207436a3fb2f15a430 · August 1, 2025, 5:28pm

Yes, I was referring to the materialization dataset of the object.
How can I convert the materialization to Parquet format?

nicornk · August 1, 2025, 6:31pm

You can‘t (as I said above).

You can build a downstream transform with the materialization as input using code repo or pipeline builder and have that update on a schedule.

Why do you specifically want parquet format? For usage outside of Foundry, I assume?

27207436a3fb2f15a430 · August 1, 2025, 8:19pm

We originally built the pipelines using the Batch pipeline setup, which processes all the data on each run.

Now we want to switch to Streaming, since the data is coming in more frequently and we want to process it in near real time.

From what I understand, we’ll need to convert the materialized dataset into Parquet format to make it compatible with the streaming environment.

nicornk · August 1, 2025, 8:39pm

The issue is that the materialization dataset is not giving you incremental updates which means you would first need a downstream incremental transform to detect deltas.

What’s your latency requirements for the end2end data flow? I have seen lightweight pipeline builder been able to keep op with <2 minutes latency requirements. That would mean you don’t require streaming datasets.

What’s consuming your data at the end of the pipeline?