Locally Export Large Dataset from Pipeline Builder

geoffs · April 24, 2025, 6:29pm

I have a 13 million row output dataset that I need to be able to be download locally. I tried setting the dataset write format as parquet, but I’m not exactly sure what that does. Does anyone have any advice?

david · April 24, 2025, 6:41pm

Hey there! Parquet is an open source data format used by spark! You can check out the apache docs for more details (looks like I can’t post links, but a quick google search will get you there).

Are you experiencing a specific issue with downloading data from the dataset? If you have too many files, you can try running a “coalesce” down to 2 or 3 to reduce the number of files you have to download.

And in general you can download the files in a dataset by navigating to the dataset view of the output, then “details → files → download”

geoffs · April 24, 2025, 7:07pm

I wasn’t aware of the last point - downloading directly rather than converting to CSV. Thanks!

VenkatPolimera · April 26, 2025, 2:00pm

This details → files → download option -helps to downlaod the datafile but will be having multiple files ( dataset splits) . Is there a option to download as a single/consolidated data file?

nicornk · April 26, 2025, 3:45pm

If you have too many files, you can try running a “coalesce” down to 1 to reduce the number of files you have to download.

It’s in the docs:

https://www.palantir.com/docs/foundry/code-repositories/prepare-datasets-download/#coalesce-partitions-and-set-output-format

VenkatPolimera · April 26, 2025, 8:46pm

Cool. Got it.Thank You