I have a 13 million row output dataset that I need to be able to be download locally. I tried setting the dataset write format as parquet, but I’m not exactly sure what that does. Does anyone have any advice?
Hey there! Parquet is an open source data format used by spark! You can check out the apache docs for more details (looks like I can’t post links, but a quick google search will get you there).
Are you experiencing a specific issue with downloading data from the dataset? If you have too many files, you can try running a “coalesce” down to 2 or 3 to reduce the number of files you have to download.
And in general you can download the files in a dataset by navigating to the dataset view of the output, then “details → files → download”
I wasn’t aware of the last point - downloading directly rather than converting to CSV. Thanks!
This details → files → download option -helps to downlaod the datafile but will be having multiple files ( dataset splits) . Is there a option to download as a single/consolidated data file?
If you have too many files, you can try running a “coalesce” down to 1 to reduce the number of files you have to download.
It’s in the docs:
https://www.palantir.com/docs/foundry/code-repositories/prepare-datasets-download/#coalesce-partitions-and-set-output-format
Cool. Got it.Thank You