Partition_cols in pipeline builder?

Marcos · September 29, 2024, 1:05pm

Hi all,
how to partition outputs by specific columns in pipeline builder? I would like to do something similar to this code repository sentence:

output.write_dataframe(df, partition_cols=[‘col’])

Thanks!

bkaplan · September 29, 2024, 4:28pm

There is a transform called Repartition Data that can be used here.

Marcos · September 29, 2024, 8:36pm

Hi bkaplan, I’m aware of repartition() and partitionBy() in pyspark, I’m looking for the 2nd one, in code repository with the @transform() decorator using output.write_dataframe(df, partition_cols=[“year”]) the files I get in the Details of the dataset are something like this:
spark/year=2023/part-00000-372287f2-c241-4198-a0b0-47002fdf7c4e.c000.snappy.parquet
spark/year=2024/part-00000-372287f2-c241-4198-a0b0-47002fdf7c4e.c000.snappy.parquet

With the repartition data transform in pipeline builder I dont get files with the same pattern, it seems a shuffle of the data to optimize the pipeline but not to write the output dataset. Right?
Thank you

sandpiper · September 30, 2024, 2:20am

Unfortunately, hive partitioning in Pipeline Builder outputs is not currently supported. You can make a Projection on the outputs, however, which may be sufficient for your use-case. See my earlier answer for a discussion of Projections vs hive partitioning.

Marcos · September 30, 2024, 2:02pm

Thank you for the clarifications
Regards