Hi all,
how to partition outputs by specific columns in pipeline builder? I would like to do something similar to this code repository sentence:
output.write_dataframe(df, partition_cols=[‘col’])
Thanks!
Hi all,
how to partition outputs by specific columns in pipeline builder? I would like to do something similar to this code repository sentence:
output.write_dataframe(df, partition_cols=[‘col’])
Thanks!
There is a transform called Repartition Data that can be used here.
Hi bkaplan, I’m aware of repartition() and partitionBy() in pyspark, I’m looking for the 2nd one, in code repository with the @transform() decorator using output.write_dataframe(df, partition_cols=[“year”]) the files I get in the Details of the dataset are something like this:
spark/year=2023/part-00000-372287f2-c241-4198-a0b0-47002fdf7c4e.c000.snappy.parquet
spark/year=2024/part-00000-372287f2-c241-4198-a0b0-47002fdf7c4e.c000.snappy.parquet
With the repartition data transform in pipeline builder I dont get files with the same pattern, it seems a shuffle of the data to optimize the pipeline but not to write the output dataset. Right?
Thank you
Unfortunately, hive partitioning in Pipeline Builder outputs is not currently supported. You can make a Projection on the outputs, however, which may be sufficient for your use-case. See my earlier answer for a discussion of Projections vs hive partitioning.
Thank you for the clarifications
Regards