I am writing a data pipeline that has as input a 2 TB dataset partitioned by a column my_col
. I want to take this input, apply a regex to a column, and save the output still repartitioned by my_col
, but with some of the files coalesced according to what spark thinks is optimal (since some of the files are just a couple hundred KB). Is there a more efficient way to do this than the following:
def compute(timeseries, out):
df = timeseries.dataframe().withColumn(
'channel_id',
F.regexp_extract(F.col('timeseries_id'), r'^([^\|]*\|\|)', 1)
)
out.write_dataframe(df, partition_cols=['my_col']
The reason I am asking is because I am seeing a huge amount of disk spillage even though the data should already be partitioned correctly; and if I add df.repartition('my_col')
the disk spillage goes to 0 whilst shuffle writes become very large.
When running a build like this I also see that the logical plan isn’t making use of the partitioning of the input
(I see: ..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [], PushedGroupBy: [], ...
)