Hello,
I have a non repartitionned dataset that i’am trying to snapshot to repartition by a column, and i’am getting an error saying : ```
Operation MSCK REPAIR TABLE is not allowed because it is not a partitioned table.
Does anyone know what is the issue ?
Best Regards,
Soufiane
MSCK REPAIR TABLE is for Hive metastore which is not used on Foundry.
https://spark.apache.org/docs/3.5.1/sql-ref-syntax-ddl-repair-table.html
Hello,
In my code i’am not doing the repair i’am just doing a write_dataframe(df, partition_cols=[‘col’]), so does that mean i can’t repartition my dataset if it’s not already repartitionned ?
Using partition_cols works even without repartitioning before.
Partition_cols doesn’t trigger a shuffle of the data, it just splits each partitions before writing them to the disk.
As for this issue with REPAIR TABLE, are you doing an incremental transform ? can you share the code ?
Hello,
here is the code i’am doing :
@transform(
out_placeholder=Output("placeholder"),
batch_dates=Output("batch_dates"),
out=Output("data_to_be_repartitionned"),
)
def compute(ctx, out, out_placeholder, batch_dates):
history_schema = my_schema
begin_date = 'begin_date'
last_date = 'end_date'
init_state = True
DAYS_PER_RUN = 90
if init_state:
out_placeholder.set_mode('replace')
history_df = history_df.dataframe('previous', history_schema)
out_placeholder.write_dataframe(history_df)
start_date = begin_date
end_date = str(datetime.datetime.strptime(start_date, '%Y-%m-%d').date() + datetime.timedelta(days=(DAYS_PER_RUN-1)))
to_write_df = history_df.filter("doing some filters")
column_name = 'my_col'
out.set_mode('replace')
write_partitionned_df(out, to_write_df, column_name, ctx, 7)
and the write_partitionned_df is doing an out.repartition(cols, partiton_cols = [cols])
So the idea is to take a copy of my dataset before repartitionning, output it in a dataset that i call placeholder, then repartition a batch of this dataset and output it in my current dataset ( i’am also writing a dataset of dates but that’s not important )
Could you also share the full stack trace of the error please ?
Hello, I just found there is someone else who has the same issue, there is the logs and even additionnal informations here :
https://community.palantir.com/t/partition-a-big-incremental-dataset/1150