Operation MSCK REPAIR TABLE is not allowed

Soufiane · November 4, 2024, 7:48am

Hello,

I have a non repartitionned dataset that i’am trying to snapshot to repartition by a column, and i’am getting an error saying : ```
Operation MSCK REPAIR TABLE is not allowed because it is not a partitioned table.

Does anyone know what is the issue ?

Best Regards,
Soufiane

Soufiane · November 5, 2024, 11:00am

Hello,

In my code i’am not doing the repair i’am just doing a write_dataframe(df, partition_cols=[‘col’]), so does that mean i can’t repartition my dataset if it’s not already repartitionned ?

cpottiez · November 5, 2024, 12:25pm

Using partition_cols works even without repartitioning before.
Partition_cols doesn’t trigger a shuffle of the data, it just splits each partitions before writing them to the disk.

As for this issue with REPAIR TABLE, are you doing an incremental transform ? can you share the code ?

Soufiane · November 5, 2024, 3:03pm

Hello,
here is the code i’am doing :

@transform(
    out_placeholder=Output("placeholder"),
    batch_dates=Output("batch_dates"),
    out=Output("data_to_be_repartitionned"),
)
def compute(ctx, out, out_placeholder, batch_dates):
    history_schema = my_schema
    begin_date = 'begin_date'
    last_date = 'end_date'
    init_state = True
    DAYS_PER_RUN = 90
    if init_state:
        out_placeholder.set_mode('replace')
        history_df = history_df.dataframe('previous', history_schema)
        out_placeholder.write_dataframe(history_df)
        start_date = begin_date
        end_date = str(datetime.datetime.strptime(start_date, '%Y-%m-%d').date() + datetime.timedelta(days=(DAYS_PER_RUN-1)))
        to_write_df = history_df.filter("doing some filters")
        column_name = 'my_col'
        out.set_mode('replace')
        write_partitionned_df(out, to_write_df, column_name, ctx, 7)

and the write_partitionned_df is doing an out.repartition(cols, partiton_cols = [cols])

So the idea is to take a copy of my dataset before repartitionning, output it in a dataset that i call placeholder, then repartition a batch of this dataset and output it in my current dataset ( i’am also writing a dataset of dates but that’s not important )

cpottiez · November 7, 2024, 5:30am

Could you also share the full stack trace of the error please ?

Soufiane · November 7, 2024, 10:39am

Hello, I just found there is someone else who has the same issue, there is the logs and even additionnal informations here :
https://community.palantir.com/t/partition-a-big-incremental-dataset/1150