I am having a problem with one table, we recently discovered it is not partitioned, and now we are in the middle of a development and it requires to be properly partitioned to make a project “future proof”
Input dataset (df_raw): +200Gb
Output dataset (“df_daily” the one we want to partition): +100Gb
It has been running incrementally for a very long time…and recomputing those 200Gb is not an option… so we decided to try the following approach:
1- Stop all schedules
2- Change code to just read the previous output and write back the output in mode “replace” partitioning by date using the following code (the original code is commented)
But now we are getting the error:
The build failed due to an org.apache.spark.sql.AnalysisException
. The error message indicates that the operation MSCK REPAIR TABLE
is not allowed because the table in question is not a partitioned table. This operation is typically used to recover partitions and data associated with partitions.
The error occurred in the file df_daily.py
at line 32, where the function write_partitioned_dataframe_row_based
is called. This function seems to be trying to write to a non-partitioned table as if it were partitioned.
To fix this issue, you can either convert the table to a partitioned table or modify your code to not treat the table as partitioned. If you’re unsure how to proceed, please consult with Palantir support.
It’s not allowing us to partition because there is “MSCK REPAIR TABLE” which seems that in the metadata of the table, it is flagged as “non-partitioned” and you can not make it partitioned even if you are completely replacing the data inside…
We also considered the idea of making the following:
current status:
<df_raw> → <df_daily (non partitioned)>
Potential Solution:
1st step: “create a copy of df_daily partitioned with @incremental”
<df_raw> → <df_daily (non partitioned)> → <df_daily_2 (partitioned)>
2nd step: “rewire to make the copy read from raw” WE DONT KNOW IF IT WOULD WORK
<df_raw> → <df_daily_2 (partitioned)>
The question at the end is: How I can partition by date an BIG incremental dataset avoiding recompute all the input dataset?
Error: