What is "Executor decommission" and why did it fail a whole stage in my build?

haavard · March 5, 2026, 6:34pm

My Spark job had a whole stage fail:

Shuffle stage failing due to executor loss

A shuffle stage failed despite retries, indicating repeated loss of executors holding the shuffle blocks. Likely reasons for executor loss may be executors running out of memory or running out of ephemeral storage during shuffle stages. If the failure is caused by executor OOM, audit the use of the join keyword, decrease partition sizes, make sure your data is not skewed, or increase executor memory (possibly just memoryOverhead) to keep the process from crashing. If your job shuffles a lot of data, consider optimizing your shuffle stages (which occur through non-broadcasted joins, groupBy operations, repartitions, etc.), or increase the number of executors to acquire more ephemeral storage capacity.

For more help debugging, please read the available documentation.

When I look at the tasks that failed, they all say the executor was decommissioned, for example:

ExecutorLostFailure (executor 10004 exited unrelated to the running tasks) Reason: Executor decommission: Executor 10004 is decommissioned.

See screenshots:

What does this mean? How can we prevent it?