Audit log dataset projections and Spark profiles

Has anyone setup dataset projections on their audit logs to improve performance? We attempted to setup a dataset projection on our 2 year audit log and it failed during the scheduled build with the following explanation. Our audit log size is about 188G. Looking for suggestions on applying Spark profiles for a very large dataset.

Summary of Failure:

The Foundry build failed due to a Spark-related issue: Missing an output location for shuffle 1 partition 0. This error typically occurs when Spark cannot find or retrieve shuffle data required for the next stage. The root cause could be related to a configuration error, resource issue, or corruption in intermediate shuffle data.

The ExecutorUnreachable exception suggests that a Spark executor might have become unavailable, likely due to memory limits, network issues, or resource contention within the job.

Suggested Fix:

  1. Check Resource Allocation: The job might require more CPU, memory, or disk space. Ensure that sufficient resources are allocated for the Spark executors to handle the shuffle tasks.
  2. Retry Job: Sometimes this issue occurs transiently. Retry the job to confirm whether the error persists.
  3. Optimize Shuffle Operations:
  • Consider reducing the amount of shuffle data by filtering or aggregating earlier in your pipeline.
  • Adjust Spark configurations, such as increasing spark.executor.memory or modifying the number of shuffle partitions (spark.sql.shuffle.partitions).

Hey,

It seems that this has experienced an executor OOM. Projections will try to self-scale, but this can sometimes fail. You can override the profiles used for the build in the build configuration page. See the box labelled “Using spark profiles”.

You are going to want to go with a big executor memory profile, e.g. EXECUTOR_MEMORY_LARGE.

Will try that profile and respond back after it runs again. Thanks!