Polars Lazy for lightweight transform getting SchemaError for snapshot input of incrementally updated dataset that received a new column

I have an incremental transform that we recently added a new column. There is an incremental lightweight transform that consumes it as a snapshot input and using polars reading it in lazy mode. Until we added the new column it worked fine but then it started to break with the error:

polars.exceptions.SchemaError: extra column in file outside of expected schema: {col name}, hint: specify this column in the schema, or pass extra_columns='ignore' in scan options.

extra_columns is not exposed in the api so I am not sure how to fix it.

We cannot reprocess things and run it as a snapshot

We are also running into this error - were you able to find a solution?

Hey, we managed to solve it by adding the following code:

import polars as pl
from functools import wraps
_original_scan_parquet = pl.scan_parquet

@wraps(_original_scan_parquet)
def _patched_scan_parquet(*args, **kwargs):
# Set default for extra_columns if not explicitly provided
if "extra_columns" not in kwargs:
kwargs["extra_columns"] = "ignore"
kwargs["missing_columns"] = "insert"

return _original_scan_parquet(*args, **kwargs)
pl.scan_parquet = _patched_scan_parquet

1 Like