Lightweight Polars scan_parquet parameters

I am using lightweight transforms with polars as often as possible and loving it, but running into an issue when I need to lazily read from datasets that are frequently appended to during ingestion, and can be missing columns in some of the underlying parquet files.

Polars has a param for this in scan_parquet:

missing_columns

Configuration for behavior when columns defined in the schema are missing from the data:

  • insert: Inserts the missing columns using NULLs as the row values.

  • raise: Raises an error.

It defaults to raise, so transforms where this issue is encountered always fail.

It would be quite helpful to expose that param as well as schema or others if possible.

2 Likes

In General a Public API to get the list of parquet paths for a LightweightTransformInput would be welcomed. This has the benefit that the transforms team would not have to play catch up with library options and developers could always fallback to own the scan_parquet code.

Thanks!

You can use scan_parquet so can set your parameters its also possible to get your temporary filepath in your lightweight container.

from transforms.api import Input, Output, transform
import polars as pl
@transform.using(
   my_input = Input("input_rid"),
   my_output = Output("output_rid")
)
def polars_compute(my_input, my_output):
   my_lazyframe = pl.scan_parquet(
      my_input.path() + "/**/*.parquet"
   )
   my_output.write_table(my_lazyframe)

.path() will trigger a full download of the input dataset to the container - which is often not practical.

I am sure Palantir is working on exposing an API to pass scan options or get the list of s3 URIs

Many thanks @Solv , working for me :+1:

It would indeed be helpful to eventually scan/apply predicates on the files in S3 rather than download the full dataset, but this is working fine for me now at current scale.

1 Like