Hi,
For Spark with sidecar, I would recommend using the ARROW_ENABLED profile (see Spark profile reference). I expect most of the slowness with Spark comes from having to convert the dataframe from Spark to Pandas format, which is generally slow and memory intensive, but made much more efficient with Arrow. You could also leverage executors via the DistributedInferenceWrapper as described here. This natively uses Arrow via Spark’s mapInPandas method.
We will soon support bringing a model as a sidecar in lightweight transforms as well, so that users can benefit both from the portability of models as containers and the efficient data loading of lightweight transforms.
Lightweight should show good performance as the data is loaded natively in Pandas, and no conversion is needed. I wonder if the out of memory error is coming from the fact that
input_data = inference_input.pandas()[columns_needed].head(1000)
will actually load the entire dataset in memory, then get the first 1000 rows, which could easily fill available memory. In contrast, Spark is able to only read a subset of the data to load the 1000 rows in the dataset when you do:
filtered_spark_df = (
spark_df
# Simplified filter condition
.filter(spark_df["required_column"].isNotNull())
# Simplified selection of input columns
.select("feature_id", "input_text_1")
.limit(10000) # Keep limit for batch context
)
You might want to use the Polars lazy API instead, which supports filter pushdowns and should stream data rather than reading the entirety of the dataset to avoid memory issues.
df = (input.polars(lazy=True)
.select(columns_needed)
.filter(your_filters_here)
.head(10000)
.collect()
).to_pandas()
or use the filesystem API directly to read a subset of files.
Hope this helps - let me know if you’re still not getting the performance you’d like.
PS: As to your initial question, I am curious if you were able to find out why you were seeing such discrepancies between live and batch deployments. Happy to help investigate if not.
Thanks,
Julien