I’ve developed a XGBoost model in Foundry, but it’s using Pandas dataframe and seems to be not that efficient on large volume of data.
How can I perform inference on very large dataset, potentially by leveraging spark for the XGBoost to execute the inference ?
Would there be a template for an XGBoost model in particular, as an example or a starting point ?
You could write a UDF that can run distributed on the Spark workers.
model = load_model()
@pandas_udf(returnType=DoubleType(), functionType=PandasUDFType.SCALAR)
def predict(*cols):
pandas_df = pd.concat(cols, axis=1)
pandas_df.columns = COLUMNS
# string -> categorical (required by XGBoost)
for col in CATEGORICAL_COLUMNS:
pandas_df[col]=pandas_df[col].astype("category")
yhat = model.predict_proba(pandas_df)
return pd.Series(yhat[:,1])
df = df.withColumn("yhat", predict(*[F.col(col) for col in COLUMNS])
Also, I just checked and it seems that XGBoost have some APIs to work directly in spark. Worth checking out. https://xgboost.readthedocs.io/en/stable/tutorials/spark_estimator.html