How to use XGBoost as a model while leveraging Spark to perform inference at scale?

VincentF · August 1, 2024, 4:57pm

I’ve developed a XGBoost model in Foundry, but it’s using Pandas dataframe and seems to be not that efficient on large volume of data.
How can I perform inference on very large dataset, potentially by leveraging spark for the XGBoost to execute the inference ?

Would there be a template for an XGBoost model in particular, as an example or a starting point ?

rob · August 5, 2024, 12:38pm

You could write a UDF that can run distributed on the Spark workers.


model = load_model()

@pandas_udf(returnType=DoubleType(), functionType=PandasUDFType.SCALAR)
def predict(*cols):
    pandas_df = pd.concat(cols, axis=1)
    pandas_df.columns = COLUMNS 

    # string -> categorical (required by XGBoost)
    for col in CATEGORICAL_COLUMNS: 
        pandas_df[col]=pandas_df[col].astype("category")
   
    yhat = model.predict_proba(pandas_df)
    return pd.Series(yhat[:,1])  

df = df.withColumn("yhat", predict(*[F.col(col) for col in COLUMNS])

Also, I just checked and it seems that XGBoost have some APIs to work directly in spark. Worth checking out. https://xgboost.readthedocs.io/en/stable/tutorials/spark_estimator.html