Int Object not iterable, single output object in Pyspark

smpena · September 30, 2024, 9:42am

Based on spark documentation (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.mllib.evaluation.RegressionMetrics.html)

Mean Absolute Error should return a single value which I should be able to put into a dataframe (see below, and screenshot). but I’m getting the error in the screenshot. Any idea why this is happening?

@transform(
    mae_output =Output("ri.foundry.main.dataset.496ab2ea-6f7f-4caf-b0de-3f803726d21c"),
    metrics =Input("ri.foundry.main.dataset.9d456872-1e36-40dd-9f0f-599ffe40b12f"),
)
def compute(
    ctx, metrics, mae_output
    ):
#    metrics = metrics.dataframe()
    predictionAndObservations = ctx.spark_session.sparkContext.parallelize([(2.5, 3.0), (0.0, -0.5), (2.0, 2.0), (8.0, 7.0)])
    metrics2 = RegressionMetrics(predictionAndObservations)
    schema = T.StructType([
    T.StructField("MAE_metric", T.IntegerType(), True)])
    MAE_df = ctx.spark_session.createDataFrame(data=int(metrics2.meanAbsoluteError), schema = schema)
    mae_output.write_dataframe(MAE_df)

Ben · September 30, 2024, 2:20pm

Hey,

sc.createDataFrame takes an iterable (docs), you’re currently trying to give it a single integer. You should wrap the data in square brackets to make it a list [metrics2.meanAbsoluteError]). This method can sometimes be a bit tricky so you may have to give a list of tuples ([(metrics2.meanAbsoluteError, )]).

For solving these kind of issues, I’d recommend using the debugger in preview (click on the left side next to line numbers to add a breakpoint). This will let you inspect datatypes and try out single lines of code one at a time!