I am trying to load a machine learning model saved as a pickle file into Code Repository but I am getting The error occurred because the `source_df` object, which is a PySpark DataFrame, does not have a `filesystem` attribute.
This is what I have:
from pyspark.sql import DataFrame
import pandas as pd
from transforms.api import Input, Output, transform_df
import pickle
@transform_df(
Output(
"/XXX/pickle file test/iris_preds"
),
source_df=Input(
"/XXX/pickle file test/iris_features"
),
)
def compute(source_df: DataFrame):
fs = source_df.filesystem()
with fs.open("iris_regression_model.pkl", "rb") as f:
model = pickle.load(f)
predictions = model.predict(source_df)
# Convert predictions to DataFrame
df_predictions = pd.DataFrame(predictions, columns=['Predictions'])
return df_predictions
You’ll need to switch from the @transform_df() decorator to the @transform() to access the FileSystem objects per the docs here: Read and write unstructured files
Hi @robind - as indicated by @yix , you’ll want to use the transform decorator instead of transform_df so you can leverage raw file access. Additionally, it’s recommended that you first publish the model in Foundry so it can then be used for inference. An example of how to do this is available here.
Thanks. I am trying to demonstrate the quickest route to applying an ML model via a pickle file so am trying to avoid the ‘proper’ model deployment route and just read in the pickle file and apply to the dataset. I am getting this error, probably due to this: fs = source_df.filesystem(). Do I need to load the pkl file into the dataset folder of code repo somehow?
Traceback (most recent call last):
File "/myproject/datasets/rd_pickletest.py", line 30, in compute
with fs.open("iris_regression_model.pkl", "rb") as f:
File "/scratch/asset-install/xxx/miniconda311/lib/python3.11/site-packages/foundry_pyls/preview/transforms_api.py", line 493, in open
file = open(file_path, mode, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/parquet-datasets/ri.code-assist2.main.workspace.xxx/iris_regression_model.pkl'
Makes sense @robind . Regarding the new error, the last line indicates that you’re seeing this error when you run Preview. Is that accurate? If so, please can you confirm that when selecting the raw files to include in Preview, one of them was named “iris_regression_model.pkl”?
Hello @robind! The pickle file must be inside of a dataset. You can create a new dataset and upload the file there. Then, in your transform, you can read the file system of that dataset.
Thanks but I am sorry I don’t know how to do that. I created a new dataset and then tried to “Import new data” but its just gibberish (as expected as its an ML model). I must be missing something.
It means that Foundry tried to infer the schema of your files, but was unable to. We can delete the inferred schema, by going to Details → Schema → Edit → Delete.
Thank you, I think I am getting close. I had to move from transform_df to transform as per the earlier suggestions. It was asking for the sklearn library, so its on the right track. However I am now getting _pickle.UnpicklingError: invalid load key, '\x05'. Possibly a corrupt pickle file but I have been able to read it back into Jupyter and it works ok (and have saved again as a new file). I have consistent python versions across Foundry and my laptop and the MD5 keys match. Could it be a discrepancy between how its saved in Jupyter and loaded in Foundry?