Load pickle file into Code Repository

robind · July 17, 2024, 4:01pm

I am trying to load a machine learning model saved as a pickle file into Code Repository but I am getting The error occurred because the `source_df` object, which is a PySpark DataFrame, does not have a `filesystem` attribute.

This is what I have:

from pyspark.sql import DataFrame
import pandas as pd
from transforms.api import Input, Output, transform_df
import pickle

@transform_df(
    Output(
        "/XXX/pickle file test/iris_preds"
    ),
    source_df=Input(
        "/XXX/pickle file test/iris_features"
    ),
)

def compute(source_df: DataFrame):

    fs = source_df.filesystem()
    with fs.open("iris_regression_model.pkl", "rb") as f:
        model = pickle.load(f)

    predictions = model.predict(source_df)

    # Convert predictions to DataFrame
    df_predictions = pd.DataFrame(predictions, columns=['Predictions'])
    return df_predictions

yix · July 17, 2024, 4:08pm

You’ll need to switch from the @transform_df() decorator to the @transform() to access the FileSystem objects per the docs here: Read and write unstructured files

cdesouza · July 17, 2024, 4:21pm

Hi @robind - as indicated by @yix , you’ll want to use the transform decorator instead of transform_df so you can leverage raw file access. Additionally, it’s recommended that you first publish the model in Foundry so it can then be used for inference. An example of how to do this is available here.

robind · July 17, 2024, 6:27pm

Thanks. I am trying to demonstrate the quickest route to applying an ML model via a pickle file so am trying to avoid the ‘proper’ model deployment route and just read in the pickle file and apply to the dataset. I am getting this error, probably due to this: fs = source_df.filesystem(). Do I need to load the pkl file into the dataset folder of code repo somehow?

Traceback (most recent call last):
  File "/myproject/datasets/rd_pickletest.py", line 30, in compute
    with fs.open("iris_regression_model.pkl", "rb") as f:
  File "/scratch/asset-install/xxx/miniconda311/lib/python3.11/site-packages/foundry_pyls/preview/transforms_api.py", line 493, in open
    file = open(file_path, mode, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/parquet-datasets/ri.code-assist2.main.workspace.xxx/iris_regression_model.pkl'

cdesouza · July 18, 2024, 5:05pm

Makes sense @robind . Regarding the new error, the last line indicates that you’re seeing this error when you run Preview. Is that accurate? If so, please can you confirm that when selecting the raw files to include in Preview, one of them was named “iris_regression_model.pkl”?

Screenshot 2024-07-18 at 12.42.02 PM

robind · July 19, 2024, 11:27am

Thanks. Nope, its returning only the details of the source_df file - the snappy.parquet and the _driver.log files

Do you know how I can point it towards iris_regression_model.pkl which is in the same folder as source_df

rob · July 19, 2024, 12:22pm

Hello @robind! The pickle file must be inside of a dataset. You can create a new dataset and upload the file there. Then, in your transform, you can read the file system of that dataset.

robind · July 19, 2024, 1:32pm

Thanks but I am sorry I don’t know how to do that. I created a new dataset and then tried to “Import new data” but its just gibberish (as expected as its an ML model). I must be missing something.

rob · July 19, 2024, 1:55pm

Were you able to upload the file into the dataset? It should look like this:

If you are getting something like this instead:

It means that Foundry tried to infer the schema of your files, but was unable to. We can delete the inferred schema, by going to Details → Schema → Edit → Delete.

Once deleted, it should look like the first picture.

On your transform you can now read from it:

from pyspark.sql import DataFrame
import pandas as pd
from transforms.api import Input, Output, transform_df
import pickle

@transform(
    Output(
        "/XXX/pickle file test/iris_preds"
    ),
    source_df=Input(
        "/XXX/pickle file test/iris_features"
    ),
    model=Input(
        "/XXX/pickle file test/dataset_with_model_uploaded"
    ),
)

def compute(source_df, model):
    
    fs = model.filesystem() # Reading from model input
    with fs.open("iris_regression_model.pkl", "rb") as f:
        model = pickle.load(f)

    predictions = model.predict(source_df)

    # Convert predictions to DataFrame
    df_predictions = pd.DataFrame(predictions, columns=['Predictions'])
    return df_predictions

robind · July 23, 2024, 3:54pm

Thank you, I think I am getting close. I had to move from transform_df to transform as per the earlier suggestions. It was asking for the sklearn library, so its on the right track. However I am now getting _pickle.UnpicklingError: invalid load key, '\x05'. Possibly a corrupt pickle file but I have been able to read it back into Jupyter and it works ok (and have saved again as a new file). I have consistent python versions across Foundry and my laptop and the MD5 keys match. Could it be a discrepancy between how its saved in Jupyter and loaded in Foundry?