Model Version misalignment

shah · July 11, 2025, 9:45am

It’s happened with multiple models, and the only solution seems to be to trash the model and make a new one. Basically I think when you are running a transform with a model output, but hen separately build another transform with an older version of that modelinput, something shifts in the transactions in the model where it shows that it’s updated to the most recent transform on the model asset ui, but then the other transforms, as well as any calls to the model adapter (ie evaluator repo), fail with a 403 error (Permission Denied), always with the prior model version). Nothing fixes this, other than fixing the version of the modelinput, which of course is not a good solution. I just have to chuck the model and build it fresh.

A 403 Client Error : Forbidden error occurred calling https://waypoint-envoy.rubix-system.svc.cluster.local:8443/compute/a82899/foundry/models-api/models/api/model/ri.models.main.model.b2f4049f-af71-4c3e-ac89-0cacf6c22a9f/versions/ri.models.main.model-version.dd06de14-a428-4b80-9a3b-642a7f0fc241 . ErrorName: Security:PermissionDenied . Please review and contact support if the problem persists

Is there a fix I’m missing here?

tucker · July 11, 2025, 12:25pm

Hi @shah, trying to reproduce your error but I’m unable to. I’m hoping you can give me some more details, but based on your description I came up with the following:

First I run model training using this adapter and “training” code:

class ExampleModelAdapter(pm.ModelAdapter):
    @pm.auto_serialize
    def __init__(self, value: str):
        self.value = value

    @classmethod
    def api(cls):
        inputs = {"param_in": pm.Parameter(type=str)}
        outputs = {"output_df": pm.Pandas(columns=[("output_column", str)])}
        return inputs, outputs

    def predict(self, param_in):
        return pd.DataFrame({"output_column": [self.value]})

which just returns a df containing whatever value is passed to the model when its created (in this case I am using the current time).

“training” code is

from transforms.api import lightweight, transform
from palantir_models.transforms import ModelOutput
from main.model_adapters.adapter import ExampleModelAdapter
import time

@lightweight
@transform(
    model_output=ModelOutput("ri.models.main.model.0d3821ef-b692-4f40-953b-9570aa243eb4"),
)
def compute(model_output):
    now = str(time.time())
    foundry_model = ExampleModelAdapter(now)
    model_output.publish(
        model_adapter=foundry_model,
        notes=now
    )

and inference

from transforms.api import Output, lightweight, transform
from palantir_models.transforms import ModelInput

@lightweight
@transform(
    model_input=ModelInput("ri.models.main.model.0d3821ef-b692-4f40-953b-9570aa243eb4"),
    output=Output("ri.foundry.main.dataset.9c6db9d1-e1d1-4b8d-8cea-1366b0f95f8e"),
)
def compute(model_input, output):
    res = model_input.predict("")
    res["rid"] = model_input.model_version_rid
    output.write_pandas(res)

I run the training build, and once complete I then run the subsequent inference build. I see in the output dataset of the inference build has the correct values (the time is correct and the model version rid in the rid column is correct). So I am not sure what I am missing in order to reproduce your error.

shah · July 11, 2025, 8:18pm

can you try without the lightweight? And my other concern with this test is that there is not long enough time for the training to actually run while you start inference from a previous version of the model…if that makes sense. maybe if you can add some kind of timer loop in the training, and then run your inference while it’s training…

tucker · July 11, 2025, 8:39pm

Yeah I can give that a shot, but Im confused - what are you expecting will happen if you have a training job running and launch an inference job while that is still running? It should resolve to the previous model version (and if it doesn’t then yes there is something wrong and we can take a look at fixing it) not the version that is currently being produced.

shah · July 14, 2025, 8:57am

I’m saying after it finishes training, and the you run the inference again, it will still point to the prior version.

In anycase, perhaps it’s just a quirk edge case, not sure.