Huge difference between Live inference & Build with a model

Hello,

We are using a ‘large’ PyTorch model as part of our pipeline and we are having long build time (around 1hr) to just infer 1 or 2 rows.
While, when using a live deployment using same architecture (1*T4 GPU), the inference would take a few dozen of seconds.

Is there particular points to pay attention to optimize ‘build inferences’ ?

Cheers,

~seconds vs hours feels like you might be running inference on CPU rather than GPU… I’d check that the GPU is indeed with something like that:

import torch
import logging
from transforms.api import transform, Output, lightweight

@lightweight(gpu_type='NVIDIA_T4')
@transform(out=Output('/Project/folder/output'))
def compute(out):
    logging.info(torch.cuda.get_device_name(0))

or even a failure condition on the device name being the expected GPU.