Hello,
We are using a ‘large’ PyTorch model as part of our pipeline and we are having long build time (around 1hr) to just infer 1 or 2 rows.
While, when using a live deployment using same architecture (1*T4 GPU), the inference would take a few dozen of seconds.
Is there particular points to pay attention to optimize ‘build inferences’ ?
Cheers,
~seconds vs hours feels like you might be running inference on CPU rather than GPU… I’d check that the GPU is indeed with something like that:
import torch
import logging
from transforms.api import transform, Output, lightweight
@lightweight(gpu_type='NVIDIA_T4')
@transform(out=Output('/Project/folder/output'))
def compute(out):
logging.info(torch.cuda.get_device_name(0))
or even a failure condition on the device name being the expected GPU.