Thanks for raising this. Would you mind elaborating on which models specifically seem ‘off’ for you? Adding a third column called tokens (k) per second might be helpful:
Model
Token (k)
time (s)
(k) tokens/s
GPT 5.1
5.5
15.0
0.37
GPT 5
8.1
39.0
0.21
GPT 5 mini
6.8
43.0
0.16
GPT 5 nano
9.0
46.0
0.20
GPT 5 codex
7.0
28.0
0.25
GPT 4.1
4.8
15.0
0.32
GPT 4.1 mini
4.7
18.0
0.26
GPT 4.1 nano
12.0
16.0
0.75
Claude 4 sonnet
6.7
28.0
0.24
claude 4.1 opus
6.7
43.0
0.16
Claude 4.5 sonnet
7.2
34.0
0.21
Claude 4.5 Haiku
6.6
27.0
0.24
Gemini 2.5 Pro
5.8
16.0
0.36
Gemini 2.5 Flash
6.1
11.0
0.55
Gemini 2.5 Flash Lite
5.7
7.0
0.81
Off a first glance, the heavier weight models (Claude 4.1 Opus) have a lower TPS than lighter weight models (GPT 4.1 nano). Would you be able to callout which models specifically you are seeing as ‘odd’?
Larger models generally have more parameters and deeper architectures, which means each token requires more computation and inference time. Things like input/output complexity can also play a role in the response times for requests.
For pipelines, we recommend users choose models based on the specific use case—balancing speed, cost, and output quality. For high-throughput or latency-sensitive tasks, I lean toward the “Flash” or “Lite” variants, while for more complex reasoning, I’ll use the larger, more capable models. We recommend setting up a robust evaluation suite in AIP Evals wherever possible to help choose the ‘best’ model for each use case.
Thanks for the detailed response. I appreciate the tokens/s breakdown, but I’d like to highlight some concerning patterns we’ve observed in production that go beyond just throughput metrics.
1. Token Consumption Paradox
Our testing shows consistent patterns across identical inputs (3 runs each): smaller models actually consume MORE tokens than their larger counterparts. For example:
GPT 4.1 nano: 12.0k tokens
GPT 4.1: 4.8k tokens
GPT 4.1 mini: 4.7k tokens
GPT 5 mini: 6.8k tokens
GPT 5 nano: 9.0k tokens
This contradicts the expected behavior where lighter models should be more efficient. Why would a “nano” model use 2.5x more tokens than the base model?
2. Speed vs. Model Tier Mismatch
Except for Gemini models, we’re seeing smaller models performing slower than larger ones:
GPT 4.1 mini (0.26 t/s) is slower than GPT 4.1 (0.32s)
GPT 5 mini/nano (0.16/0.2 t/s) is slower than GPT 5 (0.21 t/s)
The pipeline documentation suggests faster/cheaper models for efficiency, but the data shows the opposite.
Mini/nano models produce inconsistent output quality that makes them unsuitable for sophisticated production pipelines. We’ve relegated them to simple tasks like summarization or basic translation, but even there, they’re not faster than larger models—which means they’re just increasing our Palantir compute-sec costs without providing any benefit.
smaller models actually consume MORE tokens than their larger counterparts
Clarification question….are you using consume to indicate ‘output tokens’ created?
Why would a “nano” model use 2.5x more tokens than the base model?
My experience tells me that if we are indeed talking about ‘output tokens’ as an indicator of 'consumption/usage, then there could be a few reasons why a lighter weight model would output more tokens:
The nano model might require more verbose prompts or produce more verbose outputs to achieve similar quality.
Sometimes, smaller models need more context or step-by-step instructions, especially when reasoning is involved.
smaller models performing slower than larger ones
Without knowing the exact prompts and configurations involved, the answering of this question is a bit challenging, although I would highlight the above as potential reasons for the surprising results in speed behavior.
Mini/nano models produce inconsistent output quality that makes them unsuitable for sophisticated production pipelines.
We recommend setting up AIP Evaluations to help you determine which model is best for your use case! Sounds like you are doing this or similar evaluation analysis which is great!