Pipeline LLM Node token-computing time issue

I’ve calculated the compute seconds and token usage in the LLM node across different models using the same prompts.

The results seem a bit odd when you consider the class/tier of each LLM model. Does this make sense to you, or have you noticed similar patterns?

Also, which model do you generally use for your pipelines?

Model Token (k) time (s)
GPT 5.1 5.5 15.0
GPT 5 8.1 39.0
GPT 5 mini 6.8 43.0
GPT 5 nano 9.0 46.0
GPT 5 codex 7.0 28.0
GPT 4.1 4.8 15.0
GPT 4.1 mini 4.7 18.0
GPT 4.1 nano 12.0 16.0
Claude 4 sonnet 6.7 28.0
claude 4.1 opus 6.7 43.0
Claude 4.5 sonnet 7.2 34.0
Claude 4.5 Haiku 6.6 27.0
Gemini 2.5 Pro 5.8 16.0
Gemini 2.5 Flash 6.1 11.0
Gemini 2.5 Flash Lite 5.7 7.0

Hi @Jacob_SE :waving_hand:

Thanks for raising this. Would you mind elaborating on which models specifically seem ‘off’ for you? Adding a third column called tokens (k) per second might be helpful:

Model Token (k) time (s) (k) tokens/s
GPT 5.1 5.5 15.0 0.37
GPT 5 8.1 39.0 0.21
GPT 5 mini 6.8 43.0 0.16
GPT 5 nano 9.0 46.0 0.20
GPT 5 codex 7.0 28.0 0.25
GPT 4.1 4.8 15.0 0.32
GPT 4.1 mini 4.7 18.0 0.26
GPT 4.1 nano 12.0 16.0 0.75
Claude 4 sonnet 6.7 28.0 0.24
claude 4.1 opus 6.7 43.0 0.16
Claude 4.5 sonnet 7.2 34.0 0.21
Claude 4.5 Haiku 6.6 27.0 0.24
Gemini 2.5 Pro 5.8 16.0 0.36
Gemini 2.5 Flash 6.1 11.0 0.55
Gemini 2.5 Flash Lite 5.7 7.0 0.81

Off a first glance, the heavier weight models (Claude 4.1 Opus) have a lower TPS than lighter weight models (GPT 4.1 nano). Would you be able to callout which models specifically you are seeing as ‘odd’?

Larger models generally have more parameters and deeper architectures, which means each token requires more computation and inference time. Things like input/output complexity can also play a role in the response times for requests.

For pipelines, we recommend users choose models based on the specific use case—balancing speed, cost, and output quality. For high-throughput or latency-sensitive tasks, I lean toward the “Flash” or “Lite” variants, while for more complex reasoning, I’ll use the larger, more capable models. We recommend setting up a robust evaluation suite in AIP Evals wherever possible to help choose the ‘best’ model for each use case.

1 Like

Hi @Jim

Thanks for the detailed response. I appreciate the tokens/s breakdown, but I’d like to highlight some concerning patterns we’ve observed in production that go beyond just throughput metrics.

1. Token Consumption Paradox

Our testing shows consistent patterns across identical inputs (3 runs each): smaller models actually consume MORE tokens than their larger counterparts. For example:

  • GPT 4.1 nano: 12.0k tokens

  • GPT 4.1: 4.8k tokens

  • GPT 4.1 mini: 4.7k tokens

  • GPT 5 mini: 6.8k tokens

  • GPT 5 nano: 9.0k tokens

This contradicts the expected behavior where lighter models should be more efficient. Why would a “nano” model use 2.5x more tokens than the base model?

2. Speed vs. Model Tier Mismatch

Except for Gemini models, we’re seeing smaller models performing slower than larger ones:

  • GPT 4.1 mini (0.26 t/s) is slower than GPT 4.1 (0.32s)

  • GPT 5 mini/nano (0.16/0.2 t/s) is slower than GPT 5 (0.21 t/s)

The pipeline documentation suggests faster/cheaper models for efficiency, but the data shows the opposite.

3. Limited Use Case Viability

Mini/nano models produce inconsistent output quality that makes them unsuitable for sophisticated production pipelines. We’ve relegated them to simple tasks like summarization or basic translation, but even there, they’re not faster than larger models—which means they’re just increasing our Palantir compute-sec costs without providing any benefit.

smaller models actually consume MORE tokens than their larger counterparts

Clarification question….are you using consume to indicate ‘output tokens’ created?

Why would a “nano” model use 2.5x more tokens than the base model?

My experience tells me that if we are indeed talking about ‘output tokens’ as an indicator of 'consumption/usage, then there could be a few reasons why a lighter weight model would output more tokens:

  • The nano model might require more verbose prompts or produce more verbose outputs to achieve similar quality.
  • Sometimes, smaller models need more context or step-by-step instructions, especially when reasoning is involved.

smaller models performing slower than larger ones

Without knowing the exact prompts and configurations involved, the answering of this question is a bit challenging, although I would highlight the above as potential reasons for the surprising results in speed behavior.

Mini/nano models produce inconsistent output quality that makes them unsuitable for sophisticated production pipelines.

We recommend setting up AIP Evaluations to help you determine which model is best for your use case! Sounds like you are doing this or similar evaluation analysis which is great!

1 Like