TL;DR
Tokens per second measures the wrong thing for production AI agents. The right metric is loops completed per minute under realistic concurrency. A single agent command fans out into 8 to 12 internal API calls, and standard throughput metrics measure the calls, not the outcomes.
This post explains what to measure, why bursty agentic workloads break rate-limited APIs, and how reserved-throughput AI inference changes the curve.
8-12x
API calls per single user-visible agent command
15x
Infrastructure amplification factor at peak concurrency
p99
The latency percentile that defines agent loop speed
The metric nobody on your team can quote
Loops per minute is the throughput metric that matters for AI agents. Tokens per second is the metric AI inference vendors prefer to report.
Ask any AI team how their agentic workflow performs and you will get a flurry of numbers that do not add up to an answer. Tokens per second on the model. Time to first token at the endpoint. Requests per minute against the LLM API. Average response time in the dashboard. Each of these is a real metric, measured by a real tool, and none of them tell you what you actually want to know: how many useful agent loops did this system complete per minute, and how often did one fail in a way the user noticed?
This is the central measurement problem in production AI infrastructure today. The metrics inherited from the single-prompt era do not describe how AI agents actually run. An agent is not one request. It is a sequence of requests, tool calls, retries, and reasoning steps that compose into a single user-visible outcome.
What an agentic workflow actually is
An agentic workflow is a multi-step AI process where a model decides, mid-execution, what to do next. The common shape is simple: one human-visible input goes in, many model calls happen inside, and one human-visible output comes out.
Simple chat sits at about 1:1. A coding agent typically operates at 5:1 to 15:1. A deep research agent can run at 50:1 or more: one user query expanding into dozens of internal retrievals, summaries, and reasoning steps.
This ratio is what standard AI infrastructure metrics fail to capture. The model API sees 50 calls and reports its tokens-per-second number on each one. The user sees one outcome and waits for the whole loop to finish.
Why tokens per second is the wrong unit
Tokens per second measures model generation speed in isolation. For a production agentic workflow, it is downstream of the things that actually matter, and it conceals the factors that determine real performance.
| What is hidden | Why it matters | Impact |
|---|---|---|
| Queue time | Router wait before generation starts is invisible to model throughput metrics. | Critical |
| Tool latency | Search APIs, database lookups, and embedding retrievals run while the model is idle. | High |
| Cold-start tax | The first call pays cold-start cost while token rate measures warm steady state only. | High |
| Failed and retried steps | Each retry degrades loop completion time but disappears inside token throughput. | Critical |
| Thinking vs useful tokens | Reasoning models spend thousands of tokens on internal work users never see. | Medium |
The three units that actually matter
The right throughput stack has three metrics for any production AI inference platform.
01
Loops completed per minute
The user-visible unit. It counts how many full agent runs finished in the measurement window, regardless of how many internal calls each loop made.
02
Steps per loop
The structural unit. A workflow needing 15 steps to do what could be done in 7 is wasting throughput on its own architecture.
03
Tail of step latency
The slowest steps in an agentic loop determine perceived speed because loops are blocking and the slowest step gates everything after it.
Where bursty workloads collapse the standard model
Agent loops fire bursts of parallel calls. Rate-limited APIs serialize them. That is where throughput collapses.
Modern agentic frameworks parallelize wherever they can. A research agent reading 10 documents fires 10 embedding calls at once, then 10 retrieval calls, then a single synthesis call. This burst is not pathological. It is the correct design.
The first 8 calls go through. The 9th and 10th hit a 429. The agent retries. The retries hit the same rate limit. Exponential backoff begins. By the time all 10 complete, the loop has spent 45 seconds doing what should have taken 3.
Concurrency is the honest measurement axis for AI inference scalability, not tokens per second. A reserved lane offering 10 concurrent connections handles a 10-call burst in one shot. A token-priced API with the same nominal throughput may not, because request count matters separately from token count.
The amplification trap
Agentic workflows turn modest user traffic into massive AI infrastructure traffic. Most metrics measure only one side of the multiplier.
This amplification is invisible at the product layer. The dashboard shows 1,000 users. The AI inference layer sees 15,000 requests. Token-per-second metrics are reported at the layer that sees the 15,000, with no link back to the 1,000 that actually matter.
The product feels broken at exactly the moment the founder wants it to feel fast.
5-15x
Typical call amplification per agent loop
15,000
Infrastructure requests/min for 1,000 users at 15x amplification
30x
Provider quota surge from a 2x user spike at 15x amplification
How reserved throughput changes the curve
Reserved-throughput AI inference removes the shared-capacity bottleneck. The metrics that move are the ones users actually feel.
| Metric | Shared / rate-limited API | Reserved throughput |
|---|---|---|
| Concurrent burst handling | Queued requests and 429 errors | Full burst in parallel |
| Tail latency spread | Wide and unpredictable | p99 close to p50 |
| Cold-start penalty | High on first burst | Reserved lane stays warm |
| Retry rate from capacity errors | High under load | Near zero |
| Cost predictability | Variable per-token drift | Flat rate and forecastable |
| Amplification handling | Quota exhaustion at 15x | Concurrency-bounded plan |
| Vendor lock-in risk | API-specific formats | OpenAI-compatible drop-in |
A measurement protocol for your own agentic workflow
This protocol takes one to two days of instrumentation and produces the only numbers worth comparing across AI inference providers.
01
Instrument loop completion time
Wrap your agent's outer entry point with a timer that starts when the user message arrives and stops when the user-visible output is ready.
02
Count internal steps per loop
Track the distribution. If the average loop makes 7 calls but the p95 loop makes 22, you have a workflow design problem hiding behind infrastructure metrics.
03
Measure step latency at p50, p95, and p99
Use p99 to estimate your worst-case loop. If p99 step latency is 4 seconds and the average loop has 7 steps, the worst-case loop is about 28 seconds.
04
Track retry rate per loop
A workflow with a 12% retry rate pays a hidden tax in both latency and token cost. Above 5% is a warning. Above 10% is a fire.
05
Compare warm-loop vs cold-loop throughput
Run the same workflow against a freshly idle endpoint and against an endpoint under sustained load. A significant difference reveals a cold-start problem.
The new metric stack
The production metric stack for AI agents should connect user-visible outcomes, workflow shape, infrastructure quality, reliability, concurrency, and economics.
| Metric | What it measures | Threshold | Signal |
|---|---|---|---|
| Loops/min | User-visible throughput | Business KPI | North star |
| Steps/loop p50 and p95 | Workflow efficiency | p95 below 2x p50 | Trending up means design debt |
| Step latency p50/p99 | Infrastructure quality | Spread below 2x | Widening means capacity issue |
| Retry rate/loop | Reliability | Above 5% warn | Above 10% critical |
| Concurrent calls at peak | Amplification factor | Size reserved plan | True concurrency need |
| Loops per dollar | Economic efficiency | Predictable on flat rate | Unstable on per-token |
How to measure agentic workflow throughput in 2026
The problem: standard AI metrics such as tokens per second, average response time, and requests per minute measure individual model calls, not complete user-visible outcomes. Production AI agents amplify single user commands into 8 to 15 internal API calls, making call-level metrics structurally misleading.
The solution: measure loops completed per minute for user-visible throughput, steps per loop at p95 for workflow efficiency, and step latency at p99 for infrastructure quality. Track retry rate per loop; above 5% signals a capacity problem, and above 10% is critical.
The infrastructure fix: reserved-throughput AI inference eliminates rate-limit-driven retries, compresses tail latency, and handles concurrent bursts without queue-based serialization. It changes the binding constraint from infrastructure ceiling to workflow design, which is the constraint teams can actually optimize.
Run your agentic workloads on reserved throughput
No shared quotas. No burst penalties. OpenAI-compatible endpoint. Measure your loops-per-minute improvement from day one.
OpenBandwidth plans are built around flat-rate reserved throughput: Starter $20/month, Pro $40/month, and Team $90/month. Launch offer: 20% off for the first three months.
FAQ: Why is tokens per second the wrong throughput metric for AI agents?
Agentic workflows complete one user-visible task using many internal model calls. Tokens per second measures the calls, not the tasks. A loop of fifteen fast calls and a loop of fifteen slow calls produce the same loops-per-minute throughput if their total time is equal, regardless of their per-step token rates.
FAQ: How many concurrent connections does a typical AI agent need?
A simple chat assistant needs one. A coding agent doing parallel file analysis needs 5 to 10. A research agent doing parallel document retrieval needs 10 to 20. Sizing reserved AI capacity for agentic workloads is a concurrency question, not a token question.
FAQ: Does reserved capacity make every agentic workflow faster?
Reserved capacity removes the throughput limits imposed by shared AI infrastructure. It does not fix workflows that are slow because of their own design. A 50-step loop will still take longer than a 7-step loop on any infrastructure.
FAQ: How do I compare AI providers fairly for agentic workloads?
Run the same agentic workflow against each provider with identical prompts and tools, and measure loops per minute under realistic concurrency. Drop-in OpenAI-compatible endpoint testing through Claude Code, OpenClaw, or OpenCode makes the comparison a one-environment-variable change.
FAQ: What is the amplification trap in agentic AI infrastructure?
Agentic workflows multiply every user request into 5 to 15 internal API calls. At 15x amplification, 1,000 active users produce 15,000 infrastructure requests per minute. A 2x user surge becomes a 30x surge against your AI API quota.
FAQ: What retry rate is acceptable for production agentic workflows?
Above 5% is a warning. Above 10% is critical. If retries are driven by rate limits or capacity errors, the fix is reserved AI capacity. If retries are driven by model output failures, the fix is the prompt or output validator.
FAQ: Is loops per minute really the right north-star metric for AI agents?
It is the metric most correlated with what users perceive. If your product succeeds when users accomplish tasks efficiently, loops per minute aligns with your product success. Tokens per second aligns with vendor upsell.