The token bill nobody puts on the slide
You’re sold AI by the token. You pay for it by the task. One popular agent burns ~13,900 tokens of overhead before you type a word — and its own users have the receipts.

You’re sold AI by the token. You pay for it by the task. Exhibit A is documented and public: Nous Research’s Hermes Agent re-sends a fixed ~13,900-token payload — tool definitions, system prompt, a skills index loaded “regardless of need” — on every call, before you type a word. Its own users measured 73% of each API call as fixed overhead, and a 12-hour session that burned 2.6M tokens with 69% lost to context replay — “an entire monthly API budget in one sitting.” The benchmark is the ad. The token bill is the product. Stop optimizing cost-per-token and start measuring cost-per-task.
Every AI invoice tells the same lie by omission. It quotes a price per token — small, tidy, reassuring, the number that goes on the slide. It does not quote the one that actually empties the account: how many tokens the task really burned to get to an answer. For the gap between those two figures, you don’t need my opinion. You can read the receipts.
Exhibit A: an agent that re-reads its manual before every reply
Take Hermes Agent, Nous Research’s popular open agent framework. In a token-overhead analysis its own community filed, a contributor measured that 73% of every API call is fixed overhead — about 13,900 tokens of tool definitions, system prompt and a skills index loaded “regardless of need,” all of it, in the issue’s words, “paid before any conversation content is processed.” Tool definitions alone: 8,759 tokens. The system prompt, dragging along the agent’s SOUL.md and skills catalog: another 5,176. Ask Hermes the most trivial question — are you even running? — and you still pay the full freight, because the whole manual gets read aloud before every single reply.
And it compounds. In a field report from heavy production use, a developer logged a 12-hour session that consumed 2.6 million tokens, 69% of it lost to context replay — fragmented sessions silently re-sending the entire history as fresh input. One conversation: 1.9M tokens spent against the ~190K actually needed. That is 89% waste. In the author’s own words, it turned “a productive 12-hour workday into burning through an entire monthly API budget in one sitting.”
I’m not theorizing from a blog post. We run a stack of these models and agents in production and pay the bill at the end of every month — so this is the meter we actually read, and Hermes isn’t an outlier. It is just unusually well-documented, because its users did the math in public. Every agent framework has a version of this preamble tax; most of them simply haven’t been measured out loud yet.
The burn you don’t see
Two things balloon the bill, and neither shows up in the demo. First, the preamble: every step of an agent loop re-sends the full system prompt and tool definitions — the ~13,900-token freight Hermes’ users clocked — so a task that takes ten steps pays that tax ten times over. Second, reasoning: the models winning the leaderboards “think” in tokens you never see but absolutely pay for, so a short visible answer can sit on top of a long, billed, invisible monologue. You didn’t buy a chatbot that gives clever answers. You bought a process that re-reads the manual and mutters to itself before every move — and pays by the word for both.
Cost-per-token is a vanity metric
I’ve bought media my whole career, so this pattern is an old friend. A low price per token is a cheap CPM on a channel nobody converts on. It looks efficient on the plan and does nothing for the business. The number that has ever mattered is the effective one — cost per actual outcome. In media that’s cost per acquisition. In AI it’s cost per completed task.
The benchmark is the ad. The token bill is the product.
Quote me a per-token rate and you’ve told me the rate card. Tell me what it costs to reliably finish one real unit of work — the retrieval, the reasoning, the retries, the tool calls, the whole loop — and now you’ve told me whether this thing makes money. Almost nobody is measuring the second number, which is exactly why almost nobody knows their AI is underwater until the quarterly bill lands.
You’re paying a luxury markup for a tie
Here’s the part that should sting. On the hardest public benchmarks the frontier models now cluster within a few points of each other — the top of the table is effectively a tie. Yet their published prices differ by an order of magnitude. So the same task, routed to the trophy model instead of a capable cheaper one, can cost many times more for a benchmark edge that is mostly noise. You are paying a luxury markup to win a photo finish that doesn’t change the outcome of your task.
Meanwhile the agent hype is running well ahead of the agent reality. Gartner’s read: of the thousands of vendors describing themselves as “agentic,” only around 130 actually are — the rest is “agent washing.” And it projects more than 40% of agentic-AI projects will be cancelled by 2027. The killer in those post-mortems is rarely capability. It’s cost and governance — the token bill nobody put on the slide, finally read out loud.
The case for the premium model — and against using it for everything
The honest counter: sometimes the expensive model genuinely earns it. There’s a hard 15% of any real workload — the gnarly edge cases — where the cheaper model fails and the frontier one doesn’t, and there the premium is the cheapest thing you can buy. Completely true. That’s the entire point. Route the hard 15% to the expensive model and the boring 85% to the cheap one, and your cost-per-task collapses. Most stacks route 100% of traffic to the trophy model — not because the work demands it, but because nobody measured which work did.
Read the meter, own the routing
So the edge in this phase isn’t having the biggest model. It’s the system around it: the layer that routes the right model to the right task, retries intelligently, caches what it can, and logs what every call actually cost. That’s unglamorous engineering — routing, evaluation, governance, owned infrastructure — and it’s exactly the kind of build work that’s abundant where the industry isn’t looking. A team that owns its routing and reads its own meter will out-operate one that rents a trophy model and never checks the bill. Pair that discipline with cheap power and deep build talent, and you don’t compete on whose model is loudest — you win on unit economics, which is the only contest that pays.
Stop buying ad space on a leaderboard. Measure the task, route the model, read the meter. The token bill nobody puts on the slide is the only number that decides whether your AI makes money — so make it the first one you look at, not the last.
Let's talk AI

