The 5-Hour Quota, Boris's Tweet, and What the Source Code Actually Reveals

#ai #devtools #productivity #programming

Yesterday I published a deep dive into Claude Code's compaction engine. At the end, I made a promise: go deeper on the caching optimizations that happen outside of compaction.

But actually, the caching rabbit hole started before that post - because of a tweet from about ten days ago.

The Tweet That Confused Me

If you're a heavy Claude Code user, you felt the 5-hour usage cap snap shut after Anthropic's two-week promotional window closed. The complaints flooded in. Someone tagged Boris - the engineer behind Claude Code, the person who built it - asking what he planned to do about it.

His answer: improvements are coming to squeeze more out of the current quota.

My first reaction: what can he possibly do? The quota is server-side. It's rate limits and token budgets. There's no client trick that changes how many tokens you're allowed per hour.

That question sat with me. Then yesterday's compaction post led me to look harder at the source - and the answer became obvious.

Cache Hit Ratio Is the Quota

Every message you send to Claude Code costs tokens. But tokens aren't flat. Cache hits are discounted significantly. Cache misses cost 1.25x - you're not just paying full price, you're paying a penalty.

If your cache hit ratio is high, you stretch the same quota dramatically further than someone whose cache keeps busting. The quota doesn't change. What you extract from it does.

This is the reframe. When Boris says improvements are coming, he's not talking about changing server limits. He's talking about recovering cache hit ratio - which is the same thing as handing quota back to users.

What Claude Code Already Does About This

When I asked Claude to analyze its own source code, what came back wasn't a simple "we cache the system prompt." It was twelve distinct mechanisms working together, each one plugging a specific leak.

Two stood out - and they reveal how deeply Anthropic thinks about cache economics.

The first solves a combinatorial explosion: five runtime booleans in the system prompt means 32 possible cache entries, most of which will never get a second hit. Claude Code's fix involves a literal boundary string in the source that splits stable content from dynamic content, with the stable prefix shared globally across every user on Earth.

The second is even more interesting: a side-channel called cache_edits that surgically removes old tool results from the cached KV store without changing a single byte in the actual message. No cache invalidation. No reprocessing penalty.

But those are just two of twelve mechanisms. The full picture includes a 728-line diagnostic system that treats cache misses as bugs, a function literally named DANGEROUS_uncachedSystemPromptSection(), and a one-sentence prompt rewrite that saved 20K tokens per budget flip.

Read the full source code analysis on my blog →

Here's what you'll find in the full post:

How __SYSTEM_PROMPT_DYNAMIC_BOUNDARY__ solves the 2^N cache key explosion with Blake2b prefix hashing
The cache_edits side-channel: surgery without invalidation
Why there's a function called DANGEROUS_uncachedSystemPromptSection() and what it forces engineers to do
The real mechanism behind the /clear warning (it's called "willow" internally)
What Boris can actually ship to stretch your quota further

Previously: Claude Code's Compaction Engine: What the Source Code Actually Reveals