60–95% fewer tokens in your agent loops, same answers. Meet Headroom.

#agents #ai #api #llm

AI coding agents are expensive — not because models cost too much per token, but because they send too many of them. An SRE debugging session with a raw agent: 65,694 tokens in. With Headroom in the middle: 5,118. Same bug found.

Headroom is a new open-source context compression layer that intercepts everything your agent reads — tool outputs, log dumps, RAG chunks, files, conversation history — and compresses it before the LLM ever sees it. It's local, reversible, and available as a drop-in proxy, a library, or an MCP server.

The numbers that matter

Savings on real agent workloads:

Code search (100 results): 17,765 → 1,408 tokens (92% reduction)
SRE incident debugging: 65,694 → 5,118 tokens (92%)
GitHub issue triage: 54,174 → 14,761 tokens (73%)
Codebase exploration: 78,502 → 41,254 tokens (47%)

Accuracy on standard benchmarks (GSM8K, TruthfulQA, SQuAD v2, BFCL) is preserved — some scores actually improve slightly, likely because the model sees cleaner signal.

What's doing the compression

Under the hood, Headroom routes content through a stack of specialised compressors:

SmartCrusher — JSON, nested objects, arrays of dicts
CodeCompressor — AST-aware for Python, JS, Go, Rust, Java, C++
Kompress-base — a custom HuggingFace model trained on agentic traces, for prose and mixed content
CacheAligner — stabilises prompt prefixes so Anthropic/OpenAI KV caches actually hit

It also does CCR (reversible compression) — originals are cached locally and the LLM can retrieve them on demand if it needs them. Nothing is destroyed.

Why the proxy mode matters

The most interesting deployment path: headroom proxy --port 8787, then point your existing tool at localhost. Zero code changes. Works with any language.

Or even simpler: headroom wrap claude wraps Claude Code, routes its traffic through Headroom automatically. One command, savings start immediately. Same for Codex, Cursor, Aider, Copilot CLI.

"Library — compress(messages) in Python or TypeScript, inline in any app. Proxy — headroom proxy --port 8787, zero code changes, any language."

There's also a cross-agent memory store — shared context across Claude, Codex, and Gemini sessions with auto-dedup — and a headroom learn feature that mines past failed sessions and writes corrections back to your CLAUDE.md / AGENTS.md.

What to do

Running Claude Code or Codex daily? pip install "headroom-ai[all]" then headroom wrap claude. See the savings in five minutes.
Using any OpenAI-compatible client? headroom proxy --port 8787 and point your client at localhost. No code changes needed.
On LangChain, Agno, or Vercel AI SDK? Native middleware integrations are available — no proxy required.
On Opus-class models? Also enable HEADROOM_OUTPUT_SHAPER=1 — it trims verbose model output too, and on 5× output pricing that adds up fast.
Not burning tokens on agent context yet? Bookmark it. You will be.