We Cut Our LLM API Bill 30% With Four Lines of YAML

#ai #devops #programming #discuss

Our gateway handles a few thousand LLM calls per hour. Mostly internal tools, some customer-facing agents. We noticed something in the logs: a lot of prompts were basically the same question worded differently.

"Summarize this quarterly report" and "give me a summary of the Q2 report" hitting the same model, getting nearly identical responses, costing us twice. Multiply that across a few hundred users and it adds up fast.

The math on duplicate calls

Quick back-of-envelope. GPT-4o runs \$2.50 per million input tokens, \$10 per million output. Claude Sonnet is \$3/\$15. A typical summarization request with context is maybe 2K input tokens and 500 output. That's roughly \$0.007 per call on GPT-4o.

Doesn't sound like much until you're doing 50K calls a day and 30-40% of them are semantically identical. That's \$100+/day in duplicate spend. \$3K/month. For responses you already generated.

Semantic caching on the gateway

The fix is semantic caching at the gateway layer. Instead of matching prompts by exact string (which almost never hits because users word things differently), you embed the prompt into a vector and check cosine similarity against cached responses. Similar enough prompt? Return the cached response. Skip the model call entirely.

We'd been running this on Redis with RediSearch. Worked well but RediSearch needs Redis Stack, which isn't standard Redis anymore. When we moved to Valkey (like a lot of teams post-license-change), we needed the same thing on valkey-search.

LiteLLM shipped a valkey-semantic cache backend that does exactly this. Four lines in the config:

litellm_settings:
  cache: True
  cache_params:
    type: valkey-semantic
    host: os.environ/VALKEY_HOST
    port: os.environ/VALKEY_PORT
    valkey_semantic_cache_embedding_model: openai-embedding
    similarity_threshold: 0.8

The similarity_threshold controls how close a match needs to be. 0.8 worked well for us. Too low and you get false positives. Too high and you miss obvious duplicates. Tune it for your traffic.

What happens under the hood

Every prompt gets embedded (using whatever model you configure), stored in an HNSW vector index on Valkey, and tagged with a scope key so different users or API keys don't cross-contaminate caches. At lookup time it runs a KNN query and returns the cached response if cosine similarity clears the threshold.

The embedding call itself costs something (text-embedding-3-small is \$0.02 per million tokens), but it's two orders of magnitude cheaper than the model call you're skipping. Net savings are significant.

Cache hits come back with an x-litellm-semantic-similarity header so you can track your hit rate and measure actual savings.

Try it locally in 30 seconds

docker run -d -p 6379:6379 valkey/valkey-bundle:8.1

Set VALKEY_HOST=localhost, VALKEY_PORT=6379, start LiteLLM with the config above. Send the same question two different ways. Second one returns instantly from cache.

SDK version

If you're using LiteLLM as a library:

import os
import litellm
from litellm.caching.caching import Cache

litellm.cache = Cache(
    type="valkey-semantic",
    host=os.environ["VALKEY_HOST"],
    port=os.environ["VALKEY_PORT"],
    similarity_threshold=0.8,
    valkey_semantic_cache_embedding_model="text-embedding-ada-002",
)

response1 = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "explain kubernetes pods"}],
)

# different wording, same question, served from cache
response2 = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "what are pods in k8s"}],
)

assert response1.id == response2.id

AWS ElastiCache notes

If you're on AWS, ElastiCache for Valkey supports this on node-based Valkey 8.2+ clusters. Serverless doesn't have vector search yet. Cluster-mode-disabled with read replicas works fine. For TLS, add ssl: true or use a rediss:// URL. IAM auth supported, just skip the password.

The env vars (VALKEY_HOST, VALKEY_PORT, VALKEY_PASSWORD) fall back to REDIS_HOST/REDIS_PORT/REDIS_PASSWORD, so if you're migrating from Redis you don't even need to update your environment.

Where it doesn't help

Semantic caching is not a silver bullet. It works best for read-heavy, repetitive workloads: internal Q&A bots, document summarization, support tools. It's less useful for creative generation or highly personalized responses where similar prompts should produce different outputs.

Also, if your prompts include large, unique contexts (like full documents), the semantic similarity might not trigger even for functionally identical questions, because the embedding is dominated by the context rather than the question.

Know your traffic patterns. Check the similarity header. Tune the threshold.

Full setup: docs.litellm.ai/docs/proxy/caching | Blog post: docs.litellm.ai/blog/valkey_semantic_caching