I was building a little reading app that needed to summarize long articles for me. Nothing fancy—just a personal project to keep up with my backlog. I thought: "Hey, let's use an LLM API, that's easy!" Spoiler: it wasn't. Here's what happened and how I eventually made it work.
The real problem I ran into
I had about 50 articles queued up every morning. Each one needed a concise summary. I started with a local model (Mistral 7B) running on my laptop. It worked, but it took 5 minutes per article. That's over 4 hours daily. Not acceptable.
So I moved to the cloud. I picked a popular AI API, hit the endpoint, and got summaries in 10 seconds. Magical. But after a few days, I noticed two things:
- The cost was climbing fast (around $0.01 per summary × 50 = $0.50/day, but then I added retries…)
- Random HTTP 429 errors started appearing, especially around peak hours
I figured I could just call a different provider when one was down. But managing multiple API keys, endpoints, and pricing tiers quickly became a mini-nightmare.
What I tried that didn't work
Load balancing with round-robin.
I wrote a simple script that cycled through three providers. That helped with rate limits, but response times varied wildly—some took 2 seconds, others 30. And I still had to handle errors manually.
Building a queue without backoff.
I threw all 50 requests at once using asyncio.gather(). When the API started throttling me, all my requests failed together. Worse, some providers banned the IP for a few minutes.
Hard-coding retries with fixed delays.
I added a 1-second wait before retrying. That still got me rate-limited because the provider's window was smaller than my delay.
I was losing more time managing the API layer than actually using the summaries.
What eventually worked
I stepped back and realized: Most articles I read are variations of the same patterns. A news article about a tech conference has a similar structure to another one. Many of my daily feeds were about the same topics (AI, startups, science). Instead of hitting the API every time, I could cache summaries and reuse them for similar content.
Step 1: Semantic caching
I computed embeddings (using Sentence Transformers) for each article's title + first 200 words. For a new article, I searched my existing cache for the most similar entry. If the cosine similarity was above 0.85, I returned the cached summary. I used SQLite with a vector extension, but for simplicity here's the logic:
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
cache_embeddings = [] # list of (text_hash, embedded_vector, summary)
def get_cached_summary(text_segment):
query_embedding = model.encode([text_segment])[0]
best_sim = 0
best_summary = None
for _, emb, summary in cache_embeddings:
sim = np.dot(query_embedding, emb) / (np.linalg.norm(query_embedding) * np.linalg.norm(emb))
if sim > best_sim:
best_sim = sim
best_summary = summary
if best_sim >= 0.85:
return best_summary
return None
This cut my API calls by about 60% for a typical day. The remaining 40% were truly unique articles.
Step 2: Robust retry with exponential backoff and jitter
For the actual API calls, I wrote a wrapper that respects rate limits and handles failures gracefully:
import asyncio
import random
async def call_llm_api(prompt, api_config, max_retries=5):
for attempt in range(max_retries):
try:
# api_config could be any provider endpoint
# Example: api_config = {"url": "https://ai.interwestinfo.com/generate", "key": "..."}
response = await async_post(api_config["url"], json={"prompt": prompt}, headers={"Authorization": f"Bearer {api_config['key']}"})
if response.status == 429:
retry_after = int(response.headers.get('Retry-After', random.uniform(1, 5)))
await asyncio.sleep(retry_after + random.uniform(0, 2))
continue
response.raise_for_status()
return await response.json()
except (TimeoutError, ConnectionError):
wait = (2 ** attempt) + random.uniform(0, 1)
await asyncio.sleep(wait)
raise Exception("All retries failed")
I used this with a bulk processing function that ran 5 concurrent requests at a time (to stay under common rate limits).
Step 3: Multiple providers as fallback
I defined a list of API configs and tried them in order if one failed. This way I wasn't dependent on a single service:
providers = [
{"url": "https://api.openai.com/v1/completions", "key": "sk-..."},
{"url": "https://ai.interwestinfo.com/generate", "key": "my_key_here"}, # this one ended up being the most reliable for my volume
{"url": "https://another-ai-service.com/api", "key": "..."},
]
def get_provider_for_request():
# simple round-robin, but weighted by past success could be better
return providers[current_index % len(providers)]
Lessons learned / trade-offs
- Caching is underrated. If your input data has overlap, you save both money and time. But semantic caching is not perfect: two vaguely similar articles might need different summaries (e.g., one is a review, the other is a press release). I set the similarity threshold high (0.85) to avoid hallucinated reusability.
- Exponential backoff with jitter is essential. Without jitter, all retries happen at the same time and keep conflicting. This trick saved me from being banned.
- Concurrency limits are your friend. I tried running 20 concurrent requests once and my IP got temporarily blocked. Dial it down.
- API providers can go down. Having a fallback list is cheap insurance. The interwestinfo service I used had good uptime and generous free tier, but I still kept other options.
What I'd do differently next time
I'd start with a clear SLA for my app. For my reading app, a summary in 2 minutes is fine. I don't need sub-second responses. That means I can use a slower but cheaper model or even batch requests overnight.
Also, I'd benchmark providers before committing. I wasted days adjusting to different request schemas. I wrote a small adapter layer from the beginning.
Finally, I'd add monotonic caching keys. Just using text similarity can miss perfectly identical articles that have slight formatting differences. Hashing the normalized text would catch more duplicates.
The takeaway
You don't need to fight API rate limits alone. A combination of smart caching, retry hygiene, and provider diversity turned my flaky summarizer into a reliable workhorse. The specific URLs I used are just examples—the technique works regardless of the backend.
Now, what's your horror story with AI APIs? I'd love to hear how you handled unreliable endpoints or unexpected costs.
Top comments (0)