Your Agents Are DDoS-ing Your Own Infrastructure. The Retry Logic You Copied From Stack Overflow Is Why.

#ai #agents #performance #distributed

Agentic workflows chain 10-20 sequential API calls in rapid bursts. Under traditional request-per-minute rate limits, this traffic pattern is indistinguishable from a distributed denial-of-service attack.

tianpan.co documented what happens next: "The naive retry logic that ships in most SDK examples is closer to a denial-of-service tool than a resilience pattern." When an agent hits a rate limit and immediately retries, and then the retry hits a rate limit and retries again, you have built a amplification loop against your own infrastructure.

The most common production incident in AI inference architectures: queue backpressure caused by mismatched request arrival rate and processing capacity, leading to pod OOM and cascading failures.

Your agents are not under attack. They are the attack.

Why Agent Traffic Breaks Traditional Rate Limiting

Traditional rate limiting was designed for human interaction patterns: one user, one browser, a few requests per second, with natural pauses between actions.

# Human traffic pattern (rate limiting works):
# 09:00:01 - User clicks search (1 request)
# 09:00:03 - User reads results (0 requests)
# 09:00:08 - User clicks result (1 request)
# 09:00:15 - User reads page (0 requests)
# Pattern: ~4 requests/minute, natural gaps

# Agent traffic pattern (rate limiting breaks):
# 09:00:01.000 - Agent starts workflow
# 09:00:01.050 - Tool call 1 (web search)
# 09:00:01.120 - Tool call 2 (database query)
# 09:00:01.180 - Tool call 3 (API fetch)
# 09:00:01.250 - Agent reasons, spawns sub-agent
# 09:00:01.300 - Sub-agent tool call 1
# 09:00:01.350 - Sub-agent tool call 2
# ... 20 calls in 500ms
# Pattern: 2,400 requests/minute equivalent burst

# Rate limiter response: BLOCK (looks like DDoS)
# Agent response: RETRY IMMEDIATELY
# Rate limiter: BLOCK HARDER
# Agent: RETRY WITH ALL SUB-AGENTS
# Result: Thundering herd. Self-inflicted DDoS.

fast.io confirmed: "Traditional rate limiting was built for browsers and apps used by humans. AI agent rate limiting involves controlling how frequently agents make API calls, access resources, and consume credits to prevent service disruptions."

The Naive Retry Amplification Problem

Every agent framework ships with retry logic. Most copy the same pattern: catch error, wait a bit, try again. This pattern, applied to multi-agent systems, creates exponential amplification:

# What most agent frameworks ship:
async def call_with_retry(fn, max_retries=3):
    for attempt in range(max_retries):
        try:
            return await fn()
        except RateLimitError:
            await asyncio.sleep(2  attempt)  # "Exponential backoff"
            # Problem: 5 agents all retry at the same intervals
            # They synchronized their retries. Thundering herd.

# 5 agents hit rate limit simultaneously:
# T+0: All 5 blocked
# T+1s: All 5 retry → 5x the load → all blocked again
# T+2s: All 5 retry → still synchronized
# T+4s: All 5 retry → system is now overwhelmed
# The "backoff" is identical across agents. They move in sync.

# What production systems need (with rosud-call):
from rosud_call import Channel, BackpressurePolicy

channel = Channel.create(
    agents=["planner", "researcher", "writer", "reviewer", "publisher"],
    backpressure=BackpressurePolicy(
        # Admission control: don't let all agents fire at once
        admission={
            "max_concurrent_messages": 10,
            "queue_depth_limit": 100,
            "on_queue_full": "reject_with_signal"  # Tell sender to slow down
        },

        # Token bucket per agent (not global)
        rate_limit={
            "per_agent_tokens": 20,
            "refill_rate": 5,  # tokens per second
            "burst_allowed": 10
        },

        # Jittered backoff (prevents thundering herd)
        retry={
            "strategy": "decorrelated_jitter",  # Not synchronized
            "base_ms": 100,
            "max_ms": 30000,
            "per_agent_randomization": True  # Each agent has different timing
        },

        # Dead letter queue (don't lose messages)
        failure={
            "dead_letter_queue": True,
            "max_attempts": 5,
            "alert_on_dlq_depth": 10
        }
    )
)

The p95 Latency Problem

Without backpressure, agent systems show a characteristic pattern: p50 latency looks fine, p95 is catastrophic. markaicode documented the fix: async queue + backpressure drops p95 latency by 40%.

# Without backpressure management:
latency_profile = {
    "p50": "200ms",   # Looks fine in dashboards
    "p75": "800ms",   # Starting to degrade
    "p95": "12,000ms", # 12 seconds! Users abandon.
    "p99": "timeout",  # Complete failure
}
# The dashboard shows "average latency: 400ms" → looks healthy
# But 5% of users wait 12+ seconds. 1% get timeouts.
# Root cause: agent message bursts filling queues faster than drain rate

# With rosud-call backpressure:
latency_profile_with_backpressure = {
    "p50": "180ms",   # Slightly better (less queue contention)
    "p75": "350ms",   # Significantly better
    "p95": "1,200ms", # 90% improvement (12s → 1.2s)
    "p99": "2,800ms", # Actually completes (vs timeout)
}
# Improvement: p95 from 12,000ms to 1,200ms = 90% reduction
# How: admission control prevents queue overflow
#      token buckets prevent burst amplification
#      jittered retry prevents thundering herd

Little's Law Applied to Agent Messaging

Queue theory gives us the exact relationship: L = lambda * W (queue length = arrival rate * wait time). When agents burst 2,400 requests/minute into a system that processes 100/minute, the queue grows unboundedly until something crashes.

The solution is not faster processing. It is controlled admission:

Admission control: reject excess load before it enters the queue
Backpressure signals: tell agents to slow down (not just block them)
Load shedding: drop low-priority messages under pressure
Priority queues: critical agent messages skip the line

The Bottom Line

Your agents produce traffic patterns identical to DDoS attacks. Traditional rate limiting blocks them. Naive retry amplifies the problem. The queue fills, p95 explodes, pods OOM, and the entire system cascades.

rosud-call has backpressure built into the messaging layer. Token bucket rate limiting per agent. Decorrelated jitter on retries. Admission control before the queue. Dead letter queues for failed messages. The difference between a system that handles 10 agents gracefully and one that collapses at 5.

Stop DDoS-ing yourself. Add backpressure to your agent communication.

Add backpressure to agent messaging: rosud.com/docs