DEV Community

# reliability

General discussions on building and maintaining reliable software systems.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Synthetic Monitoring Best Practices: What to Monitor and How Often

Synthetic Monitoring Best Practices: What to Monitor and How Often

Comments
6 min read
What Is Synthetic Monitoring? The Complete Guide

What Is Synthetic Monitoring? The Complete Guide

Comments
6 min read
Synthetic Monitoring vs Real User Monitoring (RUM): The Difference

Synthetic Monitoring vs Real User Monitoring (RUM): The Difference

Comments
4 min read
Ten 95% Reliable Agents Chained Together Give You a 60% System. Microservices Solved This a Decade Ago.

Ten 95% Reliable Agents Chained Together Give You a 60% System. Microservices Solved This a Decade Ago.

2
Comments
4 min read
Your MCP Agent is Logging "Sucess: true" While the task never ran

Your MCP Agent is Logging "Sucess: true" While the task never ran

1
Comments
3 min read
Three AI providers went down on the same day. Here's the architecture that didn't care.

Three AI providers went down on the same day. Here's the architecture that didn't care.

Comments
5 min read
Surviving the region you run in: failover on Aurora DSQL, and what the demo proves

Surviving the region you run in: failover on Aurora DSQL, and what the demo proves

Comments
5 min read
Software Reliability

Software Reliability

Comments
3 min read
Sliding-Window Spend Guard: the $47K Loop Per-Call Caps Miss

Sliding-Window Spend Guard: the $47K Loop Per-Call Caps Miss

Comments
11 min read
Graceful Degradation: Circuit Breakers for External API Dependencies

Graceful Degradation: Circuit Breakers for External API Dependencies

Comments
5 min read
Building a Chaos Testing Harness for Multi-Region Video API Endpoints

Building a Chaos Testing Harness for Multi-Region Video API Endpoints

Comments
10 min read
Error budgets when downtime costs money: reliability engineering for payment-critical systems

Error budgets when downtime costs money: reliability engineering for payment-critical systems

Comments
10 min read
Distributed Tracing 101: The Mental Model, the Standards, and Your First Pipeline

Distributed Tracing 101: The Mental Model, the Standards, and Your First Pipeline

Comments
5 min read
Safe Operating Throughput (SOT) as a First-Class SRE Metric: Derivation and Operationalization

Safe Operating Throughput (SOT) as a First-Class SRE Metric: Derivation and Operationalization

Comments
17 min read
AI SRE: What an Autonomous Agent Doing On-Call Actually Looks Like

AI SRE: What an Autonomous Agent Doing On-Call Actually Looks Like

Comments
6 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.