DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
AI Model Failover Drills: Keep Agents Useful When Providers Break

AI Model Failover Drills: Keep Agents Useful When Providers Break

1
Comments
10 min read
Humanizing Artificial Intelligence in DevOps Documentation: Making Runbooks Easier to Create and Use

Humanizing Artificial Intelligence in DevOps Documentation: Making Runbooks Easier to Create and Use

Comments
9 min read
Runbook Hygiene: Why Yours Are Lying to You

Runbook Hygiene: Why Yours Are Lying to You

Comments
2 min read
What's the Most Annoying Part of Incident Response? I Built 5 AI Tools Trying to Solve It

What's the Most Annoying Part of Incident Response? I Built 5 AI Tools Trying to Solve It

Comments
1 min read
An 8-minute outage from a dead NLB and a JVM that cached DNS forever

An 8-minute outage from a dead NLB and a JVM that cached DNS forever

1
Comments
4 min read
Fault-injecting our LLM provider to trust Bifrost fallbacks

Fault-injecting our LLM provider to trust Bifrost fallbacks

Comments 1
4 min read
Why Retries Are More Dangerous Than Failures in Production Systems

Why Retries Are More Dangerous Than Failures in Production Systems

Comments 1
2 min read
Most AI dev tools assume you have a repo. Ops engineers have a broken node and a 3am page.

Most AI dev tools assume you have a repo. Ops engineers have a broken node and a 3am page.

Comments
4 min read
Record of Site Issues #1

Record of Site Issues #1

2
Comments
2 min read
A provider latency spike stalled our whole build queue

A provider latency spike stalled our whole build queue

Comments
4 min read
What Is Multi-Agent SRE? A Practical Introduction

What Is Multi-Agent SRE? A Practical Introduction

Comments
3 min read
5-Minute Post-Deploy Postmortem with SignalPilot

5-Minute Post-Deploy Postmortem with SignalPilot

Comments
3 min read
The Future of SRE: What the Next 5 Years Look Like

The Future of SRE: What the Next 5 Years Look Like

Comments
3 min read
Great Stack to Doesn't Work #10 — Season Finale: "When PagerDuty Calls at 3 AM"

Great Stack to Doesn't Work #10 — Season Finale: "When PagerDuty Calls at 3 AM"

Comments
12 min read
Stop breaking production: a migration path to unified platforms 🛠️

Stop breaking production: a migration path to unified platforms 🛠️

Comments
1 min read
👋 Sign in for the ability to sort posts by relevant, latest, or top.