DEV Community

Cover image for Loop Engineering: Building an Agent Loop with agent-runbook
paul_h
paul_h

Posted on

Loop Engineering: Building an Agent Loop with agent-runbook

Recently, another interesting new term has appeared in the AI industry.

Loop Engineering.

If you follow the AI space, you've probably seen it everywhere in the past couple of days. It's all over X, all over various social media, and quite a few people are discussing it in group chats too.

Recently Addy Osmani formally organized this concept into Loop Engineering — the fourth Engineering after Prompt Engineering, Context Engineering, and Harness Engineering.

What is a Loop? Here's a concrete scenario:

You have a project with 16 failing tests. Previously you'd do this: run the tests, see what failed, tell Claude "fix this", it fixes it, you run the tests again, find new issues, say something again... back and forth, you are the person driving the loop.

The idea behind Loop Engineering is: you no longer manually drive it round by round. You define the goal (all tests pass), define what to do each round (run tests → fix code), define constraints (can't modify test files), then let go. The system runs on its own until the goal is met.

/goal Is Not Enough

At this point you might say: doesn't Claude Code already have the /goal command? Can't I just /goal "all tests pass" and be done?

On the surface, yes. /goal gives you a completion condition, and Claude works on its own until it's satisfied. But after using it a few times you'll notice the problem — the goal is defined, but the agent still won't work properly. Because you only told it "what counts as done", you didn't tell it "what to do each round".

/goal "all tests pass" — what did it do:

  • Tells the agent "keep going until this condition is met"
  • At the end of each round, an independent model judges whether the goal is satisfied
  • The agent has complete freedom in what it does each round

What it doesn't do:

  1. Doesn't define the internal structure of each round. In /goal the agent does whatever it wants each round. Maybe the first round it runs tests + fixes code, the second round it suddenly goes refactoring, the third round it modifies test files.
  2. No iteration-level constraints. /goal only has a termination condition. There's no guardrail like "only modify one file per round", and you can't control when the agent goes out of bounds.
  3. Not reusable. /goal "all tests pass" is gone once you type it. Next time you switch repos or switch people, you have to type it all over again.
  4. Not auditable. When your boss asks "what's the logic of this automated fix workflow", you can't show them /goal.

To summarize: /goal solves "keeping the agent from stopping", but doesn't solve "making the agent follow the rules".

What you need is a place to write down the loop's structure, constraints, and goals — not a one-time command typed into the terminal, but a file that can be committed to the repo, where anyone who gets it can run it and get the same behavior.

agent-runbook: The Contract Format for Loops

This is what agent-runbook does.

agent-runbook is an open source project (github.com/KnoxOps/agent-runbook), it's not the execution engine for loops, but rather the contract format for loops. You use YAML to declare "what to iterate on, when to stop, what the constraints are for each round", and the compiler generates a SKILL.md for you — this is the reusable instruction format for Claude Code and Codex, put it in your project and it can be directly invoked with claude --skill.

A loop step has three elements:

  • body: what to do each round (the rhythm of observe → act → verify)
  • goal: when to stop (must be a machine-verifiable condition)
  • max_iterations: safety boundary (exceeding this number means the design has a problem, prevents burning tokens)

There's also one more key thing: quality_check. This is an iteration-level guardrail — after each round it checks whether the agent went out of bounds (e.g. modified files it shouldn't have). If blocking: true, the round doesn't count as complete if the check fails.

Hands-on: Building an Automated Test Fix Loop

Here's a simple example to show you how we use agent-runbook to build an agent loop.

We're going to build an automated test fix Loop. This loop is simple, the goal is 100% unit test pass rate. Each iteration has only two steps:

  1. run_tests - run the tests, see which ones are still failing
  2. fix - launch a clean context agent to fix the discovered issues

Beyond that, we also need to define our safety boundary: max_iterations. I wonder if any readers here have had the experience of burning through all their tokens with the /goal command — max_iterations is what prevents that.

Here's the full runbook, defined in structured YAML:

name: fix-failing-tests
description: Iteratively fix all failing tests until the test suite is green

steps:
  - id: fix_loop
    type: loop
    description: "Run tests, analyze failures, fix source code, repeat until green"
    goal: "pytest exits with 0 failures (all tests pass)"
    max_iterations: 10
    depends_on: []
    body:
      - id: run_tests
        type: script
        command: "cd examples/fix-loop && python3 -m pytest tests/ --tb=short 2>&1 | tail -60"
        depends_on: []
      - id: fix
        type: agent
        prompt: |
          Look at the pytest failures from run_tests.
          Pick ONE source file that has failing tests and fix the bugs in that file.

          Rules:
            - Only modify files in src/, NEVER modify test files
            - Fix exactly ONE file, then stop immediately
            - Do NOT read or modify any other source files
        depends_on: [run_tests]
        quality_check:
          blocking: true
          rules:
            - "Only files in src/ were modified, not test files"
            - "Exactly one source file was modified"

  - id: present
    type: inline
    prompt: |
      Generate a markdown report summarizing the fix loop results.
      Include:
        - Total iterations taken
        - What was fixed in each iteration (file + bug description)
        - Final test results
        - How cascading dependencies caused failures to clear automatically
      Write the report to fix_report.md
    depends_on: [fix_loop]
Enter fullscreen mode Exit fullscreen mode

From YAML to Executable SKILL.md

Next we need to compile the YAML into a SKILL.md that Claude Code/Codex can directly execute. The generation command is simple:

python3 -m agent_runbook generate runbook.yaml -o output/
Enter fullscreen mode Exit fullscreen mode

The generated SKILL.md looks like this:

---
name: fix-failing-tests
description: ">-"
  Iteratively fix all failing tests until the test suite is green
user-invocable: true
---

## Execution Flow

### Task Context

Before starting execution, initialize `task_context.json`:

```json
{
  "task_id": "<task_id from input>",
  "current_step": 0,
  "current_step_id": null,
  "status": "running",
  "steps": {
    "fix_loop": "pending",
    "present": "pending"
  },
  "updated_at": "<ISO timestamp>"
}
```

Update this file after each step completes. On error, set step status to `"failed"` and overall `status` to `"failed"`.

### Step 1: fix_loop

**Type:** loop
**Description:** Run tests, analyze failures, fix source code, repeat until green

## Iteration Loop

**Goal:** pytest exits with 0 failures (all tests pass)
**Max Iterations:** 10

> This step executes as a loop. The body steps repeat until the goal is met or max iterations reached.

## Loop Body (repeats each iteration)

#### Body Step 1: run_tests

**Type:** script

**Execution:** Execute the following command:
```bash
cd examples/fix-loop && python3 -m pytest tests/ --tb=short 2>&1 | tail -60
```

#### Body Step 2: fix

**Type:** agent

**Execution:** Launch an independent agent with the following prompt file:

Look at the pytest failures from run_tests.
Pick ONE source file that has failing tests and fix the bugs in that file.

Rules:
  - Only modify files in src/, NEVER modify test files
  - Fix exactly ONE file, then stop immediately
  - Do NOT read or modify any other source files


## Goal Evaluation

After all body steps complete, evaluate:

**Goal:** pytest exits with 0 failures (all tests pass)

1. If goal IS met → mark this step completed, proceed to next step.
2. If goal NOT met and iterations remain → reset body steps, start next iteration.
3. If max iterations reached → mark step completed with status "max_iterations_reached", report what remains.

Append a summary to `iteration_history` after each iteration.

### Progress Tracking

After completing this step, update `task_context.json`:
- Set `current_step_id` to `"fix_loop"`
- Set `steps.fix_loop` to `"completed"`
### Step 2: present

**Type:** inline

## Execution
Follow these instructions:

Generate a markdown report summarizing the fix loop results.
Include:
  - Total iterations taken
  - What was fixed in each iteration (file + bug description)
  - Final test results
  - How cascading dependencies caused failures to clear automatically
Write the report to fix_report.md


### Progress Tracking

After completing this step, update `task_context.json`:
- Set `current_step_id` to `"present"`
- Set `steps.present` to `"completed"`
Enter fullscreen mode Exit fullscreen mode

What does the generated SKILL.md contain? It translates the contracts you declared in YAML into execution instructions that the agent can understand:

  • iteration_history: requires the agent to record what was done each round and whether the goal was met, forming structured iteration memory
  • goal evaluation: the judgment logic after each round — if met then stop, if not met then continue, if limit reached then report
  • progress tracking: tracks overall progress through task_context.json, supports checkpoint resume

Running It: 3-Round Convergence

Now we can trigger this skill to run in Claude Code:

The run included three iterations:

  • Iteration 1: calculator fix → 6 failures disappeared

  • Iteration 2: validator fix → 5 failures disappeared

  • Iteration 3: formatter fix → all green

  • Finally, this is also what we defined earlier in the runbook — a fix_report.md to be produced after the loop.

Key Points for Designing a Good Loop

  1. Choose the right task. Not all tasks are suitable for loops. A good loop task has two characteristics: objective feedback signals (test results, lint output, whether compilation passes), and the ability to make incremental progress building on the previous round. Fixing tests, code migration, and performance optimization are all good candidates. Tasks requiring one-time creative decisions (architecture choices, naming) are not suitable.

  2. Write the goal as a decidable end state. "pytest exit 0" is a good goal, "better code quality" is not. The agent must be able to determine true or false on its own through tool output, otherwise the loop never knows whether it should stop.

  3. Keep the body in an "observe—act" rhythm. First use script steps to see the current state clearly (run tests, run lint), then use agent steps to make decisions and modifications. Don't let the agent observe, act, and verify all in one round — split them up, each step has clear responsibilities, and when something goes wrong it's easier to locate.

  4. Leave an exit for failure. max_iterations is not the number of rounds you expect, but a safety valve for "exceeding this number means the approach has a problem". A normal loop should converge well below the upper limit. If it maxes out, it means the goal is too hard or the body design has flaws, and human intervention is needed.

agent-runbook: More Than Just Loops

Due to the AI product I'm developing, I frequently need to write many long-running, as-error-free-as-possible DevOps skills for SREs.

During debugging I often encounter two types of problems:

  • One is agents not following instructions — you tell it to only restart the service, and it goes ahead and changes the configuration too.
  • The other is in a complex multi-step skill, agents not collaborating according to the established norms, where the output from the previous step isn't read by the next step at all, or it's read but the format is wrong.

Based on these problems, I developed agent-runbook: a contract-based skill generation tool, where the generated SKILL.md can be directly used as a skill integrated into the Claude Code/Codex ecosystem.

Its core philosophy is: use contracts to constrain agent collaboration, instead of relying on prompts and hoping for the best.

This table gives you a quick sense of how agent-runbook differs from /goal:

/goal agent-runbook
Per-round structure Agent does whatever it wants Body declaratively defines each round's steps
Iteration constraints None, only a termination condition quality_check guardrails, out-of-bounds doesn't count as complete
Inter-step communication Relies on LLM context passing JSON Schema files, inspectable, parallel-readable
Error recovery Start over Checkpoint & Resume, pick up from where it crashed
Build-time checks None DAG cycle detection, schema reference validation, contract closure checks
Reusability Gone once you type it Commit to repo, anyone can run it with the same behavior

Loop is a step type added on top of this foundation — when your task requires iteration, use the same contract-based approach to define the loop's body, goal, and constraints.

You don't have to start from scratch either. open-devops-skills is a production-grade DevOps skill library built on agent-runbook, currently featuring infrastructure/cloud resource cost optimization skills, with more DevOps scenarios to be expanded in the future. You can use them directly, or use them as reference for designing your own skills.

It's also worth mentioning that agent-runbook itself is not limited to DevOps. Any scenario requiring multi-step orchestration, inter-agent collaboration, and long-term reliable operation is suitable — code migration, security auditing, documentation generation, data pipeline validation. As long as your task can be broken down into "steps + contracts + dependencies", it can be expressed with a runbook.

The repo is at github.com/KnoxOps/agent-runbook, feel free to try it out and give feedback. If you have a workflow where you're repeatedly prompting agents manually, try writing it as a runbook — you'll find that once it becomes a contract, the cost of debugging and reuse drops significantly.

Top comments (0)