DEV Community

Cover image for The Drift from Chat to Backlog: How My AI Task Planning Evolved Over Three Months
Nic Lydon
Nic Lydon

Posted on

The Drift from Chat to Backlog: How My AI Task Planning Evolved Over Three Months

Three months ago, my entire task-management system was a chat window I'd lose when the tab closed. Today it's a Postgres backlog that three different coding agents — Claude Code, Codex, Grok — pull work off autonomously, stamp with attribution, and close against git history. I never decided to build a project-management system. I just kept hitting a wall, patching it, and hitting the next one the patch exposed.

There's a clean way to read the whole arc, though, and it comes down to a single variable: where the plan lives. Watch that, and every step makes sense — including why you probably want to stop well before the end.

The setup

I run a self-hosted personal data platform called Nexus, on a 128GB Strix Halo box named Furnace, surrounded by ~100 repos: MCP servers, ingestion pipelines, iOS apps, content tooling. My execution tools are Claude Code and Codex CLI. The work is bursty — during one 35-day stretch in spring I shipped roughly 557K lines across those repos — and that throughput is the pressure that broke each planning approach in turn. At a calmer pace you'll hit the same walls later, but you'll hit them.

Phase 1: The plan lives in the chat (mid-to-late March)

At the start there was no task system. Planning was the conversation. I'd open a chat, think out loud about an architecture problem, get to something coherent, and then go build it. The artifact of planning was a better mental model in my head, not a written thing.

This is, for the record, exactly what every best-practice guide tells you to do, and it's right. The math is brutal and well-known: if Claude makes the right call 80% of the time on any single decision, a feature with 20 decision points lands all 20 at 0.8^20 — about 1%. Planning collapses those 20 live decisions into a reviewed spec where each one is already made. I'd never give up the plan-first instinct; it's the one habit from this whole story that never changed.

The problem was narrower: the plan evaporated when the thread ended. A late-March session designing a multi-agent system for Nexus produced genuinely good architecture — deterministic behavior under load, agents that self-regulate instead of spiraling, adaptive thresholds. None of it was anywhere I could act on the next morning except my memory and a scrollback buffer. At one-feature-a-day that's survivable. At my pace it was lossy in a way that actively cost me work.

The wall: plans that exist only in chat history can't be acted on later, can't be prioritized against each other, and can't be handed to anything but the version of you that remembers the conversation.

Phase 2: The plan lives in a file (mid-to-late April)

The first durable fix was embarrassingly simple: a TODO.md in each repo. But the structure I landed on is the part worth stealing, because it wasn't a checklist. Each item was a small spec. Here's a real one, still in my broadside/TODO.md:

## Idempotency on publish operations

**Status:** captured — flagged during the 2026-04-26 reality-sync session.
**Trigger:** before letting an agent publish unsupervised at any volume.

Today, POST /api/posts/[id]/twitter (and the bluesky / linkedin / devto
siblings) don't refuse a re-publish. If a publish succeeds upstream but the
response gets lost in a network blip, an agent's retry would publish a
duplicate — visibly, to the audience.

**What the work will involve:**
1. Before calling upstream: SELECT posted_url, status FROM posts WHERE id=$1.
   If already posted, refuse with 409 + { already_posted, posted_url }.
2. Optional: accept an Idempotency-Key header, TTL'd table, ~10min window.
3. Update the broadside_publish_to_platform MCP tool description accordingly.

**Risks worth knowing:**
- The "republish" path is sometimes intentional (manual delete + re-fire).
  Mitigate with a posted_at recency check or a force=true param.

**Why it matters:** an agent that double-posts even once is visible to the
audience; the cost is reputational, not just internal.
Enter fullscreen mode Exit fullscreen mode

Four things made this carry weight beyond a checkbox:

  • Status + date. Every item is stamped with when it was captured. Trivial-sounding; it's what lets you reason about staleness later.
  • Trigger, not priority. Instead of P1/P2/P3, each item records the condition under which it becomes urgent. "Before letting an agent publish unsupervised" tells future-me exactly when to pull this off the shelf; "high priority" tells me nothing.
  • The work is pre-decided. That numbered list is a plan captured at the moment I understood the problem best. Hand it to Claude Code three weeks later and the 20 decisions are already made — Phase 1's whole value, persisted.
  • Risks are written down. The "republish is sometimes intentional" note is exactly the edge case I'd have forgotten and an agent would have trampled.

Notice the recurring phrase: "reality-sync session." Concretely, that was a 20-minute pass, usually before a planning block: open each repo's TODO.md next to its recent git log, close anything the commits show as already shipped, and re-date anything still open so I could tell stale from live at a glance. Reconciling the plan against ground truth on a cadence — that habit turns out to be the seed of everything in Phase 3.

The wall: TODO.md is per-repo. With ~100 repos, I had no single surface that could tell me what to work on next across all of them, no way to prioritize globally, and nothing an agent could pull from as a queue. The plan was durable but fragmented.

Phase 3: The plan lives in a database with a gate (early-to-mid May onward)

This is the structural leap. Task state moved out of flat files and into Nexus's Postgres as the Operator Backlog (OB) — with a real intake-to-execution lifecycle instead of a list.

The shape:

  • Work enters as a candidate, in pending state — not yet a real task. A candidate won't become an OB item until I approve it in a #pmo-review Discord flow.
  • Approval mints an OB-##### row in operator_backlog_items.
  • Items move through status lanesrequires_triagerequires_decisionrequires_investigationrequires_clickopsautonomous_safe — and agents drain those lanes.
  • Every commit references its OB id, joining the backlog directly to git history. OB-27081 H2/M2 — close /register on RS is a real commit subject across several of my repos.

The lanes are the interesting part, and they map almost exactly onto a distinction Anthropic draws in Building Effective Agents: the difference between work an agent can drive autonomously and work that needs a human checkpoint "before irreversible actions." autonomous_safe is the lane an agent can just do. requires_decision is the lane that needs me. The backlog isn't just storage; it's a router that sorts work by how much human judgment it still needs.

The single most important addition in this phase wasn't the lanes, though. It was the approval gate. Phase 2 captured indiscriminately — anything I typed that looked like a task got written down. Phase 3 added a filter, and I know exactly why, because I watched it fail without one. From a real session log on June 3:

Drained requires_triage + requires_decision queues (19 items → autonomous_safe/closed/ignored); 8 decisions made; discovered auto-filer over-captures on soft prose in narration (e.g. "blocked on" → filed OB-4736 to tighten regex); priority-lane starvation (skillopt-train starving PM jobs) diagnosed in OB-4715.

An automated process was scraping my session narration for tasks and mistaking the phrase "blocked on" — used conversationally — for a real blocker. The system was filing garbage into its own backlog. The fix (OB-4736) was itself filed as an OB item, through the gate. The backlog had become self-correcting: its own intake bugs are tracked in the same substrate as everything else.

That same log entry shows the daily rhythm this phase settled into. "Draining queues" became a literal, recurring operation — pull the items in a lane, make the decisions, move them to autonomous_safe or close them. Eight decisions in one sweep. It reads like an on-call shift, because that's effectively what it is: I'm the operator, the backlog is the queue.

The wall: a database-backed queue with a governance gate is great, but it assumes disciplined intake and it assumes someone drains the lanes. As volume grew, I was the bottleneck. The backlog could hold more work than I could personally execute.

Phase 4: Many agents drain the backlog (late May → now)

The most recent shift isn't about how tasks are tracked — Phase 3 settled that. It's about who executes them, and how you stay accountable when it's not just you.

The OB backlog is now a shared work queue that multiple runners pull from: Claude Code, Codex, and Grok, each tagged so I can tell after the fact which agent did which item. The same status lanes from Phase 3 keep it safe — an agent only picks up work already in autonomous_safe, never anything still sitting in requires_decision. Conflict is handled by leases: a runner claims an item, a second runner sees it's taken and skips it, and if the first agent dies the lease expires and the item frees itself.

Attribution stamping is the load-bearing piece. Because each commit and OB resolution is tagged by runner, "who did this and why" stays answerable even with three agents touching ~100 repos — and the rule is that the human owns the commit while the runner tag lives in the backlog, never forged into git history. Each run executes in its own isolated git worktree, so parallel agents never touch the same files.

That's the whole machine. Here's the moment it stopped being theory for me. OB-1623 was "wire a model-provenance footer into report delivery." Claude Code claimed it on May 30, started working — and then refused to finish it. It had discovered the task's premise was wrong: the function the task named was a shared primitive with ten callers, and the files it pointed at didn't even hold the data the footer needed. Instead of forcing a fix that would have quietly broken nine other call sites, it blocked the item, filed a corrected prerequisite as a new OB, and released its claim with a note explaining exactly why. Two days later, after the prerequisite landed, Codex picked up the same OB cold — no shared memory with the Claude Code run, just the backlog row and the blocking note — and shipped it end to end: PR merged, Nexus deployed to both Furnace and Crucible at c584d2a8, the live footer renderer verified emitting the right markers.

Read that sequence again, because it's the whole point of Phase 4 in one item. Two different models, no human mediating the handoff, and the system self-corrected across the gap — one agent's refusal to do the wrong thing became another agent's clean win, because the reason for the refusal was written down in the one place both of them could see. That's not parallelism. That's the backlog doing the thing a good engineering team does: catching a bad assumption before it ships, and carrying the correction forward to whoever picks the work up next.

The pattern underneath

Here's the whole arc as a table. Read the "Fixed" and "New limit" columns as a chain — each phase's new limit is the next phase's reason to exist.

Phase Where the plan lives Fixed New limit
1. Conversational Chat history Decision quality (plan-first) Plans evaporate
2. TODO.md Per-repo files Persistence No global view or priority
3. OB / PMO Postgres + approval gate Global queue, governance, routing Needs disciplined intake; you drain it
4. Multi-agent OB Backlog + worktrees + attribution Parallel execution, accountability Coordination overhead

Two things are worth pulling out of that table.

First: you do not need to reach Phase 4 to get most of the value. Take the Phase 2 TODO.md format — status, trigger, pre-decided steps, risks — and nothing else. It's a text file; it costs nothing; and everything after it is just scaling that same captured-plan idea to more repos and more executors. If you steal one thing from this post, steal that.

Second, and this is the part I'd defend hardest: one habit spans all four phases and predates the tooling. I re-ground against real state before I plan. It's the April "reality-sync session." It's the Phase 3 gate reconciling candidates against truth. And it shows up, almost word for word, when I catch an analysis cutting corners. A workflow review I ran in June leaned on convenient pre-summarized views instead of the raw tables, and my response was blunt:

This doesn't seem like it's aware of any of the PMO processes or project initiation… did you look through all of the actual raw ingestion tables?

The tooling got more elaborate across these four months; the discipline never changed: plan only against verified current state. The backlog, in the end, is just the most durable place I've found to keep that state — so that planning, whether it's me or an agent doing it, starts from the ground and not from a guess. The plan-first math everyone quotes only holds if the plan rests on true premises. A perfectly-structured spec built on stale context fails all 20 decisions just as surely as no plan at all — it just fails them faster, and with more confidence. Every phase here was, underneath, a better answer to the same question: where do I keep the truth the plan depends on?


I build a self-hosted personal AI data platform in the open. The one design call I'm still least sure about: whether the human approval gate at Phase 3 is a permanent feature or just scaffolding I haven't automated away yet. If you've run agents against a shared queue, where did you draw that line — and did it hold?

Top comments (1)

Collapse
 
twrty_connect profile image
twRty Connect

"Trigger, not priority" is the sleeper insight here and I think it deserves more attention than the multi-agent architecture.

Priority labels (P1/P2/P3) encode someone's assessment at a point in time, and that assessment ages badly. "Before letting an agent publish unsupervised" never goes stale — it's still true six months later, and it's self-executing: the moment the condition is met, you know exactly which backlog item to pull. Priority tells you how urgent it felt when you wrote it. Trigger tells you when it becomes urgent.

The discipline you call out — "plan only against verified current state" — is the thread everything else hangs on. The June example where an analysis was using summarized views instead of raw tables is the failure mode hiding inside most AI workflows: the model produces something structurally correct but built on stale premises. The backlog as "the most durable place to keep the truth the plan depends on" is a clean reframe of what a project management system is actually for.

On your open question about the approval gate: my instinct is it's load-bearing in a way that won't automate away cleanly. The gate isn't just filtering bad tasks — it's where you reconcile what agents think the work is with what you actually want. That reconciliation loop seems like the kind of human judgment that degrades gracefully if you skip it but whose absence is invisible until something unexpected ships.