For a while I felt slightly embarrassed about keeping two agentic coding tools open at once. Claude Code in one terminal, Codex in another. It looked like I couldn't commit to one. Then I noticed I was reaching for each of them at different moments, on purpose, and the embarrassment turned into a workflow.
The short version: one of them is for building and exploring, the other is for running the boring, repeatable work. This post is the division of labor I landed on, built around the routine automation that made it obvious, plus the cost logic underneath it. I build WordPress plugins, so my examples lean that way, but the split is general.
The split that took me a while to see
Some tasks are a conversation. You poke at the problem, change your mind, follow a thread, back up. Other tasks are a straight line. You know exactly what you want done, you just don't want to do it by hand for the fortieth time.
I use Claude Code for the first kind. It holds the whole project in its head and is comfortable going back and forth while a design takes shape. I use Codex, specifically its non-interactive mode, for the second kind: the straight-line, do-this-exact-thing work that I want to fire from a script.
Once I framed it as conversation versus straight line, the choice of tool stopped being a vibe and became a question I could answer in a second.
Codex in non-interactive mode is the automation workhorse
The piece that made the split practical is codex exec. Instead of opening a chat, you hand Codex one instruction and it runs once and prints the result to stdout. That is the part you can put in a script.
codex exec "summarize the structure of this repo in one paragraph"
I set the model and reasoning once, in ~/.codex/config.toml:
model = "gpt-5.5"
model_reasoning_effort = "medium"
approval_policy = "on-request"
sandbox_mode = "workspace-write"
Medium reasoning is a deliberate choice. Routine work is not hard design thinking, it's mechanical edits and summaries, and pointing heavy reasoning at it just makes the run slower and pricier without changing the output. GPT-5.5 at medium is plenty for this, and I bump it up in the moment only when a task actually turns hard. approval_policy = "on-request" makes Codex ask before it writes files or runs commands, and sandbox_mode = "workspace-write" keeps it from touching anything outside the working folder. Both are safety rails I leave on by default.
Project conventions go in AGENTS.md, which is Codex's version of a CLAUDE.md. Codex reads it before each task, so the output stays consistent with how the project wants things done.
The routine work I actually automate
Here is the boring stuff that used to nibble at my day.
Commit messages, from the staged diff:
git add -A
codex exec "read git diff --staged and output a single-line commit message that summarizes the change. No preamble, message only."
Version bumps, which is the one that earns its keep. A WordPress plugin keeps its version in two places that have to match: the Version: header in the main PHP file and the Stable tag: in readme.txt. Miss one and the release breaks. By hand, I get this wrong often enough to dread it.
codex exec "bump this plugin's version from 1.0.9.10 to 1.0.9.11. Change two places: the Version: header in the main PHP file and the Stable tag in readme.txt. Change nothing else."
With on-request, Codex shows me the diff before applying it, so I confirm the two changes are exactly what I asked for. Then I wrap the release chores into one script:
#!/usr/bin/env bash
set -euo pipefail
NEW_VERSION="$1"
codex exec "bump the plugin version to $NEW_VERSION in the PHP header and readme.txt Stable tag, nothing else."
codex exec "add a $NEW_VERSION section to the top of CHANGELOG.md from recent commits, matching the existing format."
git diff # I read this before anything ships
bash release-prep.sh 1.0.9.11
The thing that used to be a careful five-minute ritual is now one command and a diff review.
Where the second tool earns its place
If codex exec handles the straight-line work, why keep Claude Code in the loop at all? Because the two are good at different things, and a few patterns only work when you have both.
The one I use most is cross-model review. I build something with Claude Code, then have Codex review the diff:
codex exec "review git diff for security issues and bugs. Cite file and line for each problem. Give findings only, not praise or general impressions."
A model reviewing its own output tends to like what it wrote. Hand the diff to a different model and it trips on things the first one walked past as obvious. The instruction to skip praise matters more than it looks. Without it you get "this looks solid" followed by a soft non-answer. Ask for problems and locations, nothing else, and the review gets useful.
The second pattern is extract-the-repeat. I explore a new feature interactively in Claude Code, and somewhere in that mess I notice a step I'm going to do every time. That step gets pulled out into a codex exec line and added to a script. The thinking stays in the conversational tool, the repetition moves to the straight-line one.
The third I save for changes I can't afford to get wrong: run the same request through both and compare.
claude -p "propose a refactor for this function" > claude.txt
codex exec "propose a refactor for this function" > codex.txt
diff claude.txt codex.txt
If both land in the same place, I relax. If they diverge, that gap is exactly where a human decision is needed. It's too heavy to do constantly, so it's reserved for the scary diffs.
Handing context between them
Switching tools has a small tax, and how you pay it matters. Claude Code reads CLAUDE.md, Codex reads AGENTS.md. I keep both in the repo with the same conventions so either tool behaves the same way. The trap is updating one and forgetting the other, so changing a convention means editing both, every time.
When I move a long task from one tool to the other, I don't dump the whole history across. I have the first tool summarize where things stand, and hand over the summary. These tools can only hold so much at once, so moving the gist instead of the full transcript keeps the second tool sharp.
The cost logic, including the June 15 change
Money is part of why two tools beats one here. My rough rule: do the long, exploratory work where it's flat-rate, and the short, mechanical work where metered is cheap anyway. Interactive Claude Code runs inside the subscription. A codex exec call is small, so even metered it costs little per run.
This got sharper on June 15, 2026, when Anthropic moved programmatic Claude use, the claude -p headless path and the Agent SDK, off the subscription and onto separate metered credit. Interactive Claude Code in the terminal stayed on the plan. So scripting Claude with claude -p is no longer a flat-rate move. Which lines up neatly with the split I already had: explore interactively in Claude Code on the flat plan, run short automation through codex exec where metered is cheap. Pricing and terms shift, so check the current numbers on each vendor, but the shape of the logic holds.
The loop, end to end
Put together, a small feature looks like this:
- Build it interactively in Claude Code.
- Have Codex review the diff.
- Fold in the findings, read the diff myself, commit.
- Run
release-prep.shfor the version bump and changelog. - Read
git diffone more time, then push.
Build, check, tidy, each handed to the tool that's good at it, with judgment and the final read kept in my hands.
What goes wrong with two tools
Running two has its own friction, and pretending it doesn't is how you lose the benefit.
- Two convention files drift.
CLAUDE.mdandAGENTS.mdfalling out of sync means the tools start behaving differently. Edit both. - Don't point both at the same files at once. One tool's edit can stomp the other's. Work one at a time.
- Reviewers drift into agreement. Without "findings only," the review turns into polite approval.
- Don't over-tool. Two tools on a job that needs one is just more setup and more cost.
Two is not always better than one. If one tool covers it, use one. Reach for both only when the split pays: a separate reviewer, a flat-versus-metered cost difference, a build phase and a repeat phase that genuinely want different strengths.
One rule I don't break
The speed is real, and it makes a bad habit tempting: approving diffs without reading them, or running automation with the approval prompt turned off. Treat anything either tool writes as untrusted input until you've read it. Version numbers, config, anything touching user input, get a human diff review before they commit or ship, no matter which model produced them. Keep the approval prompt on outside of contexts you fully understand. The point of automating the boring work is to free up attention, so spend some of it on the review.
A note to my next self
The division held up because it maps to something real: some work is a conversation and some work is a straight line, and the tools are honestly better at one or the other. Build and explore in the conversational one, run and repeat in the straight-line one, let a different model check the first model's work, and put the boring release chores behind a single command. Keep judgment and the last diff for yourself. That's the whole system, and the day one tool covers the job, I'll happily use one.
References
- Codex CLI docs (OpenAI Developers)
- Codex non-interactive mode (codex exec)
- Anthropic Help Center, on the June 15 billing change
I build WordPress plugins and write about AI tooling and security at https://raplsworks.com/.
Top comments (26)
tried this same setup - the embarrassment fades. what i noticed: the gap isn't capability, it's latency tolerance. repeatable work can wait a bit longer for a result.
The latency-tolerance framing is sharper than the one I used, and I think it's the same line from a better angle. Conversation work is latency-intolerant because you're in the loop waiting to react, so an interactive tool that answers fast keeps the thread alive. Straight-line work is latency-tolerant because nobody's watching the result land, so it can run async, batched, fired from a script. That's why codex exec fits the repeatable side: not that it's less capable, but that the work it's doing doesn't mind waiting. You named the property I was sorting on without saying it out loud. Thanks for that.
latency-tolerant vs latency-intolerant is a better cut than interactive vs batch - it captures why the boundary shifts. once nobody is waiting in real time, the tool matters less than how well it handles queued work.
That last point is the part I hadn't followed through. Once nobody's waiting in real time, the question stops being "which tool" and becomes "how well does it handle queued work," which is a completely different evaluation. Interactive tools get judged on responsiveness; async tools get judged on throughput, retries, how cleanly they report a failure you weren't watching land. Same task, different scoring function, decided entirely by whether a human is in the loop.
Which is a nicer version of my whole post than my whole post was. "Conversation vs straight line" was the symptom, latency tolerance is the mechanism, and the tool choice falls out of it instead of being the thing you start from. Good thread, this sharpened it.
retry contract changes completely with it too - interactive you want fast fail, async you want smart retry with a clean error state you can inspect hours later. treating them the same is where teams get burned
The retry contract is the perfect closing piece, because it's where latency tolerance and reversibility meet. Interactive wants fast fail: a human's waiting, surface the error now and let them decide. Async wants smart retry into a clean, inspectable error state, because the only debugger present hours later is the state you left behind. Same operation, opposite retry semantics, and the deciding factor is whether anyone's there to react when it breaks.
Treating them the same is exactly where teams get burned, usually by giving async work the interactive contract: fail fast into a void nobody was watching. The failure you can't see at the moment it happens is the one that has to leave evidence. That ties the whole thread together. Good exchange, this one taught me something.
and in async pipelines that state is also your only audit trail - no replay button, no conversation to scroll back through. clean error states aren't nice to have, they're the whole debugging surface.
"The whole debugging surface" is the line that ends this for me. Interactive work has a conversation you can scroll back through, so a sloppy error state is survivable. Async has none of that: no replay, no transcript, just whatever the process wrote down before it stopped. The error state isn't reporting the failure, it is the failure as far as anyone investigating later can see.
Which is the same lesson the reversibility thread landed on, from the other side: the part you can't watch in real time has to leave evidence, because the evidence is all future-you gets. Clean error states are the audit trail you're choosing whether or not to have. Best exchange I've had on here in a while. Thanks for taking it the whole way.
Error state as the failure itself, not just the report - that reframe changes the design question completely. Instead of how do we format this error, it becomes what does an investigator need to reconstruct what happened. Those are very different specs, and most teams are still writing to the first one.
That reframe is the whole payload. "How do we format this error" optimizes for a human reading it now; "what does an investigator need to reconstruct what happened" optimizes for a stranger reading it cold, later, with no other context. Different reader, different spec. The first writes a message; the second writes a record.
And you're right that most teams write to the first one, because the first reader is the one they can picture. The investigator is hypothetical until the incident, and by then the spec's already set. Designing for the reader who isn't in the room yet is the hard discipline. That's the note I'm ending this on. Genuinely good thread, Mykola.
that timeline gap is the thing nobody designs for. useful right now vs useful in 3 months during a postmortem are genuinely different specs, and almost every error message I have seen is optimized for the first one.
"Useful now vs useful in 3 months" is the cleanest way to say it. That's the gap I'm walking away from this thread thinking about. Thanks, Mykola.
glad it resonated. that three-month gap is also where org memory breaks — the people who built the system often aren’t the ones debugging it later, which makes the "useful in a postmortem" spec even harder to nail in advance.
Right, and that makes the reader even more of a stranger: not future-you, but someone else entirely, reading the error state with none of the context you had when you wrote it. The spec gets harder because you can't even rely on shared memory to fill the gaps. You're writing for a person who was never in the room, which is the strongest version of the discipline. Good place to leave it. Thanks, Mykola.
yeah - 'never in the room' is the right frame. good thread.
The version-bump example hit a nerve — that Version header / Stable tag mismatch breaking a release is exactly the kind of silent failure that's invisible until it ships. Automating that diff-reviewed is smart.
Tangent from your WordPress angle: I've been scraping the plugin directory lately and the abandonment patterns are wild — plugins with 50k+ active installs, recent reviews full of "broke on the latest WP version," and an author who last shipped 18 months ago. Your two-tool loop is basically the antidote to how those plugins died: nobody kept a cheap, repeatable release ritual, so updates got heavy and eventually stopped. The "boring chores behind one command" discipline is what keeps a plugin alive past year two.
One question on the cross-model review — when Codex flags something Claude wrote (or vice versa), how often is it a real bug vs stylistic noise? Curious whether the divergence rate is high enough to be worth the second pass on routine diffs, or only on the scary ones.
The abandonment data is the part I want to sit with. "Nobody kept a cheap, repeatable release ritual, so updates got heavy and eventually stopped" is a better description of how plugins die than anything I wrote. It's not a dramatic failure, it's the release getting a little more painful each time until the author quietly stops paying it. The 50k-install plugin with an 18-month-old last release isn't abandoned by decision, it's abandoned by friction. Keeping the chores behind one command is exactly the thing that postpones that day.
On your question, the honest split from my own diffs: most cross-model flags are stylistic noise. Roughly, if I had to put numbers on it, the large majority is taste, naming, structure, "I'd have done it differently" with no actual defect. A smaller slice is real, and a thin slice inside that is the kind of silent bug that ships. So the divergence rate alone doesn't justify a second pass on every routine diff. What I do instead is gate by blast radius, not by how scary the diff looks. Anything that touches state I can't cleanly roll back, a migration, a write, a release step like that version bump, gets the second model every time, because there the rare real bug is unrecoverable. Pure read paths and local refactors I let ride on one model, because the worst case is cheap. The second pass isn't paying for itself on average. It's paying for itself on the tail, and the tail is where reversibility decides.
What's your read from the scraped data: do the plugins that survive past year two show any visible sign of that ritual, or is it invisible from the outside until you notice they just kept shipping small updates?
Great question, and the data backs your "death by friction" read almost too well. The survivors are visible from the outside, but only if you look at update cadence instead of update size. The plugins still alive past year two show a steady drip — many small releases, often just "tested up to WP 6.x" bumps and minor fixes, nothing dramatic. The dead ones show the opposite signature: a few big sporadic releases, then a gap that keeps stretching. You can almost see the friction winning in the commit rhythm — the releases get further apart before they stop entirely, like the author needed a bigger and bigger reason to face the ritual each time.
The other tell is the review-to-update lag. On survivors, a "broke on 6.x" review gets a response release within weeks. On the ones sliding toward abandonment, those reviews pile up unanswered for a release cycle or two first — the friction shows up as latency before it shows up as silence. So it's not invisible, but it's a derivative signal: you're watching the gaps grow, not any single release.
Which loops back to your point — the cheap repeatable ritual isn't just hygiene, it's literally what keeps the cadence regular enough to survive. The plugins that automated the boring part kept shipping small; the ones that didn't let each release get heavy until it got skipped.
"The friction shows up as latency before it shows up as silence" is the sharpest thing in this whole thread, and the review-to-update lag is the tell I wouldn't have thought to look for. It makes the decline measurable while the plugin is still technically alive. By the time there's silence it's a postmortem, but a stretching response lag is a vital sign you can read in real time.
The derivative framing is the part I keep turning over. You're not reading any single release, you're reading the second-order signal, whether the gaps are widening. That matches what it feels like from the inside exactly. No plugin dies on a decision. What actually happens is each release costs a little more focus than the last, so you wait for a bigger reason to start, and the interval quietly stretches until one day the reason never gets big enough. The cadence doesn't break, it decays.
Which is why the cheap ritual matters more than it looks: it isn't saving time per release, it's keeping the per-release cost flat so the interval never starts stretching in the first place. Flat cost, steady cadence, survival. Rising cost, widening gaps, the slide. You can almost predict which plugins make it from the slope of their release intervals alone. Genuinely good data, thanks for digging into it.
Appreciate you taking it the whole way — "flat cost keeps the interval from stretching" is the version I'm keeping. That's the thing the maintained alternative actually sells, when you think about it: not features, just a cadence the original author couldn't sustain. Good thread, learned from it. I'll be around — following your stuff.
Great breakdown of the conversation vs straight-line split. That got me thinking about a third mode I keep running into: brainstorming/ideation. Both Claude Code and Codex tend to jump straight to implementation when asked to explore ideas — that execution drift is a real productivity killer.
I put together a small hook-based plugin (Brainstorm-Mode by mehmetcanfarsak on GitHub) that uses PreToolUse hooks to keep agents in ideation mode and block premature tool calls. Three modes (divergent, actionable, academic) so you pick the right headspace. Plugs right into the hook system if anyone's interested in the thinking behind it.
The third mode is a real one, and "execution drift" is a good name for it. Both tools do lean toward implementation the moment you ask them to explore, and the cost is that you never actually get the divergent thinking you wanted, you get a half-built version of the first idea instead.
What's interesting is that your fix sits on a different axis than my post. Mine was about reach, which tool talks to the outside. Keeping an agent in ideation and blocking premature tool calls is the suggestion-versus-guarantee axis another commenter raised: a prompt can ask the model to stay in exploration, but a PreToolUse hook can actually stop the tool call. That's the same reason hooks beat a Skill for load-bearing rules. "Don't execute yet" is exactly the kind of rule you want to have teeth, because a model that drifts past it once has already cost you the mode.
nice setup. but sincerely i prefer using one model for almost everything.
Like
Some comments may only be visible to logged-in visitors. Sign in to view all comments.