OpenAI Codex CLI in the Real World: What Actually Works

A deep dive on OpenAI Codex CLI in real workflows: where it beats Claude, where it fails, and the patterns that let it earn a permanent pane.

April 14, 2026 · 7 min read

Codex CLI was easy to underrate at launch. It looked like a Claude Code clone with a different wrapper, and a lot of reviewers treated it that way. After six months of daily use alongside Claude Code in a parallel grid, I can say it isn't a clone — it has a distinct character, different strengths, and a real place in a senior developer's workflow.

This is the deep dive I wanted when I started. Where Codex actually wins, where it loses to Claude, and the operating manual that keeps it useful instead of frustrating.

The shape of Codex

Codex's design center is "do the thing, minimize back-and-forth." It reads less, talks less, and commits to a direction faster. For the right task, that's exactly what you want. For the wrong task, it produces confident code that isn't what you asked for.

Three things I noticed early:

  • It rarely asks clarifying questions. If your prompt is vague, it guesses.
  • Its tool-use loop is tight. Fewer re-reads, faster iterations.
  • It has stronger shell-flavored instincts than Claude. It reaches for grep, sed, and inline scripts more naturally.

This makes it a great implementer, a weak architect, and an excellent "just type it" tool. Full head-to-head is in Claude vs Codex vs Qwen.

Where Codex beats Claude

There are specific tasks where I'd pick Codex over Claude every time:

Shell-heavy workflows. Pipelines, grep-and-transform, running a bunch of small commands in sequence. Codex is faster here because it doesn't narrate; it just runs.

Specced implementation. If you hand it a two-sentence spec and a clear acceptance criterion, Codex implements it faster than Claude does. The speed is real — in my timing, something like 20-30% less wall-clock time on specced tasks.

File-level edits when you know the file. "In utils.ts, change parseDate to handle ISO-with-offset input." Codex dives in. Claude often reads three neighboring files first.

Quick one-offs in a shell pane. "Show me a Dockerfile for a Python 3.12 Fastapi app with poetry." Codex produces usable output fast and without ceremony.

If your week is mostly well-specced implementation, Codex is probably your best daily driver. For the role-based grid pattern, see agentic coding setup.

Where Codex loses to Claude

Judgment calls. "Should we migrate from X to Y?" Claude will argue both sides; Codex will pick one. Sometimes that's fine. For real architecture, Claude is better.

Large codebases. Codex's context management on very big repos is less graceful. It'll work, but it'll miss things Claude catches by reading more supporting files.

Ambiguous prompts. Codex's willingness to guess is a liability when you don't know exactly what you want. You'll get confident wrong code.

Refactors across many files. Codex handles one-file refactors well. Cross-file refactors with semantic coupling are where Claude's more thorough exploration pays off.

The "confident wrong" problem

This is Codex's signature failure mode. A vague prompt produces confident code that looks right, compiles, even runs, but doesn't do what you meant.

Two mitigations that work:

Tight prompts. "Add input validation" is vague. "Add input validation to createUser: reject if email is missing, if password is < 8 chars, or if name contains digits. Throw ValidationError with the specific reason." Codex does this well when the edges are sharp.

Diff review. Always review the diff before accepting. Codex's wrong code is harder to spot in chat than in a git diff. I run every Codex change through a diff pane before merging.

Tool use: fast and occasionally too eager

Codex's tool loop is impressive. It runs commands in tight succession, reacts to output, keeps going. On paper that's ideal. In practice it has one annoying habit: it'll run destructive-ish commands without asking more readily than Claude.

Examples I've seen:

  • rm on files it thought were intermediate artifacts.
  • git reset --hard to "start over" after a failed attempt.
  • npm install on a branch I hadn't wanted to pollute.

Mitigation: set an allowlist of commands or run in a worktree so the blast radius is contained. The background AI agents post covers the sandbox patterns.

The AGENTS.md question

Codex uses an AGENTS.md file (or .codex/config, depending on version) where Claude uses CLAUDE.md. Functionally similar. A few differences in practice:

  • Codex's config format is more explicit about tools and permissions.
  • Codex respects AGENTS.md strictly; if you say "don't touch auth/", it doesn't.
  • Codex's context loading is simpler — it reads the file at session start, doesn't re-read per subdirectory by default.

Write a good AGENTS.md. Same rules as CLAUDE.md: stack overview, style guardrails, commands, things to never touch. For a shared file that both CLIs can read, I sometimes symlink AGENTS.md -> CLAUDE.md and it works fine with minor edits.

A workflow that uses Codex well

My Codex pattern:

  1. I'm in the Claude (Driver) pane, planning a feature.
  2. I write a spec — two sentences, acceptance criteria — and paste it into the Codex pane.
  3. Codex runs with it. Implements, iterates, tests.
  4. I glance at the Codex pane every couple of minutes. It rarely needs me.
  5. When it's done, I diff-review and merge.

In this pattern, Codex is the implementer and Claude is the planner. They play different roles, and the grid makes the handoff frictionless. See the parallel AI agents use case for the full layout.

Cost comparison

Rough numbers, per active hour in the implementer role:

ModelCost per hour (directional)SpeedQuality on specced tasks
Claude OpusHighMediumExcellent
Claude SonnetMedium-lowMedium-fastGood
Codex (GPT-5)MediumFastGood
Codex (GPT-4 class)LowFastDecent

For implementer work, Codex and Sonnet sit in a similar price-performance band, with Codex biased toward speed and Sonnet biased toward quality. I mix them: Sonnet for one pane, Codex for another. For the cost playbook see cutting your AI coding bill in half.

When to not use Codex

Some jobs I've learned to hand back to Claude:

  • First-time exploration of a new codebase (Claude's reconnaissance is better).
  • Architecture decisions (Codex commits too fast).
  • Anything touching auth, payments, or data privacy (I want Claude's caution).
  • Cross-cutting refactors (more than three files).

For everything else, Codex is competitive or better, especially on speed.

Setup in a grid

In SpaceSpider I keep a permanent Codex pane in every space. The setup:

  • One pane assigned to Codex CLI.
  • Worktree-isolated (../myapp-codex-work).
  • AGENTS.md committed at the worktree root.
  • Allowlist in the Codex config limiting destructive commands.

See getting started for how spaces and per-pane CLIs work. Once configured, the pane persists across sessions and I can pick up where I left off.

Codex-specific tips

Small habits that make Codex work better:

  • Give it the test command in the first message. It'll use it for verification instead of guessing.
  • Be explicit about file boundaries. "Modify only src/foo.ts" saves a lot of collateral editing.
  • Ask for a plan first on anything non-trivial. Codex is happy to plan when asked; it just won't volunteer.
  • Use its shell-first instincts. "Use ripgrep to find X" works better than "find X in the code."

None of these are Codex-exclusive, but they're especially useful because Codex's defaults skew toward "do it now."

Key takeaways

Codex CLI isn't a Claude substitute. It's a fast, shell-native implementer that earns its pane in a parallel grid when paired with Claude as the planner. Its strengths — speed, tool-use tightness, directness — are exactly the weaknesses of Opus-heavy workflows.

Use Codex for specced implementation, shell-heavy workflows, and quick one-offs. Leave Claude for judgment, exploration, and anything ambiguous. The two together are genuinely more than either alone, and the grid is how you make them coexist without friction.

Keep reading