Running Background AI Agents Without Losing Your Mind
A practical tutorial for running background AI agents safely: sandboxing, timeouts, cost caps, and the supervision patterns that actually work.
April 11, 2026 · 7 min read
"Background agent" is one of those phrases that sounds magical until you leave one unsupervised for an hour and find it's committed 47 times, force-pushed to main, and opened three pull requests titled "fix stuff." The magic is real, the disasters are also real, and most people overcorrect in one direction or the other.
This tutorial is about the middle path: running agents in the background where they save you time, keeping them from destroying your repo, and knowing when to pull them to the foreground. The guardrails are cheap to set up and the benefits compound.
What "background" actually means
Background doesn't mean autonomous. It means "running in a pane I'm not currently focused on, with clear boundaries on what it can do." The agent still needs supervision; the supervision is just cheaper.
In practice, a background agent:
- Runs in a pane you can see in your grid.
- Has a narrow, well-specced task.
- Has strict limits on what tools it can use.
- Has a time budget or iteration cap.
- Gets checked on every few minutes, not every few seconds.
If you're imagining "agent runs all night, wakes me up when done," that's a different beast and it requires much stronger guardrails. Start with the more modest version.
The sandbox question
The first question is isolation. A background agent with full filesystem access is a bad idea. A background agent restricted to a single working directory is a reasonable idea.
Three layers of isolation I use:
- Worktree isolation: the agent works in a
git worktree, not the main checkout. Worst case it trashes the worktree, not your main branch. - Permission scoping: the agent has an explicit allowlist of commands it can run. No
rm -rf, nogit push, nonpm publish. - Directory restriction: the agent's CLI config limits file operations to the worktree path.
Claude Code has built-in permission prompting for this. Codex supports allowlists. Qwen is more permissive and needs a tighter sandbox at the system level. See Claude vs Codex vs Qwen for the per-CLI behavior.
Time and iteration budgets
An agent in a tight loop will happily iterate forever. You need a budget. There are two kinds:
Wall-clock budgets: "run for at most 10 minutes, then stop and ask." This is enforced by you, looking at the clock.
Iteration budgets: "edit the file, run the test, try again — but only three times." This is enforced by prompting: "If the tests fail three times in a row, stop and summarize the issue."
Both are cheap to communicate. Add them to the initial prompt for every background task:
Task: Add input validation to
user.ts. Write tests first. If tests fail three times, stop and report. Do not commit, do not push.
The "do not commit, do not push" is load-bearing. Most agents want to be helpful; they'll commit if you don't stop them.
Cost caps
Background agents cost money even when idle — they're holding context, they're running tool calls, they're burning tokens. A cost cap is a sanity constraint.
Per-task cost caps I use:
- Trivial task (one function): fraction of a dollar.
- Small task (module-scale): small dollar amount.
- Medium task (multi-file feature): a few dollars.
- Large task (refactor across package): ten dollars or more — and probably not a background task at all.
If an agent blows through its budget, the correct response is to pull it to the foreground and figure out why. Not to raise the budget. For the full cost playbook, see cutting your AI coding bill in half.
Supervision cadence
Background doesn't mean unattended. My cadence:
- Every 2-3 minutes during active parallel work, I scan the grid.
- Every 5-10 minutes I check any pane I've been ignoring.
- When a pane has a question, I answer within a minute or two.
- When a pane has been silent for a long time, I check for stuck tool calls.
A grid terminal with activity indicators makes this nearly free. Without one, the supervision cadence collapses because you can't tell at a glance which pane needs you. See why grid terminals beat tabs.
What to run in the background
Good background tasks share features:
- Narrow scope (one function, one module, one test file).
- Clear acceptance criteria (tests pass, output matches).
- No ambiguity (you'd write the same code if you had the time).
- Low blast radius (trashing the worktree is cheap).
Bad background tasks:
- Architectural decisions.
- Open-ended refactors.
- Anything touching secrets or auth.
- Anything you don't personally know how to verify.
If a task requires judgment, it belongs in the foreground. If a task is mechanical, it parallelizes beautifully. The parallel AI agents use case has examples.
Prompts designed for background work
Background prompts are different from foreground prompts. They're more explicit about:
- What to do when stuck ("stop and ask; do not guess").
- What to never touch ("do not modify
src/auth/*"). - What success looks like ("
npm testexits 0"). - When to give up ("after three failed attempts, stop").
A template I reuse:
Background task.
Scope: <one sentence>.
Success: <exact command or behavior>.
Do not: commit, push, modify files outside <dir>, touch auth code.
Stop condition: success, three failures, or 10 minutes — whichever first.
On stop, summarize what you did, what you tried, and what's left.
This template is worth memorizing. Every background session starts with some version of it.
Recovery when things go wrong
They will go wrong. A disciplined recovery playbook:
- Interrupt the agent immediately when you notice drift.
- Inspect what it did:
git status,git diff, read the chat. - Revert cleanly:
git reset --hard,git clean -fdin the worktree if needed. - Diagnose the prompt problem: was the spec unclear, were the guardrails missing, did the agent hit an edge case?
- Fix the prompt, not the agent, and retry.
Most disasters come from under-specified prompts, not model failure. The model did what you asked; you just asked for the wrong thing. The fix is in the prompt.
A concrete background task
An example of a task I'd run in the background:
Task: In the worktree at
../myapp-tests, write unit tests forsrc/lib/cli.ts. Use vitest. Target 80% branch coverage on the exported functions. Do not modify source files. Do not commit. Runpnpm test cli.tsto verify each test. Stop after all exported functions have tests or after 15 minutes.
I can leave this running, check on it every few minutes, and it'll either succeed or stop cleanly with a summary. If it gets confused, it'll ask a question in the pane. That's the ideal background agent.
Comparison of background-friendliness across CLIs
| CLI | Background discipline | Notes |
|---|---|---|
| Claude Code | Strong | Good permission prompts, respects boundaries |
| Codex CLI | Medium | Fast, but happy to commit when not explicitly forbidden |
| Qwen Code | Weaker | Needs explicit system-level sandboxing |
| Aider | Strong | Commit-oriented, easy to review |
| Plain shell | N/A | Not an agent |
Claude is my default for background work specifically because it's the least likely to take unprompted destructive actions. It'll ask rather than assume. See the Claude Code integration docs for the permission setup.
Key takeaways
Background AI agents are a force multiplier when they're sandboxed, time-budgeted, cost-capped, and running narrow tasks. They're a liability when they're unsupervised, full-access, open-ended, or running work you can't verify.
The mental model is not "autonomous robot." It's "junior engineer on a defined ticket who'll ping you when stuck." Treat them that way, and background agents save real time. Treat them as autonomous, and you'll spend more time cleaning up than you saved.
FAQ
How many background agents can I run at once? I cap at three background + one foreground. Beyond that, supervision breaks down and you start missing agent questions.
Should I use a Dockerized sandbox? For most workflows, worktrees plus CLI-level permissions are enough. Docker makes sense for very high-risk work or compliance needs; most people don't need it.
Can I run a background agent overnight? I don't recommend it unless the task is fully specced and sandboxed with a hard cost cap. "Ten hours of autonomous agent" is where most of the horror stories come from.
Keep reading
- From Cursor to a Terminal Grid: A Migration StoryAn honest migration story from Cursor to a terminal grid of AI CLIs: what I missed, what I gained, and why I didn't switch back.
- The Developer Productivity Stack for an AI-First TeamA practical productivity stack for AI-first teams: shared spaces, CLI conventions, review loops, and team-level habits that compound across developers.
- AI Pair Programming in 2026: Past the HypeAI pair programming is past the hype phase and into the workflow phase. What actually works in 2026, what's overrated, and how senior devs are using it.
- OpenAI Codex CLI in the Real World: What Actually WorksA deep dive on OpenAI Codex CLI in real workflows: where it beats Claude, where it fails, and the patterns that let it earn a permanent pane.
- 10 Claude Code Power Tips You Haven't Seen on TwitterTen practical Claude Code tips beyond the basics: session surgery, skill composition, CLAUDE.md patterns, and parallel tricks that actually ship code faster.
- Multi-Model Code Review: Claude, GPT, and Qwen in One GridA step-by-step tutorial for multi-model code review with Claude, GPT/Codex, and Qwen running in parallel panes. Catch bugs none of them would catch alone.