Running Background AI Agents Without Losing Your Mind

A practical tutorial for running background AI agents safely: sandboxing, timeouts, cost caps, and the supervision patterns that actually work.

April 11, 2026 · 7 min read

"Background agent" is one of those phrases that sounds magical until you leave one unsupervised for an hour and find it's committed 47 times, force-pushed to main, and opened three pull requests titled "fix stuff." The magic is real, the disasters are also real, and most people overcorrect in one direction or the other.

This tutorial is about the middle path: running agents in the background where they save you time, keeping them from destroying your repo, and knowing when to pull them to the foreground. The guardrails are cheap to set up and the benefits compound.

What "background" actually means

Background doesn't mean autonomous. It means "running in a pane I'm not currently focused on, with clear boundaries on what it can do." The agent still needs supervision; the supervision is just cheaper.

In practice, a background agent:

  • Runs in a pane you can see in your grid.
  • Has a narrow, well-specced task.
  • Has strict limits on what tools it can use.
  • Has a time budget or iteration cap.
  • Gets checked on every few minutes, not every few seconds.

If you're imagining "agent runs all night, wakes me up when done," that's a different beast and it requires much stronger guardrails. Start with the more modest version.

The sandbox question

The first question is isolation. A background agent with full filesystem access is a bad idea. A background agent restricted to a single working directory is a reasonable idea.

Three layers of isolation I use:

  1. Worktree isolation: the agent works in a git worktree, not the main checkout. Worst case it trashes the worktree, not your main branch.
  2. Permission scoping: the agent has an explicit allowlist of commands it can run. No rm -rf, no git push, no npm publish.
  3. Directory restriction: the agent's CLI config limits file operations to the worktree path.

Claude Code has built-in permission prompting for this. Codex supports allowlists. Qwen is more permissive and needs a tighter sandbox at the system level. See Claude vs Codex vs Qwen for the per-CLI behavior.

Time and iteration budgets

An agent in a tight loop will happily iterate forever. You need a budget. There are two kinds:

Wall-clock budgets: "run for at most 10 minutes, then stop and ask." This is enforced by you, looking at the clock.

Iteration budgets: "edit the file, run the test, try again — but only three times." This is enforced by prompting: "If the tests fail three times in a row, stop and summarize the issue."

Both are cheap to communicate. Add them to the initial prompt for every background task:

Task: Add input validation to user.ts. Write tests first. If tests fail three times, stop and report. Do not commit, do not push.

The "do not commit, do not push" is load-bearing. Most agents want to be helpful; they'll commit if you don't stop them.

Cost caps

Background agents cost money even when idle — they're holding context, they're running tool calls, they're burning tokens. A cost cap is a sanity constraint.

Per-task cost caps I use:

  • Trivial task (one function): fraction of a dollar.
  • Small task (module-scale): small dollar amount.
  • Medium task (multi-file feature): a few dollars.
  • Large task (refactor across package): ten dollars or more — and probably not a background task at all.

If an agent blows through its budget, the correct response is to pull it to the foreground and figure out why. Not to raise the budget. For the full cost playbook, see cutting your AI coding bill in half.

Supervision cadence

Background doesn't mean unattended. My cadence:

  • Every 2-3 minutes during active parallel work, I scan the grid.
  • Every 5-10 minutes I check any pane I've been ignoring.
  • When a pane has a question, I answer within a minute or two.
  • When a pane has been silent for a long time, I check for stuck tool calls.

A grid terminal with activity indicators makes this nearly free. Without one, the supervision cadence collapses because you can't tell at a glance which pane needs you. See why grid terminals beat tabs.

What to run in the background

Good background tasks share features:

  • Narrow scope (one function, one module, one test file).
  • Clear acceptance criteria (tests pass, output matches).
  • No ambiguity (you'd write the same code if you had the time).
  • Low blast radius (trashing the worktree is cheap).

Bad background tasks:

  • Architectural decisions.
  • Open-ended refactors.
  • Anything touching secrets or auth.
  • Anything you don't personally know how to verify.

If a task requires judgment, it belongs in the foreground. If a task is mechanical, it parallelizes beautifully. The parallel AI agents use case has examples.

Prompts designed for background work

Background prompts are different from foreground prompts. They're more explicit about:

  • What to do when stuck ("stop and ask; do not guess").
  • What to never touch ("do not modify src/auth/*").
  • What success looks like ("npm test exits 0").
  • When to give up ("after three failed attempts, stop").

A template I reuse:

Background task.
Scope: <one sentence>.
Success: <exact command or behavior>.
Do not: commit, push, modify files outside <dir>, touch auth code.
Stop condition: success, three failures, or 10 minutes — whichever first.
On stop, summarize what you did, what you tried, and what's left.

This template is worth memorizing. Every background session starts with some version of it.

Recovery when things go wrong

They will go wrong. A disciplined recovery playbook:

  1. Interrupt the agent immediately when you notice drift.
  2. Inspect what it did: git status, git diff, read the chat.
  3. Revert cleanly: git reset --hard, git clean -fd in the worktree if needed.
  4. Diagnose the prompt problem: was the spec unclear, were the guardrails missing, did the agent hit an edge case?
  5. Fix the prompt, not the agent, and retry.

Most disasters come from under-specified prompts, not model failure. The model did what you asked; you just asked for the wrong thing. The fix is in the prompt.

A concrete background task

An example of a task I'd run in the background:

Task: In the worktree at ../myapp-tests, write unit tests for src/lib/cli.ts. Use vitest. Target 80% branch coverage on the exported functions. Do not modify source files. Do not commit. Run pnpm test cli.ts to verify each test. Stop after all exported functions have tests or after 15 minutes.

I can leave this running, check on it every few minutes, and it'll either succeed or stop cleanly with a summary. If it gets confused, it'll ask a question in the pane. That's the ideal background agent.

Comparison of background-friendliness across CLIs

CLIBackground disciplineNotes
Claude CodeStrongGood permission prompts, respects boundaries
Codex CLIMediumFast, but happy to commit when not explicitly forbidden
Qwen CodeWeakerNeeds explicit system-level sandboxing
AiderStrongCommit-oriented, easy to review
Plain shellN/ANot an agent

Claude is my default for background work specifically because it's the least likely to take unprompted destructive actions. It'll ask rather than assume. See the Claude Code integration docs for the permission setup.

Key takeaways

Background AI agents are a force multiplier when they're sandboxed, time-budgeted, cost-capped, and running narrow tasks. They're a liability when they're unsupervised, full-access, open-ended, or running work you can't verify.

The mental model is not "autonomous robot." It's "junior engineer on a defined ticket who'll ping you when stuck." Treat them that way, and background agents save real time. Treat them as autonomous, and you'll spend more time cleaning up than you saved.

FAQ

How many background agents can I run at once? I cap at three background + one foreground. Beyond that, supervision breaks down and you start missing agent questions.

Should I use a Dockerized sandbox? For most workflows, worktrees plus CLI-level permissions are enough. Docker makes sense for very high-risk work or compliance needs; most people don't need it.

Can I run a background agent overnight? I don't recommend it unless the task is fully specced and sandboxed with a hard cost cap. "Ten hours of autonomous agent" is where most of the horror stories come from.

Keep reading