7 min read ai

Your AI Safety Model Can't Be Popups

AI coding agents need boundaries, not just approval prompts. Frequent approvals cap autonomy, while broad access without boundaries creates risk.

Every serious AI coding workflow eventually teaches you two things about the little approve button.

First, the button is useful. If an agent is about to run a shell command, edit a file, install a package, or call a tool, I would rather see the question than not see it.

Second, the button is nowhere near enough.

Approval prompts fail in both directions.

They are too weak to be the whole safety model, because they ask a human to evaluate low-level actions without enough context about the environment. They are also too interruptive to enable real autonomy, because an agent that has to stop every minute for permission cannot run very far.

That is the worst of both worlds: broad access plus frequent babysitting.

I wrote recently that AI changed your pipeline, not just your editor and that someone has to own the AI delivery system. This is the safety version of the same argument. If AI is becoming part of the path from ticket to production, AI safety has to live in that path too.

If your safety model is “the developer will click approve carefully,” you do not have an AI safety model. You have an optimistic UI. And if your autonomy model is “the agent will pause until the developer comes back,” you do not have much autonomy either.

Babysitting is not autonomy

Permission prompts feel rigorous at first.

The agent asks to run a command. You read the command. You approve it. The agent asks to edit a file. You skim the path. You approve it. The agent asks to install a package. You pause, maybe read a little more carefully, then approve it.

After the thirtieth prompt, the ritual changes.

You stop reading each one as a risk decision. You start treating the prompt as friction between you and the work finishing. The button becomes a speed bump, not a boundary.

There is a productivity failure hiding in that pattern too.

If every non-trivial command needs approval, the agent cannot build momentum. It cannot inspect a failure, adjust the code, rerun the test, inspect the next failure, and keep going while you do something else. You are still in the loop, but not at the judgment points. You are in the loop at the plumbing points.

That is not the goal.

Anthropic made this point directly in its write-up on Claude Code sandboxing. Their answer was not “make users click more carefully.” It was to put Claude inside explicit filesystem and network boundaries. Inside the boundary, the agent can move faster. Outside the boundary, the system has something real to enforce.

That is the important shift. The point is not to make agents ask less because we trust them more. The point is to make them ask less because the boundary is better designed.

Is npm test safe? Usually.

Is npm test safe in a repo where pretest runs an arbitrary script from an untrusted branch? Different question.

The right answer depends on the boundary. A popup only shows the immediate action.

The workspace is the autonomy decision

When I built Hivemind, one of the boring decisions that mattered most was running agents in isolated Docker containers.

That was not because containers are magic. You can still give a container the wrong credentials, mount too much of the filesystem, or let it reach too much of the network.

The value was that the execution boundary became explicit.

What code does the agent see? Which files are mounted read-only? Which commands are available? Which commands are never allowed? Where do test artifacts go? What has to be copied out intentionally?

The point was to give the agent a smaller room. It could work on the repo and run tests without needing access to SSH keys, cloud credentials, browser sessions, sibling checkouts, or everything else that happens to live on a developer laptop.

That is the shape I want more teams to reach for.

Not necessarily Docker. Maybe it is a local sandbox. Maybe it is an ephemeral cloud workspace. Maybe it is a CI runner with sharply scoped permissions. The mechanism matters less than the principle: decide what the agent can reach before the task starts, not one popup at a time after it is already running.

Good boundaries let the agent move. Bad boundaries make you choose between dangerous autonomy and safe babysitting.

Credentials need a separate lane

For years, engineering teams have treated secrets management as mostly a CI, deployment, and runtime problem. Keep production credentials out of source control. Scope CI tokens. Rotate API keys. Limit who can deploy.

Agents create a new category: execution-time secrets for AI work.

GitHub’s May 2026 changelog for Copilot cloud agent secrets and variables makes this concrete. Copilot cloud agent now has dedicated “Agents” secrets and variables alongside Actions, Codespaces, and Dependabot scopes.

That is useful. It is also a signal.

Agent configuration is becoming a first-class delivery-system surface.

If an agent needs access to an internal package registry, a staging API, an MCP server, or a test data service, someone has to decide which credential it gets. That credential should not automatically be the developer’s credential. It should not automatically be the CI credential. It should not automatically be the deployment credential.

Human, CI, deployment, and agent credentials have different jobs.

Mixing them is how a convenience feature turns into a blast-radius problem.

Repo config is policy now

A lot of engineering teams still treat repo-local agent files as personal productivity notes.

CLAUDE.md. .cursorrules. .mcp.json. Custom instructions. Workflow prompts. Agent skills. Tool allowlists. CI actions that call models.

Some of these files are just documentation. Some are policy. Some can change execution.

That means code review needs to catch up.

Adversa’s TrustFall research made the risk concrete: repo-local agent config can affect what starts, what gets trusted, and what an agent is allowed to do. Treat the cross-tool details as security-research claims, but the operating lesson is stable: agent config belongs in review.

The same is true for MCP. A Stack Overflow post framed MCP as a way to give agents secure access to enterprise context: internal knowledge, APIs, documents, and workflow systems. That promise is real. Agents get more useful when they can see the right context.

But useful context is still access.

If a PR adds .mcp.json, someone should inspect the command. If it changes a workflow that interpolates issue text into an LLM prompt, someone should inspect the trust boundary. If it grants an agent write access, someone should ask why read-only is not enough.

GitHub’s guidance on reviewing agent pull requests is useful here because it does not stop at the diff. It calls out prompt inputs, model output piped to shell, token scope, secret access, least-privilege workflow permissions, and human approval gates for production-touching work.

That is the right review frame.

Review the code. Also review the path that produced, checked, or executed the code.

The button should be the last mile

The practical alternative to popup-driven safety is not a giant governance program.

It is a few boring defaults:

  • run agents in constrained workspaces
  • keep sensitive paths and unnecessary network access out of reach by default
  • define commands the agent can run without asking, and commands it can never run
  • give agents scoped credentials, not human or deployment credentials
  • treat MCP and repo-local agent config as reviewable policy
  • separate analysis from execution in CI
  • require human approval for production-touching actions
  • log enough to reconstruct what happened

None of that makes agents perfectly safe.

Good.

“Perfectly safe” is usually where useful conversations go to die.

The goal is to make the risk explicit, bounded, reviewable, and recoverable.

That is what mature engineering systems already do everywhere else. We do not make deploys safe by asking one tired engineer to click carefully. We use environments, permissions, feature flags, CI gates, rollback paths, logs, and ownership. We do not make database migrations safe by adding a scarier modal. We make the blast radius smaller and the recovery path clearer.

AI agents deserve the same treatment.

Permission prompts are still useful. I want the button to exist.

But the button should be the last-mile question, not the architecture. It should appear when the agent hits a boundary or needs human judgment, not every time it wants to do ordinary work inside an approved lane.

If the agent can act like infrastructure, it needs infrastructure-shaped controls.

Popups are not enough.