Safety Design in the Age of Agents — From Autonomous Execution Engines to Controllable Decision Systems

Introduction

In recent years, AI companies such as Anthropic have significantly advanced agent technologies.

AI is no longer:

  • something that answers

but has become:

👉 something that acts (Agent)

  • Calling tools
  • Accessing external data
  • Executing code
  • Autonomously progressing tasks

This evolution has brought AI into the realm of practical deployment.

However, at the same time, it introduces a new class of problems that were previously less visible:

👉 How do we safely control agents?


In traditional AI systems, evaluation focused on:

  • Output accuracy
  • Naturalness of responses

But in the age of agents, the problem shifts:

👉 Not what to output, but what to execute


The moment AI begins to act,
it becomes:

👉 a safety-critical system


Problem: Why Agents Do Not Stop

As agent adoption increases, the following issues are emerging in real-world environments:

  • Executing unnecessary operations
  • Proceeding without required confirmation
  • Producing inconsistent decisions for the same input
  • Being unable to explain why a specific action was taken

At first glance, these may appear to be:

  • Accuracy issues
  • Model limitations
  • Implementation bugs

However, this is not the case.


👉 This is a safety design problem


More precisely:

👉 The structure for stopping and decision-making is not defined


In traditional AI:

  • Output quality
  • Response correctness

were the main concerns.

But agents are different.


👉 The moment AI starts acting, it must be designed as a controlled system


Core Insight: Signal ≠ Decision

Agents like those from Anthropic operate on:

  • Search results
  • Inference outputs
  • Generated text
  • Action candidates

These are:

👉 Signals (inputs to judgment)


However, what real-world systems require is:

👉 Decision (final commitment to action)


Why This Becomes a Safety Problem

The core issue is this:


Agents are very good at generating Signals.

But:

👉 They do not take responsibility for selecting Decisions


As a result:

  • What gets executed
  • Where to stop
  • When to escalate to humans

all become ambiguous.


👉 There is no structure for safe stopping


This is why:

👉 Agents do not stop


Internal Structure of Agents

A typical agent operates in a loop:

State → Reason → Action → State → Reason → Action …

This loop is powerful:

  • Observe environment
  • Infer next action
  • Execute
  • Update state

However, from a safety perspective, it has a critical omission:

There is no explicit:

  • Decision selection
  • Stopping condition (Boundary)
  • Human delegation

👉 Actions are generated, but not validated


This leads to:

  • Continuous action generation
  • Implicit stopping (not guaranteed)
  • Uncertain human escalation

👉 Therefore, agents do not stop


Wrong Approach

Many systems attempt to address this with prompts such as:

  • “Stop if dangerous”
  • “Confirm before execution”
  • “Ask a human if uncertain”

This appears safe—but it is not.


Because agents are not:

👉 systems that enforce strict conditions

They are:

👉 systems that generate the most plausible next action


Instructions are interpreted as:

  • vague guidelines
  • context-dependent suggestions

As a result:

👉 behavior varies by situation

👉 safety is not structurally guaranteed


Concrete Failures

Example 1: Auto-closing customer inquiries

Even with:

“Ask a human if uncertain”

Agents may:

  • generate a plausible FAQ response
  • mark the issue as resolved
  • skip escalation

Example 2: Over-modifying code

Even with:

“Avoid risky changes”

Agents may:

  • modify multiple files
  • update configurations
  • rewrite tests

Example 3: Executing unintended operations

Agents may:

  • delete records
  • send emails
  • update systems

because:

👉 “importance” is not structurally defined


Core Issue

All these share one root cause:

👉 Stopping conditions are not structurally defined


Prompts like:

  • “Be careful”
  • “Confirm”

are:

👉 policies, not rules


Agents interpret them as:

👉 ambiguous context


Fundamental Limitation

This is not an implementation issue.

👉 It is a fundamental limitation


LLMs are:

👉 systems that generate the most probable next output


They do not:

  • enforce rules
  • evaluate conditions strictly

They only produce:

👉 continuous sequences of tokens

as described in “Judgment Cannot Be Expressed by Smooth Computation — The Problem of Discontinuity in AI


Continuous vs Discrete

This creates a fundamental gap:

  • LLM → continuous generation
  • Decision → discrete commitment

👉 Continuous processes cannot produce discrete decisions


Why Decision Cannot Be Embedded

A Decision requires:

  • evaluation
  • selection
  • responsibility

But LLM outputs are:

👉 generated candidates

not:

👉 committed decisions


Therefore:

  • No entity determines “must stop”
  • No guarantee exists

👉 Prompt-based safety cannot work


Solution: Boundary, Not Prompt

Instead of saying:

“Please stop if dangerous”

We must define:

  • Always stop under condition X
  • Always require human approval under condition Y
  • Always redirect under condition Z

👉 Stopping conditions must exist as structure


Now:

👉 Agents become controllable


Reframing the Problem

The problem is simple:

👉 Agents do not stop


But this is not behavioral.

👉 It is structural.


The Right Question

Not:

👉 How do we stop agents?

But:

👉 Where do we define decisions?


Solution: Decision Trace Model (DTM)

DTM defines decisions as structure:

Event → Signal → Decision → Boundary → Human → Log

This clarifies:

  • What happened
  • How it was interpreted
  • What was selected
  • What constraints applied
  • Whether humans intervened
  • How it was recorded

Key Insight

👉 Decision must pass through Boundary


Boundary determines:

  • Execute
  • Stop
  • Escalate

Redefining “Stopping”

👉 Stopping = not passing the Decision


Agents continue generating Signals.

But:

  • Reject (Hard Stop)
  • Hold (Soft Stop)
  • Redirect

👉 No execution occurs


Three Types of Stops

1. Hard Stop

Reject the Decision

2. Soft Stop

Delegate to human

3. Redirect

Route elsewhere


Why This Matters

Traditional design:

  • controls behavior
  • relies on prompts

DTM:

👉 controls Decisions


Roles

  • Agent → generates Signals
  • DTM → selects Decisions

👉 Agent does not stop
👉 Decision is stopped


Best Practice Architecture

Layer 1: Capability Control

Restrict what is possible

Layer 2: Decision / Boundary

Control what is allowed

Layer 3: Human-in-the-loop

Restore responsibility

Layer 4: Logging

Enable replay and improvement


Architecture

Event

Agent (Signal)

Decision (DTM)

Boundary
├ STOP
├ HUMAN
├ REDIRECT
└ EXECUTE

Execution

Log

Why DTM Works

DTM:

  • externalizes decisions
  • defines stopping conditions
  • structures human intervention
  • records everything

👉 This is a paradigm shift


Before / After

Before

  • Autonomous
  • Unstoppable
  • Non-reproducible
  • No accountability

After

  • Controlled per decision
  • Stoppable at boundary
  • Human escalation defined
  • Fully traceable

Conclusion

Anthropic’s agent technology has transformed AI into:

👉 something that acts


But that is not enough.


👉 We must define where Decisions are stopped


Final Statement

👉 Agents generate Signals.
👉 Decisions change the world.


Summary

The best-in-class design for controlling signal-based agents is:

👉 a multi-layered architecture centered on DTM


And the essence is simple:

👉 Agents do not stop.
👉 Decisions stop them.

Ultimately, as agents become autonomous actors,

we must start designing them with the same rigor as safety-critical systems in industries like automotive manufacturing.

Five-nines reliability is no longer optional.

It becomes a requirement.

タイトルとURLをコピーしました