Safety Design in the Age of Agents — From Autonomous Execution Engines to Controllable Decision Systems

Introduction

In recent years, AI companies such as Anthropic have significantly advanced agent technologies.

AI is no longer:

something that answers

but has become:

👉 something that acts (Agent)

Calling tools
Accessing external data
Executing code
Autonomously progressing tasks

This evolution has brought AI into the realm of practical deployment.

However, at the same time, it introduces a new class of problems that were previously less visible:

👉 How do we safely control agents?

In traditional AI systems, evaluation focused on:

Output accuracy
Naturalness of responses

But in the age of agents, the problem shifts:

👉 Not what to output, but what to execute

The moment AI begins to act,
it becomes:

👉 a safety-critical system

Problem: Why Agents Do Not Stop

As agent adoption increases, the following issues are emerging in real-world environments:

Executing unnecessary operations
Proceeding without required confirmation
Producing inconsistent decisions for the same input
Being unable to explain why a specific action was taken

At first glance, these may appear to be:

Accuracy issues
Model limitations
Implementation bugs

However, this is not the case.

👉 This is a safety design problem

More precisely:

👉 The structure for stopping and decision-making is not defined

In traditional AI:

Output quality
Response correctness

were the main concerns.

But agents are different.

👉 The moment AI starts acting, it must be designed as a controlled system

Core Insight: Signal ≠ Decision

Agents like those from Anthropic operate on:

Search results
Inference outputs
Generated text
Action candidates

These are:

👉 Signals (inputs to judgment)

However, what real-world systems require is:

👉 Decision (final commitment to action)

Why This Becomes a Safety Problem

The core issue is this:

Agents are very good at generating Signals.

But:

👉 They do not take responsibility for selecting Decisions

As a result:

What gets executed
Where to stop
When to escalate to humans

all become ambiguous.

👉 There is no structure for safe stopping

This is why:

👉 Agents do not stop

Internal Structure of Agents

A typical agent operates in a loop:

State → Reason → Action → State → Reason → Action …

This loop is powerful:

Observe environment
Infer next action
Execute
Update state

However, from a safety perspective, it has a critical omission:

There is no explicit:

Decision selection
Stopping condition (Boundary)
Human delegation

👉 Actions are generated, but not validated

This leads to:

Continuous action generation
Implicit stopping (not guaranteed)
Uncertain human escalation

👉 Therefore, agents do not stop

Wrong Approach

Many systems attempt to address this with prompts such as:

“Stop if dangerous”
“Confirm before execution”
“Ask a human if uncertain”

This appears safe—but it is not.

Because agents are not:

👉 systems that enforce strict conditions

They are:

👉 systems that generate the most plausible next action

Instructions are interpreted as:

vague guidelines
context-dependent suggestions

As a result:

👉 behavior varies by situation

👉 safety is not structurally guaranteed

Concrete Failures

Example 1: Auto-closing customer inquiries

Even with:

“Ask a human if uncertain”

Agents may:

generate a plausible FAQ response
mark the issue as resolved
skip escalation

Example 2: Over-modifying code

Even with:

“Avoid risky changes”

Agents may:

modify multiple files
update configurations
rewrite tests

Example 3: Executing unintended operations

Agents may:

delete records
send emails
update systems

because:

👉 “importance” is not structurally defined

Core Issue

All these share one root cause:

👉 Stopping conditions are not structurally defined

Prompts like:

“Be careful”
“Confirm”

are:

👉 policies, not rules

Agents interpret them as:

👉 ambiguous context

Fundamental Limitation

This is not an implementation issue.

👉 It is a fundamental limitation

LLMs are:

👉 systems that generate the most probable next output

They do not:

enforce rules
evaluate conditions strictly

They only produce:

👉 continuous sequences of tokens

as described in “Judgment Cannot Be Expressed by Smooth Computation — The Problem of Discontinuity in AI“

Continuous vs Discrete

This creates a fundamental gap:

LLM → continuous generation
Decision → discrete commitment

👉 Continuous processes cannot produce discrete decisions

Why Decision Cannot Be Embedded

A Decision requires:

evaluation
selection
responsibility

But LLM outputs are:

👉 generated candidates

not:

👉 committed decisions

Therefore:

No entity determines “must stop”
No guarantee exists

👉 Prompt-based safety cannot work

Solution: Boundary, Not Prompt

Instead of saying:

“Please stop if dangerous”

We must define:

Always stop under condition X
Always require human approval under condition Y
Always redirect under condition Z

👉 Stopping conditions must exist as structure

Now:

👉 Agents become controllable

Reframing the Problem

The problem is simple:

👉 Agents do not stop

But this is not behavioral.

👉 It is structural.

The Right Question

Not:

👉 How do we stop agents?

But:

👉 Where do we define decisions?

Solution: Decision Trace Model (DTM)

DTM defines decisions as structure:

Event → Signal → Decision → Boundary → Human → Log

This clarifies:

What happened
How it was interpreted
What was selected
What constraints applied
Whether humans intervened
How it was recorded

Key Insight

👉 Decision must pass through Boundary

Boundary determines:

Execute
Stop
Escalate

Redefining “Stopping”

👉 Stopping = not passing the Decision

Agents continue generating Signals.

But:

Reject (Hard Stop)
Hold (Soft Stop)
Redirect

👉 No execution occurs

Three Types of Stops

1. Hard Stop

Reject the Decision

2. Soft Stop

Delegate to human

3. Redirect

Route elsewhere

Why This Matters

Traditional design:

controls behavior
relies on prompts

DTM:

👉 controls Decisions

Roles

Agent → generates Signals
DTM → selects Decisions

👉 Agent does not stop
👉 Decision is stopped

Best Practice Architecture

Layer 1: Capability Control

Restrict what is possible

Layer 2: Decision / Boundary

Control what is allowed

Layer 3: Human-in-the-loop

Restore responsibility

Layer 4: Logging

Enable replay and improvement

Architecture

Event

 ↓

Agent (Signal)

 ↓

Decision (DTM)

 ↓

Boundary

 ├ STOP

 ├ HUMAN

 ├ REDIRECT

 └ EXECUTE

 ↓

Execution

 ↓

Log

Why DTM Works

DTM:

externalizes decisions
defines stopping conditions
structures human intervention
records everything

👉 This is a paradigm shift

Before / After

Before

Autonomous
Unstoppable
Non-reproducible
No accountability

After

Controlled per decision
Stoppable at boundary
Human escalation defined
Fully traceable

Conclusion

Anthropic’s agent technology has transformed AI into:

👉 something that acts

But that is not enough.

👉 We must define where Decisions are stopped

Final Statement

👉 Agents generate Signals.
👉 Decisions change the world.

Summary

The best-in-class design for controlling signal-based agents is:

👉 a multi-layered architecture centered on DTM

And the essence is simple:

👉 Agents do not stop.
👉 Decisions stop them.

Ultimately, as agents become autonomous actors,

we must start designing them with the same rigor as safety-critical systems in industries like automotive manufacturing.

Five-nines reliability is no longer optional.

It becomes a requirement.

Masao Watanabe

AIシステム設計・意思決定構造の設計を専門としています。
Ontology・DSL・Behavior Treeによる判断の外部化、マルチエージェント構築に取り組んでいます。

Specialized in AI system design and decision-making architecture.
Focused on externalizing decision logic using Ontology, DSL, and Behavior Trees, and building multi-agent systems.