Safety Monitoring Overview¶

ARTEMIS includes comprehensive safety monitoring to detect and prevent problematic agent behaviors during debates. This is a key differentiator from other multi-agent frameworks.

Why Safety Monitoring?¶

LLM agents can exhibit concerning behaviors:

Sandbagging: Deliberately underperforming to appear less capable
Deception: Making false claims or hiding information
Behavioral Drift: Gradually shifting behavior over time
Ethical Violations: Crossing ethical boundaries

ARTEMIS monitors for these behaviors in real-time.

Safety Architecture¶

┌────────────────────────────────────────────────────────────────┐
│                      Safety Layer                              │
├────────────────────────────────────────────────────────────────┤
│  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐       │
│  │  Sandbagging  │  │   Deception   │  │   Behavior    │       │
│  │   Detector    │  │    Monitor    │  │    Tracker    │       │
│  └───────┬───────┘  └───────┬───────┘  └───────┬───────┘       │
│          │                  │                  │               │
│          └──────────────────┼──────────────────┘               │
│                             │                                  │
│                    ┌────────▼────────┐                         │
│                    │ safety_monitors │                         │
│                    └────────┬────────┘                         │
│                             │                                  │
│  ┌───────────────┐  ┌──────▼───────┐  ┌───────────────┐        │
│  │    Ethics     │  │    Debate    │  │    Alerts     │        │
│  │     Guard     │──│  Integration │──│   (results)   │        │
│  └───────────────┘  └──────────────┘  └───────────────┘        │
└────────────────────────────────────────────────────────────────┘

Available Monitors¶

Monitor	Purpose	Detects
Sandbagging Detector	Detect intentional underperformance	Capability hiding
Deception Monitor	Detect false claims	Lies, misdirection
Behavior Tracker	Track behavioral changes	Drift, inconsistency
Ethics Guard	Monitor ethical boundaries	Violations, harm

Quick Start¶

Basic Safety Setup¶

from artemis.core.agent import Agent
from artemis.core.debate import Debate
from artemis.safety import (
    SandbagDetector,
    DeceptionMonitor,
    BehaviorTracker,
    EthicsGuard,
    MonitorMode,
    EthicsConfig,
)

# Create agents
agents = [
    Agent(name="pro", role="Advocate for the proposition", model="gpt-4o"),
    Agent(name="con", role="Advocate against the proposition", model="gpt-4o"),
]

# Create individual monitors
sandbag = SandbagDetector(
    mode=MonitorMode.PASSIVE,
    sensitivity=0.7,
)
deception = DeceptionMonitor(
    mode=MonitorMode.PASSIVE,
    sensitivity=0.6,
)
behavior = BehaviorTracker(
    mode=MonitorMode.PASSIVE,
    sensitivity=0.5,
    window_size=5,
)
ethics = EthicsGuard(
    mode=MonitorMode.PASSIVE,
    config=EthicsConfig(harmful_content_threshold=0.5),
)

# Create debate with safety monitors
debate = Debate(
    topic="Your topic",
    agents=agents,
    safety_monitors=[
        sandbag.process,
        deception.process,
        behavior.process,
        ethics.process,
    ],
)

debate.assign_positions({
    "pro": "supports the proposition",
    "con": "opposes the proposition",
})

result = await debate.run()

# Check safety alerts
for alert in result.safety_alerts:
    print(f"Alert: {alert.type}")
    print(f"Severity: {alert.severity}")
    print(f"Agent: {alert.agent}")

Multiple Monitors¶

Combine multiple monitors for comprehensive safety:

from artemis.safety import (
    SandbagDetector,
    DeceptionMonitor,
    BehaviorTracker,
    EthicsGuard,
    MonitorMode,
    EthicsConfig,
)

# Create all monitors
monitors = [
    SandbagDetector(mode=MonitorMode.PASSIVE, sensitivity=0.7),
    DeceptionMonitor(mode=MonitorMode.PASSIVE, sensitivity=0.6),
    BehaviorTracker(mode=MonitorMode.PASSIVE, sensitivity=0.5),
    EthicsGuard(mode=MonitorMode.PASSIVE, config=EthicsConfig(harmful_content_threshold=0.5)),
]

# Pass their process methods to the debate
debate = Debate(
    topic="Your topic",
    agents=agents,
    safety_monitors=[m.process for m in monitors],
)

Monitor Modes¶

All monitors support three modes via the MonitorMode enum:

from artemis.safety import MonitorMode

# Available modes
MonitorMode.PASSIVE   # Observe and report only
MonitorMode.ACTIVE    # Can intervene and halt debate
MonitorMode.LEARNING  # Learn patterns without alerting

Passive Mode (Default)¶

Monitors observe and report but don't intervene:

monitor = SandbagDetector(
    mode=MonitorMode.PASSIVE,
    sensitivity=0.7,
)
# Alerts are generated but debate continues

Active Mode¶

Monitors can intervene and halt the debate:

monitor = SandbagDetector(
    mode=MonitorMode.ACTIVE,
    sensitivity=0.7,
)
# Debate may halt if severe issues detected

Learning Mode¶

Monitors learn patterns without generating alerts:

monitor = BehaviorTracker(
    mode=MonitorMode.LEARNING,
    sensitivity=0.5,
)
# Gathers data for future reference

Safety Results¶

Each monitor's process method is called during debate and can contribute to alerts:

result = await debate.run()

# All alerts from all monitors
for alert in result.safety_alerts:
    print(f"Type: {alert.type}")
    print(f"Agent: {alert.agent}")
    print(f"Severity: {alert.severity:.0%}")

Alert Severity Levels¶

Level	Score Range	Description
Low	0.0 - 0.3	Minor concern
Medium	0.3 - 0.6	Notable issue
High	0.6 - 0.9	Serious concern
Critical	0.9 - 1.0	Severe issue

Configuration¶

Per-Monitor Settings¶

Each monitor has its own configuration options:

# Sandbagging detector
sandbag = SandbagDetector(
    mode=MonitorMode.PASSIVE,
    sensitivity=0.7,
    baseline_turns=3,
    drop_threshold=0.3,
)

# Deception monitor
deception = DeceptionMonitor(
    mode=MonitorMode.PASSIVE,
    sensitivity=0.6,
)

# Behavior tracker
behavior = BehaviorTracker(
    mode=MonitorMode.PASSIVE,
    sensitivity=0.5,
    window_size=5,
    drift_threshold=0.3,
)

# Ethics guard
ethics = EthicsGuard(
    mode=MonitorMode.PASSIVE,
    config=EthicsConfig(
        harmful_content_threshold=0.5,
        bias_threshold=0.4,
        fairness_threshold=0.3,
        enabled_checks=["harmful_content", "bias", "fairness"],
    ),
)

Accessing Safety Data¶

After Debate¶

result = await debate.run()

# All safety alerts
print(f"Total alerts: {len(result.safety_alerts)}")

for alert in result.safety_alerts:
    print(f"  {alert.type}: {alert.severity:.0%} - {alert.agent}")

# Filter by type
sandbagging_alerts = [a for a in result.safety_alerts if "sandbag" in a.type.lower()]
deception_alerts = [a for a in result.safety_alerts if "deception" in a.type.lower()]

Best Practices¶

Start with passive mode: Understand behavior before enabling active intervention
Tune sensitivity: Adjust based on false positive rates
Combine monitors: Multiple monitors catch more issues
Review alerts: Verify detections before taking action
Consider context: Some patterns may be legitimate

Next Steps¶

Learn about Sandbagging Detection
Understand Deception Monitoring
Explore Behavior Tracking
Configure Ethics Guard