Safety Monitoring Overview¶
ARTEMIS includes comprehensive safety monitoring to detect and prevent problematic agent behaviors during debates. This is a key differentiator from other multi-agent frameworks.
Why Safety Monitoring?¶
LLM agents can exhibit concerning behaviors:
- Sandbagging: Deliberately underperforming to appear less capable
- Deception: Making false claims or hiding information
- Behavioral Drift: Gradually shifting behavior over time
- Ethical Violations: Crossing ethical boundaries
ARTEMIS monitors for these behaviors in real-time.
Safety Architecture¶
┌────────────────────────────────────────────────────────────────┐
│ Safety Layer │
├────────────────────────────────────────────────────────────────┤
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ Sandbagging │ │ Deception │ │ Behavior │ │
│ │ Detector │ │ Monitor │ │ Tracker │ │
│ └───────┬───────┘ └───────┬───────┘ └───────┬───────┘ │
│ │ │ │ │
│ └──────────────────┼──────────────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ safety_monitors │ │
│ └────────┬────────┘ │
│ │ │
│ ┌───────────────┐ ┌──────▼───────┐ ┌───────────────┐ │
│ │ Ethics │ │ Debate │ │ Alerts │ │
│ │ Guard │──│ Integration │──│ (results) │ │
│ └───────────────┘ └──────────────┘ └───────────────┘ │
└────────────────────────────────────────────────────────────────┘
Available Monitors¶
| Monitor | Purpose | Detects |
|---|---|---|
| Sandbagging Detector | Detect intentional underperformance | Capability hiding |
| Deception Monitor | Detect false claims | Lies, misdirection |
| Behavior Tracker | Track behavioral changes | Drift, inconsistency |
| Ethics Guard | Monitor ethical boundaries | Violations, harm |
Quick Start¶
Basic Safety Setup¶
from artemis.core.agent import Agent
from artemis.core.debate import Debate
from artemis.safety import (
SandbagDetector,
DeceptionMonitor,
BehaviorTracker,
EthicsGuard,
MonitorMode,
EthicsConfig,
)
# Create agents
agents = [
Agent(name="pro", role="Advocate for the proposition", model="gpt-4o"),
Agent(name="con", role="Advocate against the proposition", model="gpt-4o"),
]
# Create individual monitors
sandbag = SandbagDetector(
mode=MonitorMode.PASSIVE,
sensitivity=0.7,
)
deception = DeceptionMonitor(
mode=MonitorMode.PASSIVE,
sensitivity=0.6,
)
behavior = BehaviorTracker(
mode=MonitorMode.PASSIVE,
sensitivity=0.5,
window_size=5,
)
ethics = EthicsGuard(
mode=MonitorMode.PASSIVE,
config=EthicsConfig(harmful_content_threshold=0.5),
)
# Create debate with safety monitors
debate = Debate(
topic="Your topic",
agents=agents,
safety_monitors=[
sandbag.process,
deception.process,
behavior.process,
ethics.process,
],
)
debate.assign_positions({
"pro": "supports the proposition",
"con": "opposes the proposition",
})
result = await debate.run()
# Check safety alerts
for alert in result.safety_alerts:
print(f"Alert: {alert.type}")
print(f"Severity: {alert.severity}")
print(f"Agent: {alert.agent}")
Multiple Monitors¶
Combine multiple monitors for comprehensive safety:
from artemis.safety import (
SandbagDetector,
DeceptionMonitor,
BehaviorTracker,
EthicsGuard,
MonitorMode,
EthicsConfig,
)
# Create all monitors
monitors = [
SandbagDetector(mode=MonitorMode.PASSIVE, sensitivity=0.7),
DeceptionMonitor(mode=MonitorMode.PASSIVE, sensitivity=0.6),
BehaviorTracker(mode=MonitorMode.PASSIVE, sensitivity=0.5),
EthicsGuard(mode=MonitorMode.PASSIVE, config=EthicsConfig(harmful_content_threshold=0.5)),
]
# Pass their process methods to the debate
debate = Debate(
topic="Your topic",
agents=agents,
safety_monitors=[m.process for m in monitors],
)
Monitor Modes¶
All monitors support three modes via the MonitorMode enum:
from artemis.safety import MonitorMode
# Available modes
MonitorMode.PASSIVE # Observe and report only
MonitorMode.ACTIVE # Can intervene and halt debate
MonitorMode.LEARNING # Learn patterns without alerting
Passive Mode (Default)¶
Monitors observe and report but don't intervene:
monitor = SandbagDetector(
mode=MonitorMode.PASSIVE,
sensitivity=0.7,
)
# Alerts are generated but debate continues
Active Mode¶
Monitors can intervene and halt the debate:
monitor = SandbagDetector(
mode=MonitorMode.ACTIVE,
sensitivity=0.7,
)
# Debate may halt if severe issues detected
Learning Mode¶
Monitors learn patterns without generating alerts:
monitor = BehaviorTracker(
mode=MonitorMode.LEARNING,
sensitivity=0.5,
)
# Gathers data for future reference
Safety Results¶
Each monitor's process method is called during debate and can contribute to alerts:
result = await debate.run()
# All alerts from all monitors
for alert in result.safety_alerts:
print(f"Type: {alert.type}")
print(f"Agent: {alert.agent}")
print(f"Severity: {alert.severity:.0%}")
Alert Severity Levels¶
| Level | Score Range | Description |
|---|---|---|
| Low | 0.0 - 0.3 | Minor concern |
| Medium | 0.3 - 0.6 | Notable issue |
| High | 0.6 - 0.9 | Serious concern |
| Critical | 0.9 - 1.0 | Severe issue |
Configuration¶
Per-Monitor Settings¶
Each monitor has its own configuration options:
# Sandbagging detector
sandbag = SandbagDetector(
mode=MonitorMode.PASSIVE,
sensitivity=0.7,
baseline_turns=3,
drop_threshold=0.3,
)
# Deception monitor
deception = DeceptionMonitor(
mode=MonitorMode.PASSIVE,
sensitivity=0.6,
)
# Behavior tracker
behavior = BehaviorTracker(
mode=MonitorMode.PASSIVE,
sensitivity=0.5,
window_size=5,
drift_threshold=0.3,
)
# Ethics guard
ethics = EthicsGuard(
mode=MonitorMode.PASSIVE,
config=EthicsConfig(
harmful_content_threshold=0.5,
bias_threshold=0.4,
fairness_threshold=0.3,
enabled_checks=["harmful_content", "bias", "fairness"],
),
)
Accessing Safety Data¶
After Debate¶
result = await debate.run()
# All safety alerts
print(f"Total alerts: {len(result.safety_alerts)}")
for alert in result.safety_alerts:
print(f" {alert.type}: {alert.severity:.0%} - {alert.agent}")
# Filter by type
sandbagging_alerts = [a for a in result.safety_alerts if "sandbag" in a.type.lower()]
deception_alerts = [a for a in result.safety_alerts if "deception" in a.type.lower()]
Best Practices¶
- Start with passive mode: Understand behavior before enabling active intervention
- Tune sensitivity: Adjust based on false positive rates
- Combine monitors: Multiple monitors catch more issues
- Review alerts: Verify detections before taking action
- Consider context: Some patterns may be legitimate
Next Steps¶
- Learn about Sandbagging Detection
- Understand Deception Monitoring
- Explore Behavior Tracking
- Configure Ethics Guard