Behavior Tracking¶
The Behavior Tracker monitors how agent behavior changes over time, detecting drift, inconsistencies, and concerning patterns.
What It Tracks¶
Behavior tracking monitors:
- Style Drift: Changes in communication style
- Position Drift: Shifts in argued position
- Capability Drift: Changes in demonstrated abilities
- Engagement Patterns: How agents interact
Why Track Behavior?¶
Behavior tracking helps detect:
- Goal Drift: Agent straying from assigned position
- Strategic Shifts: Concerning strategic changes
- Manipulation: Gradual influence by opponent
- Degradation: Performance declining over time
Usage¶
Basic Setup¶
from artemis.safety import BehaviorTracker, MonitorMode
tracker = BehaviorTracker(
mode=MonitorMode.PASSIVE,
sensitivity=0.5,
window_size=5,
)
debate = Debate(
topic="Your topic",
agents=agents,
safety_monitors=[tracker.process],
)
Configuration Options¶
tracker = BehaviorTracker(
mode=MonitorMode.PASSIVE, # PASSIVE, ACTIVE, or LEARNING
sensitivity=0.5, # 0.0 to 1.0
window_size=5, # Turns to consider for drift detection
drift_threshold=0.3, # Threshold for drift alerts
)
Configuration Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
mode |
MonitorMode | PASSIVE | Monitor mode |
sensitivity |
float | 0.5 | Detection sensitivity (0-1) |
window_size |
int | 5 | Turns to consider |
drift_threshold |
float | 0.3 | Drift detection threshold |
Tracked Metrics¶
Style Metrics¶
How the agent communicates:
- Formality level
- Aggression/assertiveness
- Complexity
- Sentiment
- Confidence
Position Metrics¶
What position the agent argues:
- Position alignment with assigned stance
- Position strength
- Concession rate
- Counter-argument engagement
Capability Metrics¶
How capable the agent appears:
- Vocabulary complexity
- Reasoning depth
- Evidence usage
- Argument structure
Drift Detection¶
Drift Types¶
| Drift Type | Description | Concern Level |
|---|---|---|
| Gradual | Slow, steady change | Low |
| Sudden | Abrupt shift | High |
| Oscillating | Back and forth | Medium |
| Escalating | Increasing severity | High |
Results¶
The tracker contributes to debate safety alerts:
result = await debate.run()
# Check for behavior drift alerts
for alert in result.safety_alerts:
if "behavior" in alert.type.lower() or "drift" in alert.type.lower():
print(f"Agent: {alert.agent}")
print(f"Severity: {alert.severity:.0%}")
Integration¶
With Debate¶
from artemis.core.agent import Agent
from artemis.core.debate import Debate
from artemis.safety import BehaviorTracker, MonitorMode
agents = [
Agent(name="pro", role="Advocate for the proposition", model="gpt-4o"),
Agent(name="con", role="Advocate against the proposition", model="gpt-4o"),
]
tracker = BehaviorTracker(
mode=MonitorMode.PASSIVE,
sensitivity=0.5,
window_size=5,
)
debate = Debate(
topic="Your topic",
agents=agents,
safety_monitors=[tracker.process],
)
debate.assign_positions({
"pro": "supports the proposition",
"con": "opposes the proposition",
})
result = await debate.run()
# Check for behavior alerts
behavior_alerts = [
a for a in result.safety_alerts
if "behavior" in a.type.lower()
]
for alert in behavior_alerts:
print(f"Agent {alert.agent}: {alert.severity:.0%} severity")
With Other Monitors¶
Behavior tracking complements other monitors:
from artemis.safety import (
BehaviorTracker,
SandbagDetector,
DeceptionMonitor,
MonitorMode,
)
behavior = BehaviorTracker(mode=MonitorMode.PASSIVE, sensitivity=0.5, window_size=5)
sandbag = SandbagDetector(mode=MonitorMode.PASSIVE, sensitivity=0.7)
deception = DeceptionMonitor(mode=MonitorMode.PASSIVE, sensitivity=0.6)
debate = Debate(
topic="Your topic",
agents=agents,
safety_monitors=[
behavior.process,
sandbag.process,
deception.process,
],
)
Sensitivity Tuning¶
Low Sensitivity (0.3)¶
- Only catches severe drift
- Minimal false positives
- May miss gradual changes
Medium Sensitivity (0.5)¶
- Balanced detection
- Catches most concerning drift
- Good general setting
High Sensitivity (0.8)¶
- Catches subtle changes
- More false positives
- For high-stakes scenarios
Best Practices¶
- Set appropriate window: 5-7 turns typically works well
- Account for natural variation: Some drift is normal
- Consider debate phase: Closing arguments differ from opening
- Monitor all agents: Compare behavior across participants
- Correlate with other signals: Drift may accompany other issues
Next Steps¶
- Learn about Sandbagging Detection
- Explore Deception Monitoring
- Configure Ethics Guard