Deception Monitoring¶
The Deception Monitor detects when agents make false claims, misrepresent information, or attempt to mislead.
What is Deception?¶
Deception in debates includes:
- Factual Falsity: Making claims that are demonstrably false
- Logical Fallacies: Using invalid reasoning to mislead
- Misrepresentation: Distorting sources or opponent positions
- Selective Omission: Hiding relevant information
- Misdirection: Distracting from key issues
Detection Capabilities¶
The Deception Monitor checks multiple dimensions:
| Dimension | What It Checks |
|---|---|
| Factual | Are claims consistent and plausible? |
| Logical | Is reasoning valid? |
| Consistency | Do claims contradict each other? |
| Source | Are sources accurately represented? |
| Context | Is context preserved? |
Usage¶
Basic Setup¶
from artemis.safety import DeceptionMonitor, MonitorMode
monitor = DeceptionMonitor(
mode=MonitorMode.PASSIVE,
sensitivity=0.6,
)
debate = Debate(
topic="Your topic",
agents=agents,
safety_monitors=[monitor.process],
)
Configuration Options¶
monitor = DeceptionMonitor(
mode=MonitorMode.PASSIVE, # PASSIVE, ACTIVE, or LEARNING
sensitivity=0.6, # 0.0 to 1.0
)
Configuration Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
mode |
MonitorMode | PASSIVE | Monitor mode |
sensitivity |
float | 0.5 | Detection sensitivity (0-1) |
What It Detects¶
Logical Fallacies¶
Common fallacies detected:
| Fallacy | Description |
|---|---|
| Ad Hominem | Attacking the person, not the argument |
| Straw Man | Misrepresenting opponent's position |
| False Dichotomy | Presenting only two options when more exist |
| Appeal to Authority | Using authority as sole justification |
| Circular Reasoning | Conclusion restates the premise |
| Red Herring | Introducing irrelevant information |
| Slippery Slope | Assuming inevitable chain of events |
| Hasty Generalization | Drawing broad conclusions from few examples |
Consistency Issues¶
- Internal contradictions within an agent's arguments
- Position shifts that contradict earlier statements
- Conflicting evidence claims
Misrepresentation¶
- Distorting opponent's position
- Taking sources out of context
- Selective quoting
Results¶
The monitor contributes to debate safety alerts:
result = await debate.run()
# Check for deception alerts
for alert in result.safety_alerts:
if "deception" in alert.type.lower():
print(f"Agent: {alert.agent}")
print(f"Severity: {alert.severity:.0%}")
Distinguishing Intent¶
Not all false claims are intentional deception:
| Type | Description | Severity |
|---|---|---|
| Mistake | Unintentional error | Low |
| Negligence | Careless claim | Medium |
| Deception | Intentional misleading | High |
Integration¶
With Debate¶
from artemis.core.agent import Agent
from artemis.core.debate import Debate
from artemis.safety import DeceptionMonitor, MonitorMode
agents = [
Agent(name="pro", role="Advocate for the proposition", model="gpt-4o"),
Agent(name="con", role="Advocate against the proposition", model="gpt-4o"),
]
monitor = DeceptionMonitor(
mode=MonitorMode.PASSIVE,
sensitivity=0.6,
)
debate = Debate(
topic="Your topic",
agents=agents,
safety_monitors=[monitor.process],
)
debate.assign_positions({
"pro": "supports the proposition",
"con": "opposes the proposition",
})
result = await debate.run()
# Check for deception alerts
deception_alerts = [
a for a in result.safety_alerts
if "deception" in a.type.lower()
]
for alert in deception_alerts:
print(f"Agent: {alert.agent}")
print(f"Severity: {alert.severity:.0%}")
With Other Monitors¶
from artemis.safety import (
DeceptionMonitor,
SandbagDetector,
EthicsGuard,
MonitorMode,
EthicsConfig,
)
deception = DeceptionMonitor(mode=MonitorMode.PASSIVE, sensitivity=0.6)
sandbag = SandbagDetector(mode=MonitorMode.PASSIVE, sensitivity=0.7)
ethics = EthicsGuard(
mode=MonitorMode.PASSIVE,
config=EthicsConfig(harmful_content_threshold=0.5),
)
debate = Debate(
topic="Your topic",
agents=agents,
safety_monitors=[
deception.process,
sandbag.process,
ethics.process,
],
)
Sensitivity Tuning¶
Low Sensitivity (0.3)¶
- Catches only obvious deception
- Few false positives
- May miss subtle cases
Medium Sensitivity (0.6)¶
- Balanced detection
- Some false positives
- Good general setting
High Sensitivity (0.8)¶
- Catches subtle deception
- More false positives
- Good for high-stakes scenarios
Best Practices¶
- Enable comprehensive monitoring: Combine with other monitors
- Track consistency: Many deceptions are revealed by contradictions
- Consider intent: Not all false claims are deceptive
- Review edge cases: Some content needs human judgment
- Combine with ethics: Deception often accompanies ethical violations
Next Steps¶
- Learn about Sandbagging Detection
- Explore Behavior Tracking
- Configure Ethics Guard