Ethics Guard¶
The Ethics Guard monitors debates for ethical boundary violations, ensuring arguments remain within acceptable moral limits.
Overview¶
The Ethics Guard detects:
- Harmful Content: Arguments promoting harm
- Discrimination: Unfair treatment of groups
- Manipulation: Psychological manipulation tactics
- Privacy Violations: Exposing private information
- Deceptive Claims: Intentionally false statements
Usage¶
Basic Setup¶
from artemis.safety import EthicsGuard, MonitorMode, EthicsConfig
guard = EthicsGuard(
mode=MonitorMode.PASSIVE,
config=EthicsConfig(
harmful_content_threshold=0.4,
bias_threshold=0.4,
fairness_threshold=0.3,
enabled_checks=["harmful_content", "bias", "fairness"],
),
)
debate = Debate(
topic="Your topic",
agents=agents,
safety_monitors=[guard.process],
)
Configuration Options¶
from artemis.safety import EthicsGuard, MonitorMode, EthicsConfig
config = EthicsConfig(
harmful_content_threshold=0.4,
bias_threshold=0.4,
fairness_threshold=0.3,
enabled_checks=[
"harmful_content",
"bias",
"fairness",
"privacy",
"manipulation",
],
)
guard = EthicsGuard(
mode=MonitorMode.PASSIVE,
config=config,
)
EthicsConfig Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
harmful_content_threshold |
float | 0.5 | Threshold for harmful content |
bias_threshold |
float | 0.4 | Threshold for bias detection |
fairness_threshold |
float | 0.3 | Threshold for fairness violations |
enabled_checks |
list[str] | default set | Ethics checks to enable |
Ethical Principles¶
Built-in Principles¶
| Principle | Description | Detects |
|---|---|---|
| Fairness | Equal treatment | Discrimination, bias |
| Transparency | Honest communication | Hidden agendas, misdirection |
| Non-harm | Avoiding harm | Violence, dangerous advice |
| Respect | Dignified treatment | Insults, dehumanization |
| Accuracy | Truthful claims | Misinformation, false claims |
Custom Checks¶
You can specify which checks to enable:
config = EthicsConfig(
harmful_content_threshold=0.3,
enabled_checks=["harmful_content", "bias"], # Only these two
)
Detection Categories¶
Harmful Content¶
Content that promotes or glorifies harm:
- Violence advocacy
- Self-harm promotion
- Dangerous activities
- Harmful advice
Discrimination¶
Unfair treatment based on protected characteristics:
- Racial discrimination
- Gender discrimination
- Religious discrimination
- Age discrimination
- Disability discrimination
Manipulation¶
Psychological manipulation tactics:
- Fear mongering
- Guilt tripping
- Gaslighting language
- Coercion
- Emotional exploitation
Privacy Violations¶
Exposure of private information:
- Personal identification
- Location disclosure
- Financial information
- Health information
Results¶
The Ethics Guard contributes to debate safety alerts:
result = await debate.run()
# Check for ethics alerts
for alert in result.safety_alerts:
if "ethics" in alert.type.lower():
print(f"Agent: {alert.agent}")
print(f"Severity: {alert.severity:.0%}")
Integration¶
With Debate¶
from artemis.core.agent import Agent
from artemis.core.debate import Debate
from artemis.safety import EthicsGuard, MonitorMode, EthicsConfig
agents = [
Agent(name="pro", role="Advocate for the proposition", model="gpt-4o"),
Agent(name="con", role="Advocate against the proposition", model="gpt-4o"),
]
guard = EthicsGuard(
mode=MonitorMode.PASSIVE,
config=EthicsConfig(harmful_content_threshold=0.4),
)
debate = Debate(
topic="Your topic",
agents=agents,
safety_monitors=[guard.process],
)
debate.assign_positions({
"pro": "supports the proposition",
"con": "opposes the proposition",
})
result = await debate.run()
# Check for ethics violations
ethics_alerts = [
a for a in result.safety_alerts
if "ethics" in a.type.lower()
]
for alert in ethics_alerts:
print(f"Agent: {alert.agent}")
print(f"Severity: {alert.severity:.0%}")
With Other Monitors¶
from artemis.safety import (
EthicsGuard,
DeceptionMonitor,
BehaviorTracker,
MonitorMode,
EthicsConfig,
)
ethics = EthicsGuard(
mode=MonitorMode.PASSIVE,
config=EthicsConfig(harmful_content_threshold=0.4),
)
deception = DeceptionMonitor(mode=MonitorMode.PASSIVE, sensitivity=0.6)
behavior = BehaviorTracker(mode=MonitorMode.PASSIVE, sensitivity=0.5)
debate = Debate(
topic="Your topic",
agents=agents,
safety_monitors=[
ethics.process,
deception.process,
behavior.process,
],
)
Sensitivity Tuning¶
Low Sensitivity (0.3)¶
- Only catches severe violations
- Minimal false positives
- Allows controversial but not harmful content
Medium Sensitivity (0.6)¶
- Catches most concerning content
- Balanced false positive rate
- Good general setting
High Sensitivity (0.9)¶
- Very strict enforcement
- More false positives
- For sensitive contexts
Common Configurations¶
Academic Debate¶
guard = EthicsGuard(
mode=MonitorMode.PASSIVE,
config=EthicsConfig(
harmful_content_threshold=0.5,
bias_threshold=0.5,
enabled_checks=["harmful_content", "bias", "fairness"],
),
)
Sensitive Topics¶
guard = EthicsGuard(
mode=MonitorMode.ACTIVE, # Can halt debate
config=EthicsConfig(
harmful_content_threshold=0.2, # Very strict
bias_threshold=0.3,
enabled_checks=["harmful_content", "bias", "fairness", "privacy"],
),
)
Policy Debate¶
guard = EthicsGuard(
mode=MonitorMode.PASSIVE,
config=EthicsConfig(
harmful_content_threshold=0.4,
bias_threshold=0.4,
enabled_checks=["harmful_content", "bias", "fairness"],
),
)
Best Practices¶
- Set appropriate sensitivity: Match to debate context
- Define clear principles: Be explicit about boundaries
- Use passive mode initially: Understand patterns before blocking
- Review edge cases: Some content needs human judgment
- Document decisions: Track why alerts were generated
Next Steps¶
- Learn about Safety Overview
- Explore Deception Monitoring
- Configure Behavior Tracking