Skip to content

L-AE-CR: Adaptive Evaluation with Causal Reasoning

L-AE-CR (Leveled Adaptive Evaluation with Causal Reasoning) is ARTEMIS's dynamic evaluation system. Unlike static scoring, it adapts evaluation criteria based on context and traces causal relationships in arguments.

Core Principles

1. Adaptive Criteria Weighting

Evaluation criteria aren't fixed—they shift based on:

  • Topic Domain: Technical vs. ethical vs. policy debates
  • Round Context: Opening statements vs. rebuttals
  • Argument Type: Evidence-based vs. logical vs. emotional appeals

2. Causal Chain Analysis

Arguments are evaluated not just on content but on the strength of causal reasoning:

  • Does A actually cause B?
  • Is the causal chain complete?
  • Are there gaps or logical fallacies?

Evaluation Criteria

Standard Criteria

The EvaluationCriteria class defines the default weights:

from artemis.core.types import EvaluationCriteria

criteria = EvaluationCriteria(
    logical_coherence=0.25,   # Internal consistency of argument
    evidence_quality=0.25,    # Strength and relevance of evidence
    causal_reasoning=0.20,    # Strength of causal reasoning
    ethical_alignment=0.15,   # Ethical soundness
    persuasiveness=0.15,      # Overall persuasiveness
)
Criterion Description Default Weight
logical_coherence Internal consistency of argument 0.25
evidence_quality Strength and relevance of evidence 0.25
causal_reasoning Strength of causal reasoning 0.20
ethical_alignment Ethical soundness 0.15
persuasiveness Overall persuasiveness 0.15

Custom Weights

You can customize evaluation weights in the debate configuration:

from artemis.core.types import DebateConfig, EvaluationCriteria

# Technical domain: higher evidence weight
technical_criteria = EvaluationCriteria(
    logical_coherence=0.30,
    evidence_quality=0.35,
    causal_reasoning=0.15,
    ethical_alignment=0.10,
    persuasiveness=0.10,
)

# Ethical domain: higher ethics weight
ethical_criteria = EvaluationCriteria(
    logical_coherence=0.20,
    evidence_quality=0.15,
    causal_reasoning=0.15,
    ethical_alignment=0.35,
    persuasiveness=0.15,
)

config = DebateConfig(
    evaluation_criteria=technical_criteria,
    adaptation_enabled=True,
    adaptation_rate=0.1,
)

Argument Evaluation

Each argument receives an ArgumentEvaluation with detailed scores:

# After running a debate
result = await debate.run()

for turn in result.transcript:
    if turn.evaluation:
        eval = turn.evaluation
        print(f"Agent: {turn.agent}")
        print(f"Total Score: {eval.total_score:.2f}")
        print("Criterion Scores:")
        for criterion, score in eval.scores.items():
            weight = eval.weights.get(criterion, 0)
            print(f"  {criterion}: {score:.2f} (weight: {weight:.2f})")
        print(f"Causal Score: {eval.causal_score:.2f}")

ArgumentEvaluation Fields

Field Type Description
argument_id str ID of evaluated argument
scores dict Score for each criterion
weights dict Adapted weight for each criterion
criterion_details list Detailed per-criterion breakdown
causal_score float Score for causal reasoning
total_score float Weighted total score

Causal Reasoning Analysis

Arguments contain causal links that are evaluated:

from artemis.core.types import CausalLink

# Causal links in arguments
link = CausalLink(
    cause="increased_regulation",
    effect="reduced_innovation_speed",
    mechanism="Compliance overhead diverts resources",
    strength=0.7,
    bidirectional=False,
)

Causal Evaluation

The evaluation system assesses:

  1. Completeness: Is the causal chain fully specified?
  2. Strength: How strong is each link?
  3. Evidence: Is there evidence supporting causation?
  4. Validity: Are there logical fallacies?

Common Causal Fallacies Detected

Fallacy Description
Post Hoc Assumes causation from sequence
Correlation Treats correlation as causation
Single Cause Ignores multiple factors
Slippery Slope Assumes inevitable escalation

Evaluation Flow

graph TD
    A[Receive Argument] --> B[Identify Context]
    B --> C[Adapt Criteria Weights]
    C --> D[Extract Causal Links]
    D --> E[Score Each Criterion]
    E --> F[Compute Causal Score]
    F --> G[Compute Weighted Total]
    G --> H[Return Evaluation]

Adaptive Weight Adjustment

When adaptation_enabled=True, weights are dynamically adjusted:

from artemis.core.types import DebateConfig

config = DebateConfig(
    adaptation_enabled=True,
    adaptation_rate=0.1,  # How fast weights adjust (0-0.5)
)

Adaptation Factors

Weights adapt based on:

  • Topic sensitivity: Higher ethical weight for sensitive topics
  • Topic complexity: Higher causal weight for complex topics
  • Round progress: Different expectations for opening vs. closing
  • Argument type: Evidence-heavy arguments get higher evidence weight

Using the Evaluator

The AdaptiveEvaluator is used internally but can be accessed:

from artemis.core.evaluation import AdaptiveEvaluator
from artemis.core.types import DebateContext

evaluator = AdaptiveEvaluator()

# Evaluate a single argument
evaluation = await evaluator.evaluate_argument(
    argument=argument,
    context=debate_context,
)

print(f"Total Score: {evaluation.total_score:.2f}")
print(f"Breakdown: {evaluation.scores}")

Integration with Jury

L-AE-CR provides scores to the jury mechanism:

from artemis.core.jury import JuryPanel

# Jury members receive evaluation results
panel = JuryPanel(evaluators=5, model="gpt-4o")

# Each jury member considers:
# - Evaluation scores from L-AE-CR
# - Their perspective-specific weights
# - Argument content and structure

Score Components

Logical Coherence Score

Evaluates internal consistency:

  • Premises support conclusion
  • No contradictions
  • Valid inference patterns

Evidence Quality Score

Evaluates supporting evidence:

  • Source credibility
  • Relevance to claims
  • Recency and accuracy
  • Diversity of sources

Causal Reasoning Score

Evaluates causal reasoning:

  • Chain completeness
  • Link strength
  • Evidence for causation
  • Fallacy absence

Ethical Alignment Score

Evaluates ethical soundness:

  • Fairness of reasoning
  • Consideration of stakeholders
  • Avoidance of harmful claims
  • Respect for values

Persuasiveness Score

Evaluates persuasiveness:

  • Clarity of thesis
  • Effectiveness of rhetoric
  • Audience appropriateness
  • Counter-argument handling

Benefits of L-AE-CR

1. Context-Aware Evaluation

  • Adapts to topic domain automatically
  • Adjusts for round context
  • Considers argument type

2. Transparent Scoring

  • Clear criteria breakdown
  • Weighted contributions visible
  • Feedback explains scores

3. Causal Rigor

  • Validates causal claims
  • Detects logical fallacies
  • Measures chain strength

4. Fair Assessment

  • Multiple criteria considered
  • Weights prevent single-focus bias
  • Adaptation ensures relevance

Next Steps