Contributing to ARTEMIS¶

Thank you for your interest in contributing to ARTEMIS! This guide will help you get started.

Getting Started¶

Prerequisites¶

Python 3.10 or higher
Git
A GitHub account

Setting Up Development Environment¶

Fork and clone the repository

git clone https://github.com/your-username/artemis-agents.git
cd artemis-agents

Create a virtual environment

python -m venv venv
source venv/bin/activate  # Linux/Mac
# or: venv\Scripts\activate  # Windows

Install development dependencies

pip install -e ".[dev]"

Set up environment variables

cp .env.example .env
# Edit .env with your API keys

Run tests to verify setup

pytest tests/ -v

Development Workflow¶

Creating a Branch¶

git checkout -b feature/your-feature-name
# or: git checkout -b fix/your-bug-fix

Running Tests¶

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/unit/test_debate.py -v

# Run with coverage
pytest tests/ --cov=artemis --cov-report=html

Code Quality¶

# Type checking
mypy artemis/

# Linting
ruff check artemis/

# Formatting
ruff format artemis/

Pre-commit Checks¶

Before committing, run:

# Format code
ruff format artemis/

# Check for issues
ruff check artemis/

# Run type checks
mypy artemis/

# Run tests
pytest tests/

Code Guidelines¶

Type Hints¶

Use Python 3.10+ type hints everywhere:

# Good
def process_argument(
    argument: Argument,
    context: DebateContext | None = None,
) -> ProcessedArgument:
    ...

# Avoid
def process_argument(argument, context=None):
    ...

Pydantic Models¶

Use Pydantic v2 for data classes:

from pydantic import BaseModel, Field

class Argument(BaseModel):
    content: str = Field(description="The argument content")
    level: ArgumentLevel
    evidence: list[Evidence] = Field(default_factory=list)

Async Code¶

Core debate logic should be async:

# Good
async def generate_argument(self, context: DebateContext) -> Argument:
    response = await self.model.generate(messages)
    return self.parse_argument(response)

# Avoid sync in core logic
def generate_argument(self, context: DebateContext) -> Argument:
    ...

Docstrings¶

Use Google-style docstrings:

async def evaluate(
    self,
    argument: Argument,
    context: DebateContext,
) -> Evaluation:
    """Evaluate an argument using adaptive criteria.

    Args:
        argument: The argument to evaluate.
        context: The debate context for evaluation.

    Returns:
        Evaluation result with scores and feedback.

    Raises:
        EvaluationError: If evaluation fails.
    """
    ...

Error Handling¶

Use custom exceptions:

from artemis.exceptions import DebateError, AgentError

try:
    result = await agent.generate_argument(context)
except ModelError as e:
    raise AgentError(f"Failed to generate argument: {e}") from e

Adding New Features¶

New LLM Provider¶

Create artemis/models/{provider}.py
Implement BaseModel interface
Add to artemis/models/__init__.py
Add tests in tests/unit/models/test_{provider}.py
Update documentation

# artemis/models/newprovider.py
from artemis.models.base import BaseModel

class NewProviderModel(BaseModel):
    async def generate(
        self,
        messages: list[Message],
        **kwargs,
    ) -> Response:
        ...

    async def generate_with_reasoning(
        self,
        messages: list[Message],
        thinking_budget: int = 8000,
    ) -> ReasoningResponse:
        ...

New Safety Monitor¶

Create artemis/safety/{monitor}.py
Implement SafetyMonitor interface
Add tests with mock scenarios
Document detection methodology

# artemis/safety/newmonitor.py
from artemis.safety.base import SafetyMonitor, SafetyResult

class NewMonitor(SafetyMonitor):
    async def analyze(
        self,
        turn: Turn,
        context: DebateContext,
    ) -> SafetyResult:
        ...

New Framework Integration¶

Create artemis/integrations/{framework}.py
Follow framework's extension patterns
Add example in examples/
Add integration tests

Commit Guidelines¶

Format¶

<type>(<scope>): <description>

[optional body]

[optional footer]

Types¶

feat: New feature
fix: Bug fix
docs: Documentation
refactor: Code refactoring
test: Adding tests
chore: Maintenance

Examples¶

feat(core): implement H-L-DAG argument generation

Add hierarchical argument generation with three levels:
- Strategic: high-level thesis
- Tactical: supporting points
- Operational: specific evidence

fix(models): handle o1 reasoning token limits

The o1 model has specific token limits for reasoning.
This fix ensures we properly manage the thinking budget.

Fixes #123

Pull Request Process¶

Update tests: Add tests for new features
Update docs: Document new functionality
Run checks: Ensure all tests pass
Create PR: Use the PR template
Address feedback: Respond to review comments

PR Template¶

## Description
Brief description of changes.

## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update

## Testing
- [ ] Tests added
- [ ] All tests pass
- [ ] Manual testing done

## Checklist
- [ ] Code follows style guidelines
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] No new warnings

Testing Guidelines¶

Unit Tests¶

Test individual components in isolation:

@pytest.fixture
def mock_model():
    model = AsyncMock(spec=BaseModel)
    model.generate.return_value = Response(
        content="Test argument",
        usage=Usage(prompt_tokens=100, completion_tokens=50),
    )
    return model

async def test_agent_generates_argument(mock_model):
    agent = Agent(name="test", role="Test agent", model=mock_model)
    argument = await agent.generate_argument(context)
    assert argument.content == "Test argument"

Integration Tests¶

Test component interactions:

async def test_debate_runs_complete():
    debate = Debate(
        topic="Test topic",
        agents=agents,
        rounds=2,
    )
    result = await debate.run()
    assert result.verdict is not None
    assert len(result.transcript) == 4  # 2 rounds * 2 agents

Safety Tests¶

Verify monitor detection:

from artemis.safety import SandbagDetector, MonitorMode

async def test_detects_sandbagging():
    detector = SandbagDetector(
        mode=MonitorMode.PASSIVE,
        sensitivity=0.8,
    )

    # Create turns with declining capability
    turns = create_declining_capability_turns()

    results = []
    for turn in turns:
        result = await detector.process(turn, context)
        if result:
            results.append(result)

    assert any(r.severity > 0.7 for r in results)

Getting Help¶

Questions: Open a GitHub Discussion
Bugs: Open a GitHub Issue
Features: Open a GitHub Issue with [Feature Request] prefix

Code of Conduct¶

Please be respectful and inclusive in all interactions. We follow the Contributor Covenant.

License¶

By contributing, you agree that your contributions will be licensed under the project's MIT License.