Integration Contract
Keep scorer modules narrow. They should grade an artifact, not own the entire rollout.1. Package Layout
2. Result Type
Return a type the training or evaluation loop can consume directly.src/agentix/swebench/__init__.py
3. Scoring Function
src/agentix/swebench/__init__.py
agentix-cookbook/swebench
for a full working scorer.
4. Packaging
pyproject.toml
agentix-swebench, the worker can
import agentix.swebench and call score.
5. Sandbox Requirements
Benchmark harnesses often assume system tools. SWE-bench commonly needs:bash,git, andcurlon PATH- Miniconda or benchmark-specific execution images
- Enough disk and timeout budget for dependency setup and tests
default.nix, or a
deployment backend that knows how to start benchmark-provided images.
Caller Flow
The caller owns the rollout choreography. This keeps every part swappable: dataset row, repo setup, agent, patch extraction, scorer.Design Rules
- A scorer grades one artifact and returns structured data.
- The caller chooses the dataset split and instance.
- The caller prepares repos and extracts patches with generic primitives.
- The scorer should reuse official benchmark code whenever possible.
- Logs should be returned or persisted in a way the rollout system can correlate with the call.