A dataset integration turns a benchmark harness into a remote scoring function. The agent does the work, the scorer judges the result, and the caller composes both pieces in one rollout.
from agentix.swebench import score

report = await client.remote(score, instance=inst, patch=model_patch)
This guide uses SWE-bench Verified as the example, but the same shape works for internal evals, MLE-Bench-style tasks, or any harness that can run inside the sandbox.

Integration Contract

Keep scorer modules narrow. They should grade an artifact, not own the entire rollout.
async def score(
    *,
    instance: dict[str, Any],
    patch: str,
    setup_timeout: float = 1800,
    eval_timeout: float = 1800,
) -> Score:
    ...
Dataset enumeration, repo checkout, agent execution, and patch extraction stay on the caller side. That separation lets one scorer work with many agents.

1. Package Layout

agentix-swebench/
├── pyproject.toml
└── src/agentix/
    └── swebench/
        └── __init__.py

2. Result Type

Return a type the training or evaluation loop can consume directly.
src/agentix/swebench/__init__.py
from __future__ import annotations

from dataclasses import dataclass


@dataclass
class Score:
    resolved: bool
    patch_applied: bool
    fail_to_pass_resolved: list[str]
    fail_to_pass_missing: list[str]
    pass_to_pass_kept: list[str]
    pass_to_pass_broken: list[str]
    logs: str
The type can be a dataclass or Pydantic model as long as Agentix can serialize it through the shared codec.

3. Scoring Function

src/agentix/swebench/__init__.py
from __future__ import annotations

from typing import Any


async def score(
    *,
    instance: dict[str, Any],
    patch: str,
    setup_timeout: float = 1800,
    eval_timeout: float = 1800,
) -> Score:
    """Run the official SWE-bench evaluation against `patch`."""
    from swebench.harness.grading import get_eval_report
    from swebench.harness.test_spec.test_spec import make_test_spec

    spec = make_test_spec(instance)

    # Set up the per-instance environment, apply `patch`, apply the
    # benchmark's test patch, run the eval script, and parse the logs.
    # Then convert the official report into the Score dataclass.
    report = get_eval_report(...)
    return _to_score(report)
The complete implementation is intentionally harness-specific. For SWE-bench, lean on the official package for test specs, log parsing, and grading instead of reimplementing benchmark rules. See agentix-cookbook/swebench for a full working scorer.

4. Packaging

pyproject.toml
[project]
name = "agentix-swebench"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = ["swebench>=4.0"]

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.hatch.build.targets.wheel]
packages = ["src/agentix"]
When a rollout bundle depends on agentix-swebench, the worker can import agentix.swebench and call score.

5. Sandbox Requirements

Benchmark harnesses often assume system tools. SWE-bench commonly needs:
  • bash, git, and curl on PATH
  • Miniconda or benchmark-specific execution images
  • Enough disk and timeout budget for dependency setup and tests
Put those requirements in the bundle’s base image, a default.nix, or a deployment backend that knows how to start benchmark-provided images.

Caller Flow

The caller owns the rollout choreography. This keeps every part swappable: dataset row, repo setup, agent, patch extraction, scorer.
from datasets import load_dataset
from agentix import RuntimeClient
from agentix.bash import run as bash_run
from agentix.claude_code import run as run_claude
from agentix.swebench import score

inst = dict(load_dataset("princeton-nlp/SWE-bench_Verified", split="test")[0])

async with RuntimeClient(sandbox.runtime_url) as client:
    await client.remote(
        bash_run,
        command=(
            f"rm -rf /testbed && "
            f"git clone https://github.com/{inst['repo']}.git /testbed && "
            f"cd /testbed && git checkout {inst['base_commit']}"
        ),
    )

    await client.remote(
        run_claude,
        instruction=inst["problem_statement"],
        workdir="/testbed",
        env={"ANTHROPIC_API_KEY": api_key},
    )

    diff = await client.remote(
        bash_run,
        command="cd /testbed && git add -A && git diff --cached --no-color",
    )
    report = await client.remote(score, instance=inst, patch=diff.stdout)

    print("PASS" if report.resolved else "FAIL")

Design Rules

  • A scorer grades one artifact and returns structured data.
  • The caller chooses the dataset split and instance.
  • The caller prepares repos and extracts patches with generic primitives.
  • The scorer should reuse official benchmark code whenever possible.
  • Logs should be returned or persisted in a way the rollout system can correlate with the call.