Integrate a dataset

A dataset integration turns a benchmark harness into a remote scoring function. The agent does the work, the scorer judges the result, and the caller composes both pieces in one rollout.

from agentix.swebench import score

report = await client.remote(score, instance=inst, patch=model_patch)

This guide uses SWE-bench Verified as the example, but the same shape works for internal evals, MLE-Bench-style tasks, or any harness that can run inside the sandbox.

Integration Contract

Keep scorer modules narrow. They should grade an artifact, not own the entire rollout.

async def score(
    *,
    instance: dict[str, Any],
    patch: str,
    setup_timeout: float = 1800,
    eval_timeout: float = 1800,
) -> Score:
    ...

Dataset enumeration, repo checkout, agent execution, and patch extraction stay on the caller side. That separation lets one scorer work with many agents.

1. Package Layout

agentix-swebench/
├── pyproject.toml
└── src/agentix/
    └── swebench/
        └── __init__.py

2. Result Type

Return a type the training or evaluation loop can consume directly.

src/agentix/swebench/__init__.py

from __future__ import annotations

from dataclasses import dataclass


@dataclass
class Score:
    resolved: bool
    patch_applied: bool
    fail_to_pass_resolved: list[str]
    fail_to_pass_missing: list[str]
    pass_to_pass_kept: list[str]
    pass_to_pass_broken: list[str]
    logs: str

The type can be a dataclass or Pydantic model as long as Agentix can serialize it through the shared codec.

3. Scoring Function

src/agentix/swebench/__init__.py

from __future__ import annotations

from typing import Any


async def score(
    *,
    instance: dict[str, Any],
    patch: str,
    setup_timeout: float = 1800,
    eval_timeout: float = 1800,
) -> Score:
    """Run the official SWE-bench evaluation against `patch`."""
    from swebench.harness.grading import get_eval_report
    from swebench.harness.test_spec.test_spec import make_test_spec

    spec = make_test_spec(instance)

    # Set up the per-instance environment, apply `patch`, apply the
    # benchmark's test patch, run the eval script, and parse the logs.
    # Then convert the official report into the Score dataclass.
    report = get_eval_report(...)
    return _to_score(report)

The complete implementation is intentionally harness-specific. For SWE-bench, lean on the official package for test specs, log parsing, and grading instead of reimplementing benchmark rules. See agentix-cookbook/swebench for a full working scorer.

4. Packaging

pyproject.toml

[project]
name = "agentix-swebench"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = ["swebench>=4.0"]

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.hatch.build.targets.wheel]
packages = ["src/agentix"]

When a rollout bundle depends on agentix-swebench, the worker can import agentix.swebench and call score.

5. Sandbox Requirements

Benchmark harnesses often assume system tools. SWE-bench commonly needs:

bash, git, and curl on PATH
Miniconda or benchmark-specific execution images
Enough disk and timeout budget for dependency setup and tests

Put those requirements in the bundle’s base image, a default.nix, or a deployment backend that knows how to start benchmark-provided images.

Caller Flow

The caller owns the rollout choreography. This keeps every part swappable: dataset row, repo setup, agent, patch extraction, scorer.

from datasets import load_dataset
from agentix import RuntimeClient
from agentix.bash import run as bash_run
from agentix.claude_code import run as run_claude
from agentix.swebench import score

inst = dict(load_dataset("princeton-nlp/SWE-bench_Verified", split="test")[0])

async with RuntimeClient(sandbox.runtime_url) as client:
    await client.remote(
        bash_run,
        command=(
            f"rm -rf /testbed && "
            f"git clone https://github.com/{inst['repo']}.git /testbed && "
            f"cd /testbed && git checkout {inst['base_commit']}"
        ),
    )

    await client.remote(
        run_claude,
        instruction=inst["problem_statement"],
        workdir="/testbed",
        env={"ANTHROPIC_API_KEY": api_key},
    )

    diff = await client.remote(
        bash_run,
        command="cd /testbed && git add -A && git diff --cached --no-color",
    )
    report = await client.remote(score, instance=inst, patch=diff.stdout)

    print("PASS" if report.resolved else "FAIL")

Design Rules

A scorer grades one artifact and returns structured data.
The caller chooses the dataset split and instance.
The caller prepares repos and extracts patches with generic primitives.
The scorer should reuse official benchmark code whenever possible.
Logs should be returned or persisted in a way the rollout system can correlate with the call.

​Integration Contract

​1. Package Layout

​2. Result Type

​3. Scoring Function

​4. Packaging

​5. Sandbox Requirements

​Caller Flow

​Design Rules