Evals — score your agent as you debug it¶

Agent evals work like pytest: write functions named eval_*, and the framework discovers and runs them. The difference from tests: evals return scores (0–1), not just pass/fail, because agent output quality is a spectrum.

# eval_my_agent.py — discovered automatically by eval_discover()

def eval_tests_passed(ctx):
    """Did the agent get tests to pass?"""
    for s in reversed(ctx.steps):
        if s.tool_call.tool == "bash" and "pytest" in s.tool_call.args.get("command", ""):
            return s.tool_result.data.get("exit_code") == 0
    return False

def eval_efficiency(ctx):
    """Score 0-1: fewer steps = better."""
    return min(5 / max(ctx.step_count, 1), 1.0)

def eval_ioc_quality(ctx):
    """Return multiple metrics at once."""
    return {"precision": 0.9, "recall": 0.75, "f1": 0.82}

def eval_reasoning_gaps(ctx, llm):
    """LLM-as-judge: are conclusions supported by data?"""
    resp = llm.generate(f"Score 0-1: {ctx.final_output} supported by {ctx.session_log_text}?")
    return float(resp.strip())

Return anything — float, bool, str, dict, or EvalResult. The framework normalises. If your function takes an llm parameter, the framework passes the judge LLM automatically.

Attach to your loop¶

For live scoring during development:

from looplet import EvalHook

hook = EvalHook(
    evaluators=[eval_tests_passed, eval_efficiency],
    verbose=True,   # prints scores after each run
)
for step in composable_loop(..., hooks=[hook]):
    ...
print(hook.summary())          # "2 scored (avg 0.90)"
hook.save("evals/run_1.json")

Discover and batch-run across saved trajectories¶

from looplet import eval_discover, eval_run, EvalContext

evals = eval_discover("eval_my_agent.py")       # finds all eval_* functions
ctx = EvalContext.from_trajectory_dir("traces/run_1/")
results = eval_run(evals, ctx, judge_llm=my_judge)
for r in results:
    print(r.pretty())

The workflow: debug a run → notice a failure pattern → write a 5-line eval_* function → it runs automatically on every future run. Your debugging becomes your eval suite.

Discovery scope. eval_discover only collects functions defined in each eval_*.py file. Re-exports like from looplet import eval_mark are filtered out, so you can freely import decorators and helpers without them accidentally being run as evaluators.

Distinguish "done" from hook-triggered early stops¶

Hooks that terminate the loop early (budget caps, source counters, timeouts, quality gates) leave the agent without a done() call in the trajectory. Evals should dispatch on ctx.stop_reason:

def eval_completed_normally(ctx):
    """Agent called done() itself (not stopped by a hook)."""
    return ctx.completed          # shorthand for ctx.stop_reason == "done"

def eval_stopped_within_budget(ctx):
    """Either finished normally OR stopped by the budget hook (both are fine)."""
    return ctx.stop_reason in {"done", "budget_exceeded"}

def eval_not_hit_timeout(ctx):
    return ctx.stop_reason != "timeout"

stop_reason is populated from both live EvalHook runs (read from state) and saved trajectories (read from trajectory.json). Hooks should pass a meaningful label when they stop the loop:

from looplet import HookDecision

class BudgetCap:
    def should_stop(self, state, step_num, new_entities):
        if self.tokens > self.budget:
            return HookDecision(stop="budget_exceeded")   # shows up as ctx.stop_reason
        return False

Returning a plain True from should_stop is still supported; it records stop_reason="hook_stop".

Tag evals with marks for filtering¶

from looplet import eval_mark

@eval_mark("verdict", "fast")
def eval_verdict_correct(ctx): ...

@eval_mark("ioc", "slow")
def eval_ioc_quality(ctx, llm): ...

# Run only "verdict" evals:
results = eval_run(evals, ctx, include=["verdict"])

# Skip "slow" evals in CI:
results = eval_run(evals, ctx, exclude=["slow"])

Batch-run across multiple trajectories¶

from looplet import eval_run_batch

contexts = [EvalContext.from_trajectory_dir(d) for d in trace_dirs]
table = eval_run_batch(evals, contexts)
for row in table:
    print(f"{row['name']:30s} avg={row['avg_score']:.2f}")

CLI runner for CI¶

Like pytest with exit codes:

looplet eval traces/ --evals eval_agent.py --threshold 0.7 -v

  ✓ eval_verdict_correct           avg=1.00  min=1.00  max=1.00  (5 runs)
  ✗ eval_ioc_quality               avg=0.42  min=0.20  max=0.80  (5 runs)
  ✓ eval_no_tool_errors            avg=1.00  min=1.00  max=1.00  (5 runs)

  overall: 0.81
  threshold: 0.70  → PASS