Skip to content

Evals — score your agent as you debug it

Agent evals work like pytest: write functions named eval_*, and the framework discovers and runs them. The difference from tests: evals return scores (0–1), not just pass/fail, because agent output quality is a spectrum.

# eval_my_agent.py — discovered automatically by eval_discover()

def eval_tests_passed(ctx):
    """Did the agent get tests to pass?"""
    for s in reversed(ctx.steps):
        if s.tool_call.tool == "bash" and "pytest" in s.tool_call.args.get("command", ""):
            return s.tool_result.data.get("exit_code") == 0
    return False

def eval_efficiency(ctx):
    """Score 0-1: fewer steps = better."""
    return min(5 / max(ctx.step_count, 1), 1.0)

def eval_ioc_quality(ctx):
    """Return multiple metrics at once."""
    return {"precision": 0.9, "recall": 0.75, "f1": 0.82}

def eval_reasoning_gaps(ctx, llm):
    """LLM-as-judge: are conclusions supported by data?"""
    resp = llm.generate(f"Score 0-1: {ctx.final_output} supported by {ctx.session_log_text}?")
    return float(resp.strip())

Return anythingfloat, bool, str, dict, or EvalResult. The framework normalises. If your function takes an llm parameter, the framework passes the judge LLM automatically.

Attach to your loop

For live scoring during development:

from looplet import EvalHook

hook = EvalHook(
    evaluators=[eval_tests_passed, eval_efficiency],
    verbose=True,   # prints scores after each run
)
for step in composable_loop(..., hooks=[hook]):
    ...
print(hook.summary())          # "2 scored (avg 0.90)"
hook.save("evals/run_1.json")

Discover and batch-run across saved trajectories

from looplet import eval_discover, eval_run, EvalContext

evals = eval_discover("eval_my_agent.py")       # finds all eval_* functions
ctx = EvalContext.from_trajectory_dir("traces/run_1/")
results = eval_run(evals, ctx, judge_llm=my_judge)
for r in results:
    print(r.pretty())

The workflow: debug a run → notice a failure pattern → write a 5-line eval_* function → it runs automatically on every future run. Your debugging becomes your eval suite.

Discovery scope. eval_discover only collects functions defined in each eval_*.py file. Re-exports like from looplet import eval_mark are filtered out, so you can freely import decorators and helpers without them accidentally being run as evaluators.

Distinguish "done" from hook-triggered early stops

Hooks that terminate the loop early (budget caps, source counters, timeouts, quality gates) leave the agent without a done() call in the trajectory. Evals should dispatch on ctx.stop_reason:

def eval_completed_normally(ctx):
    """Agent called done() itself (not stopped by a hook)."""
    return ctx.completed          # shorthand for ctx.stop_reason == "done"

def eval_stopped_within_budget(ctx):
    """Either finished normally OR stopped by the budget hook (both are fine)."""
    return ctx.stop_reason in {"done", "budget_exceeded"}

def eval_not_hit_timeout(ctx):
    return ctx.stop_reason != "timeout"

stop_reason is populated from both live EvalHook runs (read from state) and saved trajectories (read from trajectory.json). Hooks should pass a meaningful label when they stop the loop:

from looplet import HookDecision

class BudgetCap:
    def should_stop(self, state, step_num, new_entities):
        if self.tokens > self.budget:
            return HookDecision(stop="budget_exceeded")   # shows up as ctx.stop_reason
        return False

Returning a plain True from should_stop is still supported; it records stop_reason="hook_stop".

Tag evals with marks for filtering

from looplet import eval_mark

@eval_mark("verdict", "fast")
def eval_verdict_correct(ctx): ...

@eval_mark("ioc", "slow")
def eval_ioc_quality(ctx, llm): ...

# Run only "verdict" evals:
results = eval_run(evals, ctx, include=["verdict"])

# Skip "slow" evals in CI:
results = eval_run(evals, ctx, exclude=["slow"])

Batch-run across multiple trajectories

from looplet import eval_run_batch

contexts = [EvalContext.from_trajectory_dir(d) for d in trace_dirs]
table = eval_run_batch(evals, contexts)
for row in table:
    print(f"{row['name']:30s} avg={row['avg_score']:.2f}")

CLI runner for CI

Like pytest with exit codes:

looplet eval traces/ --evals eval_agent.py --threshold 0.7 -v
  ✓ eval_verdict_correct           avg=1.00  min=1.00  max=1.00  (5 runs)
  ✗ eval_ioc_quality               avg=0.42  min=0.20  max=0.80  (5 runs)
  ✓ eval_no_tool_errors            avg=1.00  min=1.00  max=1.00  (5 runs)

  overall: 0.81
  threshold: 0.70  → PASS