Evals — score your agent as you debug it¶
Agent evals work like pytest: write functions named eval_*, and the
framework discovers and runs them. The difference from tests: evals
return scores (0–1), not just pass/fail, because agent output
quality is a spectrum.
# eval_my_agent.py — discovered automatically by eval_discover()
def eval_tests_passed(ctx):
"""Did the agent get tests to pass?"""
for s in reversed(ctx.steps):
if s.tool_call.tool == "bash" and "pytest" in s.tool_call.args.get("command", ""):
return s.tool_result.data.get("exit_code") == 0
return False
def eval_efficiency(ctx):
"""Score 0-1: fewer steps = better."""
return min(5 / max(ctx.step_count, 1), 1.0)
def eval_ioc_quality(ctx):
"""Return multiple metrics at once."""
return {"precision": 0.9, "recall": 0.75, "f1": 0.82}
def eval_reasoning_gaps(ctx, llm):
"""LLM-as-judge: are conclusions supported by data?"""
resp = llm.generate(f"Score 0-1: {ctx.final_output} supported by {ctx.session_log_text}?")
return float(resp.strip())
Return anything — float, bool, str, dict, or EvalResult.
The framework normalises. If your function takes an llm parameter,
the framework passes the judge LLM automatically.
Attach to your loop¶
For live scoring during development:
from looplet import EvalHook
hook = EvalHook(
evaluators=[eval_tests_passed, eval_efficiency],
verbose=True, # prints scores after each run
)
for step in composable_loop(..., hooks=[hook]):
...
print(hook.summary()) # "2 scored (avg 0.90)"
hook.save("evals/run_1.json")
Discover and batch-run across saved trajectories¶
from looplet import eval_discover, eval_run, EvalContext
evals = eval_discover("eval_my_agent.py") # finds all eval_* functions
ctx = EvalContext.from_trajectory_dir("traces/run_1/")
results = eval_run(evals, ctx, judge_llm=my_judge)
for r in results:
print(r.pretty())
The workflow: debug a run → notice a failure pattern → write a 5-line
eval_* function → it runs automatically on every future run. Your
debugging becomes your eval suite.
Discovery scope.
eval_discoveronly collects functions defined in eacheval_*.pyfile. Re-exports likefrom looplet import eval_markare filtered out, so you can freely import decorators and helpers without them accidentally being run as evaluators.
Distinguish "done" from hook-triggered early stops¶
Hooks that terminate the loop early (budget caps, source counters,
timeouts, quality gates) leave the agent without a done() call in the
trajectory. Evals should dispatch on ctx.stop_reason:
def eval_completed_normally(ctx):
"""Agent called done() itself (not stopped by a hook)."""
return ctx.completed # shorthand for ctx.stop_reason == "done"
def eval_stopped_within_budget(ctx):
"""Either finished normally OR stopped by the budget hook (both are fine)."""
return ctx.stop_reason in {"done", "budget_exceeded"}
def eval_not_hit_timeout(ctx):
return ctx.stop_reason != "timeout"
stop_reason is populated from both live EvalHook runs (read from
state) and saved trajectories (read from trajectory.json). Hooks
should pass a meaningful label when they stop the loop:
from looplet import HookDecision
class BudgetCap:
def should_stop(self, state, step_num, new_entities):
if self.tokens > self.budget:
return HookDecision(stop="budget_exceeded") # shows up as ctx.stop_reason
return False
Returning a plain True from should_stop is still supported; it
records stop_reason="hook_stop".
Tag evals with marks for filtering¶
from looplet import eval_mark
@eval_mark("verdict", "fast")
def eval_verdict_correct(ctx): ...
@eval_mark("ioc", "slow")
def eval_ioc_quality(ctx, llm): ...
# Run only "verdict" evals:
results = eval_run(evals, ctx, include=["verdict"])
# Skip "slow" evals in CI:
results = eval_run(evals, ctx, exclude=["slow"])
Batch-run across multiple trajectories¶
from looplet import eval_run_batch
contexts = [EvalContext.from_trajectory_dir(d) for d in trace_dirs]
table = eval_run_batch(evals, contexts)
for row in table:
print(f"{row['name']:30s} avg={row['avg_score']:.2f}")
CLI runner for CI¶
Like pytest with exit codes: