Provenance Guide¶
looplet.provenance captures exactly what your agent saw and did —
every prompt, every response, every step — and writes them to a
diff-friendly directory you can cat, grep, and check into git.
The loop: capture → read → replay.
# 1. Capture
sink = ProvenanceSink(dir="traces/run_1/")
llm = sink.wrap_llm(AnthropicBackend(...))
for step in composable_loop(llm=llm, tools=tools, state=state,
hooks=[sink.trajectory_hook()]):
print(step.pretty())
sink.flush()
# 2. Read (from the shell)
$ python -m looplet show traces/run_1/
# 3. Replay (change hooks/tools, no LLM cost)
from looplet import replay_loop
for step in replay_loop("traces/run_1/", tools=tools):
print(step.pretty())
Two primitives, one facade, two extras:
| Class / function | What it does | Implements |
|---|---|---|
RecordingLLMBackend |
Every call to generate / generate_with_tools |
LLMBackend (wraps any backend) |
AsyncRecordingLLMBackend |
Same, async | AsyncLLMBackend |
TrajectoryRecorder |
Every loop step, with context-before and linked LLM calls | LoopHook |
ProvenanceSink |
Both of the above in a 3-line drop-in | — |
replay_loop(dir, ...) |
Rerun the loop against cached LLM output | generator |
python -m looplet show <dir> |
One-page readable summary | CLI |
All of it is zero-dependency, safe by default, and preserves the
NativeToolBackend capability surface of whatever they wrap.
Quick start — ProvenanceSink¶
from looplet import ProvenanceSink, composable_loop, DefaultState
from looplet.backends import AnthropicBackend
sink = ProvenanceSink(dir="traces/run_1/")
llm = sink.wrap_llm(AnthropicBackend(api_key=...))
hooks = [sink.trajectory_hook()]
state = DefaultState(max_steps=15)
for step in composable_loop(llm=llm, tools=tools, state=state, hooks=hooks):
print(step.pretty())
sink.flush()
After flush(), the directory contains:
traces/run_1/
├── trajectory.json # run_id, steps, termination_reason, metadata
├── steps/
│ ├── step_01.json # tool_call, tool_result, context_before, linked LLM indices
│ └── step_02.json
├── call_00_prompt.txt # exact prompt sent (system + user + tools)
├── call_00_response.txt # raw response
├── call_01_prompt.txt
├── call_01_response.txt
└── manifest.jsonl # one LLMCall summary per line
Every file is human-readable. trajectory.json and manifest.jsonl are
machine-parseable. You can diff, grep, version-control, attach to bug
reports, or feed to a separate analysis pipeline.
LLM-call provenance only¶
If you don't need the full trajectory (e.g. you're debugging prompt rendering in isolation), just wrap the backend:
from looplet.provenance import RecordingLLMBackend
llm = RecordingLLMBackend(MyBackend())
# Use `llm` anywhere a backend is expected — `composable_loop`, a sub-agent,
# a one-off `llm.generate(...)` call. Every invocation is captured.
response = llm.generate("hello", system_prompt="be brief", max_tokens=50)
# Inspect in memory:
assert len(llm.calls) == 1
c = llm.calls[0]
print(c.prompt, c.response, c.duration_ms, c.error)
# Or dump to disk:
llm.save("traces/debug_run/")
The LLMCall dataclass¶
Each entry in llm.calls exposes:
| Field | Type | Notes |
|---|---|---|
index |
int |
0-based, monotonic |
timestamp |
float |
Unix time when the call started |
duration_ms |
float |
Wall-clock around the wrapped call |
method |
str |
"generate" or "generate_with_tools" |
prompt |
str |
Exact text passed in (after redaction/truncation) |
system_prompt |
str |
Same |
response |
str \| list[dict] |
String for generate; content blocks for generate_with_tools |
tools |
list[dict] \| None |
Tool schemas passed, when applicable |
temperature |
float |
As passed |
max_tokens |
int |
As passed |
step_num |
int \| None |
Set by TrajectoryRecorder to link calls to loop steps |
error |
str \| None |
f"{type(e).__name__}: {e}" if the wrapped backend raised |
Safety knobs¶
llm = RecordingLLMBackend(
backend,
max_chars_per_call=200_000, # truncate huge prompts with an elision marker
redact=lambda s: SECRET_RE.sub("[REDACTED]", s), # scrub PII/secrets
)
By default, redact scrubs the prompt BEFORE it is forwarded to the
wrapped backend AND before it is written to disk. PII and secrets
never leave the process. This is the right default for compliance,
approval gates, and staging deployments.
ProvenanceSink(redact=...) accepts the same callable and inherits the
same semantics:
sink = ProvenanceSink(dir="traces/", redact=redact_pii)
llm = sink.wrap_llm(AnthropicBackend(...))
# PII is stripped before Anthropic sees it AND before the trace is written.
If you want the legacy record-only behaviour (scrub the trace but forward the raw prompt to the provider — useful when the provider itself needs the PII to respond correctly), opt out:
generate_with_tools is preserved¶
rec = RecordingLLMBackend(native_backend)
assert hasattr(rec, "generate_with_tools") # surfaced only when wrapped supports it
blocks = rec.generate_with_tools("...", tools=[...])
assert rec.calls[-1].method == "generate_with_tools"
assert rec.calls[-1].tools == [...]
Trajectory provenance only¶
Install TrajectoryRecorder as a hook. It captures a structured
Trajectory for the whole run using the standard hook surface —
pre_loop, pre_prompt, post_dispatch, on_loop_end. No changes to
the loop are needed.
from looplet.provenance import TrajectoryRecorder
hook = TrajectoryRecorder()
for step in composable_loop(llm=llm, tools=tools, state=state, hooks=[hook]):
...
hook.save("traces/run_1/") # writes trajectory.json + steps/*.json
Linking steps to LLM calls¶
Pair TrajectoryRecorder with RecordingLLMBackend and every
StepRecord.llm_call_indices will point into
Trajectory.llm_calls — so you always know which prompt produced a
given tool call:
rec_llm = RecordingLLMBackend(MyBackend())
hook = TrajectoryRecorder(recording_llm=rec_llm)
for step in composable_loop(llm=rec_llm, tools=tools, state=state, hooks=[hook]):
...
# Which prompts produced step 3?
indices = hook.trajectory.steps[2].llm_call_indices
for i in indices:
print(hook.trajectory.llm_calls[i].prompt)
What a StepRecord captures¶
| Field | Notes |
|---|---|
step_num |
1-based loop step index |
timestamp, duration_ms |
When the step ran and how long |
pretty |
Step.pretty() — #N ✓ tool(args) → result [Xms] |
tool_call |
ToolCall.to_dict() |
tool_result |
ToolResult.to_dict() (truncated safely) |
context_before |
The briefing shown to the LLM before this step's prompt |
llm_call_indices |
Into trajectory.llm_calls (empty if no recording backend) |
Termination inference¶
on_loop_end inspects the last step and sets termination_reason:
"done"— the agent called thedonetool successfully"error"— the last step returned an error"max_steps_or_stop"— budget exhausted or ashould_stophook fired"no_steps"— the loop produced nothing
The loop's done path bypasses post_dispatch, so TrajectoryRecorder
sweeps state.steps at on_loop_end to catch that final step.
Embedded Tracer¶
TrajectoryRecorder embeds a Tracer by default — no extra hook
needed. Pass your own if you want to export spans elsewhere:
from looplet import Tracer
my_tracer = Tracer()
hook = TrajectoryRecorder(tracer=my_tracer)
...
# Export my_tracer.root_spans to OTel, Datadog, etc.
Replay a captured run¶
from looplet import replay_loop
for step in replay_loop("traces/run_1/", tools=my_tools):
print(step.pretty())
replay_loop reads manifest.jsonl + call_NN_response.txt from the
trace directory and feeds them back into composable_loop in order.
The LLM is not called again — your tools, hooks, permission
engine, and state are fresh. This is the whole point: change any of
those, diff the step output, and you get a cost-free A/B.
Common uses:
- Change a hook, re-run. Add a new permission rule or logging hook, replay, see how step outputs differ.
- Upgrade looplet and diff the loop. If the replay produces different steps with the same LLM output, the loop behavior changed.
- Golden tests. Capture once, check the directory into git,
replay_loopin CI.
Constraints:
- If your loop now asks for more LLM calls than were recorded,
replay raises
RuntimeError— reducemax_stepsor re-record. - If a call was recorded as
generate_with_toolsbut the replay loop usesgenerate(or vice versa), replay raisesRuntimeError— the divergence is almost certainly a bug you want to see. - Tools do execute at replay time. Replace with mocks if you don't want side effects.
- If
manifest.jsonlis missing, replay falls back to thecall_NN_response.txtfiles alone (every call treated asgenerate).
Inspect a trace from the CLI¶
$ python -m looplet show traces/run_1/
154a1edb893c ✓ done 4 steps 4 LLM calls 0ms
#1 ✓ add(a=1, b=2) → 1 keys [ 12ms] call 0
#2 ✓ add(a=3, b=4) → 1 keys [ 10ms] call 1
#3 ✓ done(answer=ok) → 1 keys [ 8ms]
LLM: 4 calls, 1,582 in / 230 out chars, 0 errors
Zero dependencies, stdlib-only. Useful when you just want a one-glance answer to "did this run do what I expected?" without loading an IDE or a notebook.
Exit codes: 0 success, 1 missing / malformed directory.
Recipes¶
1. Golden-trajectory regression tests¶
# Capture the golden run once:
sink = ProvenanceSink(dir="tests/goldens/login_flow/")
llm = sink.wrap_llm(backend)
for step in composable_loop(llm=llm, tools=tools, state=state, hooks=[sink.trajectory_hook()]):
...
sink.flush()
# In CI, replay against the stored manifest:
expected = json.loads((Path("tests/goldens/login_flow") / "trajectory.json").read_text())
assert actual["step_count"] == expected["step_count"]
assert [s["tool_call"]["tool"] for s in actual["steps"]] == \
[s["tool_call"]["tool"] for s in expected["steps"]]
2. Cost accounting¶
from collections import Counter
llm = RecordingLLMBackend(AnthropicBackend(...))
# ... run the loop ...
total_prompt_chars = sum(len(c.prompt) for c in llm.calls)
total_response_chars = sum(
len(c.response) if isinstance(c.response, str) else len(json.dumps(c.response))
for c in llm.calls
)
by_method = Counter(c.method for c in llm.calls)
print(f"{len(llm.calls)} calls, {total_prompt_chars:,} in / {total_response_chars:,} out, {by_method}")
3. Secret scrubbing¶
import re
SECRET_RE = re.compile(r"(api_key|token|password)\s*[:=]\s*\S+", re.I)
sink = ProvenanceSink(
dir="traces/run_1/",
redact=lambda s: SECRET_RE.sub(r"\1=[REDACTED]", s),
)
4. Per-scenario benchmarks¶
for scenario in scenarios:
sink = ProvenanceSink(dir=f"traces/{scenario.name}/")
llm = sink.wrap_llm(backend)
for step in composable_loop(llm=llm, tools=tools, state=scenario.fresh_state(),
hooks=[sink.trajectory_hook()]):
...
sink.flush()
# Now you have traces/<name>/trajectory.json for every scenario —
# diff across runs, feed to a notebook, attach to PR descriptions.
5. Bug-report bundles¶
# On failure, flush whatever was captured so far:
try:
for step in composable_loop(llm=llm, tools=tools, state=state, hooks=[sink.trajectory_hook()]):
...
except Exception:
sink.flush() # partial trace written; safe to attach to the issue
raise
Performance¶
- Capture overhead is one dict append + one
str()per LLM call; per step: oneStepRecordappend. - Memory is bounded by
max_chars_per_call(default 200,000 chars) on eachprompt/response/system_prompt. Tool schemas and response blocks are stored as-is — if you pass giant tool schemas, consider redacting or truncating before the call. save()is the only disk I/O. Call it once at loop end (or inside anexcepthandler) — not inside the loop.- Both sync and async variants are correct under
asyncio.gather(...)concurrency at the call site, but note thatTrajectoryRecorderis designed for a single loop at a time; use one sink per concurrent run.
When not to use¶
- Production hot paths with throughput > ~100 LLM calls/sec:
RecordingLLMBackendis cheap but not free. Either use a sampling wrapper in front of it or use OTel viaTracingHookwhich is designed for that. - PII-sensitive deployments without a
redact=callable. The recorder writes prompts verbatim — if your prompts contain PII and you can't scrub, pick a different tool.
For everything else — development, CI, benchmarks, debugging, golden tests — this is what you want.