Skip to content

Provenance Guide

looplet.provenance captures exactly what your agent saw and did — every prompt, every response, every step — and writes them to a diff-friendly directory you can cat, grep, and check into git.

The loop: capture → read → replay.

# 1. Capture
sink = ProvenanceSink(dir="traces/run_1/")
llm = sink.wrap_llm(AnthropicBackend(...))
for step in composable_loop(llm=llm, tools=tools, state=state,
                            hooks=[sink.trajectory_hook()]):
    print(step.pretty())
sink.flush()

# 2. Read (from the shell)
$ python -m looplet show traces/run_1/

# 3. Replay (change hooks/tools, no LLM cost)
from looplet import replay_loop
for step in replay_loop("traces/run_1/", tools=tools):
    print(step.pretty())

Two primitives, one facade, two extras:

Class / function What it does Implements
RecordingLLMBackend Every call to generate / generate_with_tools LLMBackend (wraps any backend)
AsyncRecordingLLMBackend Same, async AsyncLLMBackend
TrajectoryRecorder Every loop step, with context-before and linked LLM calls LoopHook
ProvenanceSink Both of the above in a 3-line drop-in
replay_loop(dir, ...) Rerun the loop against cached LLM output generator
python -m looplet show <dir> One-page readable summary CLI

All of it is zero-dependency, safe by default, and preserves the NativeToolBackend capability surface of whatever they wrap.


Quick start — ProvenanceSink

from looplet import ProvenanceSink, composable_loop, DefaultState
from looplet.backends import AnthropicBackend

sink = ProvenanceSink(dir="traces/run_1/")

llm = sink.wrap_llm(AnthropicBackend(api_key=...))
hooks = [sink.trajectory_hook()]
state = DefaultState(max_steps=15)

for step in composable_loop(llm=llm, tools=tools, state=state, hooks=hooks):
    print(step.pretty())

sink.flush()

After flush(), the directory contains:

traces/run_1/
├── trajectory.json         # run_id, steps, termination_reason, metadata
├── steps/
│   ├── step_01.json        # tool_call, tool_result, context_before, linked LLM indices
│   └── step_02.json
├── call_00_prompt.txt      # exact prompt sent (system + user + tools)
├── call_00_response.txt    # raw response
├── call_01_prompt.txt
├── call_01_response.txt
└── manifest.jsonl          # one LLMCall summary per line

Every file is human-readable. trajectory.json and manifest.jsonl are machine-parseable. You can diff, grep, version-control, attach to bug reports, or feed to a separate analysis pipeline.


LLM-call provenance only

If you don't need the full trajectory (e.g. you're debugging prompt rendering in isolation), just wrap the backend:

from looplet.provenance import RecordingLLMBackend

llm = RecordingLLMBackend(MyBackend())

# Use `llm` anywhere a backend is expected — `composable_loop`, a sub-agent,
# a one-off `llm.generate(...)` call. Every invocation is captured.

response = llm.generate("hello", system_prompt="be brief", max_tokens=50)

# Inspect in memory:
assert len(llm.calls) == 1
c = llm.calls[0]
print(c.prompt, c.response, c.duration_ms, c.error)

# Or dump to disk:
llm.save("traces/debug_run/")

The LLMCall dataclass

Each entry in llm.calls exposes:

Field Type Notes
index int 0-based, monotonic
timestamp float Unix time when the call started
duration_ms float Wall-clock around the wrapped call
method str "generate" or "generate_with_tools"
prompt str Exact text passed in (after redaction/truncation)
system_prompt str Same
response str \| list[dict] String for generate; content blocks for generate_with_tools
tools list[dict] \| None Tool schemas passed, when applicable
temperature float As passed
max_tokens int As passed
step_num int \| None Set by TrajectoryRecorder to link calls to loop steps
error str \| None f"{type(e).__name__}: {e}" if the wrapped backend raised

Safety knobs

llm = RecordingLLMBackend(
    backend,
    max_chars_per_call=200_000,     # truncate huge prompts with an elision marker
    redact=lambda s: SECRET_RE.sub("[REDACTED]", s),  # scrub PII/secrets
)

By default, redact scrubs the prompt BEFORE it is forwarded to the wrapped backend AND before it is written to disk. PII and secrets never leave the process. This is the right default for compliance, approval gates, and staging deployments.

ProvenanceSink(redact=...) accepts the same callable and inherits the same semantics:

sink = ProvenanceSink(dir="traces/", redact=redact_pii)
llm  = sink.wrap_llm(AnthropicBackend(...))
# PII is stripped before Anthropic sees it AND before the trace is written.

If you want the legacy record-only behaviour (scrub the trace but forward the raw prompt to the provider — useful when the provider itself needs the PII to respond correctly), opt out:

sink = ProvenanceSink(dir="traces/", redact=redact_pii, redact_upstream=False)

generate_with_tools is preserved

rec = RecordingLLMBackend(native_backend)
assert hasattr(rec, "generate_with_tools")   # surfaced only when wrapped supports it

blocks = rec.generate_with_tools("...", tools=[...])
assert rec.calls[-1].method == "generate_with_tools"
assert rec.calls[-1].tools == [...]

Trajectory provenance only

Install TrajectoryRecorder as a hook. It captures a structured Trajectory for the whole run using the standard hook surface — pre_loop, pre_prompt, post_dispatch, on_loop_end. No changes to the loop are needed.

from looplet.provenance import TrajectoryRecorder

hook = TrajectoryRecorder()
for step in composable_loop(llm=llm, tools=tools, state=state, hooks=[hook]):
    ...
hook.save("traces/run_1/")     # writes trajectory.json + steps/*.json

Linking steps to LLM calls

Pair TrajectoryRecorder with RecordingLLMBackend and every StepRecord.llm_call_indices will point into Trajectory.llm_calls — so you always know which prompt produced a given tool call:

rec_llm = RecordingLLMBackend(MyBackend())
hook = TrajectoryRecorder(recording_llm=rec_llm)

for step in composable_loop(llm=rec_llm, tools=tools, state=state, hooks=[hook]):
    ...

# Which prompts produced step 3?
indices = hook.trajectory.steps[2].llm_call_indices
for i in indices:
    print(hook.trajectory.llm_calls[i].prompt)

What a StepRecord captures

Field Notes
step_num 1-based loop step index
timestamp, duration_ms When the step ran and how long
pretty Step.pretty()#N ✓ tool(args) → result [Xms]
tool_call ToolCall.to_dict()
tool_result ToolResult.to_dict() (truncated safely)
context_before The briefing shown to the LLM before this step's prompt
llm_call_indices Into trajectory.llm_calls (empty if no recording backend)

Termination inference

on_loop_end inspects the last step and sets termination_reason:

  • "done" — the agent called the done tool successfully
  • "error" — the last step returned an error
  • "max_steps_or_stop" — budget exhausted or a should_stop hook fired
  • "no_steps" — the loop produced nothing

The loop's done path bypasses post_dispatch, so TrajectoryRecorder sweeps state.steps at on_loop_end to catch that final step.

Embedded Tracer

TrajectoryRecorder embeds a Tracer by default — no extra hook needed. Pass your own if you want to export spans elsewhere:

from looplet import Tracer
my_tracer = Tracer()
hook = TrajectoryRecorder(tracer=my_tracer)
...
# Export my_tracer.root_spans to OTel, Datadog, etc.

Replay a captured run

from looplet import replay_loop

for step in replay_loop("traces/run_1/", tools=my_tools):
    print(step.pretty())

replay_loop reads manifest.jsonl + call_NN_response.txt from the trace directory and feeds them back into composable_loop in order. The LLM is not called again — your tools, hooks, permission engine, and state are fresh. This is the whole point: change any of those, diff the step output, and you get a cost-free A/B.

Common uses:

  • Change a hook, re-run. Add a new permission rule or logging hook, replay, see how step outputs differ.
  • Upgrade looplet and diff the loop. If the replay produces different steps with the same LLM output, the loop behavior changed.
  • Golden tests. Capture once, check the directory into git, replay_loop in CI.

Constraints:

  • If your loop now asks for more LLM calls than were recorded, replay raises RuntimeError — reduce max_steps or re-record.
  • If a call was recorded as generate_with_tools but the replay loop uses generate (or vice versa), replay raises RuntimeError — the divergence is almost certainly a bug you want to see.
  • Tools do execute at replay time. Replace with mocks if you don't want side effects.
  • If manifest.jsonl is missing, replay falls back to the call_NN_response.txt files alone (every call treated as generate).

Inspect a trace from the CLI

$ python -m looplet show traces/run_1/
154a1edb893c  ✓ done  4 steps  4 LLM calls  0ms

#1  ✓ add(a=1, b=2)       → 1 keys    [   12ms] call 0
#2  ✓ add(a=3, b=4)       → 1 keys    [   10ms] call 1
#3  ✓ done(answer=ok)     → 1 keys    [    8ms]

LLM: 4 calls, 1,582 in / 230 out chars, 0 errors

Zero dependencies, stdlib-only. Useful when you just want a one-glance answer to "did this run do what I expected?" without loading an IDE or a notebook.

Exit codes: 0 success, 1 missing / malformed directory.


Recipes

1. Golden-trajectory regression tests

# Capture the golden run once:
sink = ProvenanceSink(dir="tests/goldens/login_flow/")
llm = sink.wrap_llm(backend)
for step in composable_loop(llm=llm, tools=tools, state=state, hooks=[sink.trajectory_hook()]):
    ...
sink.flush()

# In CI, replay against the stored manifest:
expected = json.loads((Path("tests/goldens/login_flow") / "trajectory.json").read_text())
assert actual["step_count"] == expected["step_count"]
assert [s["tool_call"]["tool"] for s in actual["steps"]] == \
       [s["tool_call"]["tool"] for s in expected["steps"]]

2. Cost accounting

from collections import Counter

llm = RecordingLLMBackend(AnthropicBackend(...))
# ... run the loop ...

total_prompt_chars = sum(len(c.prompt) for c in llm.calls)
total_response_chars = sum(
    len(c.response) if isinstance(c.response, str) else len(json.dumps(c.response))
    for c in llm.calls
)
by_method = Counter(c.method for c in llm.calls)
print(f"{len(llm.calls)} calls, {total_prompt_chars:,} in / {total_response_chars:,} out, {by_method}")

3. Secret scrubbing

import re
SECRET_RE = re.compile(r"(api_key|token|password)\s*[:=]\s*\S+", re.I)

sink = ProvenanceSink(
    dir="traces/run_1/",
    redact=lambda s: SECRET_RE.sub(r"\1=[REDACTED]", s),
)

4. Per-scenario benchmarks

for scenario in scenarios:
    sink = ProvenanceSink(dir=f"traces/{scenario.name}/")
    llm = sink.wrap_llm(backend)
    for step in composable_loop(llm=llm, tools=tools, state=scenario.fresh_state(),
                                hooks=[sink.trajectory_hook()]):
        ...
    sink.flush()
# Now you have traces/<name>/trajectory.json for every scenario —
# diff across runs, feed to a notebook, attach to PR descriptions.

5. Bug-report bundles

# On failure, flush whatever was captured so far:
try:
    for step in composable_loop(llm=llm, tools=tools, state=state, hooks=[sink.trajectory_hook()]):
        ...
except Exception:
    sink.flush()   # partial trace written; safe to attach to the issue
    raise

Performance

  • Capture overhead is one dict append + one str() per LLM call; per step: one StepRecord append.
  • Memory is bounded by max_chars_per_call (default 200,000 chars) on each prompt/response/system_prompt. Tool schemas and response blocks are stored as-is — if you pass giant tool schemas, consider redacting or truncating before the call.
  • save() is the only disk I/O. Call it once at loop end (or inside an except handler) — not inside the loop.
  • Both sync and async variants are correct under asyncio.gather(...) concurrency at the call site, but note that TrajectoryRecorder is designed for a single loop at a time; use one sink per concurrent run.

When not to use

  • Production hot paths with throughput > ~100 LLM calls/sec: RecordingLLMBackend is cheap but not free. Either use a sampling wrapper in front of it or use OTel via TracingHook which is designed for that.
  • PII-sensitive deployments without a redact= callable. The recorder writes prompts verbatim — if your prompts contain PII and you can't scrub, pick a different tool.

For everything else — development, CI, benchmarks, debugging, golden tests — this is what you want.