Quickstart
Bring your agent. traxr does the rest.
Try in Colab
Your agent is any Python callable (Task) -> str that talks
to an OpenAI-compatible endpoint. Wrap its client with
traxr.instrument(). That is the whole integration.
# your agent, unchanged, except the client is instrumented
import openai, traxr
client = traxr.instrument(openai.OpenAI()) # records each call as a trace step
def my_agent(task: traxr.Task) -> str:
messages = [{"role": "user", "content": task.question}]
while True:
reply = client.chat.completions.create(
model="gpt-4o-mini", messages=messages, tools=tools,
).choices[0].message
if not reply.tool_calls: # no tool call -> done
return reply.content or ""
messages += [reply, *run_tools(reply, task.files)] # tools read the data
experiment = traxr.Experiment(
files="examples/sales.csv",
question="Which region had the highest Q3 revenue?",
expected_answer="EMEA",
agent=my_agent,
)
experiment.run(dry_run=True) # the full plan: zero LLM calls
results = experiment.run()
print(results.summary())
tools is your function-schema list; run_tools
runs them and returns the tool messages
(full example →).
Multi-agent? instrument captures the LLM calls, but not
which agent is acting. Tag it with
traxr.emit at routes and handoffs
(expose your agents →):
# agent_name = the agent acting now; chosen_agent = who the orchestrator picked.
# No-op outside a Traxr run, so it's safe in production code.
traxr.emit("routing_decision", {"chosen_agent": "researcher"}, agent_name="orchestrator")
How it works
One variable at a time.
-
Perturb
One operator changes a copy of one input: seeded, deterministic, single-variable. A swapped column, OCR noise, a redacted section, a missing page. The means, not the point.
-
Run paired
Your agent runs on the original, then on each variant, with identical seeds, fresh temp dirs, and original filenames. It cannot tell which condition it is in.
-
Compare traces
Every LLM call, tool call, and routing decision becomes an event. Paired traces are aligned and compared structurally: content noise ignored, behavior changes counted.
The metrics
What the final answer hides.
- d_norm
- Normalized edit distance between paired traces. 0 is an identical process; 1 is a completely different execution.
- t*
- The step where divergence first appears, and how early in the run that is (t*/T).
- manifestation
- How divergence showed up: silent semantic corruption, a strategy reroute, early termination, catastrophic failure, or a recovery that lands the same answer anyway.
- token overhead
- Perturbed-run tokens over baseline. The cost of a wobble you would never see in the output.
- noise floor
- Baseline-vs-itself divergence from clean re-runs. Anything at or below it is indistinguishable from sampling noise. Measured by default for external agents.
Operators
The v1 catalog.
| input | operators | delivery |
|---|---|---|
| csv / xlsx | column_swap · label_corrupt · data_type_corrupt · row_duplicate · irrelevant_columns · unit_change · null_content | file round-trip |
| txt / md | ocr_noise · number_corruption · text_redaction · paragraph_shuffle · encoding_error · section_removal · null_content | file round-trip |
| pdf (any agent) | number_corruption · text_redaction · section_removal · page_removal · page_shuffle · null_content | surgical in-place edits |
| pdf (built-in agent) | ocr_noise · paragraph_shuffle · encoding_error | content injection |
Is my agent traceable?
traxr captures at two tiers. Tier 0 is automatic
capture at the OpenAI-SDK boundary: chat.completions calls
(sync, async, streaming, tool calls) against any OpenAI-compatible
endpoint. Tier 1 is framework-native capture via
callbacks, so LangGraph graphs get a richer adapter with full tool
fidelity. Anything your code knows (routes, handoffs, memory reads) you
surface with traxr.emit. Other SDKs, raw HTTP, and
subprocess calls are invisible in v1: runs that capture nothing are
flagged, not reported as zero divergence.
The honest scope →
Before you run it
Perturbed inputs are an injection-adjacent vector into your agent, and
traxr cannot sandbox your agent. Disable side-effectful tools or run
inside a container. dry_run=True enumerates every run
before any agent executes; max_llm_calls_per_run is
enforced inside the capture wrapper. Security notes →