traxr: evaluate multi-agent systems beyond the final answer

Quickstart

Bring your agent. traxr does the rest.

Your agent is any Python callable (Task) -> str that talks to an OpenAI-compatible endpoint. Wrap its client with traxr.instrument(). That is the whole integration.

# your agent, unchanged, except the client is instrumented
import openai, traxr

client = traxr.instrument(openai.OpenAI())   # records each call as a trace step

def my_agent(task: traxr.Task) -> str:
    messages = [{"role": "user", "content": task.question}]
    while True:
        reply = client.chat.completions.create(
            model="gpt-4o-mini", messages=messages, tools=tools,
        ).choices[0].message
        if not reply.tool_calls:                       # no tool call -> done
            return reply.content or ""
        messages += [reply, *run_tools(reply, task.files)]   # tools read the data

experiment = traxr.Experiment(
    files="examples/sales.csv",
    question="Which region had the highest Q3 revenue?",
    expected_answer="EMEA",
    agent=my_agent,
)
experiment.run(dry_run=True)  # the full plan: zero LLM calls
results = experiment.run()
print(results.summary())

tools is your function-schema list; run_tools runs them and returns the tool messages (full example →).

Multi-agent? instrument captures the LLM calls, but not which agent is acting. Tag it with traxr.emit at routes and handoffs (expose your agents →):

# agent_name = the agent acting now; chosen_agent = who the orchestrator picked.
# No-op outside a Traxr run, so it's safe in production code.
traxr.emit("routing_decision", {"chosen_agent": "researcher"}, agent_name="orchestrator")

Full multi-agent example (orchestrator + workers) →

How it works

One variable at a time.

Perturb

One operator changes a copy of one input: seeded, deterministic, single-variable. A swapped column, OCR noise, a redacted section, a missing page. The means, not the point.
Run paired

Your agent runs on the original, then on each variant, with identical seeds, fresh temp dirs, and original filenames. It cannot tell which condition it is in.
Compare traces

Every LLM call, tool call, and routing decision becomes an event. Paired traces are aligned and compared structurally: content noise ignored, behavior changes counted.

The metrics

What the final answer hides.

d_norm: Normalized edit distance between paired traces. 0 is an identical process; 1 is a completely different execution.
t*: The step where divergence first appears, and how early in the run that is (t*/T).
manifestation: How divergence showed up: silent semantic corruption, a strategy reroute, early termination, catastrophic failure, or a recovery that lands the same answer anyway.
token overhead: Perturbed-run tokens over baseline. The cost of a wobble you would never see in the output.
noise floor: Baseline-vs-itself divergence from clean re-runs. Anything at or below it is indistinguishable from sampling noise. Measured by default for external agents.

Operators

The v1 catalog.

input	operators	delivery
csv / xlsx	column_swap · label_corrupt · data_type_corrupt · row_duplicate · irrelevant_columns · unit_change · null_content	file round-trip
txt / md	ocr_noise · number_corruption · text_redaction · paragraph_shuffle · encoding_error · section_removal · null_content	file round-trip
pdf (any agent)	number_corruption · text_redaction · section_removal · page_removal · page_shuffle · null_content	surgical in-place edits
pdf (built-in agent)	ocr_noise · paragraph_shuffle · encoding_error	content injection

Is my agent traceable?

traxr captures at two tiers. Tier 0 is automatic capture at the OpenAI-SDK boundary: chat.completions calls (sync, async, streaming, tool calls) against any OpenAI-compatible endpoint. Tier 1 is framework-native capture via callbacks, so LangGraph graphs get a richer adapter with full tool fidelity. Anything your code knows (routes, handoffs, memory reads) you surface with traxr.emit. Other SDKs, raw HTTP, and subprocess calls are invisible in v1: runs that capture nothing are flagged, not reported as zero divergence. The honest scope →

Before you run it

Perturbed inputs are an injection-adjacent vector into your agent, and traxr cannot sandbox your agent. Disable side-effectful tools or run inside a container. dry_run=True enumerates every run before any agent executes; max_llm_calls_per_run is enforced inside the capture wrapper. Security notes →

Bring your agent. traxr does the rest.

One variable at a time.

Perturb

Run paired

Compare traces

What the final answer hides.

The v1 catalog.