trace-level evaluation for multi-agent systems

Look past the final answer.

A multi-agent system can reach the right answer through the wrong process, and final-answer accuracy never sees it. traxr evaluates the execution trace itself: run paired experiments on your agent and measure how its behavior diverged. How much, where it started, and what it cost.

pip install "traxr[document,openai,pandas] @ git+https://github.com/anna-mazhar/traxr.git@v1.0.0"

Quickstart

Bring your agent. traxr does the rest.

Try in Colab

Your agent is any Python callable (Task) -> str that talks to an OpenAI-compatible endpoint. Wrap its client with traxr.instrument(). That is the whole integration.

# your agent, unchanged, except the client is instrumented
import openai, traxr

client = traxr.instrument(openai.OpenAI())   # records each call as a trace step

def my_agent(task: traxr.Task) -> str:
    messages = [{"role": "user", "content": task.question}]
    while True:
        reply = client.chat.completions.create(
            model="gpt-4o-mini", messages=messages, tools=tools,
        ).choices[0].message
        if not reply.tool_calls:                       # no tool call -> done
            return reply.content or ""
        messages += [reply, *run_tools(reply, task.files)]   # tools read the data

experiment = traxr.Experiment(
    files="examples/sales.csv",
    question="Which region had the highest Q3 revenue?",
    expected_answer="EMEA",
    agent=my_agent,
)
experiment.run(dry_run=True)  # the full plan: zero LLM calls
results = experiment.run()
print(results.summary())

tools is your function-schema list; run_tools runs them and returns the tool messages (full example →).

Multi-agent? instrument captures the LLM calls, but not which agent is acting. Tag it with traxr.emit at routes and handoffs (expose your agents →):

# agent_name = the agent acting now; chosen_agent = who the orchestrator picked.
# No-op outside a Traxr run, so it's safe in production code.
traxr.emit("routing_decision", {"chosen_agent": "researcher"}, agent_name="orchestrator")

Full multi-agent example (orchestrator + workers) →

How it works

One variable at a time.

  1. Perturb

    One operator changes a copy of one input: seeded, deterministic, single-variable. A swapped column, OCR noise, a redacted section, a missing page. The means, not the point.

  2. Run paired

    Your agent runs on the original, then on each variant, with identical seeds, fresh temp dirs, and original filenames. It cannot tell which condition it is in.

  3. Compare traces

    Every LLM call, tool call, and routing decision becomes an event. Paired traces are aligned and compared structurally: content noise ignored, behavior changes counted.

The metrics

What the final answer hides.

d_norm
Normalized edit distance between paired traces. 0 is an identical process; 1 is a completely different execution.
t*
The step where divergence first appears, and how early in the run that is (t*/T).
manifestation
How divergence showed up: silent semantic corruption, a strategy reroute, early termination, catastrophic failure, or a recovery that lands the same answer anyway.
token overhead
Perturbed-run tokens over baseline. The cost of a wobble you would never see in the output.
noise floor
Baseline-vs-itself divergence from clean re-runs. Anything at or below it is indistinguishable from sampling noise. Measured by default for external agents.

Operators

The v1 catalog.

inputoperatorsdelivery
csv / xlsxcolumn_swap · label_corrupt · data_type_corrupt · row_duplicate · irrelevant_columns · unit_change · null_contentfile round-trip
txt / mdocr_noise · number_corruption · text_redaction · paragraph_shuffle · encoding_error · section_removal · null_contentfile round-trip
pdf (any agent)number_corruption · text_redaction · section_removal · page_removal · page_shuffle · null_contentsurgical in-place edits
pdf (built-in agent)ocr_noise · paragraph_shuffle · encoding_errorcontent injection

Is my agent traceable?

traxr captures at two tiers. Tier 0 is automatic capture at the OpenAI-SDK boundary: chat.completions calls (sync, async, streaming, tool calls) against any OpenAI-compatible endpoint. Tier 1 is framework-native capture via callbacks, so LangGraph graphs get a richer adapter with full tool fidelity. Anything your code knows (routes, handoffs, memory reads) you surface with traxr.emit. Other SDKs, raw HTTP, and subprocess calls are invisible in v1: runs that capture nothing are flagged, not reported as zero divergence. The honest scope →

Before you run it

Perturbed inputs are an injection-adjacent vector into your agent, and traxr cannot sandbox your agent. Disable side-effectful tools or run inside a container. dry_run=True enumerates every run before any agent executes; max_llm_calls_per_run is enforced inside the capture wrapper. Security notes →