traxr

Evaluate multi-agent systems beyond final-answer accuracy. A multi-agent system can land the right answer through the wrong process, and answer-level metrics never see it. traxr evaluates the execution trace itself: point your agent at your data, run paired experiments, and measure how its behavior diverged. How much (d_norm), where it started (t*), how it manifested, and what it cost in tokens. Controlled input perturbation is the instrument; the trace is the measurement.

traxr operationalizes the paper “Trace-Level Analysis of Information Contamination in Multi-Agent Systems” (CAIS 2026; Mazhar, Suri, Galhotra) as an SDK, for any Python agent that talks to an OpenAI-compatible endpoint through the OpenAI SDK.

How an experiment works

Perturb: one operator corrupts a copy of one file, seeded, deterministic, single-variable.
Run paired: your agent runs on the clean file, then on each corrupted copy, with identical seeds in fresh temp dirs under original basenames. The agent cannot tell which condition it is in.
Compare traces: every LLM call, tool call, and routing decision is an event; paired traces are aligned and compared structurally.

Start with the quickstart, including how to score free-text answers and how to expose your agents with traxr.emit. Check which agents are traceable, and read the security notes before running an agent with real tools.

Install

pip install "traxr[document,openai,pandas] @ git+https://github.com/anna-mazhar/traxr.git@main"

extra	provides
`document`	PDF + XLSX support (PyMuPDF, pdfplumber, openpyxl)
`openai`	the built-in reference agent's LLM client
`pandas`	DataFrame export; required by the built-in reference agent
`langgraph`	the LangGraph adapter
`viz`	matplotlib plots over results

External agents that bring their own OpenAI client need no extras at all.