API reference
The curated public surface. Everything here is importable from traxr
unless noted.
Experiments
traxr.experiment.Experiment
A controlled-perturbation experiment over one agent and its data.
Exactly one of agent (a stateless :data:~traxr.agents.AgentRunner
callable, reused across runs), agent_factory (zero-arg factory called
once per run, the fresh-state path), or llm (the built-in reference
agent over your :class:~traxr.llm.LLMClient) must be given.
run
run(dry_run=False)
Run the experiment (or, with dry_run=True, just plan it).
traxr.experiment.ExperimentConfig
dataclass
Knobs for :class:Experiment (frozen; the controlled-variable invariant).
Attributes:
| Name | Type | Description |
|---|---|---|
perturbations |
str | Sequence[PerturbationType]
|
|
max_steps |
/ max_tokens / enable_web_tools / enable_python_tool
|
Built-in-agent knobs (ignored for external agents). |
max_llm_calls_per_run |
int | None
|
External-agent budget, enforced inside the Tier 0 wrapper; the only honest runaway bound for code we don't own. |
store_llm_content |
bool
|
Include raw LLM/tool content in trace payloads (hashes only by default; final answers are always stored raw). |
require_sequential |
bool
|
Raise instead of warn when concurrent LLM calls are detected during a run. |
scorer |
Scorer
|
|
on_run_error |
str
|
|
keep_artifacts |
bool
|
Keep the per-run temp dirs (perturbed file copies). |
noise_floor_runs |
int | None
|
Clean re-runs measuring the nondeterminism floor.
|
max_permutations |
int
|
Matrix size cap (:class: |
traxr.experiment.ExperimentPlan
dataclass
The execution plan run(dry_run=True) returns (no agent ran).
Results
traxr.results.ExperimentResults
dataclass
Everything one Experiment.run() produced.
Attributes:
| Name | Type | Description |
|---|---|---|
pairs |
list[PairResult]
|
One :class: |
traces |
dict[str, dict[str, Any]]
|
|
answers |
dict[str, str | None]
|
|
fingerprint |
dict[str, Any]
|
Environment/config fingerprint for reproducibility. |
noise_floor |
float | None
|
Baseline-vs-itself |
noise_floor_runs |
int
|
How many clean re-runs measured the floor. |
manifestation_prevalence
manifestation_prevalence()
Fraction of scored pairs per fine manifestation category.
divergence_summary
divergence_summary()
Count / mean / max of d_norm and mean t*_norm over measured pairs.
recovery_rate
recovery_rate()
Fraction of diverged pairs whose answer survived (recovery=True).
token_overhead_summary
token_overhead_summary()
Mean / max token-inflation ratio over pairs with usage data.
to_dict
to_dict(*, include_traces=True)
Canonical dict form (timestamps excluded; see :meth:to_json).
to_json
to_json(path=None, *, include_traces=True)
Canonical JSON: sorted keys, timestamps excluded, byte-stable for deterministic (stub / mock-transport) experiments.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Any
|
Optional file path to also write the JSON to. |
None
|
include_traces
|
bool
|
Include the full serialized traces. |
True
|
to_dataframe
to_dataframe()
The pairs as a pandas DataFrame (needs the [pandas] extra).
to_report
to_report(fmt='md')
Human-readable report ("md" or "html").
"md" is a plain markdown document (good for terminals and PRs).
"html" is a single self-contained file (inline styles, no scripts,
no external assets) that embeds the :mod:traxr.viz figures when
matplotlib is installed and degrades gracefully when it is not.
summary
summary()
Compact printable summary.
Reads top-down as a diagnosis: how many pairs ran, how perturbations manifested, how far the traces diverged (against the noise floor), whether the answer survived, and what it cost.
traxr.results.PairResult
dataclass
Metrics for one clean-vs-perturbed pair.
scored
property
scored
Whether this pair produced metrics (the perturbed run happened).
Capture
traxr.capture.openai_wrap.instrument
instrument(client)
Capture LLM calls made through client during Traxr runs.
Wraps client.chat.completions.create in place (sync OpenAI or
AsyncOpenAI, including streaming) and returns the same client.
Construct your agent with the instrumented client; outside a Traxr run
the wrapper is a pure passthrough, so the agent keeps working standalone.
Idempotent: instrumenting an already-instrumented client is a no-op.
Raises:
| Type | Description |
|---|---|
TypeError
|
If |
traxr.capture.patch.patch_openai
patch_openai()
Capture LLM calls from every OpenAI client created inside the context.
Raises:
| Type | Description |
|---|---|
OptionalDependencyError
|
If the |
traxr.capture.context.emit
emit(event_type, payload=None, *, agent_name='user')
Manually emit a trace event from inside your agent (escape hatch).
The event lands at the current step of the active run. Unregistered event
types fall back to the unknown:{event_type} signature (with a one-time
:class:~traxr.errors.UnknownEventTypeWarning); upgrade them via
:func:traxr.register_signature.
Outside a Traxr run this is a no-op (same passthrough principle as
instrument()), so agents that call traxr.emit() keep working
standalone.
The agent contract
traxr.agents.task.Task
dataclass
One run's input, as handed to an :data:AgentRunner.
Attributes:
| Name | Type | Description |
|---|---|---|
question |
str
|
The task question. |
files |
tuple[Path, ...]
|
Input artifact paths. Clean runs receive the originals; perturbed runs receive copies with the ORIGINAL basenames in a fresh temp dir, so file names never leak the condition. |
run_label |
str
|
|
metadata |
Mapping[str, Any]
|
Extra experiment context (never condition-revealing). |
traxr.agents.task.invoke_agent
invoke_agent(
runner,
task,
collector,
*,
max_llm_calls_per_run=None,
store_llm_content=False,
session=None,
)
Run runner on task with Tier 0 capture bound to collector.
The harness, not the agent, emits the final_answer event from the
return value. An agent exception is recorded as an agent_error event
(the partial trace stays analyzable) and re-raised; the record-vs-raise
policy is the experiment runner's concern.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
runner
|
AgentRunner
|
The agent callable. |
required |
task
|
Task
|
This run's input. |
required |
collector
|
TraceCollector
|
The run's trace collector. |
required |
max_llm_calls_per_run
|
int | None
|
LLM-call budget enforced inside the Tier 0 wrapper (the only honest runaway bound for code we don't own). |
None
|
store_llm_content
|
bool
|
Include raw LLM/tool content in event payloads instead of hashes only. |
False
|
session
|
CaptureSession | None
|
Pre-built capture session to bind instead of constructing
one (its collector must be |
None
|
Returns:
| Type | Description |
|---|---|
str
|
The agent's final answer. |
Raises:
| Type | Description |
|---|---|
AgentContractError
|
If the agent returns a non- |
traxr.agents.langgraph.from_langgraph
from_langgraph(
compiled_graph,
input_builder=None,
output_extractor=None,
)
Wrap a compiled LangGraph as an AgentRunner with Tier 1 capture.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
compiled_graph
|
Any
|
The compiled graph (anything with
|
required |
input_builder
|
Callable[[Task], Any] | None
|
|
None
|
output_extractor
|
Callable[[Any], str] | None
|
|
None
|
Raises:
| Type | Description |
|---|---|
OptionalDependencyError
|
If langchain-core is not installed. |
traxr.agents.builtin.builtin_agent
builtin_agent(
llm,
*,
enable_web_tools=False,
enable_python_tool=True,
max_steps=12,
max_tokens=100000,
python_tool_timeout=30.0,
seed=42,
)
Build a factory of :class:BuiltinAgent instances over the reference MAS.
The returned zero-argument factory is called once per run (clean baseline or perturbation), so every run gets fresh router/plan state while sharing the same LLM client and configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
llm
|
LLMClient
|
Any :class: |
required |
enable_web_tools
|
bool
|
Register web_search/web_fetch tools. Default OFF; opt-in only, keeps runs offline and deterministic. |
False
|
enable_python_tool
|
bool
|
Register the (subprocess-sandboxed) python tool. |
True
|
max_steps
|
int
|
Episode step budget. |
12
|
max_tokens
|
int | None
|
Episode token budget ( |
100000
|
python_tool_timeout
|
float
|
Seconds before LLM-written code is killed. |
30.0
|
seed
|
int
|
Seed recorded in the episode spec and used by retrieval. |
42
|
Returns:
| Type | Description |
|---|---|
Callable[[], BuiltinAgent]
|
A zero-argument callable producing fresh :class: |
Callable[[], BuiltinAgent]
|
instances. |
Raises:
| Type | Description |
|---|---|
OptionalDependencyError
|
If the reference agent's dependencies (pandas, ...) are not installed. |
Example::
from traxr.llm import DeterministicLLMStub
from traxr.agents import builtin_agent
factory = builtin_agent(llm=DeterministicLLMStub("identity"))
agent = factory()
answer = agent.run(["data.csv"], "How many rows does the table have?")
LLM clients
traxr.llm.protocol.LLMClient
Bases: Protocol
Protocol for LLM clients usable with :func:traxr.agents.builtin_agent.
response_type is a routing hint ("plan", "route",
"data_analyst", "synthesize", ...) used to select system prompts
or, for the deterministic stub, scripted replies.
generate
generate(prompt, response_type='default', context=None)
Generate a plain-text response.
generate_with_tools
generate_with_tools(
prompt,
tools,
response_type="default",
context=None,
system_prompt_override=None,
)
Generate a response that may include structured tool calls.
tools is a list of tool schemas (ToolSchema for the built-in
agent). Returned tool_calls items expose tool_name,
operation, arguments and call_id.
traxr.llm.openai_compat.OpenAICompatibleClient
:class:~traxr.llm.LLMClient for any OpenAI-compatible endpoint.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str
|
Model name as known to the endpoint. |
'gpt-4o-mini'
|
api_key
|
str | None
|
API key; falls back to the |
None
|
base_url
|
str | None
|
Endpoint base URL ( |
None
|
seed
|
int
|
Sampling seed forwarded to the endpoint (reproducibility). |
42
|
temperature
|
float | None
|
Sampling temperature; defaults to |
0.0
|
max_retries
|
int
|
How many times the OpenAI SDK retries a transient
failure (network error, |
2
|
Raises:
| Type | Description |
|---|---|
OptionalDependencyError
|
If the |
LLMConnectionError
|
If no API key is available. |
call_count
property
call_count
Number of LLM calls made by this client instance.
generate
generate(prompt, response_type='default', context=None)
Generate a plain-text response.
generate_with_tools
generate_with_tools(
prompt,
tools,
response_type="default",
context=None,
system_prompt_override=None,
)
Generate a response with native OpenAI function calling.
generate_with_retrieval
generate_with_retrieval(
prompt, retrieval_content, response_type="research"
)
Generate a response that incorporates retrieval content.
generate_with_image
generate_with_image(
prompt,
image_base64,
media_type="image/png",
response_type="research",
)
Generate a response that analyzes an image (vision-capable models).
Retained from the extracted client for interface fidelity; the built-in agent's image path is inert in v1.
reset_call_count
reset_call_count()
Reset the call counter (determinism between paired runs).
close
close()
Close the underlying HTTP client to release connections.
traxr.llm.stub.DeterministicLLMStub
Scripted :class:~traxr.llm.LLMClient for tests, goldens, and demos.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scenario
|
str
|
One of :data: |
'identity'
|
final_answer
|
str
|
The answer the |
'4'
|
wrong_answer
|
str
|
The answer the |
'999'
|
script
|
dict[str, list[StubReply]] | None
|
Custom mapping |
None
|
call_count
property
call_count
Number of LLM calls made since the last reset.
generate
generate(prompt, response_type='default', context=None)
Return the next scripted plain-text reply.
generate_with_tools
generate_with_tools(
prompt,
tools,
response_type="default",
context=None,
system_prompt_override=None,
)
Return the next scripted reply, including any scripted tool calls.
generate_with_retrieval
generate_with_retrieval(
prompt, retrieval_content, response_type="research"
)
Retrieval-augmented text generation (same scripted replies).
reset_call_count
reset_call_count()
Reset all call counters (determinism between paired runs).
Scoring and plots
traxr.scoring.check_answer_match
check_answer_match(expected, actual)
Whether actual matches expected after normalization.
Numeric answers compare as floats (so "42" matches "42.0");
everything else is normalized string equality.
check_answer_match("Paris", " PARIS. ") True check_answer_match("1,000", "$1000") True check_answer_match("42", "43") False
traxr.scoring.llm_judge_match
llm_judge_match(expected, actual, llm=None)
Semantic match via LLM judge. Opt-in only, usable directly as
ExperimentConfig(scorer=llm_judge_match).
Unlike :func:check_answer_match, this is not deterministic: results
can vary between runs and providers. Use it when literal/numeric
matching is too strict for your question's expected phrasing.
If llm is omitted, lazily constructs and caches a default
OpenAICompatibleClient; this reads OPENAI_API_KEY from the
environment and makes a live network call per scored answer. Pass your
own llm (e.g. via functools.partial) to use another provider or
avoid the implicit network/API-key dependency.
traxr.viz.plot_d_norm
plot_d_norm(results, ax=None)
Per-pair normalized edit distance, with the noise floor when measured.
traxr.viz.plot_t_star
plot_t_star(results, ax=None, bins=10)
Histogram of normalized first-divergence positions (t*/T).
traxr.viz.plot_manifestations
plot_manifestations(results, ax=None)
Manifestation-category prevalence over scored pairs.
Extending the event vocabulary
traxr.trace.registry.register_signature
register_signature(
event_type,
signature,
*,
classifier=None,
structural_types=frozenset(),
key_fields_equal=None,
)
Upgrade a custom event type from the unknown:{event_type} fallback.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
event_type
|
str
|
The custom event type emitted via |
required |
signature
|
SignatureFn
|
|
required |
classifier
|
ClassifierFn | None
|
Optional |
None
|
structural_types
|
frozenset[str] | set[str]
|
The divergence-type strings the classifier can
return (so |
frozenset()
|
key_fields_equal
|
KeyFieldsFn | None
|
Optional semantic-equality predicate on payloads. |
None
|
Errors and warnings
traxr.errors
Typed exception and warning hierarchy for Traxr.
Principle: fail loud with actionable messages; never silently produce wrong metrics. User-configuration errors fail fast; data-dependent failures are skipped and recorded (as warnings collected per pair).
All exceptions derive from :class:TraxrError; all warnings derive from
:class:TraxrWarning.
TraxrError
Bases: Exception
Base class for all Traxr exceptions.
UnsupportedModalityError
Bases: TraxrError
Unknown extension or unsupported modality.
Raised for docx/pptx/image/audio inputs in v1. The message names what IS supported (CSV/XLSX/TXT/MD/PDF) and points at the roadmap for the rest.
ModalityMismatchError
Bases: TraxrError
Declared modality does not match the detected modality of the file.
InvalidArtifactError
Bases: TraxrError
Input file is missing, unreadable, or corrupt.
OptionalDependencyError
Bases: TraxrError
A required optional dependency is not installed.
Raised when openai/PyMuPDF/pdfplumber/openpyxl/matplotlib/langchain-core
is needed but missing. The message names the pip extra that provides it
(e.g. pip install "traxr[document]").
LLMConnectionError
Bases: TraxrError
Missing/invalid API key or unreachable base_url (built-in agent path).
AgentContractError
Bases: TraxrError
The user's agent violated the AgentRunner contract.
For example, it returned a non-str value.
RunBudgetExceeded
Bases: TraxrError
The agent exceeded max_llm_calls_per_run.
Raised inside the agent by the Tier 0 capture wrapper.
ExperimentConfigError
Bases: TraxrError
Invalid experiment configuration.
Raised when both or neither of agent/agent_factory/llm are
resolvable, or other fail-fast configuration problems.
MatrixTooLargeError
Bases: TraxrError
The permutation matrix exceeds the configured cap.
MalformedEventError
Bases: TraxrError
Malformed or missing event payload, or empty/non-string event_type.
ControlledVariableError
Bases: TraxrError
Experiment configuration was mutated between paired runs.
TraxrWarning
Bases: UserWarning
Base class for all Traxr warnings (non-fatal, collected per pair).
UnknownEventTypeWarning
Bases: TraxrWarning
An unregistered event type fell back to the unknown:{event_type} signature.
Emitted once per unknown type; upgrade the type via
traxr.register_signature() to include its structure in divergence
metrics.
EmptyTraceWarning
Bases: TraxrWarning
A run produced no trace events.
NonDeterminismWarning
Bases: TraxrWarning
Paired runs differ beyond the measured noise floor's expectation.
ConcurrentTraceWarning
Bases: TraxrWarning
Concurrent LLM calls detected while tracing a single run.
TokenUnavailableWarning
Bases: TraxrWarning
Token usage could not be captured for one or more LLM calls.
PerturbationSkippedWarning
Bases: TraxrWarning
A perturbation was not applicable and was skipped (recorded).
Corresponds to the engine reporting applied=False with a
skip_reason.