Skip to content

API reference

The curated public surface. Everything here is importable from traxr unless noted.

Experiments

traxr.experiment.Experiment

A controlled-perturbation experiment over one agent and its data.

Exactly one of agent (a stateless :data:~traxr.agents.AgentRunner callable, reused across runs), agent_factory (zero-arg factory called once per run, the fresh-state path), or llm (the built-in reference agent over your :class:~traxr.llm.LLMClient) must be given.

run

run(dry_run=False)

Run the experiment (or, with dry_run=True, just plan it).

traxr.experiment.ExperimentConfig dataclass

Knobs for :class:Experiment (frozen; the controlled-variable invariant).

Attributes:

Name Type Description
perturbations str | Sequence[PerturbationType]

"all" or an explicit operator list.

max_steps / max_tokens / enable_web_tools / enable_python_tool

Built-in-agent knobs (ignored for external agents).

max_llm_calls_per_run int | None

External-agent budget, enforced inside the Tier 0 wrapper; the only honest runaway bound for code we don't own.

store_llm_content bool

Include raw LLM/tool content in trace payloads (hashes only by default; final answers are always stored raw).

require_sequential bool

Raise instead of warn when concurrent LLM calls are detected during a run.

scorer Scorer

(expected, actual) -> bool for task_success.

on_run_error str

"record" keeps a CRASHED run record and continues; "raise" propagates the agent's exception.

keep_artifacts bool

Keep the per-run temp dirs (perturbed file copies).

noise_floor_runs int | None

Clean re-runs measuring the nondeterminism floor. None means the agent-kind default: 1 for external agents, 0 for the built-in agent.

max_permutations int

Matrix size cap (:class:MatrixTooLargeError).

traxr.experiment.ExperimentPlan dataclass

The execution plan run(dry_run=True) returns (no agent ran).

Results

traxr.results.ExperimentResults dataclass

Everything one Experiment.run() produced.

Attributes:

Name Type Description
pairs list[PairResult]

One :class:PairResult per permutation.

traces dict[str, dict[str, Any]]

run_label -> serialized trace (collector to_dict()).

answers dict[str, str | None]

run_label -> raw final answer (stored raw by design, since scoring and answer_changed need them; see the security docs).

fingerprint dict[str, Any]

Environment/config fingerprint for reproducibility.

noise_floor float | None

Baseline-vs-itself d_norm (None when unmeasured).

noise_floor_runs int

How many clean re-runs measured the floor.

manifestation_prevalence

manifestation_prevalence()

Fraction of scored pairs per fine manifestation category.

divergence_summary

divergence_summary()

Count / mean / max of d_norm and mean t*_norm over measured pairs.

recovery_rate

recovery_rate()

Fraction of diverged pairs whose answer survived (recovery=True).

token_overhead_summary

token_overhead_summary()

Mean / max token-inflation ratio over pairs with usage data.

to_dict

to_dict(*, include_traces=True)

Canonical dict form (timestamps excluded; see :meth:to_json).

to_json

to_json(path=None, *, include_traces=True)

Canonical JSON: sorted keys, timestamps excluded, byte-stable for deterministic (stub / mock-transport) experiments.

Parameters:

Name Type Description Default
path Any

Optional file path to also write the JSON to.

None
include_traces bool

Include the full serialized traces.

True

to_dataframe

to_dataframe()

The pairs as a pandas DataFrame (needs the [pandas] extra).

to_report

to_report(fmt='md')

Human-readable report ("md" or "html").

"md" is a plain markdown document (good for terminals and PRs). "html" is a single self-contained file (inline styles, no scripts, no external assets) that embeds the :mod:traxr.viz figures when matplotlib is installed and degrades gracefully when it is not.

summary

summary()

Compact printable summary.

Reads top-down as a diagnosis: how many pairs ran, how perturbations manifested, how far the traces diverged (against the noise floor), whether the answer survived, and what it cost.

traxr.results.PairResult dataclass

Metrics for one clean-vs-perturbed pair.

scored property

scored

Whether this pair produced metrics (the perturbed run happened).

Capture

traxr.capture.openai_wrap.instrument

instrument(client)

Capture LLM calls made through client during Traxr runs.

Wraps client.chat.completions.create in place (sync OpenAI or AsyncOpenAI, including streaming) and returns the same client. Construct your agent with the instrumented client; outside a Traxr run the wrapper is a pure passthrough, so the agent keeps working standalone.

Idempotent: instrumenting an already-instrumented client is a no-op.

Raises:

Type Description
TypeError

If client has no chat.completions.create.

traxr.capture.patch.patch_openai

patch_openai()

Capture LLM calls from every OpenAI client created inside the context.

Raises:

Type Description
OptionalDependencyError

If the openai package is not installed.

traxr.capture.context.emit

emit(event_type, payload=None, *, agent_name='user')

Manually emit a trace event from inside your agent (escape hatch).

The event lands at the current step of the active run. Unregistered event types fall back to the unknown:{event_type} signature (with a one-time :class:~traxr.errors.UnknownEventTypeWarning); upgrade them via :func:traxr.register_signature.

Outside a Traxr run this is a no-op (same passthrough principle as instrument()), so agents that call traxr.emit() keep working standalone.

The agent contract

traxr.agents.task.Task dataclass

One run's input, as handed to an :data:AgentRunner.

Attributes:

Name Type Description
question str

The task question.

files tuple[Path, ...]

Input artifact paths. Clean runs receive the originals; perturbed runs receive copies with the ORIGINAL basenames in a fresh temp dir, so file names never leak the condition.

run_label str

"baseline" or the perturbation name (informational).

metadata Mapping[str, Any]

Extra experiment context (never condition-revealing).

traxr.agents.task.invoke_agent

invoke_agent(
    runner,
    task,
    collector,
    *,
    max_llm_calls_per_run=None,
    store_llm_content=False,
    session=None,
)

Run runner on task with Tier 0 capture bound to collector.

The harness, not the agent, emits the final_answer event from the return value. An agent exception is recorded as an agent_error event (the partial trace stays analyzable) and re-raised; the record-vs-raise policy is the experiment runner's concern.

Parameters:

Name Type Description Default
runner AgentRunner

The agent callable.

required
task Task

This run's input.

required
collector TraceCollector

The run's trace collector.

required
max_llm_calls_per_run int | None

LLM-call budget enforced inside the Tier 0 wrapper (the only honest runaway bound for code we don't own).

None
store_llm_content bool

Include raw LLM/tool content in event payloads instead of hashes only.

False
session CaptureSession | None

Pre-built capture session to bind instead of constructing one (its collector must be collector). Lets the caller read per-run session state afterwards (e.g. concurrent_detected); the budget/content keywords are ignored when given.

None

Returns:

Type Description
str

The agent's final answer.

Raises:

Type Description
AgentContractError

If the agent returns a non-str value.

traxr.agents.langgraph.from_langgraph

from_langgraph(
    compiled_graph,
    input_builder=None,
    output_extractor=None,
)

Wrap a compiled LangGraph as an AgentRunner with Tier 1 capture.

Parameters:

Name Type Description Default
compiled_graph Any

The compiled graph (anything with invoke(input, config=...)).

required
input_builder Callable[[Task], Any] | None

(Task) -> graph input. Default builds a messages-state input from the question + file-path listing.

None
output_extractor Callable[[Any], str] | None

(graph output) -> str. Default takes the last message's content from a messages-state result.

None

Raises:

Type Description
OptionalDependencyError

If langchain-core is not installed.

traxr.agents.builtin.builtin_agent

builtin_agent(
    llm,
    *,
    enable_web_tools=False,
    enable_python_tool=True,
    max_steps=12,
    max_tokens=100000,
    python_tool_timeout=30.0,
    seed=42,
)

Build a factory of :class:BuiltinAgent instances over the reference MAS.

The returned zero-argument factory is called once per run (clean baseline or perturbation), so every run gets fresh router/plan state while sharing the same LLM client and configuration.

Parameters:

Name Type Description Default
llm LLMClient

Any :class:~traxr.llm.LLMClient (OpenAICompatibleClient, DeterministicLLMStub, or your own implementation).

required
enable_web_tools bool

Register web_search/web_fetch tools. Default OFF; opt-in only, keeps runs offline and deterministic.

False
enable_python_tool bool

Register the (subprocess-sandboxed) python tool.

True
max_steps int

Episode step budget.

12
max_tokens int | None

Episode token budget (None disables the check).

100000
python_tool_timeout float

Seconds before LLM-written code is killed.

30.0
seed int

Seed recorded in the episode spec and used by retrieval.

42

Returns:

Type Description
Callable[[], BuiltinAgent]

A zero-argument callable producing fresh :class:BuiltinAgent

Callable[[], BuiltinAgent]

instances.

Raises:

Type Description
OptionalDependencyError

If the reference agent's dependencies (pandas, ...) are not installed.

Example::

from traxr.llm import DeterministicLLMStub
from traxr.agents import builtin_agent

factory = builtin_agent(llm=DeterministicLLMStub("identity"))
agent = factory()
answer = agent.run(["data.csv"], "How many rows does the table have?")

LLM clients

traxr.llm.protocol.LLMClient

Bases: Protocol

Protocol for LLM clients usable with :func:traxr.agents.builtin_agent.

response_type is a routing hint ("plan", "route", "data_analyst", "synthesize", ...) used to select system prompts or, for the deterministic stub, scripted replies.

generate

generate(prompt, response_type='default', context=None)

Generate a plain-text response.

generate_with_tools

generate_with_tools(
    prompt,
    tools,
    response_type="default",
    context=None,
    system_prompt_override=None,
)

Generate a response that may include structured tool calls.

tools is a list of tool schemas (ToolSchema for the built-in agent). Returned tool_calls items expose tool_name, operation, arguments and call_id.

traxr.llm.openai_compat.OpenAICompatibleClient

:class:~traxr.llm.LLMClient for any OpenAI-compatible endpoint.

Parameters:

Name Type Description Default
model str

Model name as known to the endpoint.

'gpt-4o-mini'
api_key str | None

API key; falls back to the OPENAI_API_KEY environment variable. Local servers (Ollama, LM Studio, ...) usually accept any non-empty string.

None
base_url str | None

Endpoint base URL (None = api.openai.com).

None
seed int

Sampling seed forwarded to the endpoint (reproducibility).

42
temperature float | None

Sampling temperature; defaults to 0.0 for the controlled-variable invariant. None omits the parameter.

0.0
max_retries int

How many times the OpenAI SDK retries a transient failure (network error, 429, 5xx) before raising. Forwarded to openai.OpenAI(max_retries=...); defaults to 2 (the SDK default). Set 0 to disable automatic retries.

2

Raises:

Type Description
OptionalDependencyError

If the openai package is not installed.

LLMConnectionError

If no API key is available.

call_count property

call_count

Number of LLM calls made by this client instance.

generate

generate(prompt, response_type='default', context=None)

Generate a plain-text response.

generate_with_tools

generate_with_tools(
    prompt,
    tools,
    response_type="default",
    context=None,
    system_prompt_override=None,
)

Generate a response with native OpenAI function calling.

generate_with_retrieval

generate_with_retrieval(
    prompt, retrieval_content, response_type="research"
)

Generate a response that incorporates retrieval content.

generate_with_image

generate_with_image(
    prompt,
    image_base64,
    media_type="image/png",
    response_type="research",
)

Generate a response that analyzes an image (vision-capable models).

Retained from the extracted client for interface fidelity; the built-in agent's image path is inert in v1.

reset_call_count

reset_call_count()

Reset the call counter (determinism between paired runs).

close

close()

Close the underlying HTTP client to release connections.

traxr.llm.stub.DeterministicLLMStub

Scripted :class:~traxr.llm.LLMClient for tests, goldens, and demos.

Parameters:

Name Type Description Default
scenario str

One of :data:SCENARIOS (ignored when script is given).

'identity'
final_answer str

The answer the identity/reroute/loop synthesizer reply returns. Match it to the fixture data (e.g. the row count printed by the scripted analyst code).

'4'
wrong_answer str

The answer the wrong_answer scenario returns.

'999'
script dict[str, list[StubReply]] | None

Custom mapping response_type -> list[StubReply]; overrides the scenario entirely.

None

call_count property

call_count

Number of LLM calls made since the last reset.

generate

generate(prompt, response_type='default', context=None)

Return the next scripted plain-text reply.

generate_with_tools

generate_with_tools(
    prompt,
    tools,
    response_type="default",
    context=None,
    system_prompt_override=None,
)

Return the next scripted reply, including any scripted tool calls.

generate_with_retrieval

generate_with_retrieval(
    prompt, retrieval_content, response_type="research"
)

Retrieval-augmented text generation (same scripted replies).

reset_call_count

reset_call_count()

Reset all call counters (determinism between paired runs).

Scoring and plots

traxr.scoring.check_answer_match

check_answer_match(expected, actual)

Whether actual matches expected after normalization.

Numeric answers compare as floats (so "42" matches "42.0"); everything else is normalized string equality.

check_answer_match("Paris", " PARIS. ") True check_answer_match("1,000", "$1000") True check_answer_match("42", "43") False

traxr.scoring.llm_judge_match

llm_judge_match(expected, actual, llm=None)

Semantic match via LLM judge. Opt-in only, usable directly as ExperimentConfig(scorer=llm_judge_match).

Unlike :func:check_answer_match, this is not deterministic: results can vary between runs and providers. Use it when literal/numeric matching is too strict for your question's expected phrasing.

If llm is omitted, lazily constructs and caches a default OpenAICompatibleClient; this reads OPENAI_API_KEY from the environment and makes a live network call per scored answer. Pass your own llm (e.g. via functools.partial) to use another provider or avoid the implicit network/API-key dependency.

traxr.viz.plot_d_norm

plot_d_norm(results, ax=None)

Per-pair normalized edit distance, with the noise floor when measured.

traxr.viz.plot_t_star

plot_t_star(results, ax=None, bins=10)

Histogram of normalized first-divergence positions (t*/T).

traxr.viz.plot_manifestations

plot_manifestations(results, ax=None)

Manifestation-category prevalence over scored pairs.

Extending the event vocabulary

traxr.trace.registry.register_signature

register_signature(
    event_type,
    signature,
    *,
    classifier=None,
    structural_types=frozenset(),
    key_fields_equal=None,
)

Upgrade a custom event type from the unknown:{event_type} fallback.

Parameters:

Name Type Description Default
event_type str

The custom event type emitted via traxr.emit().

required
signature SignatureFn

payload -> str structural signature builder.

required
classifier ClassifierFn | None

Optional (clean_payload, perturbed_payload) -> str | None returning a structural divergence type (or None for lexical-only differences).

None
structural_types frozenset[str] | set[str]

The divergence-type strings the classifier can return (so t* typing recognizes them as structural).

frozenset()
key_fields_equal KeyFieldsFn | None

Optional semantic-equality predicate on payloads.

None

Errors and warnings

traxr.errors

Typed exception and warning hierarchy for Traxr.

Principle: fail loud with actionable messages; never silently produce wrong metrics. User-configuration errors fail fast; data-dependent failures are skipped and recorded (as warnings collected per pair).

All exceptions derive from :class:TraxrError; all warnings derive from :class:TraxrWarning.

TraxrError

Bases: Exception

Base class for all Traxr exceptions.

UnsupportedModalityError

Bases: TraxrError

Unknown extension or unsupported modality.

Raised for docx/pptx/image/audio inputs in v1. The message names what IS supported (CSV/XLSX/TXT/MD/PDF) and points at the roadmap for the rest.

ModalityMismatchError

Bases: TraxrError

Declared modality does not match the detected modality of the file.

InvalidArtifactError

Bases: TraxrError

Input file is missing, unreadable, or corrupt.

OptionalDependencyError

Bases: TraxrError

A required optional dependency is not installed.

Raised when openai/PyMuPDF/pdfplumber/openpyxl/matplotlib/langchain-core is needed but missing. The message names the pip extra that provides it (e.g. pip install "traxr[document]").

LLMConnectionError

Bases: TraxrError

Missing/invalid API key or unreachable base_url (built-in agent path).

AgentContractError

Bases: TraxrError

The user's agent violated the AgentRunner contract.

For example, it returned a non-str value.

RunBudgetExceeded

Bases: TraxrError

The agent exceeded max_llm_calls_per_run.

Raised inside the agent by the Tier 0 capture wrapper.

ExperimentConfigError

Bases: TraxrError

Invalid experiment configuration.

Raised when both or neither of agent/agent_factory/llm are resolvable, or other fail-fast configuration problems.

MatrixTooLargeError

Bases: TraxrError

The permutation matrix exceeds the configured cap.

MalformedEventError

Bases: TraxrError

Malformed or missing event payload, or empty/non-string event_type.

ControlledVariableError

Bases: TraxrError

Experiment configuration was mutated between paired runs.

TraxrWarning

Bases: UserWarning

Base class for all Traxr warnings (non-fatal, collected per pair).

UnknownEventTypeWarning

Bases: TraxrWarning

An unregistered event type fell back to the unknown:{event_type} signature.

Emitted once per unknown type; upgrade the type via traxr.register_signature() to include its structure in divergence metrics.

EmptyTraceWarning

Bases: TraxrWarning

A run produced no trace events.

NonDeterminismWarning

Bases: TraxrWarning

Paired runs differ beyond the measured noise floor's expectation.

ConcurrentTraceWarning

Bases: TraxrWarning

Concurrent LLM calls detected while tracing a single run.

TokenUnavailableWarning

Bases: TraxrWarning

Token usage could not be captured for one or more LLM calls.

PerturbationSkippedWarning

Bases: TraxrWarning

A perturbation was not applicable and was skipped (recorded).

Corresponds to the engine reporting applied=False with a skip_reason.