Is my agent traceable?
The honest headline: traxr v1 measures any Python agent that talks to an OpenAI-compatible endpoint via the OpenAI SDK, not "any agent".
traxr captures at two tiers. Tier 0 is automatic capture at the OpenAI-SDK
boundary (instrument() / patch_openai()): provider-agnostic, but coarse (no
memory or retrieval events, tool success unknown). Tier 1 is framework-native
capture via callbacks (the LangGraph adapter): richer, with node transitions and
tool success/failure. Everything below is Tier 0 unless noted.
Covered
openai.OpenAI/openai.AsyncOpenAIclients you can pass into your agent: wrap once withtraxr.instrument(client). Sync, async, and streaming calls (with tool-call delta reassembly) all become trace events. Works against any OpenAI-compatible endpoint: OpenAI, Azure, Ollama, vLLM, Together, Groq, OpenRouter, a proxied Anthropic.- Clients constructed inside code you can't change: the
traxr.capture.patch_openai()context manager patches the SDK at class level for the duration of the experiment. - LangGraph graphs:
traxr.from_langgraph(graph)captures node transitions (as routing events, so reroute metrics work unchanged), tool calls with full success/failure fidelity, and LLM calls, vialangchain-corecallbacks. - Anything you emit yourself:
traxr.emit("my_event", {...})is the manual escape hatch; upgrade custom event types into the structural metrics withtraxr.register_signature().
Not covered (v1)
- Other provider SDKs (native Anthropic, Gemini, …)
- Raw HTTP calls
- The OpenAI Responses / Assistants APIs
- LLM calls made in subprocesses
Runs that capture nothing are flagged with EmptyTraceWarning and status
empty, never silently reported as zero divergence. A localhost proxy and
OpenTelemetry ingestion are the roadmap answers for the rest.
Caveats that keep the numbers honest
External traces are coarser than built-in traces. Tier 0 sees LLM
calls, tool requests, and tool results, with no memory or retrieval events,
and tool success/failure is unknown at the SDK boundary. d_norm and t*
remain valid (the metrics are vocabulary-agnostic), but built-in and
external values are not cross-comparable.
Concurrency degrades comparability. Parallel LLM calls interleave
nondeterministically; scheduling noise inflates d_norm. traxr detects it
(ConcurrentTraceWarning, order_nondeterministic on the pair), the
noise floor absorbs it empirically, and
require_sequential=True fails fast instead.
Budget enforcement is Tier 0 only. max_llm_calls_per_run is enforced
inside the instrument wrapper. LangGraph (Tier 1) runs suppress Tier 0 to
avoid double counting, so the budget is not enforced there in v1.