Skip to content

Observability Metrics & Traces Reference

This page describes the available metrics and traces presets. These parameterized query templates provide access to common telemetry patterns for monitoring AI Refinery inference services, agent workflows, and user sessions — without writing raw PromQL or TraceQL queries.

Note: The Observability APIs are available by default on api.airefinery.accenture.com. This feature is available starting from SDK version 1.25.0.

Time-series queries: Any metric that accepts time_window can also accept an optional step parameter (e.g., "step": "15m"). When provided, the response includes multiple data points at regular intervals instead of a single aggregated value — useful for building charts and trend visualizations.

Metrics

Inference Metrics

Metrics for monitoring LLM inference performance, including request counts, latency distributions, error rates, and model usage patterns.


inference_requests_total

  • Total number of inference requests over the specified time window.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `model_key` (optional)
- `time_window` (required)
- `step` (optional)

inference_active_model_count

  • Number of distinct models that have received requests within the time window.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `model_key` (optional)
- `time_window` (required)
- `step` (optional)

inference_model_usage

  • Per-model inference usage rate over the time window.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `model_key` (optional)
- `time_window` (required)
- `step` (optional)

inference_latency

  • Inference latency at a specified percentile. Defaults to p95 when percentile is not provided.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `model_key` (optional)
- `time_window` (required)
- `percentile` (optional) — e.g., `0.50`, `0.90`, `0.95`, `0.99`, or `50`, `90`, `95`, `99`. Default: `0.95`
- `step` (optional)

inference_error_rate

  • Inference error rate as a ratio of errors to total requests. Returns a value between 0 and 1 (e.g., 0.05 means 5% error rate).

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `model_key` (optional)
- `time_window` (required)
- `step` (optional)

Agent Metrics

All agent metrics support filtering by both agent_name and agent_class. The agent_class refers to the implementation type (e.g., ToolUseAgent, SearchAgent, CustomAgent) and is useful for aggregating across agents of the same type regardless of their user-defined names.


agent_task_total

  • Total agent tasks broken down by agent name, agent class, and status (success/failure/timeout) over the time window.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `time_window` (required)
- `step` (optional)

agent_performance_rate

  • Agent task rate by status over the time window. Defaults to success rate when status is not provided.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `status` (optional) — defaults to `success`. Can also be `failure` or `timeout`
- `time_window` (required)
- `step` (optional)

agent_throughput

  • Agent task completion rate in tasks per second.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `time_window` (required)
- `step` (optional)

agent_latency

  • Agent task latency at a specified percentile, grouped by agent name and agent class. Defaults to p95 when percentile is not provided.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `time_window` (required)
- `percentile` (optional) — e.g., `0.25`, `0.50`, `0.75`, `0.90`, `0.95`, `0.99`. Default: `0.95`
- `step` (optional)

agent_latency_boxplot

  • Agent task latency at four percentiles (p25, p50, p75, p95) per agent, returned in a single response. Designed for rendering box plot visualizations — p25 maps to Q1, p50 to median, p75 to Q3, and p95 to the upper whisker. Each result includes a percentile label indicating which percentile it represents.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `time_window` (required)
- `step` (optional)

agent_duration

  • Total time spent per agent in seconds over the time window, grouped by agent name and agent class.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `time_window` (required)
- `step` (optional)

agent_dependency_calls

  • Count of external dependency calls over the time window, broken down by agent name, agent class, API type, and source.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `time_window` (required)
- `step` (optional)

agent_tool_calls

  • Count of tool calls over the time window, broken down by agent name, agent class, API type, and tool name.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `time_window` (required)
- `step` (optional)

agent_messages

  • Inter-agent message counts over the time window, aggregated by sender and receiver agent pair. Returns the number of messages exchanged between each pair of agent classes. Use this metric for the "Messages" column in inter-agent communication tables.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `time_window` (required)
- `step` (optional)

agent_messages_with_tokens

  • Inter-agent token counts over the time window, aggregated by sender/receiver agent pair and token type (input/output/total). This metric does not return message counts — use agent_messages for that. Each agent pair returns three records (one per token_type). Filter with token_type: "total" to avoid double-counting input and output tokens. Use this metric for the "Tokens" column in inter-agent communication tables.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `token_type` (optional) — `total`, `input`, or `output`. When omitted, returns all three per agent pair.
- `time_window` (required)
- `step` (optional)

agent_orchestration_overhead

  • Orchestration overhead ratio (p95) — the fraction of total orchestrator time spent on coordination rather than agent execution. A value of 0.3 means 30% overhead.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `time_window` (required)
- `step` (optional)

Token Consumption Metrics

Metrics for tracking text LLM token usage across models and agents, including input/output breakdowns for cost analysis and usage optimization.

Important: Token consumption metrics only track text tokens from LLM inference calls. They include tokens used by system prompts, context, and internal orchestrator calls — not just the visible user message and response. A single user query can generate several internal LLM calls, each with its own system prompt and context overhead, so token counts will be higher than the text you see in the conversation.

Agents that do not make LLM calls (e.g., ImageGenerationAgent) will not appear in these metrics. See the Notes section for details on which agents consume tokens and which do not.


token_consumption

  • Total token consumption grouped by organization, project, and model. Supports optional agent filtering to narrow down consumption to specific agents.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `model_key` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `time_window` (required)
- `step` (optional)

token_input_total / token_output_total

  • Input and output tokens broken out separately. Support optional agent filtering to narrow down to specific agents.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `model_key` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `time_window` (required)
- `step` (optional)

token_consumption_by_agent

  • Token consumption grouped by agent name and agent class over the time window. Only agents that make LLM inference calls will appear. Agents that use non-LLM APIs (e.g., ImageGenerationAgent which calls a diffuser API) will not be listed. You may see system-level agent classes such as DirectInference (direct API calls), FallbackAgent (orchestrator fallback), and Orchestrator (routing overhead) in the results.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `time_window` (required)
- `step` (optional)

Session Metrics

Metrics for monitoring user session activity, including session counts, durations, and request throughput.


sessions_total

  • Total number of sessions started over the time window.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `time_window` (required)
- `step` (optional)

sessions_active

  • Number of currently active sessions (gauge — returns current value, no time window needed).

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)

session_duration

  • Session duration (p95) from a pre-computed recording rule.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `time_window` (required)
- `step` (optional)

session_requests_total

  • Total requests processed within sessions over the time window.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `time_window` (required)
- `step` (optional)

session_requests_rate

  • Session request rate in requests per second.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `time_window` (required)
- `step` (optional)

RAI Compliance Metrics

Metrics for tracking Responsible AI (RAI) compliance checks, including check counts, rejection rates by category, and latency.


rai_check_total

  • Total number of RAI compliance checks performed over the time window. Every user query sent through the orchestrator triggers exactly one RAI check, so this metric serves as a proxy for total user messages through the orchestrator. Direct inference calls (client.chat.completions.create()) do not trigger RAI checks and are not counted here.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `time_window` (required)
- `step` (optional)

rai_rejection_total

  • Total number of queries that failed RAI compliance checks over the time window, grouped by rejection category. This is a subset of rai_check_total — only queries where is_pass = false. When category is omitted, returns counts for each category separately. To get a single total, sum the values client-side or filter to a specific category. Zero-value categories are automatically filtered out.
  • Categories are extracted from RAI rule names. Default rules produce categories like harassment, hate, self-harm, sexual, violence, illicit. Custom user-defined rules produce categories derived from the rule name (e.g., a rule named "Prompt Injection Protection" produces prompt_injection_protection).

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `category` (optional) — filter by rejection category: `harassment`, `hate`, `self-harm`, `sexual`, `violence`, `illicit`, `illegal_content`, `harmful_content`, or custom rule-derived categories (e.g., `prompt_injection_protection`)
- `time_window` (required)
- `step` (optional)

rai_check_latency

  • RAI compliance check latency (p95) per project, optionally filtered by category. When no category is specified, returns latency for all categories including passed (checks that did not trigger a rejection). Useful for building per-category latency tables (e.g., "RAI rejections by category" with a Latency column).

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `category` (optional) — filter by RAI result category (e.g., `passed`, `hate`, `prompt_injection_protection`)
- `time_window` (required)
- `step` (optional)

rai_check_latency_global

  • Global RAI compliance check latency (p95) across all projects. Returns a single aggregated value per organization — suitable for dashboard summary metrics (BANs) where a single latency number is needed rather than per-project values.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `time_window` (required)
- `step` (optional)

rai_check_duration

  • Average RAI compliance check duration in seconds, computed as total duration divided by total checks. Returns a single value across all projects — suitable for dashboard summary metrics (BANs) showing mean check time.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `time_window` (required)
- `step` (optional)

Traces


inference_traces

  • Traces for inference service requests.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)

distiller_traces

  • Traces for distiller service operations.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)

Notes

  • Time windows: Prometheus duration format (5m, 1h, 24h). Default: 1h
  • Percentile: Accepts 0.95 or 95 format. Default: 0.95 (p95)
  • Time-series mode: Pass step (e.g., "15m") to get matrix data for charting. Omit step to get a single aggregated value. When step is included, the response contains ceil(time_window / step) + 1 data points.
  • Agent class: Filter by implementation type (e.g., ToolUseAgent) across all agents of that type.
  • Agent classes in results: Responses may include agent classes that are not part of your project's agent list. DirectInference represents direct API calls outside the orchestrator. FallbackAgent is the default agent when the orchestrator cannot route a query. Orchestrator is the orchestrator itself. These are system-level agent classes, not user-configured agents.
  • Multiple session IDs: A single user conversation with the orchestrator produces multiple internal telemetry sessions (one per agent dispatch). Seeing multiple session IDs for one chat session is expected behavior.
  • Token values are decimals: Token counts are returned as floating-point strings (e.g., "1938.47"). This is standard Prometheus behavior due to counter interpolation. Values are accurate to within ~1% of the true count.
  • Image generation tokens: ImageGenerationAgent does not produce text tokens and will not appear in token consumption metrics. ImageUnderstandingAgent does consume tokens via its Vision Language Model (VLM), which may use a platform-default model (e.g., Qwen/Qwen3-VL-32B-Instruct) regardless of the llm_config in your YAML.
  • 0 vs empty results: A value of "0" means the metric exists but had no activity. An empty result ([]) means no matching data exists at all.
  • agent_messages vs agent_messages_with_tokens: These are separate metrics. agent_messages returns message counts; agent_messages_with_tokens returns token counts. Query both to build a complete inter-agent communication view.
  • RAI check = user message count: rai_check_total counts one check per user query through the orchestrator. It serves as a proxy for total user messages. Direct inference calls bypass RAI and are not counted.
  • RAI rejection categories: rai_rejection_total groups results by category. Default rules produce categories like harassment, hate, violence, illicit. Custom user-defined rules produce categories derived from the rule name (e.g., a rule named "Prompt Injection Protection" produces prompt_injection_protection). Zero-value categories are automatically filtered out. To get a single total rejection count, sum all category values client-side.
  • Latency metrics and traffic volume: Histogram-based latency metrics (rai_check_latency, rai_check_latency_global, agent_latency, agent_latency_boxplot) require sustained traffic to produce values. With very low traffic, these may return null until enough data points accumulate for Prometheus to compute meaningful percentiles.