Observability Metrics & Traces Reference¶

This page describes the available metrics and traces presets. These parameterized query templates provide access to common telemetry patterns for monitoring AI Refinery inference services, agent workflows, and user sessions — without writing raw PromQL or TraceQL queries.

Note: The Observability APIs are available by default on api.airefinery.accenture.com. This feature is available starting from SDK version 1.25.0.

Time-series queries: Any metric that accepts time_window can also accept an optional step parameter (e.g., "step": "15m"). When provided, the response includes multiple data points at regular intervals instead of a single aggregated value — useful for building charts and trend visualizations.

Metrics¶

Inference Metrics¶

Metrics for monitoring LLM inference performance, including request counts, latency distributions, error rates, and model usage patterns.

inference_requests_total

Total number of inference requests over the specified time window.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `model_key` (optional)
- `time_window` (required)
- `step` (optional)

inference_active_model_count

Number of distinct models that have received requests within the time window.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `model_key` (optional)
- `time_window` (required)
- `step` (optional)

inference_model_usage

Per-model inference usage rate over the time window.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `model_key` (optional)
- `time_window` (required)
- `step` (optional)

inference_latency

Inference latency at a specified percentile. Defaults to p95 when percentile is not provided.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `model_key` (optional)
- `time_window` (required)
- `percentile` (optional) — e.g., `0.50`, `0.90`, `0.95`, `0.99`, or `50`, `90`, `95`, `99`. Default: `0.95`
- `step` (optional)

inference_error_rate

Inference error rate as a ratio of errors to total requests. Returns a value between 0 and 1 (e.g., 0.05 means 5% error rate).

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `model_key` (optional)
- `time_window` (required)
- `step` (optional)

Agent Metrics¶

All agent metrics support filtering by both agent_name and agent_class. The agent_class refers to the implementation type (e.g., ToolUseAgent, SearchAgent, CustomAgent) and is useful for aggregating across agents of the same type regardless of their user-defined names.

agent_task_total

Total agent tasks broken down by agent name, agent class, and status (success/failure/timeout) over the time window.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `time_window` (required)
- `step` (optional)

agent_performance_rate

Agent task rate by status over the time window. Defaults to success rate when status is not provided.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `status` (optional) — defaults to `success`. Can also be `failure` or `timeout`
- `time_window` (required)
- `step` (optional)

agent_throughput

Agent task completion rate in tasks per second.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `time_window` (required)
- `step` (optional)

agent_latency

Agent task latency at a specified percentile, grouped by agent name and agent class. Defaults to p95 when percentile is not provided.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `time_window` (required)
- `percentile` (optional) — e.g., `0.25`, `0.50`, `0.75`, `0.90`, `0.95`, `0.99`. Default: `0.95`
- `step` (optional)

agent_latency_boxplot

Agent task latency at four percentiles (p25, p50, p75, p95) per agent, returned in a single response. Designed for rendering box plot visualizations — p25 maps to Q1, p50 to median, p75 to Q3, and p95 to the upper whisker. Each result includes a percentile label indicating which percentile it represents.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `time_window` (required)
- `step` (optional)

agent_duration

Total time spent per agent in seconds over the time window, grouped by agent name and agent class.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `time_window` (required)
- `step` (optional)

agent_dependency_calls

Count of external dependency calls over the time window, broken down by agent name, agent class, API type, and source.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `time_window` (required)
- `step` (optional)

agent_tool_calls

Count of tool calls over the time window, broken down by agent name, agent class, API type, and tool name.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `time_window` (required)
- `step` (optional)

agent_messages

Inter-agent message counts over the time window, aggregated by sender and receiver agent pair. Returns the number of messages exchanged between each pair of agent classes. Use this metric for the "Messages" column in inter-agent communication tables.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `time_window` (required)
- `step` (optional)

agent_messages_with_tokens

Inter-agent token counts over the time window, aggregated by sender/receiver agent pair and token type (input/output/total). This metric does not return message counts — use agent_messages for that. Each agent pair returns three records (one per token_type). Filter with token_type: "total" to avoid double-counting input and output tokens. Use this metric for the "Tokens" column in inter-agent communication tables.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `token_type` (optional) — `total`, `input`, or `output`. When omitted, returns all three per agent pair.
- `time_window` (required)
- `step` (optional)

agent_orchestration_overhead

Orchestration overhead ratio (p95) — the fraction of total orchestrator time spent on coordination rather than agent execution. A value of 0.3 means 30% overhead.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `time_window` (required)
- `step` (optional)

Token Consumption Metrics¶

Metrics for tracking text LLM token usage across models and agents, including input/output breakdowns for cost analysis and usage optimization.

Important: Token consumption metrics only track text tokens from LLM inference calls. They include tokens used by system prompts, context, and internal orchestrator calls — not just the visible user message and response. A single user query can generate several internal LLM calls, each with its own system prompt and context overhead, so token counts will be higher than the text you see in the conversation.

Agents that do not make LLM calls (e.g., ImageGenerationAgent) will not appear in these metrics. See the Notes section for details on which agents consume tokens and which do not.

token_consumption

Total token consumption grouped by organization, project, and model. Supports optional agent filtering to narrow down consumption to specific agents.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `model_key` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `time_window` (required)
- `step` (optional)

token_input_total / token_output_total

Input and output tokens broken out separately. Support optional agent filtering to narrow down to specific agents.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `model_key` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `time_window` (required)
- `step` (optional)

token_consumption_by_agent

Token consumption grouped by agent name and agent class over the time window. Only agents that make LLM inference calls will appear. Agents that use non-LLM APIs (e.g., ImageGenerationAgent which calls a diffuser API) will not be listed. You may see system-level agent classes such as DirectInference (direct API calls), FallbackAgent (orchestrator fallback), and Orchestrator (routing overhead) in the results.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `agent_name` (optional)
- `agent_class` (optional)
- `time_window` (required)
- `step` (optional)

Session Metrics¶

Metrics for monitoring user session activity, including session counts, durations, and request throughput.

sessions_total

Total number of sessions started over the time window.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `time_window` (required)
- `step` (optional)

sessions_active

Number of currently active sessions (gauge — returns current value, no time window needed).

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)

session_duration

Session duration (p95) from a pre-computed recording rule.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `time_window` (required)
- `step` (optional)

session_requests_total

Total requests processed within sessions over the time window.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `time_window` (required)
- `step` (optional)

session_requests_rate

Session request rate in requests per second.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `time_window` (required)
- `step` (optional)

RAI Compliance Metrics¶

Metrics for tracking Responsible AI (RAI) compliance checks, including check counts, rejection rates by category, and latency.

rai_check_total

Total number of RAI compliance checks performed over the time window. Every user query sent through the orchestrator triggers exactly one RAI check, so this metric serves as a proxy for total user messages through the orchestrator. Direct inference calls (client.chat.completions.create()) do not trigger RAI checks and are not counted here.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `time_window` (required)
- `step` (optional)

rai_rejection_total

Total number of queries that failed RAI compliance checks over the time window, grouped by rejection category. This is a subset of rai_check_total — only queries where is_pass = false. When category is omitted, returns counts for each category separately. To get a single total, sum the values client-side or filter to a specific category. Zero-value categories are automatically filtered out.
Categories are extracted from RAI rule names. Default rules produce categories like harassment, hate, self-harm, sexual, violence, illicit. Custom user-defined rules produce categories derived from the rule name (e.g., a rule named "Prompt Injection Protection" produces prompt_injection_protection).

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `category` (optional) — filter by rejection category: `harassment`, `hate`, `self-harm`, `sexual`, `violence`, `illicit`, `illegal_content`, `harmful_content`, or custom rule-derived categories (e.g., `prompt_injection_protection`)
- `time_window` (required)
- `step` (optional)

rai_check_latency

RAI compliance check latency (p95) per project, optionally filtered by category. When no category is specified, returns latency for all categories including passed (checks that did not trigger a rejection). Useful for building per-category latency tables (e.g., "RAI rejections by category" with a Latency column).

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `category` (optional) — filter by RAI result category (e.g., `passed`, `hate`, `prompt_injection_protection`)
- `time_window` (required)
- `step` (optional)

rai_check_latency_global

Global RAI compliance check latency (p95) across all projects. Returns a single aggregated value per organization — suitable for dashboard summary metrics (BANs) where a single latency number is needed rather than per-project values.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `time_window` (required)
- `step` (optional)

rai_check_duration

Average RAI compliance check duration in seconds, computed as total duration divided by total checks. Returns a single value across all projects — suitable for dashboard summary metrics (BANs) showing mean check time.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)
- `time_window` (required)
- `step` (optional)

API Usage Metrics¶

Metrics for tracking API usage across your organization, including request counts by service category, error rates, latency, and billable unit consumption for non-token services (TTS, ASR, image generation, document analysis, etc.).

New in 1.31.2: These metrics provide organization-level visibility into API consumption across all AI Refinery services — useful for understanding usage patterns, capacity planning, and identifying which services are being used most.

api_requests_by_category

Total API requests grouped by service category (e.g., chat, embeddings, tts, asr, image_generation, document_analysis, governance, compression, moderations, etc.) over the time window. Provides a high-level view of which services your organization is consuming.

Parameters:

- `organization_id` (auto-resolved from token)
- `time_window` (required)
- `step` (optional)

api_requests_by_endpoint

Total API requests grouped by endpoint path and HTTP method over the time window. Provides a more granular breakdown than api_requests_by_category, showing exactly which endpoints are being called. Supports optional api_category filter to narrow results to a specific service.

Parameters:

- `organization_id` (auto-resolved from token)
- `api_category` (optional) — filter to a specific category (e.g., `chat`, `tts`, `embeddings`)
- `time_window` (required)
- `step` (optional)

api_error_rate

API error rate (4xx + 5xx responses as a ratio of total requests) for your organization. Returns a value between 0 and 1 (e.g., 0.02 means 2% error rate).

Parameters:

- `organization_id` (auto-resolved from token)
- `time_window` (required)
- `step` (optional)

api_latency

API latency at a specified percentile, grouped by service category.

Parameters:

- `organization_id` (auto-resolved from token)
- `api_category` (optional) — filter to a specific category
- `time_window` (required)
- `percentile` (optional) — e.g., `0.50`, `0.95`, `0.99`. Default: `0.95`
- `step` (optional)

tts_characters_total

Total text-to-speech characters consumed, grouped by model. Characters are the billable unit for the TTS service.

Parameters:

- `organization_id` (auto-resolved from token)
- `time_window` (required)
- `step` (optional)

asr_file_bytes_total

Total audio bytes processed by the automatic speech recognition service, grouped by model. Can be converted to approximate audio hours by dividing by 57,600,000 (16kHz × 16-bit mono).

Parameters:

- `organization_id` (auto-resolved from token)
- `time_window` (required)
- `step` (optional)

image_generation_total

Total images generated, grouped by model.

Parameters:

- `organization_id` (auto-resolved from token)
- `time_window` (required)
- `step` (optional)

image_segmentation_total

Total image segmentation requests, grouped by model.

Parameters:

- `organization_id` (auto-resolved from token)
- `time_window` (required)
- `step` (optional)

image_edit_total

Total image edit requests, grouped by model.

Parameters:

- `organization_id` (auto-resolved from token)
- `time_window` (required)
- `step` (optional)

document_analysis_total

Total document analysis requests, grouped by model and capability (e.g., layout_detection, text_detection, ocr).

Parameters:

- `organization_id` (auto-resolved from token)
- `time_window` (required)
- `step` (optional)

document_extract_total

Total document extraction requests (full document-to-text pipeline).

Parameters:

- `organization_id` (auto-resolved from token)
- `time_window` (required)
- `step` (optional)

moderation_total

Total content moderation requests, grouped by model.

Parameters:

- `organization_id` (auto-resolved from token)
- `time_window` (required)
- `step` (optional)

Traces¶

inference_traces

Traces for inference service requests.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)

distiller_traces

Traces for distiller service operations.

Parameters:

- `organization_id` (auto-resolved from token)
- `project_name` (optional)

Notes¶

Time windows: Prometheus duration format (5m, 1h, 24h). Default: 1h

Percentile: Accepts 0.95 or 95 format. Default: 0.95 (p95)

Time-series mode: Pass step (e.g., "15m") to get matrix data for charting. Omit step to get a single aggregated value. When step is included, the response contains ceil(time_window / step) + 1 data points.

Agent class: Filter by implementation type (e.g., ToolUseAgent) across all agents of that type.

Agent classes in results: Responses may include agent classes that are not part of your project's agent list. DirectInference represents direct API calls outside the orchestrator. FallbackAgent is the default agent when the orchestrator cannot route a query. Orchestrator is the orchestrator itself. These are system-level agent classes, not user-configured agents.

Multiple session IDs: A single user conversation with the orchestrator produces multiple internal telemetry sessions (one per agent dispatch). Seeing multiple session IDs for one chat session is expected behavior.

Token values are decimals: Token counts are returned as floating-point strings (e.g., "1938.47"). This is standard Prometheus behavior due to counter interpolation. Values are accurate to within ~1% of the true count.

Image generation tokens: ImageGenerationAgent does not produce text tokens and will not appear in token consumption metrics. ImageUnderstandingAgent does consume tokens via its Vision Language Model (VLM), which may use a platform-default model (e.g., Qwen/Qwen3-VL-32B-Instruct) regardless of the llm_config in your YAML.

0 vs empty results: A value of "0" means the metric exists but had no activity. An empty result ([]) means no matching data exists at all.

agent_messages vs agent_messages_with_tokens: These are separate metrics. agent_messages returns message counts; agent_messages_with_tokens returns token counts. Query both to build a complete inter-agent communication view.

RAI check = user message count: rai_check_total counts one check per user query through the orchestrator. It serves as a proxy for total user messages. Direct inference calls bypass RAI and are not counted.

RAI rejection categories: rai_rejection_total groups results by category. Default rules produce categories like harassment, hate, violence, illicit. Custom user-defined rules produce categories derived from the rule name (e.g., a rule named "Prompt Injection Protection" produces prompt_injection_protection). Zero-value categories are automatically filtered out. To get a single total rejection count, sum all category values client-side.

Latency metrics and traffic volume: Histogram-based latency metrics (rai_check_latency, rai_check_latency_global, agent_latency, agent_latency_boxplot) require sustained traffic to produce values. With very low traffic, these may return null until enough data points accumulate for Prometheus to compute meaningful percentiles.

API usage metrics scope: The api_requests_by_category and related usage metrics are scoped to your organization. They track all requests made by any project or user within your organization. These metrics are distinct from inference/agent/token metrics — they count HTTP requests at the API gateway level.

API categories: The api_category label values include: chat, embeddings, tts, asr, image_generation, image_segmentation, image_edit, document_analysis, document_extract, compression, moderations, governance, knowledge, realtime, observability, health, and others. Categories are assigned based on the request path.

Billable units vs tokens: Non-LLM services (TTS, ASR, image generation, document analysis) use service-specific billable units rather than tokens. Use tts_characters_total, asr_file_bytes_total, image_generation_total, document_analysis_total, etc. to track consumption of these services. Token metrics only apply to LLM inference calls.

ASR bytes to audio hours: To convert asr_file_bytes_total to approximate audio hours, divide by 57,600,000 (assumes 16kHz sample rate, 16-bit mono PCM). Actual conversion depends on the audio format submitted.