Overview
Metrics provide a way of monitoring and understanding behavior in aggregate. The DynamoAI API provides two types of metrics:
- System Level Metrics - API captures request statistics and sends these metrics to OpenTelemetry (Available from release 3.24.0)
- Application Level Metrics - Currently includes DynamoGuard metrics covering latency metrics
Architecture Overview
The metrics collection architecture follows a pipeline pattern that enables flexibility in metric storage and analysis:
┌─────────────┐
│ DynamoAI │
│ API │
└──────┬──────┘
│
│ Sends metrics via OTLP
│
├─────────────────────────────┐
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────────────┐
│ opentelemetry- │ │ opentelemetry-collector- │
│ collector │ │ application │
│ (System Metrics) │ │ (Application Metrics) │
└──────┬──────────────┘ └──────┬──────────────────────┘
│ │
│ │
└──────────┬───────────────┘
│
│ Exports metrics
│
▼
┌─────────────────┐
│ Prometheus │
│ (Storage) │
└─────────────────┘
Flow Description
-
API Layer: The DynamoAI API captures metrics at two levels:
- System Metrics: Automatically captured at the HTTP level (request/response statistics, errors, latency)
- Application Metrics: Explicitly sent by the application code (DynamoGuard latency metrics)
-
OpenTelemetry Collectors: Metrics are sent via OTLP (OpenTelemetry Protocol) to two separate collector deployments:
opentelemetry-collector: Receives and processes System Level Metricsopentelemetry-collector-application: Receives and processes Application Level Metrics
-
Storage Backend: The OpenTelemetry collectors export metrics to Prometheus (when Metrics Storage is enabled in the DynamoAI package). Prometheus acts as the time-series database for metric storage and querying.
-
Visualization: Grafana (shipped with DynamoAI package) connects to Prometheus as a data source to provide dashboards and visualizations.
This architecture provides flexibility - customers can configure the OpenTelemetry collectors to export metrics to their own backends (e.g., their own Prometheus instance, Datadog, New Relic, etc.) in addition to or instead of the DynamoAI-provided Prometheus instance.
System Metrics
Note: System Metrics were introduced in release 3.24.0.
System Level Metrics help in determining the health of the system, including:
- API response time
- Error rates
- Request/response statistics
These metrics are captured at the HTTP level and provide insights into API performance and reliability.
Application Metrics
Application Level Metrics capture data around what's happening at the application level. Currently, this includes:
- DynamoGuard latency metrics for input guardrails, LLM completion, and output guardrails
How does DynamoAI capture these metrics?
DynamoAI uses OpenTelemetry collectors as an intermediary layer to collect these metrics. This architecture allows customers to ship metrics to their own backends as well.
There are two separate OpenTelemetry collector deployments in the Kubernetes cluster:
| Collector Deployment | Metric Level | Purpose |
|---|---|---|
opentelemetry-collector | System Metrics | Captures the system level metrics regarding request stats by API |
opentelemetry-collector-application | Application Metrics | Captures the metrics sent by the application to deliver it to the appropriate backend |
Where does DynamoAI store these metrics?
All metrics are sent to OpenTelemetry first, from which they can be exported to any desired backend.
- If you've opted for Metrics Storage from DynamoAI, you're shipped with a Prometheus instance as part of the DynamoAI package
- All metrics (both System and Application level) collected are sent to the Prometheus instance deployed on the Kubernetes cluster