Overview
The Dynamo AI system runs in your Kubernetes cluster on-premises or on-cloud. This deployment guide covers deploying the Dynamo AI system in AWS and on-prem environments. Detailed deployment guide for Azure, GCP and other environments will be published soon. If you have questions, please feel free to reach out to Dynamo AI.
Architecture overview
The following picture illustrates the architecture of an example system setup on-premises, including the integrated systems running in your Corporation VPN and production environment.
The Dynamo AI system is deployed in a dedicated namespace in Kubernetes, with a few necessary components deployed in the kube-system namespace. The Kubernetes cluster is managed by you in your production environment.
Components
- Nginx: An Ngnix ingress gateway for the incoming requests to the Dynamo AI system.
- UI Application: An application serves the user interactions via UI.
- API Application: An application that serves the API calls to the Dynamo AI system.
- Keycloak: A Keycloak based application that serves the authentication and authorization for all incoming requests.
- NATS: A NATS based application that queues the asynchronous jobs.
- KEDA: A KEDA based application that calls Kubernetes API server to scale up/down the custer based on the NATS job queue length.
- Moderator Service: The service that routes the realtime traffic to the ML workers for moderation in DynamoGuard.
- Auto-scaled ML workers: Depending on your setup, the various ML workers that generate the test prompts, generate policies, moderate realtime traffic, etc. The ML servers can be automatically scaled up and down by KEDA.
- PostgreSQL: There are two PostgreSQL instances in the system. One is used for storing the system configuration and application data. The other one is used as the backend for Keycloak to store the authentication and authorization data. This isolation ensures strong security boundary between security data and application data.
- MongoDB: One MongoDB instance to store the application data.
- FluentBit: A FluentBit application that collects the application logs from the system.
System personas
Generally, there are four groups of personas who may interact with the Dynamo AI system for usage and maintenance purposes:
- General Users: External and internal users who utilize the LLM models. The users may interact with the application running on their machine in your Corp VPN, or an application server running in the production environment.
- System Admin: Individuals responsible for managing system configurations and setting user permissions. System admin manages the system through the Dynamo UI admin dashboard.
- ML Engineers: Engineers who run the tests and develop the policies, etc. ML engineers can use the Dynamo UI or Dynamo SDK to interact with the system.
- DevOps / Infra Team: Professionals who deploy and manage the system, and monitor the system healthiness at realtime. They also subscribe to Dynamo AI for new system and model updates. They apply the new updates to the system to upgrade the performance and fix vulnerabilities.
Interaction with external systems and alternatives
The Dynamo AI system may interact with other systems in the following ways:
-
Ingress Gateway: Dynamo AI utilizes an Nginx ingress gateway to handle incoming requests to its services.
-
Integration with IAM Systems: If you have an existing IAM system, Dynamo AI can integrate with it using Keycloak, which connects via the OIDC endpoint.
-
Model and Configuration Storage and Retrieval:
- If your organization has an existing Object Storage service, ML workers can download/upload Dynamo AI models from/to your organization's Object Storage service.
- If your cluster has access to Hugging Face, ML workers can retrieve Dynamo AI models from this external service.
- Otherwise, Dynamo AI can deploy an in-cluster MinIO service for object storage. Models can also be shipped to you directly from Dynamo AI.
-
Auto-scaling: The KEDA component interacts with the Kubernetes API Server to automatically scale ML worker pods based on jobs in the NATS queue.
-
External LLM Services for Testing:
- ML workers can call external LLM service endpoints, such as Mistral AI, to generate test prompts and datasets.
- Alternatively, you can host open-source models locally, which requires additional GPU provisioning.
-
Managed Inference with DynamoGuard: When using DynamoGuard in managed inference mode, and if the target LLM is external, ML workers send compliant prompts to the external LLM service endpoints to receive inference responses.
-
Log Management:
DynamoGuard service modes
The DynamoGuard can run in two different service modes, managed inference mode and analyze mode.
- Managed inference mode: DynamoGuard moderates the input prompt, forwards the compliant requests to the target LLM service and receives the inference result, and moderates the response before forwarding to the user. Your application sends the LLM prompt to DynamoGuard and receives the inference response or any error message if non-compliant, from DynamoGuard.
- Analyze mode: DynamoGuard only moderates the input and output to and from the target LLM service, without forwarding the request to the target LLM service. In this case, your application directly calls the LLM service for the LLM inference, and it calls DynamoGuard to moderate the input and output. This interaction is depicted as the dot-dash line in the architecture picture.