Introduction
Envoy AI Gateway
Alauda Build of Envoy AI Gateway is based on the Envoy AI Gateway project. Envoy AI Gateway is a Kubernetes-native, AI-specific gateway layer built on top of Envoy Gateway, providing intelligent traffic management, routing, and policy enforcement for AI inference workloads.
Main components and capabilities include:
- AI-Aware Routing: Routes inference requests to the appropriate backend model service based on request content, model name, and backend availability — enabling transparent multi-model serving behind a single endpoint.
- OpenAI-Compatible API: Exposes a unified, OpenAI-compatible API surface (
/v1/chat/completions,/v1/completions,/v1/models) for all downstream inference services, regardless of the underlying runtime. - Per-Model Rate Limiting & Policies: Enforces fine-grained rate limiting, token quotas, and traffic policies at the individual model level, preventing resource starvation and ensuring fair usage across tenants.
- Backend Load Balancing: Distributes inference requests across multiple replicas of the same model using configurable load-balancing strategies, with health checking and automatic failover.
- Envoy Gateway Integration: Runs as an extension of Envoy Gateway, inheriting its Kubernetes Gateway API-native control plane, TLS termination, and observability features (metrics, access logs, distributed tracing).
- Gateway API Inference Extension (GIE): Integrates with the Kubernetes SIG Gateway API Inference Extension for advanced, inference-aware scheduling and load balancing decisions based on real-time backend state.
Envoy AI Gateway is a required dependency of Alauda Build of KServe for exposing inference services.
For installation on the platform, see Install Envoy AI Gateway.
Guides
The following guides configure Envoy AI Gateway as a multi-tenant model serving control plane:
- Authenticating Consumers — verify SSO tokens or API keys and propagate caller identity.
- Configuring Token Quotas — enforce per-user, per-department, and per-tier token budgets.
- Metering Token Usage — report consumption per tenant and feed chargeback.
- Routing to LLM Providers — front external providers with credential injection and failover.
Create the Gateway and its AIGatewayRoute in a dedicated namespace, such as maas-system, rather than in the Envoy Gateway control-plane namespace envoy-gateway-system. This keeps tenant gateways separate from the control plane, and avoids an issue on some versions where a gateway placed in the control-plane namespace does not get the AI Gateway request-processing (ext_proc) filter or SecurityPolicy rules applied to its listener — which silently breaks model routing, token quotas, and authentication. The data-plane proxy pods are created in envoy-gateway-system either way; only the Gateway resource's namespace matters here.
Documentation
Envoy AI Gateway upstream documentation and related resources:
- Envoy AI Gateway Documentation: https://aigateway.envoyproxy.io/ — Official documentation covering architecture, configuration, and API references.
- Envoy AI Gateway GitHub: https://github.com/envoyproxy/ai-gateway — Source code, release notes, and issues.
- Envoy Gateway: https://gateway.envoyproxy.io/ — The underlying gateway infrastructure that Envoy AI Gateway extends.
- Gateway API Inference Extension (GIE): https://gateway-api-inference-extension.sigs.k8s.io/ — Kubernetes SIG project for AI-aware routing integrated with Envoy AI Gateway.
- KServe (Alauda Build): ../kserve/intro — KServe uses Envoy AI Gateway as a required dependency for exposing and routing inference services.