Configuring Token Quotas

Introduction

Envoy AI Gateway can rate-limit by token usage rather than request count, and track a separate budget for each caller based on identity. This prevents a single consumer or a runaway agent from exhausting a shared model budget, and it lets you offer per-user, per-department, and per-tier token quotas on one gateway.

A token quota combines three pieces:

AIGatewayRoute.llmRequestCosts extracts token counts from each LLM response into Envoy dynamic metadata.
A global rate limit backend (Redis) accumulates the cost. Because the token cost is known only after the response, it cannot be tracked by the per-pod local limiter.
A BackendTrafficPolicy of type Global defines the budget and the identity key.

Use Cases

Give each user a monthly token budget on an expensive model, with a higher budget for a premium tier.
Cap the total tokens a department can consume across all its applications.
Protect a shared model from a single misbehaving automation account.

Prerequisites

Envoy AI Gateway is installed, with an AIGatewayRoute routing to your model backends. Confirm the relevant CRDs are present:
kubectl get crd \ aigatewayroutes.aigateway.envoyproxy.io \ backendtrafficpolicies.gateway.envoyproxy.io
Caller identity is propagated as request headers, for example x-user-id. See Authenticating Consumers. Without an identity header the budget below collapses to a single counter shared by all callers, so this step is what makes the quota per consumer.
A Redis instance is available. Use a managed instance from Cache Service for Redis as the default provider, then record its access address (host:port) and credentials. Verify reachability from the cluster before going further:
kubectl run redis-probe --rm -i --restart=Never \ --image=redis:7-alpine -- \ redis-cli -h <redis-host> -p <redis-port> PING # expect: PONG
Note the Gateway name and namespace — needed later to restart the right data-plane proxy:
kubectl get gateway -A

NOTE

Create the Gateway and AIGatewayRoute in a dedicated namespace (for example maas-system), not in the Envoy Gateway control-plane namespace envoy-gateway-system. A gateway placed in the control-plane namespace may not have the AI Gateway request-processing filter and SecurityPolicy applied to its listener, which silently breaks routing and policy enforcement. See Envoy AI Gateway.

Steps

Enable the global rate limit backend

The local rate limiter cannot accumulate response-derived token cost, so the gateway must use the Global Rate Limit service backed by Redis. Create a Redis instance from Cache Service for Redis and copy its access address from the instance detail page. Set that address in the envoy-gateway-config ConfigMap (namespace envoy-gateway-system), under data."envoy-gateway.yaml":

kubectl edit configmap envoy-gateway-config -n envoy-gateway-system

Add the rateLimit block under the top-level config (keep any existing keys such as gateway: and provider: intact):

rateLimit:
  backend:
    type: Redis
    redis:
      url: <redis-host>:<redis-port>  # access address of the Redis instance
      # tls:                          # uncomment if Redis requires TLS
      #   certificateRef:
      #     name: redis-client-cert

<redis-host>:<redis-port>: the access address copied from the Redis instance detail page.
For a password-protected Redis, also create an Opaque Secret with key redis-username and redis-password, then reference it via rateLimit.backend.redis.auth.passwordRef. See the Envoy Gateway rate-limit docs for the full schema.

The Envoy Gateway control plane reads this bootstrap configuration only at startup and does not hot-reload it, so restart its Deployment to apply the change, then confirm the dedicated envoy-ratelimit Deployment is healthy:

kubectl rollout restart deployment envoy-gateway -n envoy-gateway-system
kubectl rollout status  deployment envoy-gateway   -n envoy-gateway-system
kubectl rollout status  deployment envoy-ratelimit -n envoy-gateway-system
# the envoy-ratelimit Deployment is created by the EG controller the first time
# rateLimit.backend is set; if it never appears, the config above was not parsed

NOTE

A single Redis instance per Envoy Gateway is sufficient. Any reachable Redis also works, but a managed instance is recommended for availability and backup.

If the Gateway was already running before the rate limit backend was enabled, also restart its data plane so the proxy picks up the rate limit service:

kubectl rollout restart deployment -n envoy-gateway-system \
  -l gateway.envoyproxy.io/owning-gateway-name=<gateway-name>

Capture token usage on the route

Add llmRequestCosts to the AIGatewayRoute so the gateway writes token counts into Envoy dynamic metadata (the per-request scratch space filters use to talk to each other) under the namespace io.envoy.ai_gateway. The rate-limit filter reads from this namespace in the next step.

apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: AIGatewayRoute
metadata:
  name: <aigatewayroute-name>
  namespace: <your-namespace>
spec:
  # ... existing parentRefs and rules ...
  llmRequestCosts:
    - metadataKey: llm_input_token
      type: InputToken
    - metadataKey: llm_output_token
      type: OutputToken
    - metadataKey: llm_total_token
      type: TotalToken

metadataKey: the key under io.envoy.ai_gateway where the count is written. Pick any name; the BackendTrafficPolicy below must reference the same string.
type: InputToken counts the prompt, OutputToken counts the completion, TotalToken is the sum. Use CEL for a custom formula — for example, charge output tokens 3× because they are slower:
- metadataKey: llm_weighted_cost type: CEL cel: "input_tokens + output_tokens * 3"

Apply and confirm the route is still accepted (the new field should not break translation):

kubectl get aigatewayroute <aigatewayroute-name> -n <your-namespace> \
  -o jsonpath='{.status.conditions[?(@.type=="Accepted")].status}'
# expect: True

Define the token budget by identity

Attach a BackendTrafficPolicy with a Global rate limit. Set the request cost to 0 and the response cost to the captured token metadata, so that only tokens count against the limit. Use clientSelectors to scope the budget per identity and per model.

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: maas-token-quota
  namespace: <your-namespace>
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: HTTPRoute
      name: <aigatewayroute-name>
  rateLimit:
    type: Global
    global:
      rules:
        - clientSelectors:
            - headers:
                - name: x-user-id       # identity header from the SecurityPolicy
                  type: Distinct
                - name: x-ai-eg-model
                  type: Exact
                  value: my-llm
          limit:
            requests: 200000             # 200k tokens per window
            unit: Hour
          cost:
            request:
              from: Number
              number: 0                   # requests do not count
            response:
              from: Metadata              # tokens count
              metadata:
                namespace: io.envoy.ai_gateway
                key: llm_total_token

x-user-id with type: Distinct gives each caller an independent counter, which produces a per-user quota. Use x-user-group to aggregate a department against one budget, or match a specific group such as premium with type: Exact for tiered limits.
limit.requests is interpreted as a token budget here, because the cost is sourced from token metadata. With 200000 tokens/hour and a typical chat call costing roughly 1.5–2k tokens, expect ~100–130 calls/hour per caller before throttling kicks in.
cost.request.number: 0 means a request that fails to reach the upstream (e.g. malformed body) consumes no quota. Set it to 1 if you want pre-flight throttling on call count as well.
cost.response.metadata.key must match a metadataKey declared on the route.

Verification

Drive a short burst with a valid identity token, then a request from a different identity, and confirm only the first identity is throttled:

GATEWAY=http://<gateway-address>
TOKEN_ALICE=<alice-jwt-or-api-key>
TOKEN_BOB=<bob-jwt-or-api-key>

# Send 8 calls as alice; expect early 200s then 429 once the budget is used.
for i in $(seq 1 8); do
  curl -s -o /dev/null -w "alice #$i -> %{http_code}\n" \
    -H "Authorization: Bearer $TOKEN_ALICE" \
    -H 'Content-Type: application/json' \
    -d '{"model":"my-llm","messages":[{"role":"user","content":"hi"}]}' \
    $GATEWAY/v1/chat/completions
done

# Same model, different user — must succeed because the budget is per x-user-id.
curl -s -o /dev/null -w "bob       -> %{http_code}\n" \
  -H "Authorization: Bearer $TOKEN_BOB" \
  -H 'Content-Type: application/json' \
  -d '{"model":"my-llm","messages":[{"role":"user","content":"hi"}]}' \
  $GATEWAY/v1/chat/completions

A run that exhausts alice's quota will print something like alice #1 -> 200 … alice #5 -> 429 … bob -> 200. To inspect the counter directly in Redis, scan for keys containing the identity value — Envoy Gateway names each counter <gateway-namespace>/<gateway-name>/<listener>_<route>_..._<x-user-id-value>_..._<window-timestamp>, so the user identity is the simplest filter:

kubectl run redis-cli --rm -i --restart=Never \
  --image=redis:7-alpine -- \
  redis-cli -h <redis-host> -p <redis-port> \
  --scan --pattern "*$ALICE_USER_ID*" | head

WARNING

If Redis is unreachable, Envoy fails open by default: requests pass through unmetered until Redis recovers. Watch the rate-limit pod's logs (kubectl logs deploy/envoy-ratelimit -n envoy-gateway-system) and alert on its Ready condition so silent quota loss does not go unnoticed.

Learn More

Next Steps

Configure Metering Token Usage to report consumption per tenant and feed chargeback.

#Configuring Token Quotas

#TOC

#Introduction

#Use Cases

#Prerequisites

#Steps

#Enable the global rate limit backend

#Capture token usage on the route

#Define the token budget by identity

#Verification

#Learn More

#Next Steps