Ship Models, Not Outages: Canary Deployments for AI Workloads on Kubernetes

How I stopped nuking my inference endpoints every time I swapped a model — and started sleeping through deployments instead.

The 3 a.m. Problem

You’ve trained a new model. The eval numbers look great. You push it to production, swap the container image, and… latency spikes. Accuracy drops on a long-tail query distribution nobody tested. Your on-call gets paged at 3 a.m. Rollback. Post-mortem. Repeat.

Sound familiar? If you’re running ML inference on Kubernetes, you’ve probably lived through this cycle. The dirty secret of MLOps is that deploying a model is the easy part — safely deploying a model is the hard part.

This post walks through how I built a canary deployment pipeline for AI model serving on Kubernetes, tested end-to-end on a single-node K3s cluster with an NVIDIA GB10 Blackwell GPU. We’ll cover the architecture, the manifests, the traffic-splitting logic, and the automated rollback that lets you ship model updates with zero downtime and actual confidence.

Why Canary for Models?

Traditional canary deployments shift a small percentage of HTTP traffic to a new version of a service. For model inference, the stakes are different:

Latency regression is silent. A model that’s 40% slower won’t throw a 500 — it’ll just drain your SLO budget.
Accuracy regression is invisible to the load balancer. A model that confidently returns wrong answers looks “healthy” to every health check you’ve written.
GPU resources are expensive. Running two full replicas during a canary window costs real money. You need the window to be short and decisive.

Canary deployments for models need to be metric-aware — not just “is the pod alive?” but “is the model actually performing well on live traffic?”

Architecture Overview

┌─────────────────────────────────────────────────────┐
│                    Ingress / Gateway                │
│              (Gateway API HTTPRoute)                │
└──────────┬──────────────────────┬───────────────────┘
           │  90% traffic         │  10% traffic
           ▼                      ▼
┌─────────────────┐    ┌─────────────────────┐
│Stable Deployment│    │  Canary Deployment  │
│model-v1         │    │  model-v2           │
│(vLLM+Qwen0.5B)  │    │  (vLLM+Qwen 1.5B)   │
└─────────┬───────┘    └──────────┬──────────┘
          │                       │
          └───────┬───────────────┘
                  ▼
       ┌─────────────────┐
       │  Prometheus     │
       │  Metrics        │
       │  Collector      │
       └────────┬────────┘
                │
                ▼
       ┌─────────────────┐
       │  Canary         │
       │  Analysis       │
       │  Controller     │
       │  (Argo Rollouts)│
       └─────────────────┘

The key components:

Two Deployments behind two Services, one for the stable model and one for the canary.
Gateway API HTTPRoute with weighted backend references for traffic splitting.
Prometheus scraping vLLM’s native /metrics endpoint for latency, throughput, and token-level stats.
Argo Rollouts orchestrating the canary analysis, promotion, and rollback.

In my homelab I have a Nvidia GB10 Blackwell setup, so I had to build a time slice to run both my stable and canary releases. I am running gpu-operator-v26.3.0 with a time slicing config like below

time-slicing-config

			
helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator --create-namespace \
  --set devicePlugin.config.name=time-slicing-config
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  gb10: |-
    version: v1
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: true
        resources:
          - name: nvidia.com/gpu
            replicas: 2
kubectl get node $(hostname) -o jsonpath='{.metadata.labels}' | jq 'to_entries[] | select(.key | startswith("nvidia"))'
{
  "key": "nvidia.com/gpu.product",
  "value": "NVIDIA-GB10-SHARED"
}
{
  "key": "nvidia.com/gpu.replicas",
  "value": "2"
}
{
  "key": "nvidia.com/gpu.sharing-strategy",
  "value": "time-slicing"
}

		

Here is a simple stable-model and canary-model.

Stable Model Deployment

First, let’s deploy our baseline model. We’ll use vLLM to serve Qwen/Qwen2.5-1.5B-Instruct — a compact model that runs comfortably on the GB10.

stable-model.yaml

			
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-stable
  labels:
    app: inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: inference
  template:
    metadata:
      labels:
        app: inference
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "Qwen/Qwen2.5-1.5B-Instruct"
            - "--port"
            - "8000"
            - "--max-model-len"
            - "2048"
            - "--gpu-memory-utilization"
            - "0.35"
          ports:
            - containerPort: 8000
              name: http
              protocol: TCP
          resources:
            limits:
              nvidia.com/gpu: "1"
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 30
          env:
            - name: VLLM_USAGE_SOURCE
              value: "production"
---
apiVersion: v1
kind: Service
metadata:
  name: model-stable
  labels:
    app: inference
spec:
  selector:
    app: inference
  ports:
    - port: 80
      targetPort: 8000
      protocol: TCP
      name: http

		

canary-model.yaml

			
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-canary
  labels:
    app: inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: inference
  template:
    metadata:
      labels:
        app: inference
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "Qwen/Qwen2.5-0.5B-Instruct"
            - "--port"
            - "8000"
            - "--max-model-len"
            - "4096"
            - "--gpu-memory-utilization"
            - "0.35"
          ports:
            - containerPort: 8000
              name: http
              protocol: TCP
          resources:
            limits:
              nvidia.com/gpu: "1"
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 30
          env:
            - name: VLLM_USAGE_SOURCE
              value: "canary"
---
apiVersion: v1
kind: Service
metadata:
  name: model-canary
  labels:
    app: inference
spec:
  selector:
    app: inference
  ports:
    - port: 80
      targetPort: 8000
      protocol: TCP
      name: http

		

Apply and wait for it to become ready.

			
kubectl apply -f stable-model.yaml
kubectl rollout status deployment/model-stable --timeout=300s

Gateway API Traffic Splitting

This is where it gets interesting. Instead of using an Ingress with annotations, we’ll use the Gateway API’s HTTPRoute with weighted backends — the standard, portable way to split traffic in Kubernetes.

gateway.yaml

			
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod  # optional
spec:
  gatewayClassName: envoy  # or istio, cilium, etc.
  listeners:
    - name: http
      port: 80
      protocol: HTTP
      allowedRoutes:
        namespaces:
          from: Same
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: model-route
spec:
  parentRefs:
    - name: inference-gateway
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /v1
      backendRefs:
        - name: model-stable
          port: 80
          weight: 100       # ← Argo Rollouts will adjust these
        - name: model-canary
          port: 80
          weight: 0          # ← starts at 0, ramps up during canary

		

The magic here is that Argo Rollouts understands the Gateway API natively. It will patch the weight fields on this HTTPRoute as the canary progresses — no sidecars, no service mesh required.

Prometheus Metrics for Model Health

vLLM exposes rich Prometheus metrics out of the box. The ones we care about for canary analysis:

Metric	What it Tells us
vllm:request_success_total	Total successful completions
vllm:request_latency_seconds	End-to-end request latency
vllm:time_to_first_token_seconds	TTFT — critical for streaming UX
vllm:num_requests_running	Current in-flight requests
vllm:gpu_cache_usage_perc	KV-cache pressure (memory health)

Set up a ServiceMonitor so Prometheus scrapes both deployments:

			
# service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vllm-inference
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: inference
  endpoints:
    - port: http
      interval: 15s
      path: /metrics

		

The Argo Rollout — Tying It All Together

Here’s the heart of the system. This Rollout resource replaces the canary Deployment and orchestrates the entire canary lifecycle: ramp traffic, analyze metrics, decide to promote or abort.

rollout.yaml

			
# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: model-canary-rollout
spec:
  replicas: 1
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: inference
      track: canary
  template:
    metadata:
      labels:
        app: inference
        track: canary
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "Qwen/Qwen2.5-1.5B-Instruct"
            - "--port"
            - "8000"
            - "--max-model-len"
            - "4096"
          ports:
            - containerPort: 8000
              name: http
          resources:
            limits:
              nvidia.com/gpu: "1"
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 10
  strategy:
    canary:
      canaryService: model-canary
      stableService: model-stable
      # Gateway API integration — no Istio required
      trafficRouting:
        plugins:
          argoproj-labs/gatewayAPI:
            httpRoute: model-route
            namespace: default
      # Canary progression steps
      steps:
        # Step 1: 10% traffic, analyze for 5 minutes
        - setWeight: 10
        - analysis:
            templates:
              - templateName: model-latency-check
              - templateName: model-error-rate-check
            args:
              - name: service-name
                value: model-canary
        - pause:
            duration: 5m
        # Step 2: 30% traffic, analyze again
        - setWeight: 30
        - analysis:
            templates:
              - templateName: model-latency-check
              - templateName: model-error-rate-check
            args:
              - name: service-name
                value: model-canary
        - pause:
            duration: 5m
        # Step 3: 60% traffic — the confidence gate
        - setWeight: 60
        - analysis:
            templates:
              - templateName: model-latency-check
              - templateName: model-error-rate-check
              - templateName: model-gpu-health
            args:
              - name: service-name
                value: model-canary
        - pause:
            duration: 10m
        # Step 4: Full promotion
        - setWeight: 100

		

This is what makes canary deployments actually useful for ML workloads. These AnalysisTemplate resources define Prometheus queries that Argo Rollouts evaluates at each step. If any check fails, the rollout automatically aborts and shifts traffic back to stable.

Analysis Templates

			
# analysis-templates.yaml
# Check 1: P95 latency must stay under 2 seconds
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: model-latency-check
spec:
  args:
    - name: service-name
  metrics:
    - name: p95-latency
      interval: 60s
      count: 5
      successCondition: result[0] < 2.0
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
          query: |
            histogram_quantile(0.95,
              sum(rate(
                vllm:request_latency_seconds_bucket{
                  kubernetes_name="{{args.service-name}}"
                }[2m]
              )) by (le)
            )
---
# Check 2: Error rate must stay under 5%
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: model-error-rate-check
spec:
  args:
    - name: service-name
  metrics:
    - name: error-rate
      interval: 60s
      count: 5
      successCondition: result[0] < 0.05
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
          query: |
            sum(rate(
              vllm:request_success_total{
                kubernetes_name="{{args.service-name}}",
                finished="false"
              }[2m]
            )) /
            sum(rate(
              vllm:request_success_total{
                kubernetes_name="{{args.service-name}}"
              }[2m]
            ))
---
# Check 3: GPU memory pressure — KV cache usage under 90%
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: model-gpu-health
spec:
  args:
    - name: service-name
  metrics:
    - name: gpu-cache-pressure
      interval: 60s
      count: 3
      successCondition: result[0] < 0.90
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
          query: |
            avg(
              vllm:gpu_cache_usage_perc{
                kubernetes_name="{{args.service-name}}"
              }
            )

		

Let’s break down what each check is doing:

P95 Latency Check: Runs every 60 seconds, 5 times. The canary passes if the 95th percentile request latency stays under 2 seconds. It tolerates up to 2 failures (cold-start grace period).
Error Rate Check: Same cadence. Fails the canary if more than 5% of requests are errors. This catches model crashes, OOM kills, or CUDA errors that produce HTTP 500s.
GPU Health Check: Only runs at the 60% traffic step. Checks whether the KV cache is being overwhelmed. A model that uses 90%+ cache under partial traffic will definitely OOM under full load — abort early.

Running a Canary Release

With everything deployed, triggering a canary is as simple as updating the image or model arguments:

kubectl patch rollout model-canary-rollout --type=json -p='[
  {"op": "replace",
   "path": "/spec/template/spec/containers/0/args/1",
   "value": "Qwen/Qwen2.5-1.5B-Instruct"}
]'

kubectl argo rollouts get rollout model-canary-rollout --watch
Name:            model-canary-rollout
Namespace:       default
Status:          ॥ Paused
Message:         CanaryPauseStep
Strategy:        Canary
  Step:          3/7
  SetWeight:     30
  ActualWeight:  50
Images:          vllm/vllm-openai:latest (canary, stable)
Replicas:
  Desired:       1
  Current:       2
  Updated:       1
  Ready:         2
  Available:     2

NAME                                              KIND        STATUS        AGE    INFO
⟳ model-canary-rollout                            Rollout     ॥ Paused      16m    
├──# revision:3                                                                    
│  └──⧉ model-canary-rollout-5b5dcd5588           ReplicaSet  ✔ Healthy     2m53s  canary
│     └──□ model-canary-rollout-5b5dcd5588-wzd8g  Pod         ✔ Running     2m52s  ready:1/1
├──# revision:2                                                                    
│  └──⧉ model-canary-rollout-c66fdbdd4            ReplicaSet  • ScaledDown  8m20s  
└──# revision:1                                                                    
   └──⧉ model-canary-rollout-f47959d95            ReplicaSet  ✔ Healthy     16m    stable
      └──□ model-canary-rollout-f47959d95-x7wxm   Pod         ✔ Running     16m    ready:1/1

$ kubectl get pods -o custom-columns=NAME:.metadata.name,ARGS:.spec.containers[*].args
NAME                                   ARGS
model-canary-rollout-5b5dcd5588-wzd8g   [--model Qwen/Qwen2.5-1.5B-Instruct --port 8000 --max-model-len 2048 --gpu-memory-utilization 0.35]
model-canary-rollout-f47959d95-x7wxm    [--model Qwen/Qwen2.5-0.5B-Instruct --port 8000 --max-model-len 2048 --gpu-memory-utilization 0.35]

If the analysis passes at every step, the canary gets promoted and the old ReplicaSet scales down. Traffic shifts back to stable instantly. No pages. No rollback scripts.

Before you run this in production, consider these additions:

Custom metrics for ML quality. The analysis templates above catch infrastructure failures (latency, errors, GPU health). For true model quality monitoring, add a sidecar that samples canary responses and evaluates them against a reference dataset, then exposes an accuracy or quality_score metric to Prometheus.

Notification hooks. Argo Rollouts supports Slack, PagerDuty, and webhook notifications. Wire up an on-abort hook so the team knows why a canary failed.

Multi-GPU progressive rollout. If you have a cluster with multiple GPU nodes, run the canary on a single node first, then progressively schedule it across more nodes. Use node affinity and topology spread constraints to control this.

A/B testing with header-based routing. The Gateway API HTTPRoute supports header-based matching. Route internal/dogfood traffic to the canary via a custom header while external traffic stays on stable.

Key Takeaways

Canary deployments are not optional for production ML. Swapping a model binary is not the same as swapping a stateless microservice. Models have hidden performance characteristics — latency distributions, memory profiles, accuracy regressions — that only surface under real traffic.

Gateway API is the right abstraction for traffic splitting. It’s portable across gateway controllers (Envoy, Istio, Cilium), it’s part of the Kubernetes standard, and Argo Rollouts integrates with it natively. Stop writing Ingress annotation hacks.

Metrics-driven rollback is the whole point. The AnalysisTemplate resources are the most important part of this entire setup. Without them, you’re just doing a slow rollout — not a canary. The P95 latency check, error rate check, and GPU health check together form a safety net that catches infrastructure failures, model regressions, and resource exhaustion.

Single-GPU canary testing is possible but noisy. On a GB10 or similar single-GPU setup, use MIG partitioning or generous failure tolerances to account for GPU contention. The analysis still adds value — it just needs tuned thresholds.

Start with infrastructure metrics, graduate to quality metrics. Latency and error rate catch 80% of bad deployments. Add model-quality scoring as a second phase once the pipeline is stable.