Ship Models, Not Outages: Canary Deployments for AI Workloads on Kubernetes

How I stopped nuking my inference endpoints every time I swapped a model — and started sleeping through deployments instead.

The 3 a.m. Problem

You’ve trained a new model. The eval numbers look great. You push it to production, swap the container image, and… latency spikes. Accuracy drops on a long-tail query distribution nobody tested. Your on-call gets paged at 3 a.m. Rollback. Post-mortem. Repeat.

Sound familiar? If you’re running ML inference on Kubernetes, you’ve probably lived through this cycle. The dirty secret of MLOps is that deploying a model is the easy part — safely deploying a model is the hard part.

This post walks through how I built a canary deployment pipeline for AI model serving on Kubernetes, tested end-to-end on a single-node K3s cluster with an NVIDIA GB10 Blackwell GPU. We’ll cover the architecture, the manifests, the traffic-splitting logic, and the automated rollback that lets you ship model updates with zero downtime and actual confidence.

Why Canary for Models?

Traditional canary deployments shift a small percentage of HTTP traffic to a new version of a service. For model inference, the stakes are different:

  • Latency regression is silent. A model that’s 40% slower won’t throw a 500 — it’ll just drain your SLO budget.
  • Accuracy regression is invisible to the load balancer. A model that confidently returns wrong answers looks “healthy” to every health check you’ve written.
  • GPU resources are expensive. Running two full replicas during a canary window costs real money. You need the window to be short and decisive.

Canary deployments for models need to be metric-aware — not just “is the pod alive?” but “is the model actually performing well on live traffic?”

Architecture Overview
┌─────────────────────────────────────────────────────┐
│ Ingress / Gateway │
│ (Gateway API HTTPRoute) │
└──────────┬──────────────────────┬───────────────────┘
│ 90% traffic │ 10% traffic
▼ ▼
┌─────────────────┐ ┌─────────────────────┐
│Stable Deployment│ │ Canary Deployment │
│model-v1 │ │ model-v2 │
│(vLLM+Qwen0.5B) │ │ (vLLM+Qwen 1.5B) │
└─────────┬───────┘ └──────────┬──────────┘
│ │
└───────┬───────────────┘

┌─────────────────┐
│ Prometheus │
│ Metrics │
│ Collector │
└────────┬────────┘


┌─────────────────┐
│ Canary │
│ Analysis │
│ Controller │
│ (Argo Rollouts)│
└─────────────────┘

The key components:

  • Two Deployments behind two Services, one for the stable model and one for the canary.
  • Gateway API HTTPRoute with weighted backend references for traffic splitting.
  • Prometheus scraping vLLM’s native /metrics endpoint for latency, throughput, and token-level stats.
  • Argo Rollouts orchestrating the canary analysis, promotion, and rollback.

In my homelab I have a Nvidia GB10 Blackwell setup, so I had to build a time slice to run both my stable and canary releases. I am running gpu-operator-v26.3.0 with a time slicing config like below

time-slicing-config
helm install gpu-operator nvidia/gpu-operator \
-n gpu-operator --create-namespace \
--set devicePlugin.config.name=time-slicing-config
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
gb10: |-
version: v1
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: true
resources:
- name: nvidia.com/gpu
replicas: 2
kubectl get node $(hostname) -o jsonpath='{.metadata.labels}' | jq 'to_entries[] | select(.key | startswith("nvidia"))'
{
"key": "nvidia.com/gpu.product",
"value": "NVIDIA-GB10-SHARED"
}
{
"key": "nvidia.com/gpu.replicas",
"value": "2"
}
{
"key": "nvidia.com/gpu.sharing-strategy",
"value": "time-slicing"
}

Here is a simple stable-model and canary-model.

Stable Model Deployment

First, let’s deploy our baseline model. We’ll use vLLM to serve Qwen/Qwen2.5-1.5B-Instruct — a compact model that runs comfortably on the GB10.

stable-model.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-stable
labels:
app: inference
spec:
replicas: 1
selector:
matchLabels:
app: inference
template:
metadata:
labels:
app: inference
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "Qwen/Qwen2.5-1.5B-Instruct"
- "--port"
- "8000"
- "--max-model-len"
- "2048"
- "--gpu-memory-utilization"
- "0.35"
ports:
- containerPort: 8000
name: http
protocol: TCP
resources:
limits:
nvidia.com/gpu: "1"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 30
env:
- name: VLLM_USAGE_SOURCE
value: "production"
---
apiVersion: v1
kind: Service
metadata:
name: model-stable
labels:
app: inference
spec:
selector:
app: inference
ports:
- port: 80
targetPort: 8000
protocol: TCP
name: http
canary-model.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-canary
labels:
app: inference
spec:
replicas: 1
selector:
matchLabels:
app: inference
template:
metadata:
labels:
app: inference
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "Qwen/Qwen2.5-0.5B-Instruct"
- "--port"
- "8000"
- "--max-model-len"
- "4096"
- "--gpu-memory-utilization"
- "0.35"
ports:
- containerPort: 8000
name: http
protocol: TCP
resources:
limits:
nvidia.com/gpu: "1"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 30
env:
- name: VLLM_USAGE_SOURCE
value: "canary"
---
apiVersion: v1
kind: Service
metadata:
name: model-canary
labels:
app: inference
spec:
selector:
app: inference
ports:
- port: 80
targetPort: 8000
protocol: TCP
name: http

Apply and wait for it to become ready.

kubectl apply -f stable-model.yaml
kubectl rollout status deployment/model-stable --timeout=300s
Gateway API Traffic Splitting

This is where it gets interesting. Instead of using an Ingress with annotations, we’ll use the Gateway API’s HTTPRoute with weighted backends — the standard, portable way to split traffic in Kubernetes.

gateway.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: inference-gateway
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod # optional
spec:
gatewayClassName: envoy # or istio, cilium, etc.
listeners:
- name: http
port: 80
protocol: HTTP
allowedRoutes:
namespaces:
from: Same
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: model-route
spec:
parentRefs:
- name: inference-gateway
rules:
- matches:
- path:
type: PathPrefix
value: /v1
backendRefs:
- name: model-stable
port: 80
weight: 100 # ← Argo Rollouts will adjust these
- name: model-canary
port: 80
weight: 0 # ← starts at 0, ramps up during canary


The magic here is that Argo Rollouts understands the Gateway API natively. It will patch the weight fields on this HTTPRoute as the canary progresses — no sidecars, no service mesh required.

Prometheus Metrics for Model Health

vLLM exposes rich Prometheus metrics out of the box. The ones we care about for canary analysis:

Metric What it Tells us
vllm:request_success_total Total successful completions
vllm:request_latency_seconds End-to-end request latency
vllm:time_to_first_token_seconds TTFT — critical for streaming UX
vllm:num_requests_running Current in-flight requests
vllm:gpu_cache_usage_perc KV-cache pressure (memory health)

Set up a ServiceMonitor so Prometheus scrapes both deployments:

# service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vllm-inference
labels:
release: prometheus
spec:
selector:
matchLabels:
app: inference
endpoints:
- port: http
interval: 15s
path: /metrics
The Argo Rollout — Tying It All Together

Here’s the heart of the system. This Rollout resource replaces the canary Deployment and orchestrates the entire canary lifecycle: ramp traffic, analyze metrics, decide to promote or abort.

rollout.yaml
# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: model-canary-rollout
spec:
replicas: 1
revisionHistoryLimit: 3
selector:
matchLabels:
app: inference
track: canary
template:
metadata:
labels:
app: inference
track: canary
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "Qwen/Qwen2.5-1.5B-Instruct"
- "--port"
- "8000"
- "--max-model-len"
- "4096"
ports:
- containerPort: 8000
name: http
resources:
limits:
nvidia.com/gpu: "1"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
strategy:
canary:
canaryService: model-canary
stableService: model-stable
# Gateway API integration — no Istio required
trafficRouting:
plugins:
argoproj-labs/gatewayAPI:
httpRoute: model-route
namespace: default
# Canary progression steps
steps:
# Step 1: 10% traffic, analyze for 5 minutes
- setWeight: 10
- analysis:
templates:
- templateName: model-latency-check
- templateName: model-error-rate-check
args:
- name: service-name
value: model-canary
- pause:
duration: 5m
# Step 2: 30% traffic, analyze again
- setWeight: 30
- analysis:
templates:
- templateName: model-latency-check
- templateName: model-error-rate-check
args:
- name: service-name
value: model-canary
- pause:
duration: 5m
# Step 3: 60% traffic — the confidence gate
- setWeight: 60
- analysis:
templates:
- templateName: model-latency-check
- templateName: model-error-rate-check
- templateName: model-gpu-health
args:
- name: service-name
value: model-canary
- pause:
duration: 10m
# Step 4: Full promotion
- setWeight: 100


This is what makes canary deployments actually useful for ML workloads. These AnalysisTemplate resources define Prometheus queries that Argo Rollouts evaluates at each step. If any check fails, the rollout automatically aborts and shifts traffic back to stable.

Analysis Templates
# analysis-templates.yaml
# Check 1: P95 latency must stay under 2 seconds
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: model-latency-check
spec:
args:
- name: service-name
metrics:
- name: p95-latency
interval: 60s
count: 5
successCondition: result[0] < 2.0
failureLimit: 2
provider:
prometheus:
address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
query: |
histogram_quantile(0.95,
sum(rate(
vllm:request_latency_seconds_bucket{
kubernetes_name="{{args.service-name}}"
}[2m]
)) by (le)
)
---
# Check 2: Error rate must stay under 5%
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: model-error-rate-check
spec:
args:
- name: service-name
metrics:
- name: error-rate
interval: 60s
count: 5
successCondition: result[0] < 0.05
failureLimit: 2
provider:
prometheus:
address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
query: |
sum(rate(
vllm:request_success_total{
kubernetes_name="{{args.service-name}}",
finished="false"
}[2m]
)) /
sum(rate(
vllm:request_success_total{
kubernetes_name="{{args.service-name}}"
}[2m]
))
---
# Check 3: GPU memory pressure — KV cache usage under 90%
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: model-gpu-health
spec:
args:
- name: service-name
metrics:
- name: gpu-cache-pressure
interval: 60s
count: 3
successCondition: result[0] < 0.90
failureLimit: 1
provider:
prometheus:
address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
query: |
avg(
vllm:gpu_cache_usage_perc{
kubernetes_name="{{args.service-name}}"
}
)

Let’s break down what each check is doing:

  • P95 Latency Check: Runs every 60 seconds, 5 times. The canary passes if the 95th percentile request latency stays under 2 seconds. It tolerates up to 2 failures (cold-start grace period).
  • Error Rate Check: Same cadence. Fails the canary if more than 5% of requests are errors. This catches model crashes, OOM kills, or CUDA errors that produce HTTP 500s.
  • GPU Health Check: Only runs at the 60% traffic step. Checks whether the KV cache is being overwhelmed. A model that uses 90%+ cache under partial traffic will definitely OOM under full load — abort early.
Running a Canary Release

With everything deployed, triggering a canary is as simple as updating the image or model arguments:

kubectl patch rollout model-canary-rollout --type=json -p='[
{"op": "replace",
"path": "/spec/template/spec/containers/0/args/1",
"value": "Qwen/Qwen2.5-1.5B-Instruct"}
]'

kubectl argo rollouts get rollout model-canary-rollout --watch
Name: model-canary-rollout
Namespace: default
Status: ॥ Paused
Message: CanaryPauseStep
Strategy: Canary
Step: 3/7
SetWeight: 30
ActualWeight: 50
Images: vllm/vllm-openai:latest (canary, stable)
Replicas:
Desired: 1
Current: 2
Updated: 1
Ready: 2
Available: 2

NAME KIND STATUS AGE INFO
⟳ model-canary-rollout Rollout ॥ Paused 16m
├──# revision:3
│ └──⧉ model-canary-rollout-5b5dcd5588 ReplicaSet ✔ Healthy 2m53s canary
│ └──□ model-canary-rollout-5b5dcd5588-wzd8g Pod ✔ Running 2m52s ready:1/1
├──# revision:2
│ └──⧉ model-canary-rollout-c66fdbdd4 ReplicaSet • ScaledDown 8m20s
└──# revision:1
└──⧉ model-canary-rollout-f47959d95 ReplicaSet ✔ Healthy 16m stable
└──□ model-canary-rollout-f47959d95-x7wxm Pod ✔ Running 16m ready:1/1

$ kubectl get pods -o custom-columns=NAME:.metadata.name,ARGS:.spec.containers[*].args
NAME ARGS
model-canary-rollout-5b5dcd5588-wzd8g [--model Qwen/Qwen2.5-1.5B-Instruct --port 8000 --max-model-len 2048 --gpu-memory-utilization 0.35]
model-canary-rollout-f47959d95-x7wxm [--model Qwen/Qwen2.5-0.5B-Instruct --port 8000 --max-model-len 2048 --gpu-memory-utilization 0.35]

If the analysis passes at every step, the canary gets promoted and the old ReplicaSet scales down. Traffic shifts back to stable instantly. No pages. No rollback scripts.

Before you run this in production, consider these additions:

Custom metrics for ML quality. The analysis templates above catch infrastructure failures (latency, errors, GPU health). For true model quality monitoring, add a sidecar that samples canary responses and evaluates them against a reference dataset, then exposes an accuracy or quality_score metric to Prometheus.

Notification hooks. Argo Rollouts supports Slack, PagerDuty, and webhook notifications. Wire up an on-abort hook so the team knows why a canary failed.

Multi-GPU progressive rollout. If you have a cluster with multiple GPU nodes, run the canary on a single node first, then progressively schedule it across more nodes. Use node affinity and topology spread constraints to control this.

A/B testing with header-based routing. The Gateway API HTTPRoute supports header-based matching. Route internal/dogfood traffic to the canary via a custom header while external traffic stays on stable.

Key Takeaways

Canary deployments are not optional for production ML. Swapping a model binary is not the same as swapping a stateless microservice. Models have hidden performance characteristics — latency distributions, memory profiles, accuracy regressions — that only surface under real traffic.

Gateway API is the right abstraction for traffic splitting. It’s portable across gateway controllers (Envoy, Istio, Cilium), it’s part of the Kubernetes standard, and Argo Rollouts integrates with it natively. Stop writing Ingress annotation hacks.

Metrics-driven rollback is the whole point. The AnalysisTemplate resources are the most important part of this entire setup. Without them, you’re just doing a slow rollout — not a canary. The P95 latency check, error rate check, and GPU health check together form a safety net that catches infrastructure failures, model regressions, and resource exhaustion.

Single-GPU canary testing is possible but noisy. On a GB10 or similar single-GPU setup, use MIG partitioning or generous failure tolerances to account for GPU contention. The analysis still adds value — it just needs tuned thresholds.

Start with infrastructure metrics, graduate to quality metrics. Latency and error rate catch 80% of bad deployments. Add model-quality scoring as a second phase once the pipeline is stable.


Leave a Reply

Discover more from Mind Of The Machine

Subscribe now to keep reading and get access to the full archive.

Continue reading