
How I stopped nuking my inference endpoints every time I swapped a model — and started sleeping through deployments instead.
The 3 a.m. Problem
You’ve trained a new model. The eval numbers look great. You push it to production, swap the container image, and… latency spikes. Accuracy drops on a long-tail query distribution nobody tested. Your on-call gets paged at 3 a.m. Rollback. Post-mortem. Repeat.
Sound familiar? If you’re running ML inference on Kubernetes, you’ve probably lived through this cycle. The dirty secret of MLOps is that deploying a model is the easy part — safely deploying a model is the hard part.
This post walks through how I built a canary deployment pipeline for AI model serving on Kubernetes, tested end-to-end on a single-node K3s cluster with an NVIDIA GB10 Blackwell GPU. We’ll cover the architecture, the manifests, the traffic-splitting logic, and the automated rollback that lets you ship model updates with zero downtime and actual confidence.
Why Canary for Models?
Traditional canary deployments shift a small percentage of HTTP traffic to a new version of a service. For model inference, the stakes are different:
- Latency regression is silent. A model that’s 40% slower won’t throw a 500 — it’ll just drain your SLO budget.
- Accuracy regression is invisible to the load balancer. A model that confidently returns wrong answers looks “healthy” to every health check you’ve written.
- GPU resources are expensive. Running two full replicas during a canary window costs real money. You need the window to be short and decisive.
Canary deployments for models need to be metric-aware — not just “is the pod alive?” but “is the model actually performing well on live traffic?”
Architecture Overview
┌─────────────────────────────────────────────────────┐
│ Ingress / Gateway │
│ (Gateway API HTTPRoute) │
└──────────┬──────────────────────┬───────────────────┘
│ 90% traffic │ 10% traffic
▼ ▼
┌─────────────────┐ ┌─────────────────────┐
│Stable Deployment│ │ Canary Deployment │
│model-v1 │ │ model-v2 │
│(vLLM+Qwen0.5B) │ │ (vLLM+Qwen 1.5B) │
└─────────┬───────┘ └──────────┬──────────┘
│ │
└───────┬───────────────┘
▼
┌─────────────────┐
│ Prometheus │
│ Metrics │
│ Collector │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Canary │
│ Analysis │
│ Controller │
│ (Argo Rollouts)│
└─────────────────┘The key components:
- Two Deployments behind two Services, one for the stable model and one for the canary.
- Gateway API HTTPRoute with weighted backend references for traffic splitting.
- Prometheus scraping vLLM’s native
/metricsendpoint for latency, throughput, and token-level stats.- Argo Rollouts orchestrating the canary analysis, promotion, and rollback.
In my homelab I have a Nvidia GB10 Blackwell setup, so I had to build a time slice to run both my stable and canary releases. I am running gpu-operator-v26.3.0 with a time slicing config like below
time-slicing-config
helm install gpu-operator nvidia/gpu-operator \-n gpu-operator --create-namespace \--set devicePlugin.config.name=time-slicing-configapiVersion: v1kind: ConfigMapmetadata:name: time-slicing-confignamespace: gpu-operatordata:gb10: |-version: v1sharing:timeSlicing:renameByDefault: falsefailRequestsGreaterThanOne: trueresources:- name: nvidia.com/gpureplicas: 2kubectl get node $(hostname) -o jsonpath='{.metadata.labels}' | jq 'to_entries[] | select(.key | startswith("nvidia"))'{"key": "nvidia.com/gpu.product","value": "NVIDIA-GB10-SHARED"}{"key": "nvidia.com/gpu.replicas","value": "2"}{"key": "nvidia.com/gpu.sharing-strategy","value": "time-slicing"}Here is a simple stable-model and canary-model.
Stable Model Deployment
First, let’s deploy our baseline model. We’ll use vLLM to serve
Qwen/Qwen2.5-1.5B-Instruct— a compact model that runs comfortably on the GB10.stable-model.yaml
apiVersion: apps/v1kind: Deploymentmetadata:name: model-stablelabels:app: inferencespec:replicas: 1selector:matchLabels:app: inferencetemplate:metadata:labels:app: inferenceannotations:prometheus.io/scrape: "true"prometheus.io/port: "8000"prometheus.io/path: "/metrics"spec:containers:- name: vllmimage: vllm/vllm-openai:latestargs:- "--model"- "Qwen/Qwen2.5-1.5B-Instruct"- "--port"- "8000"- "--max-model-len"- "2048"- "--gpu-memory-utilization"- "0.35"ports:- containerPort: 8000name: httpprotocol: TCPresources:limits:nvidia.com/gpu: "1"readinessProbe:httpGet:path: /healthport: 8000initialDelaySeconds: 30periodSeconds: 10livenessProbe:httpGet:path: /healthport: 8000initialDelaySeconds: 60periodSeconds: 30env:- name: VLLM_USAGE_SOURCEvalue: "production"---apiVersion: v1kind: Servicemetadata:name: model-stablelabels:app: inferencespec:selector:app: inferenceports:- port: 80targetPort: 8000protocol: TCPname: httpcanary-model.yaml
apiVersion: apps/v1kind: Deploymentmetadata:name: model-canarylabels:app: inferencespec:replicas: 1selector:matchLabels:app: inferencetemplate:metadata:labels:app: inferenceannotations:prometheus.io/scrape: "true"prometheus.io/port: "8000"prometheus.io/path: "/metrics"spec:containers:- name: vllmimage: vllm/vllm-openai:latestargs:- "--model"- "Qwen/Qwen2.5-0.5B-Instruct"- "--port"- "8000"- "--max-model-len"- "4096"- "--gpu-memory-utilization"- "0.35"ports:- containerPort: 8000name: httpprotocol: TCPresources:limits:nvidia.com/gpu: "1"readinessProbe:httpGet:path: /healthport: 8000initialDelaySeconds: 30periodSeconds: 10livenessProbe:httpGet:path: /healthport: 8000initialDelaySeconds: 60periodSeconds: 30env:- name: VLLM_USAGE_SOURCEvalue: "canary"---apiVersion: v1kind: Servicemetadata:name: model-canarylabels:app: inferencespec:selector:app: inferenceports:- port: 80targetPort: 8000protocol: TCPname: httpApply and wait for it to become ready.
kubectl apply -f stable-model.yamlkubectl rollout status deployment/model-stable --timeout=300sGateway API Traffic Splitting
This is where it gets interesting. Instead of using an Ingress with annotations, we’ll use the Gateway API’s
HTTPRoutewith weighted backends — the standard, portable way to split traffic in Kubernetes.gateway.yaml
apiVersion: gateway.networking.k8s.io/v1kind: Gatewaymetadata:name: inference-gatewayannotations:cert-manager.io/cluster-issuer: letsencrypt-prod # optionalspec:gatewayClassName: envoy # or istio, cilium, etc.listeners:- name: httpport: 80protocol: HTTPallowedRoutes:namespaces:from: Same---apiVersion: gateway.networking.k8s.io/v1kind: HTTPRoutemetadata:name: model-routespec:parentRefs:- name: inference-gatewayrules:- matches:- path:type: PathPrefixvalue: /v1backendRefs:- name: model-stableport: 80weight: 100 # ← Argo Rollouts will adjust these- name: model-canaryport: 80weight: 0 # ← starts at 0, ramps up during canary
The magic here is that Argo Rollouts understands the Gateway API natively. It will patch the
weightfields on thisHTTPRouteas the canary progresses — no sidecars, no service mesh required.Prometheus Metrics for Model Health
vLLM exposes rich Prometheus metrics out of the box. The ones we care about for canary analysis:
Metric What it Tells us vllm:request_success_total Total successful completions vllm:request_latency_seconds End-to-end request latency vllm:time_to_first_token_seconds TTFT — critical for streaming UX vllm:num_requests_running Current in-flight requests vllm:gpu_cache_usage_perc KV-cache pressure (memory health) Set up a
ServiceMonitorso Prometheus scrapes both deployments:# service-monitor.yamlapiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata:name: vllm-inferencelabels:release: prometheusspec:selector:matchLabels:app: inferenceendpoints:- port: httpinterval: 15spath: /metricsThe Argo Rollout — Tying It All Together
Here’s the heart of the system. This
Rolloutresource replaces the canaryDeploymentand orchestrates the entire canary lifecycle: ramp traffic, analyze metrics, decide to promote or abort.rollout.yaml
# rollout.yamlapiVersion: argoproj.io/v1alpha1kind: Rolloutmetadata:name: model-canary-rolloutspec:replicas: 1revisionHistoryLimit: 3selector:matchLabels:app: inferencetrack: canarytemplate:metadata:labels:app: inferencetrack: canaryannotations:prometheus.io/scrape: "true"prometheus.io/port: "8000"spec:containers:- name: vllmimage: vllm/vllm-openai:latestargs:- "--model"- "Qwen/Qwen2.5-1.5B-Instruct"- "--port"- "8000"- "--max-model-len"- "4096"ports:- containerPort: 8000name: httpresources:limits:nvidia.com/gpu: "1"readinessProbe:httpGet:path: /healthport: 8000initialDelaySeconds: 30periodSeconds: 10strategy:canary:canaryService: model-canarystableService: model-stable# Gateway API integration — no Istio requiredtrafficRouting:plugins:argoproj-labs/gatewayAPI:httpRoute: model-routenamespace: default# Canary progression stepssteps:# Step 1: 10% traffic, analyze for 5 minutes- setWeight: 10- analysis:templates:- templateName: model-latency-check- templateName: model-error-rate-checkargs:- name: service-namevalue: model-canary- pause:duration: 5m# Step 2: 30% traffic, analyze again- setWeight: 30- analysis:templates:- templateName: model-latency-check- templateName: model-error-rate-checkargs:- name: service-namevalue: model-canary- pause:duration: 5m# Step 3: 60% traffic — the confidence gate- setWeight: 60- analysis:templates:- templateName: model-latency-check- templateName: model-error-rate-check- templateName: model-gpu-healthargs:- name: service-namevalue: model-canary- pause:duration: 10m# Step 4: Full promotion- setWeight: 100
This is what makes canary deployments actually useful for ML workloads. These
AnalysisTemplateresources define Prometheus queries that Argo Rollouts evaluates at each step. If any check fails, the rollout automatically aborts and shifts traffic back to stable.Analysis Templates
# analysis-templates.yaml# Check 1: P95 latency must stay under 2 secondsapiVersion: argoproj.io/v1alpha1kind: AnalysisTemplatemetadata:name: model-latency-checkspec:args:- name: service-namemetrics:- name: p95-latencyinterval: 60scount: 5successCondition: result[0] < 2.0failureLimit: 2provider:prometheus:address: http://prometheus-kube-prometheus-prometheus.monitoring:9090query: |histogram_quantile(0.95,sum(rate(vllm:request_latency_seconds_bucket{kubernetes_name="{{args.service-name}}"}[2m])) by (le))# Check 2: Error rate must stay under 5%apiVersion: argoproj.io/v1alpha1kind: AnalysisTemplatemetadata:name: model-error-rate-checkspec:args:- name: service-namemetrics:- name: error-rateinterval: 60scount: 5successCondition: result[0] < 0.05failureLimit: 2provider:prometheus:address: http://prometheus-kube-prometheus-prometheus.monitoring:9090query: |sum(rate(vllm:request_success_total{kubernetes_name="{{args.service-name}}",finished="false"}[2m])) /sum(rate(vllm:request_success_total{kubernetes_name="{{args.service-name}}"}[2m]))# Check 3: GPU memory pressure — KV cache usage under 90%apiVersion: argoproj.io/v1alpha1kind: AnalysisTemplatemetadata:name: model-gpu-healthspec:args:- name: service-namemetrics:- name: gpu-cache-pressureinterval: 60scount: 3successCondition: result[0] < 0.90failureLimit: 1provider:prometheus:address: http://prometheus-kube-prometheus-prometheus.monitoring:9090query: |avg(vllm:gpu_cache_usage_perc{kubernetes_name="{{args.service-name}}"})Let’s break down what each check is doing:
- P95 Latency Check: Runs every 60 seconds, 5 times. The canary passes if the 95th percentile request latency stays under 2 seconds. It tolerates up to 2 failures (cold-start grace period).
- Error Rate Check: Same cadence. Fails the canary if more than 5% of requests are errors. This catches model crashes, OOM kills, or CUDA errors that produce HTTP 500s.
- GPU Health Check: Only runs at the 60% traffic step. Checks whether the KV cache is being overwhelmed. A model that uses 90%+ cache under partial traffic will definitely OOM under full load — abort early.
Running a Canary Release
With everything deployed, triggering a canary is as simple as updating the image or model arguments:
kubectl patch rollout model-canary-rollout --type=json -p='[
{"op": "replace",
"path": "/spec/template/spec/containers/0/args/1",
"value": "Qwen/Qwen2.5-1.5B-Instruct"}
]'
kubectl argo rollouts get rollout model-canary-rollout --watch
Name: model-canary-rollout
Namespace: default
Status: ॥ Paused
Message: CanaryPauseStep
Strategy: Canary
Step: 3/7
SetWeight: 30
ActualWeight: 50
Images: vllm/vllm-openai:latest (canary, stable)
Replicas:
Desired: 1
Current: 2
Updated: 1
Ready: 2
Available: 2
NAME KIND STATUS AGE INFO
⟳ model-canary-rollout Rollout ॥ Paused 16m
├──# revision:3
│ └──⧉ model-canary-rollout-5b5dcd5588 ReplicaSet ✔ Healthy 2m53s canary
│ └──□ model-canary-rollout-5b5dcd5588-wzd8g Pod ✔ Running 2m52s ready:1/1
├──# revision:2
│ └──⧉ model-canary-rollout-c66fdbdd4 ReplicaSet • ScaledDown 8m20s
└──# revision:1
└──⧉ model-canary-rollout-f47959d95 ReplicaSet ✔ Healthy 16m stable
└──□ model-canary-rollout-f47959d95-x7wxm Pod ✔ Running 16m ready:1/1
$ kubectl get pods -o custom-columns=NAME:.metadata.name,ARGS:.spec.containers[*].args
NAME ARGS
model-canary-rollout-5b5dcd5588-wzd8g [--model Qwen/Qwen2.5-1.5B-Instruct --port 8000 --max-model-len 2048 --gpu-memory-utilization 0.35]
model-canary-rollout-f47959d95-x7wxm [--model Qwen/Qwen2.5-0.5B-Instruct --port 8000 --max-model-len 2048 --gpu-memory-utilization 0.35]If the analysis passes at every step, the canary gets promoted and the old ReplicaSet scales down. Traffic shifts back to stable instantly. No pages. No rollback scripts.
Before you run this in production, consider these additions:
Custom metrics for ML quality. The analysis templates above catch infrastructure failures (latency, errors, GPU health). For true model quality monitoring, add a sidecar that samples canary responses and evaluates them against a reference dataset, then exposes an
accuracyorquality_scoremetric to Prometheus.Notification hooks. Argo Rollouts supports Slack, PagerDuty, and webhook notifications. Wire up an
on-aborthook so the team knows why a canary failed.Multi-GPU progressive rollout. If you have a cluster with multiple GPU nodes, run the canary on a single node first, then progressively schedule it across more nodes. Use node affinity and topology spread constraints to control this.
A/B testing with header-based routing. The Gateway API
HTTPRoutesupports header-based matching. Route internal/dogfood traffic to the canary via a custom header while external traffic stays on stable.Key Takeaways
Canary deployments are not optional for production ML. Swapping a model binary is not the same as swapping a stateless microservice. Models have hidden performance characteristics — latency distributions, memory profiles, accuracy regressions — that only surface under real traffic.
Gateway API is the right abstraction for traffic splitting. It’s portable across gateway controllers (Envoy, Istio, Cilium), it’s part of the Kubernetes standard, and Argo Rollouts integrates with it natively. Stop writing Ingress annotation hacks.
Metrics-driven rollback is the whole point. The
AnalysisTemplateresources are the most important part of this entire setup. Without them, you’re just doing a slow rollout — not a canary. The P95 latency check, error rate check, and GPU health check together form a safety net that catches infrastructure failures, model regressions, and resource exhaustion.Single-GPU canary testing is possible but noisy. On a GB10 or similar single-GPU setup, use MIG partitioning or generous failure tolerances to account for GPU contention. The analysis still adds value — it just needs tuned thresholds.
Start with infrastructure metrics, graduate to quality metrics. Latency and error rate catch 80% of bad deployments. Add model-quality scoring as a second phase once the pipeline is stable.
Leave a Reply