
Over last 1.5 years, I have built a lot of POCs, End-to-End products leveraging ML models, LLMs etc. With Gemini, Claude at your disposal, I am sure many of us would have done the same. At the end of 2025, my home lab was serving 20+ models with a mix of docker, EKS, 100+ exporters and I soon realized, I had to do better. I was tired of building huge Docker images with TF. Coming from a platform background, sidecar pattern is what I knew well and I used it heavily. This article touches upon what I learnt, what worked, did not work and what alternatives are available.
Overview: The Sidecar in Kubernetes
A k8 pod is the smallest unit of deployment, grouping one or more containers that share a namespace (network, IPC), and storage volumes. The sidecar pattern leverages this proximity: a secondary container (the “sidecar”) runs alongside the main application to provide support—such as logging, proxies, or configuration management—without altering the primary image.
For a while now, native sidecar support is available via initContainers with a restartPolicy: Always. This ensures sidecars start before and terminate after the primary workload, solving historical issues with container ordering that often plagued complex ML pipelines.
What is the ML Sidecar Pattern?
The ML sidecar pattern applies this architectural primitive specifically to machine learning workloads. Rather than bundling model-serving infrastructure, data pipelines, monitoring agents, or GPU management utilities into a single container image, these jobs are factored out into dedicated sidecar containers that run alongside the ML workload.
Common ML sidecar roles may include:
Model Loader / Cache Sidecar
Handles downloading model artifacts from object storage or a model registry, caching them on a shared emptyDir or persistent volume, and serving them to the primary inference container via the local filesystem. This decouples model lifecycle management from inference code.
Inference Gateway Sidecar
A lightweight reverse proxy that handles request batching, adaptive batching windows, request queuing, load shedding etc.
Metrics / Observability Sidecar
Collects model-specific telemetry — inference latency distributions, prediction drift statistics, feature distribution histograms, GPU utilization (via NVML or DCGM) — and exports them to Prometheus, Datadog, or a custom telemetry pipeline. This avoids instrumenting the model server itself.
Data Preprocessor Sidecar
Runs feature transformation, tokenization, or input validation in a separate container, often in a different language runtime than the model server. The primary container receives pre-processed tensors over localhost, keeping the serving code focused on inference.
There can be many more like MPS Context management, MIG partitioning and list goes on.
Now let us talk about them in detail.
Model Loading
I had a huge number of entrypoint.sh which looked very similar to
shell.sh
#!/bin/bashgsutil cp gs://models/resourcemonitorai/latest/model.tar.gz /models/tar xzf /models/model.tar.gz -C /models/tensorflow_model_server --model_base_path=/models/fraud --rest_api_port=8501
This had few drawbacks
- No retries.
- No integrity checks.
- No way to update the model without restarting the Pod.
- Liveness probe only checked if TF Serving was responding, not if it had a model loaded.
The Solution: A Dedicated Model Loader Sidecar
A dedicated code which handled model downloading, caching, integrity verification and ran as a native sidecar container.
And the manifest file and code looks like below.
loader.py
"""Model loader sidecar — downloads HuggingFace model to shared volume,writes a .ready sentinel file, then polls for updates."""import osimport timeimport jsonimport loggingfrom pathlib import Pathfrom datetime import datetime, timezonelogging.basicConfig(level=logging.INFO, format="%(asctime)s [loader] %(message)s")logger = logging.getLogger("model-loader")MODEL_ID = os.environ.get("MODEL_ID", "Qwen/Qwen2.5-1.5B-Instruct")MODEL_DIR = os.environ.get("MODEL_DIR", "/models/qwen")READY_FILE = os.environ.get("READY_FILE", "/models/.ready")METADATA_FILE = os.environ.get("METADATA_FILE", "/models/metadata.json")POLL_INTERVAL = int(os.environ.get("POLL_INTERVAL", "300"))def download_model(): """Download model from HuggingFace Hub to local directory.""" from huggingface_hub import snapshot_download logger.info(f"Downloading {MODEL_ID} to {MODEL_DIR}") start = time.monotonic() path = snapshot_download( repo_id=MODEL_ID, local_dir=MODEL_DIR, local_dir_use_symlinks=False, ) elapsed = time.monotonic() - start logger.info(f"Download complete in {elapsed:.1f}s -> {path}") return pathdef write_metadata(): """Write metadata about the loaded model for other sidecars to read.""" meta = { "model_id": MODEL_ID, "model_dir": MODEL_DIR, "loaded_at": datetime.now(timezone.utc).isoformat(), "loader_version": "v1.0.0", } with open(METADATA_FILE, "w") as f: json.dump(meta, f, indent=2) logger.info(f"Metadata written to {METADATA_FILE}")def signal_ready(): """Write the ready file that the inference server's readiness probe checks.""" Path(READY_FILE).write_text("ready") logger.info(f"Ready file written: {READY_FILE}")def check_for_update(): """ Check if the remote model has been updated. Returns True if a new version is available. """ try: from huggingface_hub import HfApi api = HfApi() info = api.model_info(MODEL_ID) remote_sha = info.sha local_meta_path = Path(METADATA_FILE) if local_meta_path.exists(): with open(local_meta_path) as f: local_meta = json.load(f) if local_meta.get("remote_sha") == remote_sha: return False logger.info(f"New version detected: {remote_sha}") return True except Exception as e: logger.warning(f"Update check failed: {e}") return Falsedef main(): # Initial download download_model() write_metadata() signal_ready() # Poll loop logger.info(f"Entering poll loop (interval: {POLL_INTERVAL}s)") while True: time.sleep(POLL_INTERVAL) try: if check_for_update(): logger.info("Downloading updated model...") download_model() write_metadata() logger.info("Model hot-reloaded") except Exception as e: logger.error(f"Poll cycle error: {e}")if __name__ == "__main__": main()
deployment.yaml
apiVersion: apps/v1kind: Deploymentmetadata: name: qwen-sidecar labels: app: qwen-sidecarspec: replicas: 1 selector: matchLabels: app: qwen-sidecar template: metadata: labels: app: qwen-sidecar annotations: prometheus.io/scrape: "true" prometheus.io/port: "9090" spec: terminationGracePeriodSeconds: 30 # ── Model Loader: Native Sidecar (K8s 1.28+) ──────────────── initContainers: - name: model-loader restartPolicy: Always image: python:3.11-slim command: ["/bin/sh", "-c"] args: - | pip install -q huggingface_hub && \ python /scripts/loader.py env: - name: MODEL_ID value: "Qwen/Qwen2.5-1.5B-Instruct" - name: MODEL_DIR value: "/models/qwen" - name: READY_FILE value: "/models/.ready" - name: POLL_INTERVAL value: "300" volumeMounts: - name: model-volume mountPath: /models - name: loader-scripts mountPath: /scripts resources: requests: cpu: 200m memory: 512Mi limits: cpu: "1" memory: 1Gi containers: # ── Primary: vLLM Inference Server ───────────────────────── # GPU mode: use image as-is, keep nvidia.com/gpu in resources # CPU mode: set VLLM_DEVICE=cpu env var, remove nvidia.com/gpu lines - name: vllm image: vllm/vllm-openai:latest command: ["vllm"] args: - "serve" - "/models/qwen" - "--served-model-name" - "qwen" - "--port" - "8000" - "--max-model-len" - "2048" - "--host" - "0.0.0.0" - "--dtype" - "float32" # Required for CPU; GPU users can change to auto env: - name: VLLM_TARGET_DEVICE value: "cpu" # Change to "cuda" for GPU nodes ports: - name: http containerPort: 8000 readinessProbe: exec: command: - /bin/sh - -c - "test -f /models/.ready && curl -sf http://localhost:8000/health" initialDelaySeconds: 30 periodSeconds: 15 failureThreshold: 12 # CPU startup is slow — give it 3 minutes livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 120 periodSeconds: 15 lifecycle: preStop: exec: command: ["/bin/sh", "-c", "touch /models/.shutdown && sleep 10"] volumeMounts: - name: model-volume mountPath: /models readOnly: true resources: requests: cpu: "2" memory: 6Gi limits: cpu: "4" memory: 8Gi # ── Uncomment below for GPU nodes ── # requests: # nvidia.com/gpu: "1" # limits: # nvidia.com/gpu: "1" # ── Metrics Exporter Sidecar ────────────────────────────── - name: metrics-exporter image: python:3.11-slim command: ["/bin/sh", "-c"] args: - | pip install -q prometheus_client requests && \ python /scripts/exporter.py env: - name: VLLM_URL value: "http://localhost:8000" - name: MODEL_NAME value: "qwen" - name: METRICS_PORT value: "9090" ports: - name: metrics containerPort: 9090 volumeMounts: - name: exporter-scripts mountPath: /scripts - name: model-volume mountPath: /models readOnly: true lifecycle: preStop: exec: command: - /bin/sh - -c - "while ! test -f /models/.shutdown; do sleep 1; done; sleep 15" resources: requests: cpu: 50m memory: 128Mi limits: cpu: 200m memory: 256Mi volumes: - name: model-volume emptyDir: sizeLimit: 10Gi - name: loader-scripts configMap: name: model-loader-script - name: exporter-scripts configMap: name: metrics-exporter-script
The Metrics Sidecar — Observability Without Instrumentation
Build a metrics exporter that runs alongside every inference server, scrapes the server’s prediction endpoints, and exports standardized ML metrics. Because it runs in the same Pod, it accesses the inference server over localhost — no service discovery, no network policies to configure.
alerts.yaml
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: qwen-sidecar-alertsspec: groups: - name: qwen-inference interval: 30s rules: - alert: ModelNotHealthy expr: ml_model_healthy{model_name="qwen"} == 0 for: 3m labels: severity: critical annotations: summary: "Qwen model server is not healthy" - alert: KVCacheNearFull expr: ml_gpu_kv_cache_usage{model_name="qwen"} > 0.9 for: 5m labels: severity: warning annotations: summary: "KV cache usage > 90%" - alert: RequestQueueBacklog expr: ml_requests_waiting{model_name="qwen"} > 10 for: 2m labels: severity: warning annotations: summary: "{{ $value }} requests queued" - alert: ModelLoaderCrashLoop expr: | increase(kube_pod_container_status_restarts_total{ container="model-loader" }[1h]) > 3 for: 0m labels: severity: critical annotations: summary: "Model loader crash-looping"
exporter.py
"""Metrics exporter sidecar — scrapes vLLM's native metrics,adds ML-specific alerting metrics, and exposes them for Prometheus."""import osimport timeimport loggingimport requestsfrom prometheus_client import ( start_http_server, Gauge, Counter, Histogram, Info,)logging.basicConfig(level=logging.INFO, format="%(asctime)s [metrics] %(message)s")logger = logging.getLogger("metrics-exporter")VLLM_URL = os.environ.get("VLLM_URL", "http://localhost:8000")MODEL_NAME = os.environ.get("MODEL_NAME", "qwen")METRICS_PORT = int(os.environ.get("METRICS_PORT", "9090"))INTERVAL = int(os.environ.get("COLLECTION_INTERVAL", "15"))# ─── Custom metrics (on top of what vLLM already exposes) ────────────MODEL_HEALTHY = Gauge( "ml_model_healthy", "Whether the model server is healthy and serving", ["model_name"],)HEALTH_CHECK_LATENCY = Histogram( "ml_health_check_latency_seconds", "Latency of health checks to the inference server", ["model_name"], buckets=(0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0),)EXPORTER_ERRORS = Counter( "ml_exporter_errors_total", "Total errors in the metrics exporter sidecar", ["error_type"],)MODEL_INFO = Info( "ml_model", "Metadata about the served model",)# ─── Parse vLLM's /metrics for key signals ───────────────────────────VLLM_GAUGE_PATTERNS = { "vllm:num_requests_running": Gauge( "ml_requests_running", "Number of requests currently being processed", ["model_name"] ), "vllm:num_requests_waiting": Gauge( "ml_requests_waiting", "Number of requests waiting in queue", ["model_name"] ), "vllm:gpu_cache_usage_perc": Gauge( "ml_gpu_kv_cache_usage", "GPU KV cache usage percentage", ["model_name"] ), "vllm:cpu_cache_usage_perc": Gauge( "ml_cpu_kv_cache_usage", "CPU KV cache usage percentage", ["model_name"] ),}def scrape_vllm_metrics(): """Scrape vLLM's Prometheus endpoint and extract key metrics.""" try: resp = requests.get(f"{VLLM_URL}/metrics", timeout=5) if resp.status_code != 200: return for line in resp.text.split("\n"): if line.startswith("#") or not line.strip(): continue for pattern, gauge in VLLM_GAUGE_PATTERNS.items(): if line.startswith(pattern): try: value = float(line.split()[-1]) gauge.labels(model_name=MODEL_NAME).set(value) except (ValueError, IndexError): pass except Exception as e: EXPORTER_ERRORS.labels(error_type="scrape").inc() logger.debug(f"Scrape error: {e}")def check_health(): """Check inference server health and record latency.""" start = time.monotonic() try: resp = requests.get(f"{VLLM_URL}/health", timeout=5) elapsed = time.monotonic() - start HEALTH_CHECK_LATENCY.labels(model_name=MODEL_NAME).observe(elapsed) is_healthy = resp.status_code == 200 MODEL_HEALTHY.labels(model_name=MODEL_NAME).set(1 if is_healthy else 0) return is_healthy except requests.exceptions.ConnectionError: MODEL_HEALTHY.labels(model_name=MODEL_NAME).set(0) return False except Exception as e: EXPORTER_ERRORS.labels(error_type="health_check").inc() MODEL_HEALTHY.labels(model_name=MODEL_NAME).set(0) return Falsedef load_model_metadata(): """Read metadata written by the model-loader sidecar.""" try: import json with open("/models/metadata.json") as f: meta = json.load(f) MODEL_INFO.info({ "model_id": meta.get("model_id", ""), "loaded_at": meta.get("loaded_at", ""), "loader_version": meta.get("loader_version", ""), }) except Exception: passdef main(): start_http_server(METRICS_PORT) logger.info(f"Metrics exporter started on :{METRICS_PORT}") logger.info(f"Scraping vLLM at {VLLM_URL} every {INTERVAL}s") load_model_metadata() while True: check_health() scrape_vllm_metrics() time.sleep(INTERVAL)if __name__ == "__main__": main()
What works well in the Sidecar Pattern
- Ready file gating – The
test -f /models/.readycheck in the readiness probe. Before this, there was a race condition: the inference server would start, pass its own health check (the binary was running), get added to the Service endpoints, receive traffic, and return errors because the model wasn’t loaded yet. - Decoupled model lifecycle – the model-loader sidecar can update the model while vLLM keeps serving the old version. No Pod restart needed.
- Standardized metrics – every model deployment gets the same metrics exporter, same metric names, same dashboards.
What did not work so well in the Sidecar Pattern
- Resource overhead – two extra containers per Pod. The metrics exporter is light (128Mi), but the model-loader needs memory for the download. On GPU nodes where CPU/memory is already tight, this matters.
- Metrics are passive – the exporter scrapes and exports. Prometheus collects. Grafana displays. This is a monitoring pipeline, not a routing intelligence. By the time someone sees the KV cache is full, it’s already too late.
- Scaling is all-or-nothing – you can’t scale the metrics exporter independently of the GPU. If you need more observability capacity, you’re scaling entire GPU Pods.
The Ground Is Shifting — Gateway API Inference Extension and InferencePool
Kubernetes ecosystem introduced something that fundamentally reshapes how ML inference routing works: the Gateway API Inference Extension and its companion project llm-d. These projects don’t just iterate on existing patterns — they obsolete several of the sidecars I spent months building.
What Is the Gateway API Inference Extension?
The Gateway API Inference Extension is an official Kubernetes project that extends the standard Gateway API with two new CRDs purpose-built for AI inference: InferencePool and InferenceModel. It works by leveraging Envoy’s External Processing (ext-proc) to upgrade any compatible gateway — Envoy Gateway, kgateway, NGINX into an “Inference Gateway” with model-aware, KV-cache-aware routing.
The core idea is deceptively simple: instead of each Pod carrying its own routing logic, metrics collection, and batching sidecars, those responsibilities move to infrastructure-level components that operate across the entire pool of model-serving Pods.
The new manifest looks like this — notice how much simpler the Pod spec becomes:
deployment.yaml, gateway.yaml, pool.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: qwen-pool
labels:
app: qwen-pool
spec:
replicas: 2
selector:
matchLabels:
app: qwen-pool
template:
metadata:
labels:
app: qwen-pool
spec:
terminationGracePeriodSeconds: 30
# ── Model Loader Sidecar (still needed) ─────────────────────
initContainers:
- name: model-loader
restartPolicy: Always
image: python:3.11-slim
command: ["/bin/sh", "-c"]
args:
- |
pip install -q huggingface_hub && \
python /scripts/loader.py
env:
- name: MODEL_ID
value: "Qwen/Qwen2.5-1.5B-Instruct"
- name: MODEL_DIR
value: "/models/qwen"
- name: READY_FILE
value: "/models/.ready"
- name: POLL_INTERVAL
value: "300"
volumeMounts:
- name: model-volume
mountPath: /models
- name: loader-scripts
mountPath: /scripts
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: "1"
memory: 1Gi
containers:
# ── Primary: vLLM (native metrics — no exporter sidecar) ──
# Default: CPU mode. For GPU, set VLLM_TARGET_DEVICE=cuda,
# change dtype to auto, and uncomment nvidia.com/gpu lines.
- name: vllm
image: vllm/vllm-openai:latest
command: ["vllm"]
args:
- "serve"
- "/models/qwen"
- "--served-model-name"
- "qwen"
- "--port"
- "8000"
- "--max-model-len"
- "2048"
- "--host"
- "0.0.0.0"
- "--dtype"
- "float32"
env:
- name: VLLM_TARGET_DEVICE
value: "cpu"
ports:
- name: http
containerPort: 8000
readinessProbe:
exec:
command:
- /bin/sh
- -c
- "test -f /models/.ready && curl -sf http://localhost:8000/health"
initialDelaySeconds: 30
periodSeconds: 15
failureThreshold: 12
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 15
volumeMounts:
- name: model-volume
mountPath: /models
readOnly: true
resources:
requests:
cpu: "2"
memory: 6Gi
limits:
cpu: "4"
memory: 8Gi
# ── Uncomment for GPU nodes ──
# requests:
# nvidia.com/gpu: "1"
# limits:
# nvidia.com/gpu: "1"
# NOTE: No metrics-exporter sidecar.
# The EPP queries vLLM's native /metrics endpoint directly.
volumes:
- name: model-volume
emptyDir:
sizeLimit: 10Gi
- name: loader-scripts
configMap:
name: model-loader-script
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: inference-gateway
spec:
gatewayClassName: eg
listeners:
- name: http
port: 80
protocol: HTTP
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: qwen-route
spec:
parentRefs:
- name: inference-gateway
rules:
- matches:
- path:
type: PathPrefix
value: /v1/
backendRefs:
- group: inference.networking.k8s.io
kind: InferencePool
name: qwen-pool
port: 8000
---
# InferencePool: groups the model-serving Pods
# The EPP monitors all matching Pods and picks the best one per request
apiVersion: inference.networking.k8s.io/v1
kind: InferencePool
metadata:
name: qwen-pool
spec:
targetPorts:
- number: 8000
selector:
app: qwen-pool
extensionRef:
name: qwen-pool-epp
port: 9002
failureMode: FailOpen
---
# InferenceModel: maps client-facing model name to backend model(s)
apiVersion: inference.networking.k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: qwen
spec:
modelName: qwen
criticality: Standard
poolRef:
name: qwen-pool
targetModels:
- name: qwen
weight: 100
The Lesson: Sidecars Are a Waypoint, Not a Destination
Looking back, here’s the pattern we see:
- Phase 1 — Monolith: Everything in one container. Works until it doesn’t.
- Phase 2 — Sidecars: Factor out cross-cutting concerns into co-located containers. Huge improvement in modularity and platform standardization.
- Phase 3 — Platform primitives: The concerns that were in sidecars get absorbed into purpose-built infrastructure (Inference Gateway, EPP, InferencePool). The Pod gets simpler again.
But the pattern we’re converging toward is clear: Pods should contain the model server and anything that needs GPU-local or filesystem-local access (model loading, KV-cache transfer). Everything else — routing, observability aggregation, traffic splitting, priority, load shedding — belongs at the infrastructure layer.
Leave a Reply