Sidecar Pattern in K8s MLOps

Over last 1.5 years, I have built a lot of POCs, End-to-End products leveraging ML models, LLMs etc. With Gemini, Claude at your disposal, I am sure many of us would have done the same. At the end of 2025, my home lab was serving 20+ models with a mix of docker, EKS, 100+ exporters and I soon realized, I had to do better. I was tired of building huge Docker images with TF. Coming from a platform background, sidecar pattern is what I knew well and I used it heavily. This article touches upon what I learnt, what worked, did not work and what alternatives are available.

Overview: The Sidecar in Kubernetes

A k8 pod is the smallest unit of deployment, grouping one or more containers that share a namespace (network, IPC), and storage volumes. The sidecar pattern leverages this proximity: a secondary container (the “sidecar”) runs alongside the main application to provide support—such as logging, proxies, or configuration management—without altering the primary image.

For a while now, native sidecar support is available via initContainers with a restartPolicy: Always. This ensures sidecars start before and terminate after the primary workload, solving historical issues with container ordering that often plagued complex ML pipelines.

What is the ML Sidecar Pattern?

The ML sidecar pattern applies this architectural primitive specifically to machine learning workloads. Rather than bundling model-serving infrastructure, data pipelines, monitoring agents, or GPU management utilities into a single container image, these jobs are factored out into dedicated sidecar containers that run alongside the ML workload.

Common ML sidecar roles may include:

Model Loader / Cache Sidecar

Handles downloading model artifacts from object storage or a model registry, caching them on a shared emptyDir or persistent volume, and serving them to the primary inference container via the local filesystem. This decouples model lifecycle management from inference code.

Inference Gateway Sidecar

A lightweight reverse proxy that handles request batching, adaptive batching windows, request queuing, load shedding etc.

Metrics / Observability Sidecar

Collects model-specific telemetry — inference latency distributions, prediction drift statistics, feature distribution histograms, GPU utilization (via NVML or DCGM) — and exports them to Prometheus, Datadog, or a custom telemetry pipeline. This avoids instrumenting the model server itself.

Data Preprocessor Sidecar

Runs feature transformation, tokenization, or input validation in a separate container, often in a different language runtime than the model server. The primary container receives pre-processed tensors over localhost, keeping the serving code focused on inference.

There can be many more like MPS Context management, MIG partitioning and list goes on.

Now let us talk about them in detail.

Model Loading

I had a huge number of entrypoint.sh which looked very similar to

shell.sh
#!/bin/bash
gsutil cp gs://models/resourcemonitorai/latest/model.tar.gz /models/
tar xzf /models/model.tar.gz -C /models/
tensorflow_model_server --model_base_path=/models/fraud --rest_api_port=8501

This had few drawbacks

  • No retries.
  • No integrity checks.
  • No way to update the model without restarting the Pod.
  • Liveness probe only checked if TF Serving was responding, not if it had a model loaded.

The Solution: A Dedicated Model Loader Sidecar
A dedicated code which handled model downloading, caching, integrity verification and ran as a native sidecar container.

And the manifest file and code looks like below.

loader.py
"""
Model loader sidecar — downloads HuggingFace model to shared volume,
writes a .ready sentinel file, then polls for updates.
"""
import os
import time
import json
import logging
from pathlib import Path
from datetime import datetime, timezone
logging.basicConfig(level=logging.INFO, format="%(asctime)s [loader] %(message)s")
logger = logging.getLogger("model-loader")
MODEL_ID = os.environ.get("MODEL_ID", "Qwen/Qwen2.5-1.5B-Instruct")
MODEL_DIR = os.environ.get("MODEL_DIR", "/models/qwen")
READY_FILE = os.environ.get("READY_FILE", "/models/.ready")
METADATA_FILE = os.environ.get("METADATA_FILE", "/models/metadata.json")
POLL_INTERVAL = int(os.environ.get("POLL_INTERVAL", "300"))
def download_model():
"""Download model from HuggingFace Hub to local directory."""
from huggingface_hub import snapshot_download
logger.info(f"Downloading {MODEL_ID} to {MODEL_DIR}")
start = time.monotonic()
path = snapshot_download(
repo_id=MODEL_ID,
local_dir=MODEL_DIR,
local_dir_use_symlinks=False,
)
elapsed = time.monotonic() - start
logger.info(f"Download complete in {elapsed:.1f}s -> {path}")
return path
def write_metadata():
"""Write metadata about the loaded model for other sidecars to read."""
meta = {
"model_id": MODEL_ID,
"model_dir": MODEL_DIR,
"loaded_at": datetime.now(timezone.utc).isoformat(),
"loader_version": "v1.0.0",
}
with open(METADATA_FILE, "w") as f:
json.dump(meta, f, indent=2)
logger.info(f"Metadata written to {METADATA_FILE}")
def signal_ready():
"""Write the ready file that the inference server's readiness probe checks."""
Path(READY_FILE).write_text("ready")
logger.info(f"Ready file written: {READY_FILE}")
def check_for_update():
"""
Check if the remote model has been updated.
Returns True if a new version is available.
"""
try:
from huggingface_hub import HfApi
api = HfApi()
info = api.model_info(MODEL_ID)
remote_sha = info.sha
local_meta_path = Path(METADATA_FILE)
if local_meta_path.exists():
with open(local_meta_path) as f:
local_meta = json.load(f)
if local_meta.get("remote_sha") == remote_sha:
return False
logger.info(f"New version detected: {remote_sha}")
return True
except Exception as e:
logger.warning(f"Update check failed: {e}")
return False
def main():
# Initial download
download_model()
write_metadata()
signal_ready()
# Poll loop
logger.info(f"Entering poll loop (interval: {POLL_INTERVAL}s)")
while True:
time.sleep(POLL_INTERVAL)
try:
if check_for_update():
logger.info("Downloading updated model...")
download_model()
write_metadata()
logger.info("Model hot-reloaded")
except Exception as e:
logger.error(f"Poll cycle error: {e}")
if __name__ == "__main__":
main()
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: qwen-sidecar
labels:
app: qwen-sidecar
spec:
replicas: 1
selector:
matchLabels:
app: qwen-sidecar
template:
metadata:
labels:
app: qwen-sidecar
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
spec:
terminationGracePeriodSeconds: 30
# ── Model Loader: Native Sidecar (K8s 1.28+) ────────────────
initContainers:
- name: model-loader
restartPolicy: Always
image: python:3.11-slim
command: ["/bin/sh", "-c"]
args:
- |
pip install -q huggingface_hub && \
python /scripts/loader.py
env:
- name: MODEL_ID
value: "Qwen/Qwen2.5-1.5B-Instruct"
- name: MODEL_DIR
value: "/models/qwen"
- name: READY_FILE
value: "/models/.ready"
- name: POLL_INTERVAL
value: "300"
volumeMounts:
- name: model-volume
mountPath: /models
- name: loader-scripts
mountPath: /scripts
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: "1"
memory: 1Gi
containers:
# ── Primary: vLLM Inference Server ─────────────────────────
# GPU mode: use image as-is, keep nvidia.com/gpu in resources
# CPU mode: set VLLM_DEVICE=cpu env var, remove nvidia.com/gpu lines
- name: vllm
image: vllm/vllm-openai:latest
command: ["vllm"]
args:
- "serve"
- "/models/qwen"
- "--served-model-name"
- "qwen"
- "--port"
- "8000"
- "--max-model-len"
- "2048"
- "--host"
- "0.0.0.0"
- "--dtype"
- "float32" # Required for CPU; GPU users can change to auto
env:
- name: VLLM_TARGET_DEVICE
value: "cpu" # Change to "cuda" for GPU nodes
ports:
- name: http
containerPort: 8000
readinessProbe:
exec:
command:
- /bin/sh
- -c
- "test -f /models/.ready && curl -sf http://localhost:8000/health"
initialDelaySeconds: 30
periodSeconds: 15
failureThreshold: 12 # CPU startup is slow — give it 3 minutes
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 15
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "touch /models/.shutdown && sleep 10"]
volumeMounts:
- name: model-volume
mountPath: /models
readOnly: true
resources:
requests:
cpu: "2"
memory: 6Gi
limits:
cpu: "4"
memory: 8Gi
# ── Uncomment below for GPU nodes ──
# requests:
# nvidia.com/gpu: "1"
# limits:
# nvidia.com/gpu: "1"
# ── Metrics Exporter Sidecar ──────────────────────────────
- name: metrics-exporter
image: python:3.11-slim
command: ["/bin/sh", "-c"]
args:
- |
pip install -q prometheus_client requests && \
python /scripts/exporter.py
env:
- name: VLLM_URL
value: "http://localhost:8000"
- name: MODEL_NAME
value: "qwen"
- name: METRICS_PORT
value: "9090"
ports:
- name: metrics
containerPort: 9090
volumeMounts:
- name: exporter-scripts
mountPath: /scripts
- name: model-volume
mountPath: /models
readOnly: true
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- "while ! test -f /models/.shutdown; do sleep 1; done; sleep 15"
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
volumes:
- name: model-volume
emptyDir:
sizeLimit: 10Gi
- name: loader-scripts
configMap:
name: model-loader-script
- name: exporter-scripts
configMap:
name: metrics-exporter-script

The Metrics Sidecar — Observability Without Instrumentation

Build a metrics exporter that runs alongside every inference server, scrapes the server’s prediction endpoints, and exports standardized ML metrics. Because it runs in the same Pod, it accesses the inference server over localhost — no service discovery, no network policies to configure.

alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: qwen-sidecar-alerts
spec:
groups:
- name: qwen-inference
interval: 30s
rules:
- alert: ModelNotHealthy
expr: ml_model_healthy{model_name="qwen"} == 0
for: 3m
labels:
severity: critical
annotations:
summary: "Qwen model server is not healthy"
- alert: KVCacheNearFull
expr: ml_gpu_kv_cache_usage{model_name="qwen"} > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "KV cache usage > 90%"
- alert: RequestQueueBacklog
expr: ml_requests_waiting{model_name="qwen"} > 10
for: 2m
labels:
severity: warning
annotations:
summary: "{{ $value }} requests queued"
- alert: ModelLoaderCrashLoop
expr: |
increase(kube_pod_container_status_restarts_total{
container="model-loader"
}[1h]) > 3
for: 0m
labels:
severity: critical
annotations:
summary: "Model loader crash-looping"
exporter.py
"""
Metrics exporter sidecar — scrapes vLLM's native metrics,
adds ML-specific alerting metrics, and exposes them for Prometheus.
"""
import os
import time
import logging
import requests
from prometheus_client import (
start_http_server,
Gauge,
Counter,
Histogram,
Info,
)
logging.basicConfig(level=logging.INFO, format="%(asctime)s [metrics] %(message)s")
logger = logging.getLogger("metrics-exporter")
VLLM_URL = os.environ.get("VLLM_URL", "http://localhost:8000")
MODEL_NAME = os.environ.get("MODEL_NAME", "qwen")
METRICS_PORT = int(os.environ.get("METRICS_PORT", "9090"))
INTERVAL = int(os.environ.get("COLLECTION_INTERVAL", "15"))
# ─── Custom metrics (on top of what vLLM already exposes) ────────────
MODEL_HEALTHY = Gauge(
"ml_model_healthy",
"Whether the model server is healthy and serving",
["model_name"],
)
HEALTH_CHECK_LATENCY = Histogram(
"ml_health_check_latency_seconds",
"Latency of health checks to the inference server",
["model_name"],
buckets=(0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0),
)
EXPORTER_ERRORS = Counter(
"ml_exporter_errors_total",
"Total errors in the metrics exporter sidecar",
["error_type"],
)
MODEL_INFO = Info(
"ml_model",
"Metadata about the served model",
)
# ─── Parse vLLM's /metrics for key signals ───────────────────────────
VLLM_GAUGE_PATTERNS = {
"vllm:num_requests_running": Gauge(
"ml_requests_running", "Number of requests currently being processed", ["model_name"]
),
"vllm:num_requests_waiting": Gauge(
"ml_requests_waiting", "Number of requests waiting in queue", ["model_name"]
),
"vllm:gpu_cache_usage_perc": Gauge(
"ml_gpu_kv_cache_usage", "GPU KV cache usage percentage", ["model_name"]
),
"vllm:cpu_cache_usage_perc": Gauge(
"ml_cpu_kv_cache_usage", "CPU KV cache usage percentage", ["model_name"]
),
}
def scrape_vllm_metrics():
"""Scrape vLLM's Prometheus endpoint and extract key metrics."""
try:
resp = requests.get(f"{VLLM_URL}/metrics", timeout=5)
if resp.status_code != 200:
return
for line in resp.text.split("\n"):
if line.startswith("#") or not line.strip():
continue
for pattern, gauge in VLLM_GAUGE_PATTERNS.items():
if line.startswith(pattern):
try:
value = float(line.split()[-1])
gauge.labels(model_name=MODEL_NAME).set(value)
except (ValueError, IndexError):
pass
except Exception as e:
EXPORTER_ERRORS.labels(error_type="scrape").inc()
logger.debug(f"Scrape error: {e}")
def check_health():
"""Check inference server health and record latency."""
start = time.monotonic()
try:
resp = requests.get(f"{VLLM_URL}/health", timeout=5)
elapsed = time.monotonic() - start
HEALTH_CHECK_LATENCY.labels(model_name=MODEL_NAME).observe(elapsed)
is_healthy = resp.status_code == 200
MODEL_HEALTHY.labels(model_name=MODEL_NAME).set(1 if is_healthy else 0)
return is_healthy
except requests.exceptions.ConnectionError:
MODEL_HEALTHY.labels(model_name=MODEL_NAME).set(0)
return False
except Exception as e:
EXPORTER_ERRORS.labels(error_type="health_check").inc()
MODEL_HEALTHY.labels(model_name=MODEL_NAME).set(0)
return False
def load_model_metadata():
"""Read metadata written by the model-loader sidecar."""
try:
import json
with open("/models/metadata.json") as f:
meta = json.load(f)
MODEL_INFO.info({
"model_id": meta.get("model_id", ""),
"loaded_at": meta.get("loaded_at", ""),
"loader_version": meta.get("loader_version", ""),
})
except Exception:
pass
def main():
start_http_server(METRICS_PORT)
logger.info(f"Metrics exporter started on :{METRICS_PORT}")
logger.info(f"Scraping vLLM at {VLLM_URL} every {INTERVAL}s")
load_model_metadata()
while True:
check_health()
scrape_vllm_metrics()
time.sleep(INTERVAL)
if __name__ == "__main__":
main()

What works well in the Sidecar Pattern

  • Ready file gating – The test -f /models/.ready check in the readiness probe. Before this, there was a race condition: the inference server would start, pass its own health check (the binary was running), get added to the Service endpoints, receive traffic, and return errors because the model wasn’t loaded yet.
  • Decoupled model lifecycle – the model-loader sidecar can update the model while vLLM keeps serving the old version. No Pod restart needed.
  • Standardized metrics – every model deployment gets the same metrics exporter, same metric names, same dashboards.

What did not work so well in the Sidecar Pattern

  • Resource overhead – two extra containers per Pod. The metrics exporter is light (128Mi), but the model-loader needs memory for the download. On GPU nodes where CPU/memory is already tight, this matters.
  • Metrics are passive – the exporter scrapes and exports. Prometheus collects. Grafana displays. This is a monitoring pipeline, not a routing intelligence. By the time someone sees the KV cache is full, it’s already too late.
  • Scaling is all-or-nothing – you can’t scale the metrics exporter independently of the GPU. If you need more observability capacity, you’re scaling entire GPU Pods.

The Ground Is Shifting — Gateway API Inference Extension and InferencePool

Kubernetes ecosystem introduced something that fundamentally reshapes how ML inference routing works: the Gateway API Inference Extension and its companion project llm-d. These projects don’t just iterate on existing patterns — they obsolete several of the sidecars I spent months building.
What Is the Gateway API Inference Extension?

The Gateway API Inference Extension is an official Kubernetes project that extends the standard Gateway API with two new CRDs purpose-built for AI inference: InferencePool and InferenceModel. It works by leveraging Envoy’s External Processing (ext-proc) to upgrade any compatible gateway — Envoy Gateway, kgateway, NGINX into an “Inference Gateway” with model-aware, KV-cache-aware routing.

The core idea is deceptively simple: instead of each Pod carrying its own routing logic, metrics collection, and batching sidecars, those responsibilities move to infrastructure-level components that operate across the entire pool of model-serving Pods.

The new manifest looks like this — notice how much simpler the Pod spec becomes:

deployment.yaml, gateway.yaml, pool.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwen-pool
  labels:
    app: qwen-pool
spec:
  replicas: 2
  selector:
    matchLabels:
      app: qwen-pool
  template:
    metadata:
      labels:
        app: qwen-pool
    spec:
      terminationGracePeriodSeconds: 30

      # ── Model Loader Sidecar (still needed) ─────────────────────
      initContainers:
        - name: model-loader
          restartPolicy: Always
          image: python:3.11-slim
          command: ["/bin/sh", "-c"]
          args:
            - |
              pip install -q huggingface_hub && \
              python /scripts/loader.py
          env:
            - name: MODEL_ID
              value: "Qwen/Qwen2.5-1.5B-Instruct"
            - name: MODEL_DIR
              value: "/models/qwen"
            - name: READY_FILE
              value: "/models/.ready"
            - name: POLL_INTERVAL
              value: "300"
          volumeMounts:
            - name: model-volume
              mountPath: /models
            - name: loader-scripts
              mountPath: /scripts
          resources:
            requests:
              cpu: 200m
              memory: 512Mi
            limits:
              cpu: "1"
              memory: 1Gi

      containers:
        # ── Primary: vLLM (native metrics — no exporter sidecar) ──
        # Default: CPU mode. For GPU, set VLLM_TARGET_DEVICE=cuda,
        # change dtype to auto, and uncomment nvidia.com/gpu lines.
        - name: vllm
          image: vllm/vllm-openai:latest
          command: ["vllm"]
          args:
            - "serve"
            - "/models/qwen"
            - "--served-model-name"
            - "qwen"
            - "--port"
            - "8000"
            - "--max-model-len"
            - "2048"
            - "--host"
            - "0.0.0.0"
            - "--dtype"
            - "float32"
          env:
            - name: VLLM_TARGET_DEVICE
              value: "cpu"
          ports:
            - name: http
              containerPort: 8000
          readinessProbe:
            exec:
              command:
                - /bin/sh
                - -c
                - "test -f /models/.ready && curl -sf http://localhost:8000/health"
            initialDelaySeconds: 30
            periodSeconds: 15
            failureThreshold: 12
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 15
          volumeMounts:
            - name: model-volume
              mountPath: /models
              readOnly: true
          resources:
            requests:
              cpu: "2"
              memory: 6Gi
            limits:
              cpu: "4"
              memory: 8Gi
            # ── Uncomment for GPU nodes ──
            # requests:
            #   nvidia.com/gpu: "1"
            # limits:
            #   nvidia.com/gpu: "1"

        # NOTE: No metrics-exporter sidecar.
        # The EPP queries vLLM's native /metrics endpoint directly.

      volumes:
        - name: model-volume
          emptyDir:
            sizeLimit: 10Gi
        - name: loader-scripts
          configMap:
            name: model-loader-script
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: eg
  listeners:
    - name: http
      port: 80
      protocol: HTTP

---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: qwen-route
spec:
  parentRefs:
    - name: inference-gateway
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /v1/
      backendRefs:
        - group: inference.networking.k8s.io
          kind: InferencePool
          name: qwen-pool
          port: 8000
---
# InferencePool: groups the model-serving Pods
# The EPP monitors all matching Pods and picks the best one per request
apiVersion: inference.networking.k8s.io/v1
kind: InferencePool
metadata:
  name: qwen-pool
spec:
  targetPorts:
    - number: 8000
  selector:
    app: qwen-pool
  extensionRef:
    name: qwen-pool-epp
    port: 9002
    failureMode: FailOpen

---
# InferenceModel: maps client-facing model name to backend model(s)
apiVersion: inference.networking.k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: qwen
spec:
  modelName: qwen
  criticality: Standard
  poolRef:
    name: qwen-pool
  targetModels:
    - name: qwen
      weight: 100








The Lesson: Sidecars Are a Waypoint, Not a Destination

Looking back, here’s the pattern we see:

  • Phase 1 — Monolith: Everything in one container. Works until it doesn’t.
  • Phase 2 — Sidecars: Factor out cross-cutting concerns into co-located containers. Huge improvement in modularity and platform standardization.
  • Phase 3 — Platform primitives: The concerns that were in sidecars get absorbed into purpose-built infrastructure (Inference Gateway, EPP, InferencePool). The Pod gets simpler again.

But the pattern we’re converging toward is clear: Pods should contain the model server and anything that needs GPU-local or filesystem-local access (model loading, KV-cache transfer). Everything else — routing, observability aggregation, traffic splitting, priority, load shedding — belongs at the infrastructure layer.

Leave a Reply

Discover more from Mind Of The Machine

Subscribe now to keep reading and get access to the full archive.

Continue reading