
Stop treating a $30K A100 like a boolean. Dynamic Resource Allocation (GA in Kubernetes 1.34) lets you claim GPUs by VRAM, compute capability, interconnect topology, and MIG profile — then share them safely across workloads. This article walks through every pattern with real manifests.
The Problem: GPUs Are Not Integers
For years, requesting a GPU in Kubernetes looked like this:
resources: limits: nvidia.com/gpu: "1" # That's it. That's the whole API.
This single integer tells the scheduler nothing about:
This single integer tells the scheduler nothing about:
- How much VRAM – The workload actually needs (a 1.5B model needs ~3GB; a 70B model needs ~140GB)
- Which GPU model – What is acceptable (your A100 and your T4 are not interchangeable)
- Whether the GPU can be shared – a small inference workload using 4GB on an 80GB GPU wastes 95% of the device
- Interconnect requirements – distributed training needs NVLink
- MIG partitioning – an A100 can be sliced into 7 independent instances — but the old API can’t express this
The result: GPUs sit at 15% utilization while the scheduler reports “Insufficient nvidia.com/gpu.” Every GPU is either fully allocated or fully idle. There’s no middle ground.
Dynamic Resource Allocation changes this fundamentally.
How DRA Works: The Mental Model
Think of DRA like PersistentVolumes, but for hardware:
| Storage | DRA (new) |
|---|---|
| StorageClass | DeviceClass |
| PersistentVolumeClaim | ResourceClaim |
| PersistentVolume | ResourceSlice |
| CSI Driver | DRA Driver |
The DRA driver (e.g., NVIDIA’s `k8s-dra-driver-gpu`) runs on each node and publishes ResourceSlices describing what hardware is available — not just “4 GPUs” but detailed attributes like model, VRAM, driver version, MIG profiles, PCIe topology, and NVLink connectivity.
Workloads create ResourceClaims that describe what they *need* using CEL expressions. The scheduler matches claims to slices, and the driver prepares the device for the Pod.
┌─────────────────────────────────────┐
│ Kubernetes Scheduler │
│ Matches ResourceClaims to Slices │
└──────────┬──────────┬───────────────┘
│ │
┌──────────────┴──┐ ┌────┴────────────────┐
│ ResourceSlice │ │ ResourceSlice │
│ Node: gpu-01 │ │ Node: gpu-02 │
│ 4x A100-80GB │ │ 2x T4-16GB │
│ NVLink: yes │ │ NVLink: no │
│ MIG: enabled │ │ MIG: not supported │
└─────────────────┘ └─────────────────────┘
▲ ▲
DRA Driver publishes DRA Driver publishes
You can install the nvidia-dra driver easily by Helm
helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \ --namespace nvidia-dra-driver-gpu \ --reuse-values \ --set "resources.gpus.enabled=false" \ --set "resources.computeDomains.enabled=true" \ --set kubeletRootDirectory=/var/lib/kubelet
Pattern 1: Claiming a GPU by Attributes (Not Count)
Instead of “give me 1 GPU,” we say “give me a GPU with at least 40GB VRAM”
Device Class – defines a category of GPUs
apiVersion: resource.k8s.io/v1kind: DeviceClassmetadata: name: high-memory-gpuspec: selectors: - cel: expression: > device.driver == "gpu.nvidia.com" && device.capacity["gpu.nvidia.com"].memory >= quantity("40Gi")
Now a workload claims from the class
---apiVersion: resource.k8s.io/v1kind: ResourceClaimTemplatemetadata: name: training-gpuspec: spec: devices: requests: - name: gpu deviceClassName: high-memory-gpu # Additional filtering beyond the DeviceClass: selectors: - cel: expression: > device.attributes["gpu.nvidia.com"].productName == "NVIDIA A100-SXM4-80GB"---apiVersion: batch/v1kind: Jobmetadata: name: finetune-qwenspec: template: spec: restartPolicy: Never resourceClaims: - name: training-gpu resourceClaimTemplateName: training-gpu containers: - name: trainer image: vllm/vllm-openai:latest command: ["python", "-c"] args: - | import torch print(f"CUDA available: {torch.cuda.is_available()}") print(f"Device: {torch.cuda.get_device_name(0)}") print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB") resources: claims: - name: training-gpu
What happens
- Kubernetes creates a ResourceClaim from the template for this Pod
- The scheduler reads all ResourceSlices in the cluster
- It finds a node with an A100-SXM4-80GB that has >= 40GB VRAM and compute capability >= 8
- It allocates that specific GPU to the claim
- The DRA driver on that node prepares the GPU (sets up CDI, MIG if needed)
- The Pod starts with exactly the GPU it asked for
Pattern 2: GPU Sharing for Inference
A small inference model using 3GB on an 80GB GPU is an $8/hr waste. DRA lets multiple Pods share a device:
Create a NAMED ResourceClaim (not a template) — multiple Pods reference the same claim and share the same physical GPU.
---apiVersion: resource.k8s.io/v1kind: ResourceClaimmetadata: name: shared-inference-gpuspec: devices: requests: - name: gpu deviceClassName: any-nvidia-gpu---# Pod 1: Qwen 1.5B inference (~3GB VRAM)apiVersion: v1kind: Podmetadata: name: qwen-inferencespec: resourceClaims: - name: shared-gpu resourceClaimName: shared-inference-gpu # References the SAME claim containers: - name: vllm image: vllm/vllm-openai:latest command: ["vllm"] args: ["serve", "Qwen/Qwen2.5-1.5B-Instruct", "--port", "8000", "--max-model-len", "2048", "--gpu-memory-utilization", "0.3"] resources: claims: - name: shared-gpu---# Pod 2: Embedding model on the SAME GPU (~2GB VRAM)apiVersion: v1kind: Podmetadata: name: embedding-servicespec: resourceClaims: - name: shared-gpu resourceClaimName: shared-inference-gpu # Same claim = same GPU containers: - name: embedder image: vllm/vllm-openai:latest command: ["vllm"] args: ["serve", "BAAI/bge-base-en-v1.5", "--port", "8001", "--gpu-memory-utilization", "0.2"] resources: claims: - name: shared-gpu
Both Pods land on the same node, same GPU. vLLM’s `–gpu-memory-utilization` flag limits how much VRAM each process takes. The DRA driver ensures both containers see the device.
When to use ResourceClaim vs ResourceClaimTemplate
| ResourceClaim (named) | ResourceClaimTemplate | |
|---|---|---|
| Pods share the device | Yes — all Pods reference the same claim | No — each Pod gets its own claim |
| Use case | Inference (multiple models on one GPU) | Training (each replica needs its own GPU) |
| Lifecycle | Manual — persists until deleted | Automatic — created/deleted with the Pod |
Pattern 3: MIG Partitioning
NVIDIA A100 and H100 GPUs support Multi-Instance GPU (MIG), which physically partitions a GPU into isolated instances. DRA can request specific MIG profiles:
MIG Profiles
---apiVersion: resource.k8s.io/v1kind: ResourceClaimTemplatemetadata: name: mig-mixed-workloadspec: spec: devices: requests: # A large slice for the primary model - name: large-slice deviceClassName: mig.nvidia.com selectors: - cel: expression: > device.attributes["gpu.nvidia.com"].profile == "3g.40gb" # A small slice for a helper model - name: small-slice deviceClassName: mig.nvidia.com selectors: - cel: expression: > device.attributes["gpu.nvidia.com"].profile == "1g.10gb" # CRITICAL: ensure both slices come from the same physical GPU constraints: - requests: ["large-slice", "small-slice"] matchAttribute: "gpu.nvidia.com/parentUUID"---apiVersion: apps/v1kind: Deploymentmetadata: name: mig-inferencespec: replicas: 2 selector: matchLabels: app: mig-inference template: metadata: labels: app: mig-inference spec: resourceClaims: - name: mig-devices resourceClaimTemplateName: mig-mixed-workload containers: # Main model gets the 3g.40gb slice - name: primary-model image: vllm/vllm-openai:latest command: ["vllm"] args: ["serve", "Qwen/Qwen2.5-7B-Instruct", "--port", "8000", "--max-model-len", "4096"] resources: claims: - name: mig-devices request: large-slice # Embedding model gets the 1g.10gb slice - name: embedding-model image: vllm/vllm-openai:latest command: ["vllm"] args: ["serve", "BAAI/bge-base-en-v1.5", "--port", "8001"] resources: claims: - name: mig-devices request: small-slice
This is impossible with the old device plugin API. You couldn’t express “give me a 3g.40gb and a 1g.10gb from the same GPU.” DRA makes it declarative.
Inspecting DRA State
# See what hardware the DRA driver has discoveredkubectl get resourceslices -o wide# See all claims and their allocation statuskubectl get resourceclaims -A# Inspect a specific claim's allocation detailskubectl get resourceclaim my-model-gpu-xxxxx -o yaml# See which devices are allocated to a Podkubectl get pod my-model-xxxx -o jsonpath='{.status.containerStatuses[*].allocatedResourcesStatus}'# List DeviceClasses available in the clusterkubectl get deviceclasses
DRA is powerful but not complete
- No fractional GPU memory allocation – At the Kubernetes level. You can request a MIG profile (which is a physical partition), but you can’t say “give me 10GB of a non-MIG GPU.” Workloads manage their own memory budgets (e.g., vLLM’s `–gpu-memory-utilization`).
- No cross-node device claims – A ResourceClaim is satisfied by devices on a single node. Multi-node NVLink (MNNVL) is handled by NVIDIA’s ComputeDomain abstraction in their DRA driver, not by core Kubernetes.
- CEL expressions can be complex – Filtering by memory, compute capability, interconnect, and MIG profile in a single expression requires fluency in CEL. Platform teams should encapsulate this complexity in DeviceClasses.
- Ecosystem maturity varies – NVIDIA’s DRA driver is the most complete. AMD’s is in beta. Intel, Google TPU, and others are in various stages. Check your vendor’s support before migrating.
Leave a Reply