Beyond Integer GPUs: Mastering DRA for ML Workloads

Stop treating a $30K A100 like a boolean. Dynamic Resource Allocation (GA in Kubernetes 1.34) lets you claim GPUs by VRAM, compute capability, interconnect topology, and MIG profile — then share them safely across workloads. This article walks through every pattern with real manifests.

The Problem: GPUs Are Not Integers

For years, requesting a GPU in Kubernetes looked like this:

resources:
limits:
nvidia.com/gpu: "1" # That's it. That's the whole API.

This single integer tells the scheduler nothing about:

This single integer tells the scheduler nothing about:

  • How much VRAM – The workload actually needs (a 1.5B model needs ~3GB; a 70B model needs ~140GB)
  • Which GPU model – What is acceptable (your A100 and your T4 are not interchangeable)
  • Whether the GPU can be shared – a small inference workload using 4GB on an 80GB GPU wastes 95% of the device
  • Interconnect requirements – distributed training needs NVLink
  • MIG partitioning – an A100 can be sliced into 7 independent instances — but the old API can’t express this

The result: GPUs sit at 15% utilization while the scheduler reports “Insufficient nvidia.com/gpu.” Every GPU is either fully allocated or fully idle. There’s no middle ground.

Dynamic Resource Allocation changes this fundamentally.

How DRA Works: The Mental Model

Think of DRA like PersistentVolumes, but for hardware:

Storage DRA (new)
StorageClass DeviceClass
PersistentVolumeClaim ResourceClaim
PersistentVolume ResourceSlice
CSI Driver DRA Driver

The DRA driver (e.g., NVIDIA’s `k8s-dra-driver-gpu`) runs on each node and publishes ResourceSlices describing what hardware is available — not just “4 GPUs” but detailed attributes like model, VRAM, driver version, MIG profiles, PCIe topology, and NVLink connectivity.
Workloads create ResourceClaims that describe what they *need* using CEL expressions. The scheduler matches claims to slices, and the driver prepares the device for the Pod.

                  ┌─────────────────────────────────────┐
│ Kubernetes Scheduler │
│ Matches ResourceClaims to Slices │
└──────────┬──────────┬───────────────┘
│ │
┌──────────────┴──┐ ┌────┴────────────────┐
│ ResourceSlice │ │ ResourceSlice │
│ Node: gpu-01 │ │ Node: gpu-02 │
│ 4x A100-80GB │ │ 2x T4-16GB │
│ NVLink: yes │ │ NVLink: no │
│ MIG: enabled │ │ MIG: not supported │
└─────────────────┘ └─────────────────────┘
▲ ▲
DRA Driver publishes DRA Driver publishes

You can install the nvidia-dra driver easily by Helm

helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
--namespace nvidia-dra-driver-gpu \
--reuse-values \
--set "resources.gpus.enabled=false" \
--set "resources.computeDomains.enabled=true" \
--set kubeletRootDirectory=/var/lib/kubelet

Pattern 1: Claiming a GPU by Attributes (Not Count)

Instead of “give me 1 GPU,” we say “give me a GPU with at least 40GB VRAM”


Device Class – defines a category of GPUs
apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
name: high-memory-gpu
spec:
selectors:
- cel:
expression: >
device.driver == "gpu.nvidia.com" &&
device.capacity["gpu.nvidia.com"].memory >= quantity("40Gi")
Now a workload claims from the class
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: training-gpu
spec:
spec:
devices:
requests:
- name: gpu
deviceClassName: high-memory-gpu
# Additional filtering beyond the DeviceClass:
selectors:
- cel:
expression: >
device.attributes["gpu.nvidia.com"].productName == "NVIDIA A100-SXM4-80GB"
---
apiVersion: batch/v1
kind: Job
metadata:
name: finetune-qwen
spec:
template:
spec:
restartPolicy: Never
resourceClaims:
- name: training-gpu
resourceClaimTemplateName: training-gpu
containers:
- name: trainer
image: vllm/vllm-openai:latest
command: ["python", "-c"]
args:
- |
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
resources:
claims:
- name: training-gpu

What happens

  • Kubernetes creates a ResourceClaim from the template for this Pod
  • The scheduler reads all ResourceSlices in the cluster
  • It finds a node with an A100-SXM4-80GB that has >= 40GB VRAM and compute capability >= 8
  • It allocates that specific GPU to the claim
  • The DRA driver on that node prepares the GPU (sets up CDI, MIG if needed)
  • The Pod starts with exactly the GPU it asked for

Pattern 2: GPU Sharing for Inference

A small inference model using 3GB on an 80GB GPU is an $8/hr waste. DRA lets multiple Pods share a device:

Create a NAMED ResourceClaim (not a template) — multiple Pods reference the same claim and share the same physical GPU.
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
name: shared-inference-gpu
spec:
devices:
requests:
- name: gpu
deviceClassName: any-nvidia-gpu
---
# Pod 1: Qwen 1.5B inference (~3GB VRAM)
apiVersion: v1
kind: Pod
metadata:
name: qwen-inference
spec:
resourceClaims:
- name: shared-gpu
resourceClaimName: shared-inference-gpu # References the SAME claim
containers:
- name: vllm
image: vllm/vllm-openai:latest
command: ["vllm"]
args: ["serve", "Qwen/Qwen2.5-1.5B-Instruct", "--port", "8000",
"--max-model-len", "2048", "--gpu-memory-utilization", "0.3"]
resources:
claims:
- name: shared-gpu
---
# Pod 2: Embedding model on the SAME GPU (~2GB VRAM)
apiVersion: v1
kind: Pod
metadata:
name: embedding-service
spec:
resourceClaims:
- name: shared-gpu
resourceClaimName: shared-inference-gpu # Same claim = same GPU
containers:
- name: embedder
image: vllm/vllm-openai:latest
command: ["vllm"]
args: ["serve", "BAAI/bge-base-en-v1.5", "--port", "8001",
"--gpu-memory-utilization", "0.2"]
resources:
claims:
- name: shared-gpu


Both Pods land on the same node, same GPU. vLLM’s `–gpu-memory-utilization` flag limits how much VRAM each process takes. The DRA driver ensures both containers see the device.

When to use ResourceClaim vs ResourceClaimTemplate

ResourceClaim (named) ResourceClaimTemplate
Pods share the device Yes — all Pods reference the same claim No — each Pod gets its own claim
Use case Inference (multiple models on one GPU) Training (each replica needs its own GPU)
Lifecycle Manual — persists until deleted Automatic — created/deleted with the Pod

Pattern 3: MIG Partitioning

NVIDIA A100 and H100 GPUs support Multi-Instance GPU (MIG), which physically partitions a GPU into isolated instances. DRA can request specific MIG profiles:

MIG Profiles
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: mig-mixed-workload
spec:
spec:
devices:
requests:
# A large slice for the primary model
- name: large-slice
deviceClassName: mig.nvidia.com
selectors:
- cel:
expression: >
device.attributes["gpu.nvidia.com"].profile == "3g.40gb"
# A small slice for a helper model
- name: small-slice
deviceClassName: mig.nvidia.com
selectors:
- cel:
expression: >
device.attributes["gpu.nvidia.com"].profile == "1g.10gb"
# CRITICAL: ensure both slices come from the same physical GPU
constraints:
- requests: ["large-slice", "small-slice"]
matchAttribute: "gpu.nvidia.com/parentUUID"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: mig-inference
spec:
replicas: 2
selector:
matchLabels:
app: mig-inference
template:
metadata:
labels:
app: mig-inference
spec:
resourceClaims:
- name: mig-devices
resourceClaimTemplateName: mig-mixed-workload
containers:
# Main model gets the 3g.40gb slice
- name: primary-model
image: vllm/vllm-openai:latest
command: ["vllm"]
args: ["serve", "Qwen/Qwen2.5-7B-Instruct",
"--port", "8000", "--max-model-len", "4096"]
resources:
claims:
- name: mig-devices
request: large-slice
# Embedding model gets the 1g.10gb slice
- name: embedding-model
image: vllm/vllm-openai:latest
command: ["vllm"]
args: ["serve", "BAAI/bge-base-en-v1.5",
"--port", "8001"]
resources:
claims:
- name: mig-devices
request: small-slice

This is impossible with the old device plugin API. You couldn’t express “give me a 3g.40gb and a 1g.10gb from the same GPU.” DRA makes it declarative.

Inspecting DRA State

# See what hardware the DRA driver has discovered
kubectl get resourceslices -o wide
# See all claims and their allocation status
kubectl get resourceclaims -A
# Inspect a specific claim's allocation details
kubectl get resourceclaim my-model-gpu-xxxxx -o yaml
# See which devices are allocated to a Pod
kubectl get pod my-model-xxxx -o jsonpath='{.status.containerStatuses[*].allocatedResourcesStatus}'
# List DeviceClasses available in the cluster
kubectl get deviceclasses

DRA is powerful but not complete

  • No fractional GPU memory allocation – At the Kubernetes level. You can request a MIG profile (which is a physical partition), but you can’t say “give me 10GB of a non-MIG GPU.” Workloads manage their own memory budgets (e.g., vLLM’s `–gpu-memory-utilization`).
  • No cross-node device claims – A ResourceClaim is satisfied by devices on a single node. Multi-node NVLink (MNNVL) is handled by NVIDIA’s ComputeDomain abstraction in their DRA driver, not by core Kubernetes.
  • CEL expressions can be complex – Filtering by memory, compute capability, interconnect, and MIG profile in a single expression requires fluency in CEL. Platform teams should encapsulate this complexity in DeviceClasses.
  • Ecosystem maturity varies – NVIDIA’s DRA driver is the most complete. AMD’s is in beta. Intel, Google TPU, and others are in various stages. Check your vendor’s support before migrating.

Leave a Reply

Discover more from Mind Of The Machine

Subscribe now to keep reading and get access to the full archive.

Continue reading