Beyond Integer GPUs: Mastering DRA for ML Workloads

Stop treating a $30K A100 like a boolean. Dynamic Resource Allocation (GA in Kubernetes 1.34) lets you claim GPUs by VRAM, compute capability, interconnect topology, and MIG profile — then share them safely across workloads. This article walks through every pattern with real manifests.

The Problem: GPUs Are Not Integers

For years, requesting a GPU in Kubernetes looked like this:

			
resources:
  limits:
    nvidia.com/gpu: "1"    # That's it. That's the whole API.

This single integer tells the scheduler nothing about:

How much VRAM – The workload actually needs (a 1.5B model needs ~3GB; a 70B model needs ~140GB)
Which GPU model – What is acceptable (your A100 and your T4 are not interchangeable)
Whether the GPU can be shared – a small inference workload using 4GB on an 80GB GPU wastes 95% of the device
Interconnect requirements – distributed training needs NVLink
MIG partitioning – an A100 can be sliced into 7 independent instances — but the old API can’t express this

The result: GPUs sit at 15% utilization while the scheduler reports “Insufficient nvidia.com/gpu.” Every GPU is either fully allocated or fully idle. There’s no middle ground.

Dynamic Resource Allocation changes this fundamentally.

How DRA Works: The Mental Model

Think of DRA like PersistentVolumes, but for hardware:

Storage	DRA (new)
StorageClass	DeviceClass
PersistentVolumeClaim	ResourceClaim
PersistentVolume	ResourceSlice
CSI Driver	DRA Driver

The DRA driver (e.g., NVIDIA’s `k8s-dra-driver-gpu`) runs on each node and publishes ResourceSlices describing what hardware is available — not just “4 GPUs” but detailed attributes like model, VRAM, driver version, MIG profiles, PCIe topology, and NVLink connectivity.
Workloads create ResourceClaims that describe what they *need* using CEL expressions. The scheduler matches claims to slices, and the driver prepares the device for the Pod.

                  ┌─────────────────────────────────────┐
                  │         Kubernetes Scheduler        │
                  │  Matches ResourceClaims to Slices   │
                  └──────────┬──────────┬───────────────┘
                             │          │
              ┌──────────────┴──┐  ┌────┴────────────────┐
              │  ResourceSlice  │  │  ResourceSlice      │
              │  Node: gpu-01   │  │  Node: gpu-02       │
              │  4x A100-80GB   │  │  2x T4-16GB         │
              │  NVLink: yes    │  │  NVLink: no         │
              │  MIG: enabled   │  │  MIG: not supported │
              └─────────────────┘  └─────────────────────┘
                       ▲                    ▲
              DRA Driver publishes   DRA Driver publishes

You can install the nvidia-dra driver easily by Helm

			
helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
  --namespace nvidia-dra-driver-gpu \
  --reuse-values \
  --set "resources.gpus.enabled=false" \
  --set "resources.computeDomains.enabled=true" \
  --set kubeletRootDirectory=/var/lib/kubelet

		

Pattern 1: Claiming a GPU by Attributes (Not Count)

Instead of “give me 1 GPU,” we say “give me a GPU with at least 40GB VRAM”

Device Class – defines a category of GPUs

			
apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: high-memory-gpu
spec:
  selectors:
    - cel:
        expression: >
          device.driver == "gpu.nvidia.com" &&
          device.capacity["gpu.nvidia.com"].memory >= quantity("40Gi")

		

Now a workload claims from the class

			
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: training-gpu
spec:
  spec:
    devices:
      requests:
        - name: gpu
          deviceClassName: high-memory-gpu
          # Additional filtering beyond the DeviceClass:
          selectors:
            - cel:
                expression: >
                  device.attributes["gpu.nvidia.com"].productName == "NVIDIA A100-SXM4-80GB"
---
apiVersion: batch/v1
kind: Job
metadata:
  name: finetune-qwen
spec:
  template:
    spec:
      restartPolicy: Never
      resourceClaims:
        - name: training-gpu
          resourceClaimTemplateName: training-gpu
      containers:
        - name: trainer
          image: vllm/vllm-openai:latest
          command: ["python", "-c"]
          args:
            - |
              import torch
              print(f"CUDA available: {torch.cuda.is_available()}")
              print(f"Device: {torch.cuda.get_device_name(0)}")
              print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
          resources:
            claims:
              - name: training-gpu

		

What happens

Kubernetes creates a ResourceClaim from the template for this Pod
The scheduler reads all ResourceSlices in the cluster
It finds a node with an A100-SXM4-80GB that has >= 40GB VRAM and compute capability >= 8
It allocates that specific GPU to the claim
The DRA driver on that node prepares the GPU (sets up CDI, MIG if needed)
The Pod starts with exactly the GPU it asked for

Pattern 2: GPU Sharing for Inference

A small inference model using 3GB on an 80GB GPU is an $8/hr waste. DRA lets multiple Pods share a device:

Create a NAMED ResourceClaim (not a template) — multiple Pods reference the same claim and share the same physical GPU.

			
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
  name: shared-inference-gpu
spec:
  devices:
    requests:
      - name: gpu
        deviceClassName: any-nvidia-gpu
---
# Pod 1: Qwen 1.5B inference (~3GB VRAM)
apiVersion: v1
kind: Pod
metadata:
  name: qwen-inference
spec:
  resourceClaims:
    - name: shared-gpu
      resourceClaimName: shared-inference-gpu    # References the SAME claim
  containers:
    - name: vllm
      image: vllm/vllm-openai:latest
      command: ["vllm"]
      args: ["serve", "Qwen/Qwen2.5-1.5B-Instruct", "--port", "8000",
             "--max-model-len", "2048", "--gpu-memory-utilization", "0.3"]
      resources:
        claims:
          - name: shared-gpu
---
# Pod 2: Embedding model on the SAME GPU (~2GB VRAM)
apiVersion: v1
kind: Pod
metadata:
  name: embedding-service
spec:
  resourceClaims:
    - name: shared-gpu
      resourceClaimName: shared-inference-gpu    # Same claim = same GPU
  containers:
    - name: embedder
      image: vllm/vllm-openai:latest
      command: ["vllm"]
      args: ["serve", "BAAI/bge-base-en-v1.5", "--port", "8001",
             "--gpu-memory-utilization", "0.2"]
      resources:
        claims:
          - name: shared-gpu

		

Both Pods land on the same node, same GPU. vLLM’s `–gpu-memory-utilization` flag limits how much VRAM each process takes. The DRA driver ensures both containers see the device.

When to use ResourceClaim vs ResourceClaimTemplate

	ResourceClaim (named)	ResourceClaimTemplate
Pods share the device	Yes — all Pods reference the same claim	No — each Pod gets its own claim
Use case	Inference (multiple models on one GPU)	Training (each replica needs its own GPU)
Lifecycle	Manual — persists until deleted	Automatic — created/deleted with the Pod

Pattern 3: MIG Partitioning

NVIDIA A100 and H100 GPUs support Multi-Instance GPU (MIG), which physically partitions a GPU into isolated instances. DRA can request specific MIG profiles:

MIG Profiles

			
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: mig-mixed-workload
spec:
  spec:
    devices:
      requests:
        # A large slice for the primary model
        - name: large-slice
          deviceClassName: mig.nvidia.com
          selectors:
            - cel:
                expression: >
                  device.attributes["gpu.nvidia.com"].profile == "3g.40gb"
        # A small slice for a helper model
        - name: small-slice
          deviceClassName: mig.nvidia.com
          selectors:
            - cel:
                expression: >
                  device.attributes["gpu.nvidia.com"].profile == "1g.10gb"
      # CRITICAL: ensure both slices come from the same physical GPU
      constraints:
        - requests: ["large-slice", "small-slice"]
          matchAttribute: "gpu.nvidia.com/parentUUID"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mig-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: mig-inference
  template:
    metadata:
      labels:
        app: mig-inference
    spec:
      resourceClaims:
        - name: mig-devices
          resourceClaimTemplateName: mig-mixed-workload
      containers:
        # Main model gets the 3g.40gb slice
        - name: primary-model
          image: vllm/vllm-openai:latest
          command: ["vllm"]
          args: ["serve", "Qwen/Qwen2.5-7B-Instruct",
                 "--port", "8000", "--max-model-len", "4096"]
          resources:
            claims:
              - name: mig-devices
                request: large-slice
        # Embedding model gets the 1g.10gb slice
        - name: embedding-model
          image: vllm/vllm-openai:latest
          command: ["vllm"]
          args: ["serve", "BAAI/bge-base-en-v1.5",
                 "--port", "8001"]
          resources:
            claims:
              - name: mig-devices
                request: small-slice

		

This is impossible with the old device plugin API. You couldn’t express “give me a 3g.40gb and a 1g.10gb from the same GPU.” DRA makes it declarative.

Inspecting DRA State

			
# See what hardware the DRA driver has discovered
kubectl get resourceslices -o wide
# See all claims and their allocation status
kubectl get resourceclaims -A
# Inspect a specific claim's allocation details
kubectl get resourceclaim my-model-gpu-xxxxx -o yaml
# See which devices are allocated to a Pod
kubectl get pod my-model-xxxx -o jsonpath='{.status.containerStatuses[*].allocatedResourcesStatus}'
# List DeviceClasses available in the cluster
kubectl get deviceclasses

		

DRA is powerful but not complete

No fractional GPU memory allocation – At the Kubernetes level. You can request a MIG profile (which is a physical partition), but you can’t say “give me 10GB of a non-MIG GPU.” Workloads manage their own memory budgets (e.g., vLLM’s `–gpu-memory-utilization`).
No cross-node device claims – A ResourceClaim is satisfied by devices on a single node. Multi-node NVLink (MNNVL) is handled by NVIDIA’s ComputeDomain abstraction in their DRA driver, not by core Kubernetes.
CEL expressions can be complex – Filtering by memory, compute capability, interconnect, and MIG profile in a single expression requires fluency in CEL. Platform teams should encapsulate this complexity in DeviceClasses.
Ecosystem maturity varies – NVIDIA’s DRA driver is the most complete. AMD’s is in beta. Intel, Google TPU, and others are in various stages. Check your vendor’s support before migrating.

Mind Of The Machine