From 9b6eca3ce16ca620742f6bfd52a439d7e20ae6cd Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Thu, 2 Apr 2026 15:19:29 +0300 Subject: [PATCH 1/3] docs(gpu): add vGPU setup guide for GPU sharing between VMs Add practical instructions for deploying GPU Operator with vGPU variant: - Building proprietary vGPU Manager container image - Deploying with vgpu variant via Package CR - NLS license server configuration - KubeVirt mediatedDeviceTypes setup - vGPU profile reference table for L40S - VM creation example with vGPU resource Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- content/en/docs/v1/virtualization/gpu.md | 208 +++++++++++++++++++++-- 1 file changed, 198 insertions(+), 10 deletions(-) diff --git a/content/en/docs/v1/virtualization/gpu.md b/content/en/docs/v1/virtualization/gpu.md index 36ef3542..676e5b49 100644 --- a/content/en/docs/v1/virtualization/gpu.md +++ b/content/en/docs/v1/virtualization/gpu.md @@ -198,25 +198,213 @@ We are now ready to create a VM. Kernel modules: nvidiafb, nvidia_drm, nvidia ``` -## GPU Sharing for Virtual Machines +## GPU Sharing for Virtual Machines (vGPU) -GPU passthrough assigns an entire physical GPU to a single VM. To share one GPU between multiple VMs, you need **NVIDIA vGPU**. +GPU passthrough assigns an entire physical GPU to a single VM. To share one GPU between multiple VMs, you can use **NVIDIA vGPU**, which creates virtual GPUs from a single physical GPU using mediated devices (mdev). -### vGPU (Virtual GPU) +{{% alert color="info" %}} +**Why not MIG?** MIG (Multi-Instance GPU) partitions a GPU into isolated instances, but these are logical divisions within a single PCIe device. VFIO cannot pass them to VMs — MIG only works with containers. To use MIG with VMs, you need vGPU on top of MIG partitions (still requires a vGPU license). +{{% /alert %}} -NVIDIA vGPU uses mediated devices (mdev) to create virtual GPUs assignable to VMs. This is the only production-ready solution for GPU sharing between VMs. +### Prerequisites -**Requirements:** -- NVIDIA vGPU license (commercial, purchased from NVIDIA) -- NVIDIA vGPU Manager installed on host nodes +- A GPU that supports vGPU (e.g., NVIDIA L40S, A100, A30, A16) +- An NVIDIA vGPU Software license (NVIDIA AI Enterprise or vGPU subscription) +- Access to the [NVIDIA Licensing Portal](https://ui.licensing.nvidia.com) to download the vGPU Manager driver -{{% alert color="info" %}} -**Why not MIG?** MIG (Multi-Instance GPU) partitions a GPU into isolated instances, but these are logical divisions within a single PCIe device. VFIO cannot pass them to VMs — MIG only works with containers. To use MIG with VMs, you need vGPU on top of MIG partitions (still requires a license). +{{% alert color="warning" %}} +The vGPU Manager driver is proprietary software distributed by NVIDIA under a commercial license. Cozystack does not include or redistribute this driver. You must obtain it directly from NVIDIA and build the container image yourself. {{% /alert %}} +### 1. Build the vGPU Manager Image + +Download the vGPU Manager driver from the [NVIDIA Licensing Portal](https://ui.licensing.nvidia.com) and build a container image: + +```bash +# Example Containerfile +FROM ubuntu:22.04 +ARG DRIVER_VERSION +COPY NVIDIA-Linux-x86_64-${DRIVER_VERSION}-vgpu-kvm.run /opt/ +RUN chmod +x /opt/NVIDIA-Linux-x86_64-${DRIVER_VERSION}-vgpu-kvm.run +``` + +```bash +docker build --build-arg DRIVER_VERSION=550.90.05 \ + --tag registry.example.com/nvidia/vgpu-manager:550.90.05 . +docker push registry.example.com/nvidia/vgpu-manager:550.90.05 +``` + +Refer to the [NVIDIA GPU Operator documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-vgpu.html) for detailed instructions on building the vGPU Manager image. + +### 2. Install the GPU Operator with vGPU Variant + +The GPU Operator provides a `vgpu` variant that enables the vGPU Manager and vGPU Device Manager instead of the VFIO Manager used in passthrough mode. + +1. Label the worker node for vGPU workloads: + + ```bash + kubectl label node --overwrite nvidia.com/gpu.workload.config=vm-vgpu + ``` + +2. Create the GPU Operator Package with the `vgpu` variant, providing your vGPU Manager image coordinates: + + ```yaml + apiVersion: cozystack.io/v1alpha1 + kind: Package + metadata: + name: cozystack.gpu-operator + spec: + variant: vgpu + components: + gpu-operator: + values: + gpu-operator: + vgpuManager: + repository: registry.example.com/nvidia + version: "550.90.05" + ``` + + If your registry requires authentication, create an `imagePullSecret` in the `cozy-gpu-operator` namespace first, then reference it: + + ```yaml + gpu-operator: + vgpuManager: + repository: registry.example.com/nvidia + version: "550.90.05" + imagePullSecrets: + - name: nvidia-registry-secret + ``` + +3. Verify all pods are running: + + ```bash + kubectl get pods -n cozy-gpu-operator + ``` + + Example output: + + ```console + NAME READY STATUS RESTARTS AGE + ... + nvidia-vgpu-manager-daemonset-xxxxx 1/1 Running 0 60s + nvidia-vgpu-device-manager-xxxxx 1/1 Running 0 45s + nvidia-sandbox-validator-xxxxx 1/1 Running 0 30s + ``` + +### 3. Configure NVIDIA License Server (NLS) + +vGPU requires a license to operate. Create a ConfigMap with the NLS client configuration: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: licensing-config + namespace: cozy-gpu-operator +data: + gridd.conf: | + ServerAddress=nls.example.com + ServerPort=443 + FeatureType=1 +``` + +Then reference it in the Package values: + +```yaml +gpu-operator: + vgpuManager: + repository: registry.example.com/nvidia + version: "550.90.05" + driver: + licensingConfig: + configMapName: licensing-config +``` + +### 4. Update the KubeVirt Custom Resource + +Configure KubeVirt to permit mediated devices. The `mediatedDeviceTypes` field specifies which vGPU profiles to use, and `permittedHostDevices` makes them available to VMs: + +```bash +kubectl edit kubevirt -n cozy-kubevirt +``` + +```yaml +spec: + configuration: + mediatedDevicesConfiguration: + mediatedDeviceTypes: + - nvidia-592 # Example: NVIDIA L40S-24Q + permittedHostDevices: + mediatedDevices: + - mdevNameSelector: NVIDIA L40S-24Q + resourceName: nvidia.com/NVIDIA_L40S-24Q +``` + +To find the correct type ID and profile name for your GPU, consult the [NVIDIA vGPU User Guide](https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/). + +### 5. Create a Virtual Machine with vGPU + +```yaml +apiVersion: apps.cozystack.io/v1alpha1 +appVersion: '*' +kind: VirtualMachine +metadata: + name: gpu-vgpu + namespace: tenant-example +spec: + running: true + instanceProfile: ubuntu + instanceType: u1.medium + systemDisk: + image: ubuntu + storage: 5Gi + storageClass: replicated + gpus: + - name: nvidia.com/NVIDIA_L40S-24Q + cloudInit: | + #cloud-config + password: ubuntu + chpasswd: { expire: False } +``` + +```bash +kubectl apply -f vmi-vgpu.yaml +``` + +Once the VM is running, log in and verify the vGPU is available: + +```bash +virtctl console virtual-machine-gpu-vgpu +``` + +```console +ubuntu@gpu-vgpu:~$ nvidia-smi ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 550.90.05 Driver Version: 550.90.05 CUDA Version: 12.4 | +| | +| GPU Name ... MIG M. | +| 0 NVIDIA L40S-24Q ... N/A | ++-----------------------------------------------------------------------------------------+ +``` + +### vGPU Profiles + +Each GPU model supports specific vGPU profiles that determine how the GPU is partitioned. Common profiles for NVIDIA L40S: + +| Profile | Frame Buffer | Max Instances | Use Case | +| --- | --- | --- | --- | +| NVIDIA L40S-1Q | 1 GB | 48 | Light 3D / VDI | +| NVIDIA L40S-2Q | 2 GB | 24 | Medium 3D / VDI | +| NVIDIA L40S-4Q | 4 GB | 12 | Heavy 3D / VDI | +| NVIDIA L40S-6Q | 6 GB | 8 | Professional 3D | +| NVIDIA L40S-8Q | 8 GB | 6 | AI/ML inference | +| NVIDIA L40S-12Q | 12 GB | 4 | AI/ML training | +| NVIDIA L40S-24Q | 24 GB | 2 | Large AI workloads | +| NVIDIA L40S-48Q | 48 GB | 1 | Full GPU equivalent | + ### Open-Source vGPU (Experimental) -NVIDIA is developing open-source vGPU support for the Linux kernel. Once merged, this could enable GPU sharing without a license. +NVIDIA is developing open-source vGPU support for the Linux kernel. Once merged, this could enable GPU sharing without a commercial license. - Status: RFC stage, not merged into mainline kernel - Supports Ada Lovelace and newer (L4, L40, etc.) From 468dd7b8381b320bca8d9838adae8c2a644a0aba Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Thu, 2 Apr 2026 15:48:48 +0300 Subject: [PATCH 2/3] docs(gpu): fix vGPU driver container build instructions and NLS config Replace simplified Containerfile with NVIDIA's Makefile-based build system from gitlab.com/nvidia/container-images/driver. The GPU Operator expects pre-compiled kernel modules, not a raw .run file. Add EULA warning about public redistribution of vGPU driver images. Add note about NLS ServerPort being deployment-dependent. Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- content/en/docs/v1/virtualization/gpu.md | 35 ++++++++++++++++-------- 1 file changed, 24 insertions(+), 11 deletions(-) diff --git a/content/en/docs/v1/virtualization/gpu.md b/content/en/docs/v1/virtualization/gpu.md index 676e5b49..b983b905 100644 --- a/content/en/docs/v1/virtualization/gpu.md +++ b/content/en/docs/v1/virtualization/gpu.md @@ -218,23 +218,35 @@ The vGPU Manager driver is proprietary software distributed by NVIDIA under a co ### 1. Build the vGPU Manager Image -Download the vGPU Manager driver from the [NVIDIA Licensing Portal](https://ui.licensing.nvidia.com) and build a container image: +The GPU Operator expects a pre-built driver container image — it does not install the driver from a raw `.run` file at runtime. -```bash -# Example Containerfile -FROM ubuntu:22.04 -ARG DRIVER_VERSION -COPY NVIDIA-Linux-x86_64-${DRIVER_VERSION}-vgpu-kvm.run /opt/ -RUN chmod +x /opt/NVIDIA-Linux-x86_64-${DRIVER_VERSION}-vgpu-kvm.run -``` +1. Download the vGPU Manager driver from the [NVIDIA Licensing Portal](https://ui.licensing.nvidia.com) (Software Downloads → NVIDIA AI Enterprise → Linux KVM) +2. Build the driver container image using NVIDIA's Makefile-based build system: ```bash -docker build --build-arg DRIVER_VERSION=550.90.05 \ - --tag registry.example.com/nvidia/vgpu-manager:550.90.05 . +# Clone the NVIDIA driver container repository +git clone https://gitlab.com/nvidia/container-images/driver.git +cd driver + +# Place the downloaded .run file in the appropriate directory +cp NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run vgpu/ + +# Build using the provided Makefile +make OS_TAG=ubuntu22.04 \ + VGPU_DRIVER_VERSION=550.90.05 \ + PRIVATE_REGISTRY=registry.example.com/nvidia + +# Push to your private registry docker push registry.example.com/nvidia/vgpu-manager:550.90.05 ``` -Refer to the [NVIDIA GPU Operator documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-vgpu.html) for detailed instructions on building the vGPU Manager image. +{{% alert color="info" %}} +The build process compiles kernel modules against the host kernel version. Refer to the [NVIDIA GPU Operator vGPU documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-vgpu.html) for the complete build procedure and supported OS/kernel combinations. +{{% /alert %}} + +{{% alert color="warning" %}} +Uploading the vGPU driver to a publicly available registry is a violation of the NVIDIA vGPU EULA. Always use a private registry. +{{% /alert %}} ### 2. Install the GPU Operator with vGPU Variant @@ -306,6 +318,7 @@ data: ServerAddress=nls.example.com ServerPort=443 FeatureType=1 + # ServerPort depends on your NLS deployment (commonly 443 for DLS or 7070 for legacy NLS) ``` Then reference it in the Package values: From 492f318894bbdaca8e5305e3a17d10cfa5b52713 Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Thu, 2 Apr 2026 15:57:14 +0300 Subject: [PATCH 3/3] fix(gpu): use Secret for licensing config, fix console hostname - Switch licensing config from ConfigMap to Secret (configMapName deprecated) - Add FeatureType comment explaining values (1=vGPU, 2=vCS) - Fix console hostname to match Cozystack naming convention (virtual-machine- prefix) Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- content/en/docs/v1/virtualization/gpu.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/content/en/docs/v1/virtualization/gpu.md b/content/en/docs/v1/virtualization/gpu.md index b983b905..404538d1 100644 --- a/content/en/docs/v1/virtualization/gpu.md +++ b/content/en/docs/v1/virtualization/gpu.md @@ -305,23 +305,23 @@ The GPU Operator provides a `vgpu` variant that enables the vGPU Manager and vGP ### 3. Configure NVIDIA License Server (NLS) -vGPU requires a license to operate. Create a ConfigMap with the NLS client configuration: +vGPU requires a license to operate. Create a Secret with the NLS client configuration: ```yaml apiVersion: v1 -kind: ConfigMap +kind: Secret metadata: name: licensing-config namespace: cozy-gpu-operator -data: +stringData: gridd.conf: | ServerAddress=nls.example.com ServerPort=443 - FeatureType=1 + FeatureType=1 # 1 for vGPU (vPC/vWS), 2 for Virtual Compute Server (vCS) # ServerPort depends on your NLS deployment (commonly 443 for DLS or 7070 for legacy NLS) ``` -Then reference it in the Package values: +Then reference the Secret in the Package values: ```yaml gpu-operator: @@ -330,7 +330,7 @@ gpu-operator: version: "550.90.05" driver: licensingConfig: - configMapName: licensing-config + secretName: licensing-config ``` ### 4. Update the KubeVirt Custom Resource @@ -391,7 +391,7 @@ virtctl console virtual-machine-gpu-vgpu ``` ```console -ubuntu@gpu-vgpu:~$ nvidia-smi +ubuntu@virtual-machine-gpu-vgpu:~$ nvidia-smi +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.05 Driver Version: 550.90.05 CUDA Version: 12.4 | | |