diff --git a/content/en/docs/v1/virtualization/gpu.md b/content/en/docs/v1/virtualization/gpu.md index 36ef3542..404538d1 100644 --- a/content/en/docs/v1/virtualization/gpu.md +++ b/content/en/docs/v1/virtualization/gpu.md @@ -198,25 +198,226 @@ We are now ready to create a VM. Kernel modules: nvidiafb, nvidia_drm, nvidia ``` -## GPU Sharing for Virtual Machines +## GPU Sharing for Virtual Machines (vGPU) -GPU passthrough assigns an entire physical GPU to a single VM. To share one GPU between multiple VMs, you need **NVIDIA vGPU**. +GPU passthrough assigns an entire physical GPU to a single VM. To share one GPU between multiple VMs, you can use **NVIDIA vGPU**, which creates virtual GPUs from a single physical GPU using mediated devices (mdev). -### vGPU (Virtual GPU) +{{% alert color="info" %}} +**Why not MIG?** MIG (Multi-Instance GPU) partitions a GPU into isolated instances, but these are logical divisions within a single PCIe device. VFIO cannot pass them to VMs — MIG only works with containers. To use MIG with VMs, you need vGPU on top of MIG partitions (still requires a vGPU license). +{{% /alert %}} + +### Prerequisites + +- A GPU that supports vGPU (e.g., NVIDIA L40S, A100, A30, A16) +- An NVIDIA vGPU Software license (NVIDIA AI Enterprise or vGPU subscription) +- Access to the [NVIDIA Licensing Portal](https://ui.licensing.nvidia.com) to download the vGPU Manager driver + +{{% alert color="warning" %}} +The vGPU Manager driver is proprietary software distributed by NVIDIA under a commercial license. Cozystack does not include or redistribute this driver. You must obtain it directly from NVIDIA and build the container image yourself. +{{% /alert %}} + +### 1. Build the vGPU Manager Image + +The GPU Operator expects a pre-built driver container image — it does not install the driver from a raw `.run` file at runtime. + +1. Download the vGPU Manager driver from the [NVIDIA Licensing Portal](https://ui.licensing.nvidia.com) (Software Downloads → NVIDIA AI Enterprise → Linux KVM) +2. Build the driver container image using NVIDIA's Makefile-based build system: + +```bash +# Clone the NVIDIA driver container repository +git clone https://gitlab.com/nvidia/container-images/driver.git +cd driver + +# Place the downloaded .run file in the appropriate directory +cp NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run vgpu/ -NVIDIA vGPU uses mediated devices (mdev) to create virtual GPUs assignable to VMs. This is the only production-ready solution for GPU sharing between VMs. +# Build using the provided Makefile +make OS_TAG=ubuntu22.04 \ + VGPU_DRIVER_VERSION=550.90.05 \ + PRIVATE_REGISTRY=registry.example.com/nvidia -**Requirements:** -- NVIDIA vGPU license (commercial, purchased from NVIDIA) -- NVIDIA vGPU Manager installed on host nodes +# Push to your private registry +docker push registry.example.com/nvidia/vgpu-manager:550.90.05 +``` {{% alert color="info" %}} -**Why not MIG?** MIG (Multi-Instance GPU) partitions a GPU into isolated instances, but these are logical divisions within a single PCIe device. VFIO cannot pass them to VMs — MIG only works with containers. To use MIG with VMs, you need vGPU on top of MIG partitions (still requires a license). +The build process compiles kernel modules against the host kernel version. Refer to the [NVIDIA GPU Operator vGPU documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-vgpu.html) for the complete build procedure and supported OS/kernel combinations. +{{% /alert %}} + +{{% alert color="warning" %}} +Uploading the vGPU driver to a publicly available registry is a violation of the NVIDIA vGPU EULA. Always use a private registry. {{% /alert %}} +### 2. Install the GPU Operator with vGPU Variant + +The GPU Operator provides a `vgpu` variant that enables the vGPU Manager and vGPU Device Manager instead of the VFIO Manager used in passthrough mode. + +1. Label the worker node for vGPU workloads: + + ```bash + kubectl label node --overwrite nvidia.com/gpu.workload.config=vm-vgpu + ``` + +2. Create the GPU Operator Package with the `vgpu` variant, providing your vGPU Manager image coordinates: + + ```yaml + apiVersion: cozystack.io/v1alpha1 + kind: Package + metadata: + name: cozystack.gpu-operator + spec: + variant: vgpu + components: + gpu-operator: + values: + gpu-operator: + vgpuManager: + repository: registry.example.com/nvidia + version: "550.90.05" + ``` + + If your registry requires authentication, create an `imagePullSecret` in the `cozy-gpu-operator` namespace first, then reference it: + + ```yaml + gpu-operator: + vgpuManager: + repository: registry.example.com/nvidia + version: "550.90.05" + imagePullSecrets: + - name: nvidia-registry-secret + ``` + +3. Verify all pods are running: + + ```bash + kubectl get pods -n cozy-gpu-operator + ``` + + Example output: + + ```console + NAME READY STATUS RESTARTS AGE + ... + nvidia-vgpu-manager-daemonset-xxxxx 1/1 Running 0 60s + nvidia-vgpu-device-manager-xxxxx 1/1 Running 0 45s + nvidia-sandbox-validator-xxxxx 1/1 Running 0 30s + ``` + +### 3. Configure NVIDIA License Server (NLS) + +vGPU requires a license to operate. Create a Secret with the NLS client configuration: + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: licensing-config + namespace: cozy-gpu-operator +stringData: + gridd.conf: | + ServerAddress=nls.example.com + ServerPort=443 + FeatureType=1 # 1 for vGPU (vPC/vWS), 2 for Virtual Compute Server (vCS) + # ServerPort depends on your NLS deployment (commonly 443 for DLS or 7070 for legacy NLS) +``` + +Then reference the Secret in the Package values: + +```yaml +gpu-operator: + vgpuManager: + repository: registry.example.com/nvidia + version: "550.90.05" + driver: + licensingConfig: + secretName: licensing-config +``` + +### 4. Update the KubeVirt Custom Resource + +Configure KubeVirt to permit mediated devices. The `mediatedDeviceTypes` field specifies which vGPU profiles to use, and `permittedHostDevices` makes them available to VMs: + +```bash +kubectl edit kubevirt -n cozy-kubevirt +``` + +```yaml +spec: + configuration: + mediatedDevicesConfiguration: + mediatedDeviceTypes: + - nvidia-592 # Example: NVIDIA L40S-24Q + permittedHostDevices: + mediatedDevices: + - mdevNameSelector: NVIDIA L40S-24Q + resourceName: nvidia.com/NVIDIA_L40S-24Q +``` + +To find the correct type ID and profile name for your GPU, consult the [NVIDIA vGPU User Guide](https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/). + +### 5. Create a Virtual Machine with vGPU + +```yaml +apiVersion: apps.cozystack.io/v1alpha1 +appVersion: '*' +kind: VirtualMachine +metadata: + name: gpu-vgpu + namespace: tenant-example +spec: + running: true + instanceProfile: ubuntu + instanceType: u1.medium + systemDisk: + image: ubuntu + storage: 5Gi + storageClass: replicated + gpus: + - name: nvidia.com/NVIDIA_L40S-24Q + cloudInit: | + #cloud-config + password: ubuntu + chpasswd: { expire: False } +``` + +```bash +kubectl apply -f vmi-vgpu.yaml +``` + +Once the VM is running, log in and verify the vGPU is available: + +```bash +virtctl console virtual-machine-gpu-vgpu +``` + +```console +ubuntu@virtual-machine-gpu-vgpu:~$ nvidia-smi ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 550.90.05 Driver Version: 550.90.05 CUDA Version: 12.4 | +| | +| GPU Name ... MIG M. | +| 0 NVIDIA L40S-24Q ... N/A | ++-----------------------------------------------------------------------------------------+ +``` + +### vGPU Profiles + +Each GPU model supports specific vGPU profiles that determine how the GPU is partitioned. Common profiles for NVIDIA L40S: + +| Profile | Frame Buffer | Max Instances | Use Case | +| --- | --- | --- | --- | +| NVIDIA L40S-1Q | 1 GB | 48 | Light 3D / VDI | +| NVIDIA L40S-2Q | 2 GB | 24 | Medium 3D / VDI | +| NVIDIA L40S-4Q | 4 GB | 12 | Heavy 3D / VDI | +| NVIDIA L40S-6Q | 6 GB | 8 | Professional 3D | +| NVIDIA L40S-8Q | 8 GB | 6 | AI/ML inference | +| NVIDIA L40S-12Q | 12 GB | 4 | AI/ML training | +| NVIDIA L40S-24Q | 24 GB | 2 | Large AI workloads | +| NVIDIA L40S-48Q | 48 GB | 1 | Full GPU equivalent | + ### Open-Source vGPU (Experimental) -NVIDIA is developing open-source vGPU support for the Linux kernel. Once merged, this could enable GPU sharing without a license. +NVIDIA is developing open-source vGPU support for the Linux kernel. Once merged, this could enable GPU sharing without a commercial license. - Status: RFC stage, not merged into mainline kernel - Supports Ada Lovelace and newer (L4, L40, etc.)