[docs] Add vGPU setup guide for GPU sharing between VMs#467
[docs] Add vGPU setup guide for GPU sharing between VMs#467
Conversation
Add practical instructions for deploying GPU Operator with vGPU variant: - Building proprietary vGPU Manager container image - Deploying with vgpu variant via Package CR - NLS license server configuration - KubeVirt mediatedDeviceTypes setup - vGPU profile reference table for L40S - VM creation example with vGPU resource Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
✅ Deploy Preview for cozystack ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
📝 WalkthroughWalkthroughDocumentation replaces the previous GPU-sharing overview with a focused vGPU (mediated device) guide covering prerequisites, NVIDIA vGPU licensing, GPU Operator vgpu variant, vGPU Manager image build, NVIDIA License Server wiring, KubeVirt mediated device config, VM examples, and vGPU profile details (≤50 words). Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request provides comprehensive documentation for configuring NVIDIA vGPU sharing for virtual machines, including prerequisites, image building, operator installation, and licensing setup. The review feedback suggests clarifying the FeatureType parameter in the licensing configuration and updating the example command prompt to maintain consistency with the platform's virtual machine naming conventions.
| gridd.conf: | | ||
| ServerAddress=nls.example.com | ||
| ServerPort=443 | ||
| FeatureType=1 |
There was a problem hiding this comment.
It is helpful to clarify what the FeatureType value represents to assist users in customizing their configuration. In the NVIDIA Grid configuration, 1 corresponds to the "NVIDIA vGPU" (vPC/vWS) feature, while 2 is for "NVIDIA Virtual Compute Server" (vCS).
| FeatureType=1 | |
| FeatureType=1 # 1 for vGPU |
| ``` | ||
|
|
||
| ```console | ||
| ubuntu@gpu-vgpu:~$ nvidia-smi |
There was a problem hiding this comment.
For consistency with the GPU passthrough example (line 194) and Cozystack's default naming convention for virtual machine instances, the hostname in the command prompt should include the virtual-machine- prefix. Since the VM name is defined as gpu-vgpu on line 352, the resulting instance name is virtual-machine-gpu-vgpu.
| ubuntu@gpu-vgpu:~$ nvidia-smi | |
| ubuntu@virtual-machine-gpu-vgpu:~$ nvidia-smi |
Replace simplified Containerfile with NVIDIA's Makefile-based build system from gitlab.com/nvidia/container-images/driver. The GPU Operator expects pre-compiled kernel modules, not a raw .run file. Add EULA warning about public redistribution of vGPU driver images. Add note about NLS ServerPort being deployment-dependent. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
content/en/docs/v1/virtualization/gpu.md (1)
360-385: Name the manifest file explicitly before the apply command.
kubectl apply -f vmi-vgpu.yamlappears without first labeling the YAML block asvmi-vgpu.yaml(unlike the earlier passthrough example). Adding a filename label right above the manifest would remove ambiguity for copy/paste users.✏️ Suggested doc tweak
### 5. Create a Virtual Machine with vGPU +**vmi-vgpu.yaml**: + ```yaml apiVersion: apps.cozystack.io/v1alpha1 appVersion: '*' kind: VirtualMachine ...</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@content/en/docs/v1/virtualization/gpu.mdaround lines 360 - 385, Add an
explicit filename label above the YAML manifest block so users know the file
name to save before running kubectl; specifically, annotate the VirtualMachine
manifest block with "vmi-vgpu.yaml" (the same name used in the kubectl apply -f
vmi-vgpu.yaml command) by placing the filename line immediately before theinvocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@content/en/docs/v1/virtualization/gpu.md`:
- Around line 360-385: Add an explicit filename label above the YAML manifest
block so users know the file name to save before running kubectl; specifically,
annotate the VirtualMachine manifest block with "vmi-vgpu.yaml" (the same name
used in the kubectl apply -f vmi-vgpu.yaml command) by placing the filename line
immediately before the ```yaml fence, ensuring consistency between the manifest
and the kubectl apply invocation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 150cd704-7d01-4c3d-86f5-81294e0a31b5
📒 Files selected for processing (1)
content/en/docs/v1/virtualization/gpu.md
- Switch licensing config from ConfigMap to Secret (configMapName deprecated) - Add FeatureType comment explaining values (1=vGPU, 2=vCS) - Fix console hostname to match Cozystack naming convention (virtual-machine- prefix) Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
content/en/docs/v1/virtualization/gpu.md (1)
279-288: Make theimagePullSecretssnippet fully qualified to prevent misplacement.At Line 279, the snippet is context-trimmed and can be pasted under the wrong key. Please show the full values path to avoid broken Package configuration.
Proposed doc patch
- ```yaml - gpu-operator: - vgpuManager: - repository: registry.example.com/nvidia - version: "550.90.05" - imagePullSecrets: - - name: nvidia-registry-secret - ``` + ```yaml + components: + gpu-operator: + values: + gpu-operator: + vgpuManager: + repository: registry.example.com/nvidia + version: "550.90.05" + imagePullSecrets: + - name: nvidia-registry-secret + ```🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@content/en/docs/v1/virtualization/gpu.md` around lines 279 - 288, The snippet for imagePullSecrets is too context-trimmed and can be pasted under the wrong key; update the example so it shows the full values path (wrap the existing gpu-operator.vgpuManager block under components -> gpu-operator -> values -> gpu-operator -> vgpuManager) so users see the complete hierarchy and the imagePullSecrets entry (refer to symbols: components, gpu-operator, values, gpu-operator, vgpuManager, imagePullSecrets) and replace the trimmed snippet with this fully-qualified version.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@content/en/docs/v1/virtualization/gpu.md`:
- Around line 358-391: Add an explicit readiness check for the
VirtualMachineInstance before calling "virtctl console": after applying
vmi-vgpu.yaml (kubectl apply -f vmi-vgpu.yaml) add a step that waits for the VMI
to become Ready (e.g., using "kubectl get vmi -n tenant-example -w" or "kubectl
wait --for=condition=Ready vmi/gpu-vgpu -n tenant-example") so the subsequent
"virtctl console virtual-machine-gpu-vgpu" call won't fail intermittently.
---
Nitpick comments:
In `@content/en/docs/v1/virtualization/gpu.md`:
- Around line 279-288: The snippet for imagePullSecrets is too context-trimmed
and can be pasted under the wrong key; update the example so it shows the full
values path (wrap the existing gpu-operator.vgpuManager block under components
-> gpu-operator -> values -> gpu-operator -> vgpuManager) so users see the
complete hierarchy and the imagePullSecrets entry (refer to symbols: components,
gpu-operator, values, gpu-operator, vgpuManager, imagePullSecrets) and replace
the trimmed snippet with this fully-qualified version.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7d6caa76-d2a5-4856-97eb-524345257f6b
📒 Files selected for processing (1)
content/en/docs/v1/virtualization/gpu.md
| ### 5. Create a Virtual Machine with vGPU | ||
|
|
||
| ```yaml | ||
| apiVersion: apps.cozystack.io/v1alpha1 | ||
| appVersion: '*' | ||
| kind: VirtualMachine | ||
| metadata: | ||
| name: gpu-vgpu | ||
| namespace: tenant-example | ||
| spec: | ||
| running: true | ||
| instanceProfile: ubuntu | ||
| instanceType: u1.medium | ||
| systemDisk: | ||
| image: ubuntu | ||
| storage: 5Gi | ||
| storageClass: replicated | ||
| gpus: | ||
| - name: nvidia.com/NVIDIA_L40S-24Q | ||
| cloudInit: | | ||
| #cloud-config | ||
| password: ubuntu | ||
| chpasswd: { expire: False } | ||
| ``` | ||
|
|
||
| ```bash | ||
| kubectl apply -f vmi-vgpu.yaml | ||
| ``` | ||
|
|
||
| Once the VM is running, log in and verify the vGPU is available: | ||
|
|
||
| ```bash | ||
| virtctl console virtual-machine-gpu-vgpu | ||
| ``` |
There was a problem hiding this comment.
Add an explicit VM readiness check before opening console.
After Line 384, jumping directly to virtctl console can fail intermittently if the VM/VMI is not ready yet. Add a wait/check step to keep the flow deterministic.
Proposed doc patch
```bash
kubectl apply -f vmi-vgpu.yaml+Wait until the VM instance is ready:
+
+bash +kubectl get vmi -n tenant-example -w +
+
Once the VM is running, log in and verify the vGPU is available:
</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.
In @content/en/docs/v1/virtualization/gpu.md around lines 358 - 391, Add an
explicit readiness check for the VirtualMachineInstance before calling "virtctl
console": after applying vmi-vgpu.yaml (kubectl apply -f vmi-vgpu.yaml) add a
step that waits for the VMI to become Ready (e.g., using "kubectl get vmi -n
tenant-example -w" or "kubectl wait --for=condition=Ready vmi/gpu-vgpu -n
tenant-example") so the subsequent "virtctl console virtual-machine-gpu-vgpu"
call won't fail intermittently.
</details>
<!-- fingerprinting:phantom:triton:hawk:8b73ebdb-e38a-4b41-bcc3-3277f8a35b52 -->
<!-- This is an auto-generated comment by CodeRabbit -->
What this PR does
Expands the GPU documentation page with a practical guide for deploying the GPU Operator in vGPU mode. Replaces the brief theoretical section with step-by-step instructions covering:
vgpuvariant via Package CRSummary by CodeRabbit