Skip to content

[docs] Add vGPU setup guide for GPU sharing between VMs#467

Open
lexfrei wants to merge 3 commits intomainfrom
docs/gpu-vgpu-setup
Open

[docs] Add vGPU setup guide for GPU sharing between VMs#467
lexfrei wants to merge 3 commits intomainfrom
docs/gpu-vgpu-setup

Conversation

@lexfrei
Copy link
Copy Markdown
Contributor

@lexfrei lexfrei commented Apr 2, 2026

What this PR does

Expands the GPU documentation page with a practical guide for deploying the GPU Operator in vGPU mode. Replaces the brief theoretical section with step-by-step instructions covering:

  • Building the proprietary vGPU Manager container image
  • Deploying GPU Operator with the vgpu variant via Package CR
  • NVIDIA License Server (NLS) configuration
  • KubeVirt mediatedDeviceTypes setup for VM access
  • vGPU profile reference table for L40S
  • Complete VM creation example with vGPU resource

Summary by CodeRabbit

  • Documentation
    • Replaced GPU Sharing guide with an expanded vGPU guide covering mediated devices and why MIG isn’t suitable for passthrough
    • Added prerequisites, licensing notes, and warning that proprietary vGPU Manager driver must be obtained from NVIDIA
    • Provided step-by-step workflow: pre-built driver image build/publish, GPU Operator vgpu variant, License Server wiring, and VM updates to permit mediated devices
    • Added vGPU VM examples, sample verification output, profile reference table, and updated open-source vGPU wording

Add practical instructions for deploying GPU Operator with vGPU variant:
- Building proprietary vGPU Manager container image
- Deploying with vgpu variant via Package CR
- NLS license server configuration
- KubeVirt mediatedDeviceTypes setup
- vGPU profile reference table for L40S
- VM creation example with vGPU resource

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
@netlify
Copy link
Copy Markdown

netlify bot commented Apr 2, 2026

Deploy Preview for cozystack ready!

Name Link
🔨 Latest commit 492f318
🔍 Latest deploy log https://app.netlify.com/projects/cozystack/deploys/69ce67b64ef3e60008af1961
😎 Deploy Preview https://deploy-preview-467--cozystack.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 2, 2026

📝 Walkthrough

Walkthrough

Documentation replaces the previous GPU-sharing overview with a focused vGPU (mediated device) guide covering prerequisites, NVIDIA vGPU licensing, GPU Operator vgpu variant, vGPU Manager image build, NVIDIA License Server wiring, KubeVirt mediated device config, VM examples, and vGPU profile details (≤50 words).

Changes

Cohort / File(s) Summary
GPU vGPU Documentation
content/en/docs/v1/virtualization/gpu.md
Rewrote GPU sharing section into a full vGPU (mdev) guide: new prerequisites and licensing notes; instructions to build/publish vGPU Manager driver container; GPU Operator variant: vgpu installation and node labeling; NVIDIA License Server Secret/ConfigMap example and Package wiring; KubeVirt mediatedDevices configuration; VM example requesting nvidia.com/<profile> and verification; added L40S vGPU profiles table; clarified open-source vGPU wording.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I hopped through docs with glee today,
Tucked mdev notes and profiles away,
Built driver images, licenses in tow,
VMs now share GPUs—watch them go! 🎉

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a comprehensive vGPU setup guide for GPU sharing between virtual machines, which aligns with the expanded documentation covering vGPU Manager deployment, licensing, KubeVirt configuration, and VM examples.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/gpu-vgpu-setup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides comprehensive documentation for configuring NVIDIA vGPU sharing for virtual machines, including prerequisites, image building, operator installation, and licensing setup. The review feedback suggests clarifying the FeatureType parameter in the licensing configuration and updating the example command prompt to maintain consistency with the platform's virtual machine naming conventions.

gridd.conf: |
ServerAddress=nls.example.com
ServerPort=443
FeatureType=1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It is helpful to clarify what the FeatureType value represents to assist users in customizing their configuration. In the NVIDIA Grid configuration, 1 corresponds to the "NVIDIA vGPU" (vPC/vWS) feature, while 2 is for "NVIDIA Virtual Compute Server" (vCS).

Suggested change
FeatureType=1
FeatureType=1 # 1 for vGPU

```

```console
ubuntu@gpu-vgpu:~$ nvidia-smi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with the GPU passthrough example (line 194) and Cozystack's default naming convention for virtual machine instances, the hostname in the command prompt should include the virtual-machine- prefix. Since the VM name is defined as gpu-vgpu on line 352, the resulting instance name is virtual-machine-gpu-vgpu.

Suggested change
ubuntu@gpu-vgpu:~$ nvidia-smi
ubuntu@virtual-machine-gpu-vgpu:~$ nvidia-smi

@lexfrei lexfrei self-assigned this Apr 2, 2026
Replace simplified Containerfile with NVIDIA's Makefile-based build
system from gitlab.com/nvidia/container-images/driver. The GPU Operator
expects pre-compiled kernel modules, not a raw .run file.

Add EULA warning about public redistribution of vGPU driver images.
Add note about NLS ServerPort being deployment-dependent.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
@lexfrei lexfrei marked this pull request as ready for review April 2, 2026 12:51
@lexfrei lexfrei requested review from kvaps and lllamnyp as code owners April 2, 2026 12:51
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
content/en/docs/v1/virtualization/gpu.md (1)

360-385: Name the manifest file explicitly before the apply command.

kubectl apply -f vmi-vgpu.yaml appears without first labeling the YAML block as vmi-vgpu.yaml (unlike the earlier passthrough example). Adding a filename label right above the manifest would remove ambiguity for copy/paste users.

✏️ Suggested doc tweak
 ### 5. Create a Virtual Machine with vGPU

+**vmi-vgpu.yaml**:
+
 ```yaml
 apiVersion: apps.cozystack.io/v1alpha1
 appVersion: '*'
 kind: VirtualMachine
 ...
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @content/en/docs/v1/virtualization/gpu.md around lines 360 - 385, Add an
explicit filename label above the YAML manifest block so users know the file
name to save before running kubectl; specifically, annotate the VirtualMachine
manifest block with "vmi-vgpu.yaml" (the same name used in the kubectl apply -f
vmi-vgpu.yaml command) by placing the filename line immediately before the

invocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@content/en/docs/v1/virtualization/gpu.md`:
- Around line 360-385: Add an explicit filename label above the YAML manifest
block so users know the file name to save before running kubectl; specifically,
annotate the VirtualMachine manifest block with "vmi-vgpu.yaml" (the same name
used in the kubectl apply -f vmi-vgpu.yaml command) by placing the filename line
immediately before the ```yaml fence, ensuring consistency between the manifest
and the kubectl apply invocation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 150cd704-7d01-4c3d-86f5-81294e0a31b5

📥 Commits

Reviewing files that changed from the base of the PR and between 624a38c and 468dd7b.

📒 Files selected for processing (1)
  • content/en/docs/v1/virtualization/gpu.md

- Switch licensing config from ConfigMap to Secret (configMapName deprecated)
- Add FeatureType comment explaining values (1=vGPU, 2=vCS)
- Fix console hostname to match Cozystack naming convention (virtual-machine- prefix)

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
content/en/docs/v1/virtualization/gpu.md (1)

279-288: Make the imagePullSecrets snippet fully qualified to prevent misplacement.

At Line 279, the snippet is context-trimmed and can be pasted under the wrong key. Please show the full values path to avoid broken Package configuration.

Proposed doc patch
-    ```yaml
-    gpu-operator:
-      vgpuManager:
-        repository: registry.example.com/nvidia
-        version: "550.90.05"
-        imagePullSecrets:
-        - name: nvidia-registry-secret
-    ```
+    ```yaml
+    components:
+      gpu-operator:
+        values:
+          gpu-operator:
+            vgpuManager:
+              repository: registry.example.com/nvidia
+              version: "550.90.05"
+              imagePullSecrets:
+              - name: nvidia-registry-secret
+    ```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@content/en/docs/v1/virtualization/gpu.md` around lines 279 - 288, The snippet
for imagePullSecrets is too context-trimmed and can be pasted under the wrong
key; update the example so it shows the full values path (wrap the existing
gpu-operator.vgpuManager block under components -> gpu-operator -> values ->
gpu-operator -> vgpuManager) so users see the complete hierarchy and the
imagePullSecrets entry (refer to symbols: components, gpu-operator, values,
gpu-operator, vgpuManager, imagePullSecrets) and replace the trimmed snippet
with this fully-qualified version.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@content/en/docs/v1/virtualization/gpu.md`:
- Around line 358-391: Add an explicit readiness check for the
VirtualMachineInstance before calling "virtctl console": after applying
vmi-vgpu.yaml (kubectl apply -f vmi-vgpu.yaml) add a step that waits for the VMI
to become Ready (e.g., using "kubectl get vmi -n tenant-example -w" or "kubectl
wait --for=condition=Ready vmi/gpu-vgpu -n tenant-example") so the subsequent
"virtctl console virtual-machine-gpu-vgpu" call won't fail intermittently.

---

Nitpick comments:
In `@content/en/docs/v1/virtualization/gpu.md`:
- Around line 279-288: The snippet for imagePullSecrets is too context-trimmed
and can be pasted under the wrong key; update the example so it shows the full
values path (wrap the existing gpu-operator.vgpuManager block under components
-> gpu-operator -> values -> gpu-operator -> vgpuManager) so users see the
complete hierarchy and the imagePullSecrets entry (refer to symbols: components,
gpu-operator, values, gpu-operator, vgpuManager, imagePullSecrets) and replace
the trimmed snippet with this fully-qualified version.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7d6caa76-d2a5-4856-97eb-524345257f6b

📥 Commits

Reviewing files that changed from the base of the PR and between 468dd7b and 492f318.

📒 Files selected for processing (1)
  • content/en/docs/v1/virtualization/gpu.md

Comment on lines +358 to +391
### 5. Create a Virtual Machine with vGPU

```yaml
apiVersion: apps.cozystack.io/v1alpha1
appVersion: '*'
kind: VirtualMachine
metadata:
name: gpu-vgpu
namespace: tenant-example
spec:
running: true
instanceProfile: ubuntu
instanceType: u1.medium
systemDisk:
image: ubuntu
storage: 5Gi
storageClass: replicated
gpus:
- name: nvidia.com/NVIDIA_L40S-24Q
cloudInit: |
#cloud-config
password: ubuntu
chpasswd: { expire: False }
```

```bash
kubectl apply -f vmi-vgpu.yaml
```

Once the VM is running, log in and verify the vGPU is available:

```bash
virtctl console virtual-machine-gpu-vgpu
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add an explicit VM readiness check before opening console.

After Line 384, jumping directly to virtctl console can fail intermittently if the VM/VMI is not ready yet. Add a wait/check step to keep the flow deterministic.

Proposed doc patch
 ```bash
 kubectl apply -f vmi-vgpu.yaml

+Wait until the VM instance is ready:
+
+bash +kubectl get vmi -n tenant-example -w +
+
Once the VM is running, log in and verify the vGPU is available:

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @content/en/docs/v1/virtualization/gpu.md around lines 358 - 391, Add an
explicit readiness check for the VirtualMachineInstance before calling "virtctl
console": after applying vmi-vgpu.yaml (kubectl apply -f vmi-vgpu.yaml) add a
step that waits for the VMI to become Ready (e.g., using "kubectl get vmi -n
tenant-example -w" or "kubectl wait --for=condition=Ready vmi/gpu-vgpu -n
tenant-example") so the subsequent "virtctl console virtual-machine-gpu-vgpu"
call won't fail intermittently.


</details>

<!-- fingerprinting:phantom:triton:hawk:8b73ebdb-e38a-4b41-bcc3-3277f8a35b52 -->

<!-- This is an auto-generated comment by CodeRabbit -->

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant