Skip to content

Commit bbc4215

Browse files
a-mccarthyrajathagasthyacdesiniotisrahulait
committed
Add docs for 26.3.0 release
Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> Co-authored-by: Rajath Agasthya <rajathagasthya@gmail.com> Co-authored-by: Christopher Desiniotis <chris.desiniotis@gmail.com> Co-authored-by: Rahul Sharma <rahulait@users.noreply.github.com>
1 parent 87ae31f commit bbc4215

23 files changed

Lines changed: 630 additions & 515 deletions

gpu-operator/amazon-eks.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ without any limitations, you perform the following high-level actions:
110110
Make sure the instance type supports enough IP addresses for your workload.
111111
For example, the ``g4dn.xlarge`` instance type supports ``29`` IP addresses for pods on the node.
112112

113-
* Use an Amazon EKS optimized Amazon Machine Image (AMI) with Ubuntu 20.04, 22.04, or 24.04 on the nodes in the node group.
113+
* Use an Amazon EKS optimized Amazon Machine Image (AMI) with a `supported operating system <platform-support.html?category=cloud-service-providers#container-platforms>`_ on the nodes in the node group.
114114

115115
AMIs support are specific to an AWS region and Kubernetes version.
116116
See https://cloud-images.ubuntu.com/aws-eks/ for the AMI values such as ``ami-00687acd80b7a620a``.

gpu-operator/cdi.rst

Lines changed: 109 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,15 @@
1616
1717
.. headings # #, * *, =, -, ^, "
1818
19-
############################################################
20-
Container Device Interface (CDI) Support in the GPU Operator
21-
############################################################
19+
#################################################################################
20+
Container Device Interface (CDI) and Node Resource Interface (NRI) Plugin Support
21+
#################################################################################
2222

23-
************************************
24-
About the Container Device Interface
25-
************************************
23+
This page gives an overview of CDI and NRI Plugin support in the GPU Operator.
24+
25+
**************************************
26+
About Container Device Interface (CDI)
27+
**************************************
2628

2729
The `Container Device Interface (CDI) <https://github.com/cncf-tags/container-device-interface/blob/main/SPEC.md>`_
2830
is an open specification for container runtimes that abstracts what access to a device, such as an NVIDIA GPU, means,
@@ -31,7 +33,7 @@ ensure that a device is available in a container. CDI simplifies adding support
3133
the specification is applicable to all container runtimes that support CDI.
3234

3335
Starting with GPU Operator v25.10.0, CDI is used by default for enabling GPU support in containers running on Kubernetes.
34-
Specifically, CDI support in container runtimes, e.g. containerd and cri-o, is used to inject GPU(s) into workload
36+
Specifically, CDI support in container runtimes, like containerd and cri-o, is used to inject GPU(s) into workload
3537
containers. This differs from prior GPU Operator releases where CDI was used via a CDI-enabled ``nvidia`` runtime class.
3638

3739
If you are upgrading from a version of the GPU Operator prior to v25.10.0, where CDI was disabled by default, and you are upgrading to v25.10.0 or later, where CDI is enabled by default, no configuration changes are required for standard workloads using GPU allocation through the Device Plugin.
@@ -45,22 +47,27 @@ plugins.
4547
CDI and GPU Management Containers
4648
*********************************
4749

48-
When CDI is enabled in GPU Operator versions v25.10.0 and later, GPU Management Containers that use the ``NVIDIA_VISIBLE_DEVICES`` environment variable to get GPU access, bypassing GPU allocation via the Device Plugin, must set ``runtimeClassName: nvidia`` in the pod specification.
49-
A GPU Management Containers is a container that requires access to all GPUs without them being allocated by Kubernetes.
50+
When CDI is enabled in GPU Operator versions v25.10.0 and later, GPU Management Containers that use the ``NVIDIA_VISIBLE_DEVICES`` environment variable to get GPU access, bypassing GPU allocation via the Device Plugin or DRA Driver for GPUs, must set ``runtimeClassName: nvidia`` in the pod specification.
51+
A GPU Management Container is a container that requires access to all GPUs without them being allocated by Kubernetes.
5052
Examples of GPU Management Containers include monitoring agents and device plugins.
5153

52-
It is recommended that ``NVIDIA_VISIBLE_DEVICES`` only be used by management containers.
54+
It is recommended that ``NVIDIA_VISIBLE_DEVICES`` only be used by GPU Management Containers.
55+
56+
.. note::
57+
58+
Setting ``runtimeClassName: nvidia`` in the pod specification is not required when the NRI Plugin is enabled in GPU Operator.
59+
Refer to :ref:`About the Node Resource Interface (NRI) Plugin <nri-plugin>`.
60+
5361

54-
********************************
55-
Enabling CDI During Installation
56-
********************************
62+
************
63+
Enabling CDI
64+
************
5765

5866
CDI is enabled by default during installation in GPU Operator v25.10.0 and later.
5967
Follow the instructions for installing the Operator with Helm on the :doc:`getting-started` page.
6068

6169
CDI is also enabled by default during a Helm upgrade to GPU Operator v25.10.0 and later.
6270

63-
*******************************
6471
Enabling CDI After Installation
6572
*******************************
6673

@@ -138,3 +145,91 @@ disable CDI and use the legacy NVIDIA Container Toolkit stack instead with the f
138145
nvidia.com/gpu.deploy.operator-validator=true \
139146
nvidia.com/gpu.present=true \
140147
--overwrite
148+
149+
150+
.. _nri-plugin:
151+
152+
**********************************************
153+
About the Node Resource Interface (NRI) Plugin
154+
**********************************************
155+
156+
Node Resource Interface (NRI) is a standardized interface for plugging in extensions, called NRI Plugins, to OCI-compatible container runtimes like containerd.
157+
NRI Plugins serve as hooks which intercept pod and container lifecycle events and perform functions including injecting devices to a container, topology aware placement strategies, and more. For more details on NRI, refer to the `NRI overview <https://github.com/containerd/nri/tree/main?tab=readme-ov-file#background>`_ in the containerd repository.
158+
159+
When enabled in the GPU Operator, the NVIDIA Container Toolkit daemonset will run an NRI Plugin on every GPU node.
160+
The purpose of the NRI Plugin is to inject GPUs into GPU management containers that use the ``NVIDIA_VISIBLE_DEVICES`` environment variable to get GPU access, bypassing GPU allocation via the Device Plugin or DRA Driver for GPUs.
161+
162+
In previous GPU Operator versions, device injection was handled by the ``nvidia`` container runtime. With CDI and the NRI Plugin enabled, the ``nvidia`` runtime class is no longer needed. When enabling the NRI plugin during install, the ``nvidia`` runtime class will not be created. If you enable the NRI Plugin after install, the ``nvidia`` runtime class will be deleted.
163+
164+
Additionally, with the NRI Plugin enabled, modifications to the container runtime configuration are no longer needed. For example, no modifications are made to containerd’s config.toml file.
165+
This means that on platforms that configure containerd in a non-standard way, like k3s, k0s, and Rancher Kubernetes Engine 2, users no longer need to configure environment variables like ``CONTAINERD_CONFIG``, ``CONTAINERD_SOCKET``, or ``RUNTIME_CONFIG_SOURCE``.
166+
167+
168+
***********************
169+
Enabling the NRI Plugin
170+
***********************
171+
172+
The NRI Plugin requires the following:
173+
174+
- CDI to be enabled in the GPU Operator.
175+
176+
- containerd v1.7.30, v2.1.x, or v2.2.x.
177+
If you are not using the latest containerd version, check that both CDI and NRI are enabled in the containerd configuration file before deploying GPU Operator.
178+
179+
.. note::
180+
Enabling the NRI plugin is not supported with cri-o.
181+
182+
To enable the NRI Plugin during installation, follow the instructions for installing the Operator with Helm on the :doc:`getting-started` page and include the ``--set cdi.nriPluginEnabled=true`` argument in you Helm command.
183+
184+
Enabling the NRI Plugin After Installation
185+
******************************************
186+
187+
#. Enable NRI Plugin by modifying the cluster policy:
188+
189+
.. code-block:: console
190+
191+
$ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \
192+
-p='[{"op": "replace", "path": "/spec/cdi/nriPluginEnabled", "value":true}]'
193+
194+
*Example Output*
195+
196+
.. code-block:: output
197+
198+
clusterpolicy.nvidia.com/cluster-policy patched
199+
200+
After enabling the NRI Plugin, the ``nvidia`` runtime class will be deleted.
201+
202+
#. (Optional) Confirm that the container toolkit and device plugin pods restart:
203+
204+
.. code-block:: console
205+
206+
$ kubectl get pods -n gpu-operator
207+
208+
*Example Output*
209+
210+
.. literalinclude:: ./manifests/output/nri-get-pods-restart.txt
211+
:language: output
212+
:emphasize-lines: 6,9
213+
214+
215+
************************
216+
Disabling the NRI Plugin
217+
************************
218+
219+
Disable the NRI Plugin and use the ``nvidia`` runtime class instead with the following procedure:
220+
221+
Disable the NRI Plugin by modifying the cluster policy:
222+
223+
.. code-block:: console
224+
225+
$ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \
226+
-p='[{"op": "replace", "path": "/spec/cdi/nriPluginEnabled", "value":false}]'
227+
228+
*Example Output*
229+
230+
.. code-block:: output
231+
232+
clusterpolicy.nvidia.com/cluster-policy patched
233+
234+
235+
After disabling the NRI Plugin, the ``nvidia`` runtime class will be created.

gpu-operator/conf.py

Lines changed: 0 additions & 226 deletions
This file was deleted.

0 commit comments

Comments
 (0)