Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 8 additions & 11 deletions confidential-containers/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,6 @@ The GPU Operator deploys the components needed to run Confidential Containers to
* NVIDIA Confidential Computing Manager (cc-manager) for Kubernetes - to set the confidential computing (CC) mode on the NVIDIA GPUs.
* NVIDIA Sandbox Device Plugin - to discover NVIDIA GPUs along with their capabilities, to advertise these to Kubernetes, and to allocate GPUs during pod deployment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/kata-containers/kata-containers/pull/12651/changes has an updated version on the description of the sandbox device plugin

* NVIDIA VFIO Manager - to bind discovered NVIDIA GPUs to the vfio-pci driver for VFIO passthrough.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see https://github.com/kata-containers/kata-containers/pull/12651/changes:
nvidia-vfio-manager: Binding discovered NVIDIA GPUs and nvswitches to
the vfio-pci driver for VFIO passthrough.

* NVIDIA Kata Manager for Kubernetes - to create host-side CDI specifications for GPU passthrough.

**Kata Deploy**

Expand Down Expand Up @@ -167,14 +166,13 @@ The following is the component stack to support the open Reference Architecture
| - NVIDIA VFIO Manager
| - NVIDIA Sandbox device plugin
| - NVIDIA Confidential Computing Manager for Kubernetes
| - NVIDIA Kata Manager for Kubernetes
- v25.10.0 and higher
* - CoCo release (EA)
| - Kata 3.25 (w/ kata-deploy helm)
| - Trustee/Guest components 0.17.0
| - KBS protocol 0.4.0
- v0.18.0

- v25.10.0 and higher
* - CoCo release (EA)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional? I think for our latest stack we need a Kata 3.28 release.
I don't know what 'v0.18.0' is here and I ma not sure if we have the exact trustee/guest components in these versions. We are not using a concrete CoCo release. We are using a Kata release and this Kata release pull in CoCo components as dependencies

| - Kata 3.25 (w/ kata-deploy helm)
| - Trustee/Guest components 0.17.0
| - KBS protocol 0.4.0
- v0.18.0


Cluster Topology Considerations
-------------------------------
Expand All @@ -194,8 +192,7 @@ You can configure all the worker nodes in your cluster for running GPU workloads
* NVIDIA MIG Manager for Kubernetes
* Node Feature Discovery
* NVIDIA GPU Feature Discovery
- * NVIDIA Kata Manager for Kubernetes
* NVIDIA Confidential Computing Manager for Kubernetes
- * NVIDIA Confidential Computing Manager for Kubernetes
* NVIDIA Sandbox Device Plugin
* NVIDIA VFIO Manager
* Node Feature Discovery
Expand Down
108 changes: 62 additions & 46 deletions gpu-operator/confidential-containers-deploy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ The implementation relies on the Kata Containers project to provide the lightwei

Refer to the `Confidential Containers overview <https://docs.nvidia.com/datacenter/cloud-native/confidential-containers/latest/overview.html>`_ for details on the reference architecture and supported platforms.

.. tip::

Refer to the :doc:`Kata Containers deployment guide <kata-containers-deploy>` if you'd like to run workloads in Kata Containers, not confidential containers.

.. _coco-prerequisites:

Prerequisites
Expand Down Expand Up @@ -60,38 +64,51 @@ Installing and configuring your cluster to support the NVIDIA GPU Operator with

This step installs all required components from the Kata Containers project including the Kata Containers runtime binary, runtime configuration, UVM kernel and initrd that NVIDIA uses for confidential containers and native Kata containers.

3. Install the latest version of the NVIDIA GPU Operator (minimum version: v25.10.0).
3. Install the latest version of the NVIDIA GPU Operator (minimum version: v26.3.0).

You install the Operator and specify options to deploy the operands that are required for confidential containers.

After installation, you can change the confidential computing mode and run a sample GPU workload in a confidential container.

Label nodes and install the Kata Containers Helm Chart
-------------------------------------------------------
Label Nodes
-----------

Perform the following steps to install and verify the Kata Containers Helm Chart:
Add a label to the nodes on which you intend to run confidential containers.

1. Label the nodes on which you intend to run confidential containers as follows::
#. Label the nodes on which you intend to run confidential containers as follows:

$ kubectl label node <node-name> nvidia.com/gpu.workload.config=vm-passthrough

2. Use the 3.24.0 Kata Containers version and chart in environment variables::
By labeling only the nodes that will run confidential containers, you can continue to run traditional container workloads with GPU or vGPU workloads on other nodes in your cluster.
If you plan to run confidential containers on all your worker nodes, you can set the default sandbox workload to ``vm-passthrough`` when you install the GPU Operator.

Install the Kata Containers Helm Chart
--------------------------------------

#. Get the ``3.24.0`` version of the ``kata-deploy`` Helm chart:

.. code-block:: console

$ export VERSION="3.24.0"
$ export CHART="oci://ghcr.io/kata-containers/kata-deploy-charts/kata-deploy"

3. Install the Chart::

$ helm install kata-deploy \
--namespace kata-system \
--create-namespace \
-f "https://raw.githubusercontent.com/kata-containers/kata-containers/refs/tags/${VERSION}/tools/packaging/kata-deploy/helm-chart/kata-deploy/try-kata-nvidia-gpu.values.yaml" \
--set nfd.enabled=false \
--set shims.qemu-nvidia-gpu-tdx.enabled=false \
--wait --timeout 10m --atomic \
"${CHART}" --version "${VERSION}"
#. Install the kata-deploy Helm chart:

*Example Output*::
.. code-block:: console


$ helm install kata-deploy "${CHART}" \
--namespace kata-system --create-namespace \
--set nfd.enabled=false \
--wait --timeout 10m \
--set shims.qemu-nvidia-gpu-tdx.enabled=false \
--version "${VERSION}"


*Example Output*

.. code-block:: output

Pulled: ghcr.io/kata-containers/kata-deploy-charts/kata-deploy:3.24.0
Digest: sha256:d87e4f3d93b7d60eccdb3f368610f2b5ca111bfcd7133e654d08cfd192fb3351
Expand All @@ -102,7 +119,7 @@ Perform the following steps to install and verify the Kata Containers Helm Chart
REVISION: 1
TEST SUITE: None

4. Optional: View the pod in the kata-system namespace and ensure it is running::
#. Optional: View the pod in the kata-system namespace and ensure it is running:

$ kubectl get pod,svc -n kata-system

Expand All @@ -113,51 +130,49 @@ Perform the following steps to install and verify the Kata Containers Helm Chart

Wait a few minutes for kata-deploy to create the base runtime classes.

5. Verify that the kata-qemu-nvidia-gpu and kata-qemu-nvidia-gpu-snp runtime classes are available::
5. Verify that the ``kata-qemu-nvidia-gpu`` and ``kata-qemu-nvidia-gpu-snp`` runtime classes are available:

.. code-block:: console

$ kubectl get runtimeclass

*Example Output*::
*Example Output*

NAME HANDLER AGE
kata-qemu-nvidia-gpu kata-qemu-nvidia-gpu 40s
kata-qemu-nvidia-gpu-snp kata-qemu-nvidia-gpu-snp 40s

``kata-deploy`` installs several runtime classes. The ``kata-qemu-nvidia-gpu`` runtime class is used with Kata Containers.
The ``kata-qemu-nvidia-gpu-snp`` runtime class is used to deploy Confidential Containers.

Install the NVIDIA GPU Operator
--------------------------------

Perform the following steps to install the Operator for use with confidential containers:

1. Add and update the NVIDIA Helm repository::
1. Add and update the NVIDIA Helm repository:

.. code-block:: console

$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update

2. Specify at least the following options when you install the Operator. If you want to run Confidential Containers by default on all worker nodes, also specify ``--set sandboxWorkloads.defaultWorkload=vm-passthrough``::
2. Specify at least the following options when you install the Operator.
If you want to run Confidential Containers by default on all worker nodes, also specify ``--set sandboxWorkloads.defaultWorkload=vm-passthrough``:

.. code-block:: console

$ helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set sandboxWorkloads.enabled=true \
--set kataManager.enabled=true \
--set kataManager.config.runtimeClasses=null \
--set kataManager.repository=nvcr.io/nvidia/cloud-native \
--set kataManager.image=k8s-kata-manager \
--set kataManager.version=v0.2.4 \
--set ccManager.enabled=true \
--set ccManager.defaultMode=on \
--set ccManager.repository=nvcr.io/nvidia/cloud-native \
--set ccManager.image=k8s-cc-manager \
--set ccManager.version=v0.2.0 \
--set sandboxDevicePlugin.repository=nvcr.io/nvidia/cloud-native \
--set sandboxDevicePlugin.image=nvidia-sandbox-device-plugin \
--set sandboxDevicePlugin.version=v0.0.1 \
--set 'sandboxDevicePlugin.env[0].name=P_GPU_ALIAS' \
--set 'sandboxDevicePlugin.env[0].value=pgpu' \
--set nfd.enabled=true \
--set nfd.nodefeaturerules=true
--set sandboxWorkloads.enabled=true \
--set sandboxWorkloads.mode=kata \
--set nfd.enabled=true \
--set nfd.nodefeaturerules=true

*Example Output*::
*Example Output*:

.. code-block:: output

NAME: gpu-operator-1766001809
LAST DEPLOYED: Wed Dec 17 20:03:29 2025
Expand All @@ -172,24 +187,25 @@ Perform the following steps to install the Operator for use with confidential co
resource types (such as ``nvidia.com/GH100_H100L_94GB``) instead of the generic
``nvidia.com/pgpu``. For simplicity, this guide uses the generic alias.

3. Verify that all GPU Operator pods, especially the Kata Manager, Confidential Computing Manager, Sandbox Device Plugin and VFIO Manager operands, are running::
3. Verify that all GPU Operator pods, especially the Confidential Computing Manager, Sandbox Device Plugin and VFIO Manager operands, are running:

.. code-block:: console

$ kubectl get pods -n gpu-operator

*Example Output*::
*Example Output*:

NAME READY STATUS RESTARTS AGE
gpu-operator-1766001809-node-feature-discovery-gc-75776475sxzkp 1/1 Running 0 86s
gpu-operator-1766001809-node-feature-discovery-master-6869lxq2g 1/1 Running 0 86s
gpu-operator-1766001809-node-feature-discovery-worker-mh4cv 1/1 Running 0 86s
gpu-operator-f48fd66b-vtfrl 1/1 Running 0 86s
nvidia-cc-manager-7z74t 1/1 Running 0 61s
nvidia-kata-manager-k8ctm 1/1 Running 0 62s
nvidia-sandbox-device-plugin-daemonset-d5rvg 1/1 Running 0 30s
nvidia-kata-sandbox-device-plugin-daemonset-d5rvg 1/1 Running 0 30s
nvidia-sandbox-validator-6xnzc 1/1 Running 1 30s
nvidia-vfio-manager-h229x 1/1 Running 0 62s

4. If the nvidia-cc-manager is *not* running, you need to label your CC-capable node(s) by hand. The node labelling capabilities in the early access version are not complete. To label your node(s), run::
4. If the nvidia-cc-manager is *not* running, you need to label your CC-capable node(s) by hand. The node labelling capabilities in the early access version are not complete. To label your node(s), run:

$ kubectl label node <nodename> nvidia.com/cc.capable=true

Expand All @@ -206,7 +222,7 @@ Perform the following steps to install the Operator for use with confidential co
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau

b. Confirm that the kata-deploy functionality installed the kata-qemu-nvidia-gpu-snp and kata-qemu-nvidia-gpu runtime class files::
b. Confirm that the kata-deploy functionality installed the kata-qemu-nvidia-gpu-snp and kata-qemu-nvidia-gpu runtime class files:

$ ls -l /opt/kata/share/defaults/kata-containers/ | grep nvidia

Expand Down
1 change: 1 addition & 0 deletions gpu-operator/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -317,6 +317,7 @@ To view all the options, run ``helm show values nvidia/gpu-operator``.
- Specifies the default type of workload for the cluster, one of ``container``, ``vm-passthrough``, or ``vm-vgpu``.

Setting ``vm-passthrough`` or ``vm-vgpu`` can be helpful if you plan to run all or mostly virtual machines in your cluster.
Refer to :doc:`KubeVirt <gpu-operator-kubevirt>`, :doc:`Kata Containers <kata-containers-deploy>`, or :doc:`Confidential Containers <confidential-containers-deploy>` for more details on deploying different workload containers.
- ``container``

* - ``sandboxWorkloads.mode``
Expand Down
1 change: 1 addition & 0 deletions gpu-operator/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@
:hidden:

KubeVirt <gpu-operator-kubevirt.rst>
Kata Containers <kata-containers-deploy.rst>
Confidential Containers <confidential-containers-deploy.rst>

.. toctree::
Expand Down
Loading
Loading