Skip to content

fix(k8s): make NVIDIA device plugin daemonset and namespace configurable#237

Merged
dmitsh merged 1 commit intomainfrom
ds-data-broker
Mar 21, 2026
Merged

fix(k8s): make NVIDIA device plugin daemonset and namespace configurable#237
dmitsh merged 1 commit intomainfrom
ds-data-broker

Conversation

@dmitsh
Copy link
Copy Markdown
Collaborator

@dmitsh dmitsh commented Mar 20, 2026

No description provided.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 20, 2026

Greptile Summary

This PR makes the NVIDIA device-plugin DaemonSet name and namespace configurable for the infiniband-k8s provider, replacing two hardcoded strings with defaults that can be overridden at deploy time via CLI flags. It also migrates node-data-broker-initc from the Go stdlib flag package to pflag to support the new double-dash arguments.

Key changes:

  • pkg/providers/infiniband/k8s.go: GetClusterID and GetNodeAnnotations now accept an overrides map[string]string; getDevicePluginInfo resolves the DaemonSet name and namespace from that map with safe fallback defaults.
  • cmd/node-data-broker-initc/main.go: Migrated to pflag, added a --set flag accepting name=value pairs, added getExtras to parse and validate those pairs, and threads the resulting map through to the infiniband provider.
  • charts/.../daemonset.yaml: Fixed the provider flag to use --provider (double-dash, matching pflag) and added a range loop to forward extraArgs as --set arguments to the init container.
  • charts/.../values.yaml: initc.enabled was flipped from false to true by default — this is a potentially breaking change (see inline comment); deployments without global.provider.name configured will now have a failing init container that blocks the main container from starting.

Confidence Score: 3/5

  • The Go logic and Helm template changes are sound, but flipping initc.enabled to true by default is a potentially breaking change that needs to be evaluated before merging.
  • The core Go implementation — pflag migration, getExtras parsing, and the overrides map threading — is correct and well-tested. The Helm template change (--provider double-dash fix, range loop for extra args) is also correct. The main concern is the initc.enabled default flip from false to true in values.yaml: any deployment that does not explicitly set global.provider.name will now have a failing init container that prevents the main container from starting, making this a potential breaking change in existing environments.
  • charts/topograph/charts/node-data-broker/values.yaml — the initc.enabled: true default change warrants careful review before merging.

Important Files Changed

Filename Overview
charts/topograph/charts/node-data-broker/values.yaml Changed initc.enabled from false to true by default and added extraArgs — the default flip is a potentially breaking change for deployments without global.provider.name configured.
charts/topograph/charts/node-data-broker/templates/daemonset.yaml Corrected -provider to --provider (pflag double-dash) and added a range loop to pass each extraArg as --set=<key>=<value> to the init container. Logic is correct.
pkg/providers/infiniband/k8s.go Added configurable daemonset/namespace via overrides map, new constants for arg keys and defaults, and helper getDevicePluginInfo. Nil-map reads are safe in Go; defaults apply correctly when keys are absent.
cmd/node-data-broker-initc/main_test.go Added TestGetExtras with six cases (empty, valid single/multiple, and three error conditions). Updated existing tests to use tt naming convention and pass the new nil extras argument.
pkg/providers/infiniband/k8s_test.go Added TestGetDevicePluginInfo with five cases covering nil overrides, individual overrides, both overrides, and irrelevant keys. Covers the nil-map edge case via the zero-value struct field.
charts/topograph/values.k8s-ib-example.yaml Added initc.extraArgs with the two new configurable keys; both values are identical to the in-code defaults, making them redundant but harmless.
go.mod Added github.com/spf13/pflag v1.0.6 as a direct dependency to support the flag migration.
cmd/node-data-broker-initc/main.go Migrated flag parsing from stdlib to pflag, added a set flag for key-value overrides, and wired the extras map through getAnnotations to the infiniband provider. getExtras validates format and rejects empty keys or values.

Sequence Diagram

sequenceDiagram
    participant HelmChart as Helm Chart
    participant InitC as node-data-broker-initc
    participant K8sAPI as Kubernetes API
    participant DevPlugin as nvidia-device-plugin (pod)

    HelmChart->>InitC: --provider=infiniband-k8s<br/>--set=gpu-operator-namespace=X<br/>--set=device-plugin-daemonset=Y
    InitC->>InitC: pflag.Parse()<br/>getExtras(sets) → map[string]string
    InitC->>K8sAPI: InClusterConfig / NewForConfig
    InitC->>K8sAPI: GetDaemonSetPods(ds, namespace, hostname)
    K8sAPI-->>InitC: PodList
    InitC->>DevPlugin: ExecInPod(sh -c cmdClusterID)
    DevPlugin-->>InitC: ClusterUUID + CliqueId output
    InitC->>InitC: parseClusterID() → clusterID
    InitC->>K8sAPI: Nodes().Get(nodeName)
    K8sAPI-->>InitC: Node object
    InitC->>InitC: mergeNodeAnnotations(node, annotations)
    InitC->>K8sAPI: Nodes().Update(node)
    K8sAPI-->>InitC: Updated Node
Loading

Comments Outside Diff (2)

  1. charts/topograph/charts/node-data-broker/values.yaml, line 17-18 (link)

    initc.enabled flipped to true by default — potential breaking change

    Changing initc.enabled from false to true means that every deployment using the default node-data-broker values (without explicitly overriding global.provider.name) will now attempt to run the init container. Since global.provider.name is not defined in this sub-chart's own values.yaml, the rendered arg will be --provider= (empty string), causing the init container to exit with "must set provider".

    In Kubernetes, a failing init container puts the pod into a CrashLoopBackOff restart loop and prevents the main (data-broker) container from ever starting. Any existing deployment that:

    1. deploys node-data-broker standalone (without the parent topograph chart), or
    2. uses the parent chart but does not set global.provider.name

    ...will be broken after upgrading to this chart version.

    Consider either keeping the default as false (opt-in), or adding an explicit guard in the template (e.g. only render the init container when both initc.enabled and a non-empty provider are present):

    initc:
      enabled: false   # keep opt-in; users must set global.provider.name first
  2. charts/topograph/values.k8s-ib-example.yaml, line 22-25 (link)

    extraArgs mirror the hardcoded defaults

    The two extraArgs entries added here (gpu-operator-namespace=gpu-operator and device-plugin-daemonset=nvidia-device-plugin-daemonset) are identical to defaultGpuOperatorNamespace and defaultDevicePluginDaemonSet in pkg/providers/infiniband/k8s.go. Passing them explicitly has no functional effect and may mislead readers into thinking these values are required (or that omitting them would change behaviour).

    If the intent is to document the available keys, a comment would be clearer:

      initc:
        # extraArgs override the device-plugin discovery defaults:
        #   gpu-operator-namespace=<namespace>   (default: gpu-operator)
        #   device-plugin-daemonset=<name>       (default: nvidia-device-plugin-daemonset)
        extraArgs: []

Last reviewed commit: "fix(k8s): make NVIDI..."

Comment thread pkg/providers/infiniband/k8s.go Outdated
Comment thread pkg/providers/infiniband/k8s_test.go Outdated
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 20, 2026

Codecov Report

❌ Patch coverage is 45.45455% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.00%. Comparing base (e233962) to head (7c3853d).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
cmd/node-data-broker-initc/main.go 44.82% 16 Missing ⚠️
pkg/providers/infiniband/k8s.go 46.66% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #237      +/-   ##
==========================================
+ Coverage   66.98%   67.00%   +0.01%     
==========================================
  Files          82       82              
  Lines        4649     4676      +27     
==========================================
+ Hits         3114     3133      +19     
- Misses       1424     1432       +8     
  Partials      111      111              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dmitsh dmitsh force-pushed the ds-data-broker branch 3 times, most recently from ee00bcd to f28c760 Compare March 20, 2026 15:46
@dmitsh dmitsh requested a review from ravisoundar March 20, 2026 15:47
@dmitsh dmitsh force-pushed the ds-data-broker branch 3 times, most recently from 1fc8073 to c74b24e Compare March 20, 2026 18:54
Signed-off-by: Dmitry Shmulevich <dshmulevich@nvidia.com>
@dmitsh dmitsh merged commit 31c17c7 into main Mar 21, 2026
4 checks passed
@dmitsh dmitsh deleted the ds-data-broker branch March 21, 2026 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant