Skip to content

Feature request: add --collector.infiniband.device-exclude option #3583

@sara4dev

Description

@sara4dev

Problem

On nodes with mixed InfiniBand hardware (e.g. 8× ConnectX-7 NDR 400G + 4× ConnectX-7 HDR 100G behind a BlueField-3 DPU), some devices expose sysfs counters that return EINVAL when read. This causes the entire infiniband collector to fail on every scrape, producing zero node_infiniband_* metrics — even for the healthy devices.

Example error (every 30s on 16 out of 23 GPU nodes):

level=ERROR source=collector.go:168 msg="collector failed" name=infiniband duration_seconds=0.354
  err="error obtaining InfiniBand class info: failed to read file
  \"/host/sys/class/infiniband/mlx5_6/ports/1/counters/VL15_dropped\": invalid argument"

The problematic devices are mlx5_6 through mlx5_9 (HDR HCAs), while mlx5_0 through mlx5_5 and mlx5_10/mlx5_11 (NDR HCAs) work fine.

There is currently no way to exclude specific InfiniBand devices from collection. The only workaround is --no-collector.infiniband, which disables all IB metrics entirely.

Requested Feature

Add --collector.infiniband.device-exclude (and optionally --collector.infiniband.device-include) flags, consistent with the pattern already used by other collectors:

  • --collector.netdev.device-exclude
  • --collector.ethtool.device-exclude
  • --collector.qdisc.device-exclude

The flag should accept a regexp pattern to skip matching device names in /sys/class/infiniband/.

Example usage

--collector.infiniband.device-exclude=mlx5_[6-9]

This would skip mlx5_6, mlx5_7, mlx5_8, mlx5_9 and collect metrics from all other IB devices.

Environment

  • node_exporter version: v1.9.1
  • OS: Ubuntu (Kubernetes nodes, B200 GPU cluster)
  • Hardware:
    • 8× Mellanox ConnectX-7 NDR 400G (mlx5_0mlx5_5, mlx5_10mlx5_11) — work fine
    • 4× Mellanox ConnectX-7 HDR 100G (mlx5_6mlx5_9) — VL15_dropped and symbol_error counters return EINVAL
    • 1× BlueField-3 DPU

Additional Context

A secondary benefit of this flag is allowing users to reduce cardinality on nodes with many IB devices by collecting metrics only from devices actively carrying traffic.

Other collectors already follow this pattern, so adding device-exclude/device-include to the infiniband collector would be a consistent and low-risk enhancement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions