Skip to content

Fix JACCL GID index for multi-node Thunderbolt 5 setups#3389

Open
qubitcontracting wants to merge 1 commit intoml-explore:mainfrom
qubitcontracting:jaccl-gid-fix
Open

Fix JACCL GID index for multi-node Thunderbolt 5 setups#3389
qubitcontracting wants to merge 1 commit intoml-explore:mainfrom
qubitcontracting:jaccl-gid-fix

Conversation

@qubitcontracting
Copy link
Copy Markdown

Summary

Fixes hardcoded sgid_index=1 in JACCL connection setup that prevents 3-node Thunderbolt 5 RDMA clusters from working.

Problem

Connection::info() and queue_pair_rtr() both hardcode GID index 1 when querying the InfiniBand GID table and setting up queue pairs. On some Thunderbolt 5 interfaces (observed on Mac mini M4 Pro with rdma_en3/rdma_en4), the valid IPv4-mapped GID is at index 2, not 1. This causes the QP transition to RTR state to fail with errno 22 (EINVAL).

2-node setups often work because both interfaces happen to have GID at index 1. When a third node with GID at index 2 is added, the connection fails.

Fix

Scan GID indices 1-3 for a valid entry (non-zero interface_id), falling back to index 0 if none found. Applied to both Connection::info() and queue_pair_rtr().

Test plan

  • 3-node TB5 mesh: Mac Studio M3 Ultra + 2x Mac mini M4 Pro
  • Before fix: 2-node works, 3-node fails with errno 22
  • After fix: 3-node works reliably (tested with Qwen3-235B pipeline parallel)
  • 2-node operation unaffected

The JACCL connection code hardcodes sgid_index=1 when querying the
GID table and setting up queue pairs. On some Thunderbolt 5 RDMA
interfaces, the valid IPv4-mapped GID is at index 2 (not 1), causing
3-node setups to fail with "errno 22" (EINVAL) during QP transition
to RTR state.

This changes both Connection::info() and queue_pair_rtr() to scan
GID indices 1-3 for a valid entry (non-zero interface_id), falling
back to index 0 if none found.

Tested on a 3-node Thunderbolt 5 mesh:
- Mac Studio M3 Ultra (rdma_en3, rdma_en4)
- Mac mini M4 Pro ml-explore#1 (rdma_en4, rdma_en3) — GID at index 2
- Mac mini M4 Pro ml-explore#2 (rdma_en4, rdma_en3) — GID at index 2

Before: 2-node works, 3-node fails with errno 22
After: 3-node works reliably

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant