Fix JACCL GID index for multi-node Thunderbolt 5 setups#3389
Open
qubitcontracting wants to merge 1 commit intoml-explore:mainfrom
Open
Fix JACCL GID index for multi-node Thunderbolt 5 setups#3389qubitcontracting wants to merge 1 commit intoml-explore:mainfrom
qubitcontracting wants to merge 1 commit intoml-explore:mainfrom
Conversation
The JACCL connection code hardcodes sgid_index=1 when querying the GID table and setting up queue pairs. On some Thunderbolt 5 RDMA interfaces, the valid IPv4-mapped GID is at index 2 (not 1), causing 3-node setups to fail with "errno 22" (EINVAL) during QP transition to RTR state. This changes both Connection::info() and queue_pair_rtr() to scan GID indices 1-3 for a valid entry (non-zero interface_id), falling back to index 0 if none found. Tested on a 3-node Thunderbolt 5 mesh: - Mac Studio M3 Ultra (rdma_en3, rdma_en4) - Mac mini M4 Pro ml-explore#1 (rdma_en4, rdma_en3) — GID at index 2 - Mac mini M4 Pro ml-explore#2 (rdma_en4, rdma_en3) — GID at index 2 Before: 2-node works, 3-node fails with errno 22 After: 3-node works reliably Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes hardcoded
sgid_index=1in JACCL connection setup that prevents 3-node Thunderbolt 5 RDMA clusters from working.Problem
Connection::info()andqueue_pair_rtr()both hardcode GID index 1 when querying the InfiniBand GID table and setting up queue pairs. On some Thunderbolt 5 interfaces (observed on Mac mini M4 Pro withrdma_en3/rdma_en4), the valid IPv4-mapped GID is at index 2, not 1. This causes the QP transition to RTR state to fail witherrno 22(EINVAL).2-node setups often work because both interfaces happen to have GID at index 1. When a third node with GID at index 2 is added, the connection fails.
Fix
Scan GID indices 1-3 for a valid entry (non-zero
interface_id), falling back to index 0 if none found. Applied to bothConnection::info()andqueue_pair_rtr().Test plan