Skip to content

transport/libfabric: Optimize progress path by removing counters and inline AMO ack processing#71

Open
amitrad-aws wants to merge 11 commits intoNVIDIA:develfrom
amitrad-aws:progress_opt
Open

transport/libfabric: Optimize progress path by removing counters and inline AMO ack processing#71
amitrad-aws wants to merge 11 commits intoNVIDIA:develfrom
amitrad-aws:progress_opt

Conversation

@amitrad-aws
Copy link

This PR optimizes the libfabric transport progress and quiet paths with two changes:

  1. Replace counter-based completion tracking with CQ-based tracking
    Remove fi_cntr usage and track operation completions entirely through CQ
    entries. The endpoint now maintains a completed_ops counter incremented in
    the CQ completion handler, and quiet polls the CQ instead of reading or
    waiting on a counter. This also unifies the staged-atomics and
    non-staged-atomics quiet paths into a single CQ-driven polling loop, and
    consolidates error reporting into the CQ error path.

  2. Process AMO acks inline in CQ completion handler
    Move AMO ack processing out of the batched op_queue drain and into the CQ
    completion handler directly. When an ACK recv completion arrives, the ack
    is processed and the recv buffer is re-posted immediately, eliminating the
    intermediate queue and the separate per-progress-call drain loop.

Together these changes reduce overhead in the progress hot path by removing
counter reads and an extra queue drain pass per progress call.

a-szegel and others added 11 commits March 5, 2026 14:43
CUDA 11.3 released cuFlushGPUDirectRDMAWrites API which takes the place
of the host transport enforce_cst api. NVSHMEM no longer supports CUDA
11, so these legacy API's can be removed.

Signed-off-by: Seth Zegelstein <szegel@amazon.com>
The previous is_proxy variable equals qp_index. Change the name
everywhere for consistency.

Signed-off-by: Seth Zegelstein <szegel@amazon.com>
Attempt to request FI_PROGRESS_AUTO to see if the libfabric provider
supports it, if it doesn't fall back to FI_PROGRESS_MANUAL.
FI_PROGRESS_AUTO means that we do not need to call into the progress
engine for submitted operations to complete.  This means that we can
remove the host endpoint from the progress call, and we only need to
progress the host endpoint when user calls nvshmem_quiet() from the
host. This allows us to set the threading model as FI_THREAD_COMPELTION
because the host only progress the host EP, and the proxy only
progresses the proxy EP, leading to compliance with FI_THREAD_COMPLETION.
An edge case exists here where the user calls nvshmem_quiet() on the
host QP_IDX from a GPU kernel, but this is illegial because the user
shouldn't be calling QP API's on QP's not provided to them via the qp
creation API's.

This patch should offer a performance improvement because it reduces the
number of EP's that are progressed in the critical path, and it allows
the libfabric provider to reduce locking b/c of threading model
FI_THREAD_COMPLETION.

Signed-off-by: Seth Zegelstein <szegel@amazon.com>
This change implements multi-rail support for the libfabric host proxy
transport. The transport changes from having 1 domain with 2 EP's to
having 1 host domain on NIC 1 and one proxy domain per NIC. Splitting
the host EP and proxy EP into seperate domains was done for simplicity
of the code. Every domain resource (including AV) was bound on a 1-1
basis per EP so this change should be a functional no-op. In the future
when one implements the QP API on the libfabric host proxy transport, N
EP's per domain can be easily extended on this. This code uses a round
robin based load balancer to assign messages to NIC's.  One NIC will be
used for the entire operation call into the libfabric transport
(including put-signal), but not including messages that are segmented
due to size or MR boundaries.

The number of NIC's (domains) per PE are limited by the size of the struct
nvshmemt_libfabric_mem_handle_t. A new env variable
NVSHMEM_LIBFABRIC_MAX_NIC_PER_PE controls the max number of NIC's per PE.

Thank you Justin for contributing an initial implementation of multi-rail
which I built on top of.

Co-authored-by: Justin Chui <justchiu@amazon.com>
Signed-off-by: Seth Zegelstein <szegel@amazon.com>
The current topology code has the problem that NIC-GPU load balancing is
applied only to PEs. If an application initializes NVSHMEM such that
there is only one PE per node per rank, load balancing will not be
applied between GPUs. Then, PEs/GPUs from separate ranks will share
NICs, leading to contention and reduced performance.

In this patch, fix the issue by iterating over all GPUs in the system,
listed in `/sys/bus/pci/drivers/nvidia`, and do load balancing over all
of them.

Signed-off-by: Eric Raut <eraut@amazon.com>
Cache counter value to avoid potentially expensive counter read
operation if no new operations have been submitted since the last call
to fence/quiet.

Signed-off-by: Eric Raut <eraut@amazon.com>
All signals initiated by the same peer PE will be applied in-order based
on sequence number.

This change also orders signals with respect to puts, so that signal
delivery guarantees previous puts have arrived. For this, puts are sent
with a sequence number that orders them with the signals. A periodic ack
is sent for puts to prevent sequence overflow errors.

Ordering is applied across all rails, per peer PE.

Signed-off-by: Eric Raut <eraut@amazon.com>
…odel

Introduce conditional_mutex that skips locking when FI_THREAD_COMPLETION
is active (FI_PROGRESS_AUTO path). With auto progress, the provider
guarantees thread-safe completion processing per endpoint, and the host
thread and proxy thread operate on separate endpoints (eps[0] vs
eps[1+]), so their op_queues are disjoint and no synchronization is
needed.

With FI_PROGRESS_MANUAL, we fall back to FI_THREAD_SAFE where
manual_progress() iterates all EPs from both threads, requiring full
mutex synchronization.

Signed-off-by: Xuan Jiang <xuanj@amazon.com>
Cap the three queue-draining loops in the proxy progress path
(CQ poll, AMO ack processing, AMO processing) to a maximum number
of items per iteration, preventing any single queue from starving
the others.

The limit is controlled by the NVSHMEM_LIBFABRIC_PROXY_REQUEST_BATCH_MAX
environment variable (default: 32).

Signed-off-by: Anshuman Goswami <anshumgo@amazon.com>
…Q-based tracking

Remove fi_cntr usage from the libfabric transport and track operation
completions entirely through completion queue (CQ) entries. The endpoint
now maintains a completed_ops counter that is incremented in the CQ
completion handler, and quiet polls the CQ instead of reading or waiting
on a counter.

This eliminates the overhead of counter reads in the progress.

The error handling path is also moved from a counter-error pre-check
into the existing CQ error path (fi_cq_readerr), consolidating error
reporting.

Signed-off-by: Amit Radzi <amitrad@amazon.com>
Move AMO ack processing out of the batched op_queue drain
(nvshmemt_libfabric_gdr_process_amos_ack) and into the CQ completion
handler directly. When an ACK recv completion arrives, the ack is
processed and the recv buffer is re-posted immediately, eliminating the
intermediate queue and the separate per-progress-call drain loop.

This removes a level of indirection and the associated queue
put/get overhead from the AMO ack path, and removes the
nvshmemt_libfabric_gdr_process_amos_ack call from the main progress
function.

Signed-off-by: Amit Radzi <amitrad@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants