transport/libfabric: Optimize progress path by removing counters and inline AMO ack processing#71
Open
amitrad-aws wants to merge 11 commits intoNVIDIA:develfrom
Open
transport/libfabric: Optimize progress path by removing counters and inline AMO ack processing#71amitrad-aws wants to merge 11 commits intoNVIDIA:develfrom
amitrad-aws wants to merge 11 commits intoNVIDIA:develfrom
Conversation
CUDA 11.3 released cuFlushGPUDirectRDMAWrites API which takes the place of the host transport enforce_cst api. NVSHMEM no longer supports CUDA 11, so these legacy API's can be removed. Signed-off-by: Seth Zegelstein <szegel@amazon.com>
The previous is_proxy variable equals qp_index. Change the name everywhere for consistency. Signed-off-by: Seth Zegelstein <szegel@amazon.com>
Attempt to request FI_PROGRESS_AUTO to see if the libfabric provider supports it, if it doesn't fall back to FI_PROGRESS_MANUAL. FI_PROGRESS_AUTO means that we do not need to call into the progress engine for submitted operations to complete. This means that we can remove the host endpoint from the progress call, and we only need to progress the host endpoint when user calls nvshmem_quiet() from the host. This allows us to set the threading model as FI_THREAD_COMPELTION because the host only progress the host EP, and the proxy only progresses the proxy EP, leading to compliance with FI_THREAD_COMPLETION. An edge case exists here where the user calls nvshmem_quiet() on the host QP_IDX from a GPU kernel, but this is illegial because the user shouldn't be calling QP API's on QP's not provided to them via the qp creation API's. This patch should offer a performance improvement because it reduces the number of EP's that are progressed in the critical path, and it allows the libfabric provider to reduce locking b/c of threading model FI_THREAD_COMPLETION. Signed-off-by: Seth Zegelstein <szegel@amazon.com>
This change implements multi-rail support for the libfabric host proxy transport. The transport changes from having 1 domain with 2 EP's to having 1 host domain on NIC 1 and one proxy domain per NIC. Splitting the host EP and proxy EP into seperate domains was done for simplicity of the code. Every domain resource (including AV) was bound on a 1-1 basis per EP so this change should be a functional no-op. In the future when one implements the QP API on the libfabric host proxy transport, N EP's per domain can be easily extended on this. This code uses a round robin based load balancer to assign messages to NIC's. One NIC will be used for the entire operation call into the libfabric transport (including put-signal), but not including messages that are segmented due to size or MR boundaries. The number of NIC's (domains) per PE are limited by the size of the struct nvshmemt_libfabric_mem_handle_t. A new env variable NVSHMEM_LIBFABRIC_MAX_NIC_PER_PE controls the max number of NIC's per PE. Thank you Justin for contributing an initial implementation of multi-rail which I built on top of. Co-authored-by: Justin Chui <justchiu@amazon.com> Signed-off-by: Seth Zegelstein <szegel@amazon.com>
The current topology code has the problem that NIC-GPU load balancing is applied only to PEs. If an application initializes NVSHMEM such that there is only one PE per node per rank, load balancing will not be applied between GPUs. Then, PEs/GPUs from separate ranks will share NICs, leading to contention and reduced performance. In this patch, fix the issue by iterating over all GPUs in the system, listed in `/sys/bus/pci/drivers/nvidia`, and do load balancing over all of them. Signed-off-by: Eric Raut <eraut@amazon.com>
Cache counter value to avoid potentially expensive counter read operation if no new operations have been submitted since the last call to fence/quiet. Signed-off-by: Eric Raut <eraut@amazon.com>
All signals initiated by the same peer PE will be applied in-order based on sequence number. This change also orders signals with respect to puts, so that signal delivery guarantees previous puts have arrived. For this, puts are sent with a sequence number that orders them with the signals. A periodic ack is sent for puts to prevent sequence overflow errors. Ordering is applied across all rails, per peer PE. Signed-off-by: Eric Raut <eraut@amazon.com>
…odel Introduce conditional_mutex that skips locking when FI_THREAD_COMPLETION is active (FI_PROGRESS_AUTO path). With auto progress, the provider guarantees thread-safe completion processing per endpoint, and the host thread and proxy thread operate on separate endpoints (eps[0] vs eps[1+]), so their op_queues are disjoint and no synchronization is needed. With FI_PROGRESS_MANUAL, we fall back to FI_THREAD_SAFE where manual_progress() iterates all EPs from both threads, requiring full mutex synchronization. Signed-off-by: Xuan Jiang <xuanj@amazon.com>
Cap the three queue-draining loops in the proxy progress path (CQ poll, AMO ack processing, AMO processing) to a maximum number of items per iteration, preventing any single queue from starving the others. The limit is controlled by the NVSHMEM_LIBFABRIC_PROXY_REQUEST_BATCH_MAX environment variable (default: 32). Signed-off-by: Anshuman Goswami <anshumgo@amazon.com>
…Q-based tracking Remove fi_cntr usage from the libfabric transport and track operation completions entirely through completion queue (CQ) entries. The endpoint now maintains a completed_ops counter that is incremented in the CQ completion handler, and quiet polls the CQ instead of reading or waiting on a counter. This eliminates the overhead of counter reads in the progress. The error handling path is also moved from a counter-error pre-check into the existing CQ error path (fi_cq_readerr), consolidating error reporting. Signed-off-by: Amit Radzi <amitrad@amazon.com>
Move AMO ack processing out of the batched op_queue drain (nvshmemt_libfabric_gdr_process_amos_ack) and into the CQ completion handler directly. When an ACK recv completion arrives, the ack is processed and the recv buffer is re-posted immediately, eliminating the intermediate queue and the separate per-progress-call drain loop. This removes a level of indirection and the associated queue put/get overhead from the AMO ack path, and removes the nvshmemt_libfabric_gdr_process_amos_ack call from the main progress function. Signed-off-by: Amit Radzi <amitrad@amazon.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR optimizes the libfabric transport progress and quiet paths with two changes:
Replace counter-based completion tracking with CQ-based tracking
Remove fi_cntr usage and track operation completions entirely through CQ
entries. The endpoint now maintains a completed_ops counter incremented in
the CQ completion handler, and quiet polls the CQ instead of reading or
waiting on a counter. This also unifies the staged-atomics and
non-staged-atomics quiet paths into a single CQ-driven polling loop, and
consolidates error reporting into the CQ error path.
Process AMO acks inline in CQ completion handler
Move AMO ack processing out of the batched op_queue drain and into the CQ
completion handler directly. When an ACK recv completion arrives, the ack
is processed and the recv buffer is re-posted immediately, eliminating the
intermediate queue and the separate per-progress-call drain loop.
Together these changes reduce overhead in the progress hot path by removing
counter reads and an extra queue drain pass per progress call.