Skip to content

Backend: Add BackendWithFallback to retry IO#215

Open
riley-dixon wants to merge 9 commits intodevelopfrom
rildixon/io-retry-compat
Open

Backend: Add BackendWithFallback to retry IO#215
riley-dixon wants to merge 9 commits intodevelopfrom
rildixon/io-retry-compat

Conversation

@riley-dixon
Copy link
Collaborator

@riley-dixon riley-dixon commented Mar 11, 2026

Motivation

Modifies Backend & adds a new BackendWithFallback interface to re-attempt an IO request that fails with a different Backend. For example, if a Fastpath IO request fails, the Fallback path can automatically retry the IO.

Technical Details

  • Derived BackendWithFallback backends can optionally register another Backend as eligible to retry failed IO requests.
  • hipFile will resubmit the failed IO request if a fallback backend is registered and would otherwise accept the original IO request.
  • Backend has been reworked for how IO requests are processed. A new virtual function _io_impl() has been added which handles only the logic required to submit the IO request to the kernel/driver. io() now acts as the entry point and can handle coordinating other tasks such as updating stats or the auto fallback mechanism. (Part of the rationale for this change is so the auto fallback mechanism only retries the IO if the IO failed and not something else.)

Test Plan

Added unit tests and Fastpath integration level tests. System level tests will be added separately as it is a non-trivial addition.

Reviewer Note

When reviewing the first couple of commits, you may wish to be aware of certain changes made later on.

AIHIPFILE-100

@riley-dixon riley-dixon self-assigned this Mar 11, 2026
@riley-dixon riley-dixon force-pushed the rildixon/io-retry-compat branch 4 times, most recently from 0234797 to 8b8ccfa Compare March 18, 2026 19:06
@riley-dixon riley-dixon changed the title Backend: Add RetryableBackend to retry IO Backend: Add BackendWithFallback to retry IO Mar 18, 2026
@riley-dixon riley-dixon force-pushed the rildixon/io-retry-compat branch from 8b8ccfa to 4025ee3 Compare March 18, 2026 19:29
@riley-dixon riley-dixon requested a review from Copilot March 18, 2026 19:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the Backend IO pathway to centralize common logic (stats updates + fallback retry orchestration) and introduces BackendWithFallback so a backend (e.g., Fastpath) can automatically retry failed IO through an alternate backend (e.g., Fallback).

Changes:

  • Added BackendWithFallback and refactored Backend to route IO through a new _io_impl() hook, with io() acting as the common entry point.
  • Wired Fastpath to optionally register and use a fallback backend when _io_impl() throws and the fallback backend accepts the request.
  • Added/updated unit + integration tests and build plumbing for the new backend behavior.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/amd_detail/backend.h Introduces _io_impl(), stats update hooks, and new BackendWithFallback interface.
src/amd_detail/backend.cpp Implements the shared Backend::io() flow and fallback retry logic for BackendWithFallback::io().
src/amd_detail/backend/fastpath.h / src/amd_detail/backend/fastpath.cpp Migrates Fastpath to BackendWithFallback, moves IO submission into _io_impl(), adds stats hook implementations.
src/amd_detail/backend/fallback.h / src/amd_detail/backend/fallback.cpp Adds stats hook implementations and _io_impl(); adjusts IO helper structure for chunking.
src/amd_detail/state.cpp Creates/registers a shared Fallback backend instance and registers it as Fastpath’s fallback when enabled.
src/amd_detail/CMakeLists.txt Adds new backend.cpp to the library build.
test/amd_detail/backend.cpp New unit tests for fallback retry behavior at the Backend layer.
test/amd_detail/fastpath.cpp Adds Fastpath integration tests covering retry + rejection behaviors via fallback.
test/amd_detail/mbackend.h Extends backend mocks to support the new hooks (_io_impl, stats updates) and adds a BackendWithFallback mock.
test/amd_detail/CMakeLists.txt Adds new backend test file to the test build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +23 to +34
ssize_t nbytes = _io_impl(type, file, buffer, size, file_offset, buffer_offset);
switch (type) {
case (IoType::Read):
update_read_stats(nbytes);
break;
case (IoType::Write):
update_write_stats(nbytes);
break;
default:
break;
}
return nbytes;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add the guard to check for nbytes >= 0 , however I agree we may want to look at what we are returning. To my knowledge we only throw exceptions on errors at this moment in time.

@riley-dixon riley-dixon force-pushed the rildixon/io-retry-compat branch from 4025ee3 to 756a18f Compare March 18, 2026 21:39
@riley-dixon
Copy link
Collaborator Author

@copilot

Copy link

Copilot AI commented Mar 18, 2026

@riley-dixon I've opened a new pull request, #227, to work on those changes. Once the pull request is ready, I'll request review from you.

@riley-dixon riley-dixon requested a review from Copilot March 18, 2026 21:43
One challenge here was ensuring that the right stats counter gets
incremented when an IO is retried, or trying to prevent an exception
being raised by the stats module from causing the IO to be retried.

Parts of RetryableBackend may want to be brought up to Backend,
like the update_stats functions.
Adds tests for the behaviour of the retry mechanism, as well as
any default behaviour from the RetryableBackend.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an automatic IO retry mechanism by adding a BackendWithFallback interface and refactoring backend IO submission so failures from one backend (e.g., Fastpath) can be retried via a registered fallback backend (e.g., Fallback).

Changes:

  • Refactor Backend IO flow by introducing _io_impl() and centralizing stats updates in Backend::io().
  • Add BackendWithFallback and wire Fastpath to optionally retry failed IO using a registered fallback backend.
  • Add unit and integration tests covering fallback retry behavior.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/amd_detail/backend.h Defines new Backend::io() entry point, _io_impl(), stats hooks, and BackendWithFallback.
src/amd_detail/backend.cpp Implements shared IO/stats behavior and fallback retry logic.
src/amd_detail/backend/fastpath.h Converts Fastpath to BackendWithFallback and moves IO into _io_impl().
src/amd_detail/backend/fastpath.cpp Implements _io_impl() plus Fastpath stats update hooks.
src/amd_detail/backend/fallback.h Adds stats hooks and _io_impl() plumbing for the fallback backend.
src/amd_detail/backend/fallback.cpp Routes IO through _io_impl() and ensures stats update behavior remains correct.
src/amd_detail/state.cpp Creates a shared fallback backend instance and registers it with Fastpath.
src/amd_detail/CMakeLists.txt Adds new production source backend.cpp to the build.
test/amd_detail/mbackend.h Updates mocks for new backend interface and adds MBackendWithFallback.
test/amd_detail/backend.cpp Adds unit tests validating fallback eligibility and retry behavior.
test/amd_detail/fastpath.cpp Adds integration-level tests for Fastpath retrying via Fallback.
test/amd_detail/CMakeLists.txt Adds new test source backend.cpp to the test build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

This change passes in the original parameters of the IO request
for `is_retryable()` to process. By default, we now check that the
fallback IO engine will accept the IO request prior to submitting
it. This is useful in cases where the Fallback backend is available,
but for some reason the request is still invalid.

It also removes a hack that was used for testing this scenario
with "optionally" mocking `is_retryable()`.
Note:
retryable_io() -> _io_impl()
Formatter moved the order of some functions around.

Previously, a "RetryableBackend" would use io() to wrap the fallback
mechanism before calling retryable_io() to actually perform the
request.

Now, every Backend will use _io_impl() which will be responsible
for handling the IO request, leaving io() as the public front-end.
"Retryable" was considered to be a misnomer in this context. Instead of
retrying the IO with the same backend, we were resubmitting the IO to
a different Backend.

While "BackendWithFallback" is not exactly a great name, its something
we can change later.
This was originally written to BackendWithFallback. However,
this could be further generalized to all Backends. _io_impl()
is then only concerned with issuing the IO, letting the io()
method do any additional processing.
@riley-dixon riley-dixon force-pushed the rildixon/io-retry-compat branch 2 times, most recently from 517341b to 4f131e6 Compare March 18, 2026 22:21
@riley-dixon riley-dixon marked this pull request as ready for review March 18, 2026 22:21
@riley-dixon riley-dixon requested a review from zbyrne as a code owner March 18, 2026 22:21
Copilot AI review requested due to automatic review settings March 18, 2026 22:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an automatic IO retry mechanism via a new BackendWithFallback interface, refactoring backend IO submission so a failing backend (e.g., Fastpath) can transparently retry with a registered fallback backend (e.g., Fallback).

Changes:

  • Adds BackendWithFallback with built-in retry logic and refactors Backend::io() to centralize shared concerns (stats, retry coordination) while pushing submission into _io_impl().
  • Updates Fastpath and Fallback backends to the new _io_impl() + update_*_stats() model and wires Fastpath→Fallback registration in DriverState.
  • Adds unit/integration tests covering successful fallback retry and exception propagation when fallback is ineligible.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/amd_detail/backend.h Refactors Backend API and introduces BackendWithFallback interface.
src/amd_detail/backend.cpp Implements new shared IO entry points and fallback retry logic.
src/amd_detail/backend/fastpath.h Converts Fastpath to BackendWithFallback and moves IO to _io_impl().
src/amd_detail/backend/fastpath.cpp Implements _io_impl() and stat update hooks for Fastpath.
src/amd_detail/backend/fallback.h Adds stat update hooks and _io_impl() plumbing for Fallback.
src/amd_detail/backend/fallback.cpp Routes IO through _io_impl() and updates stats on success.
src/amd_detail/state.cpp Registers a fallback backend and attaches it to Fastpath when enabled.
src/amd_detail/CMakeLists.txt Adds new backend.cpp compilation unit to the library build.
test/amd_detail/mbackend.h Extends mocks to support _io_impl() and stats update hooks; adds MBackendWithFallback.
test/amd_detail/backend.cpp Adds unit tests for fallback eligibility and retry behavior.
test/amd_detail/fastpath.cpp Adds integration-level tests validating Fastpath retry via fallback.
test/amd_detail/CMakeLists.txt Adds the new backend unit test file to the internal test target.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

throw;
}
return nbytes;
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar in nature to one of the previous CoPilot suggestions.

@riley-dixon riley-dixon force-pushed the rildixon/io-retry-compat branch from 4f131e6 to 34d8412 Compare March 18, 2026 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants