[release/2.11] Skip linalg.eig tests when MAGMA is not available by ethanwee1 · Pull Request #3072 · ROCm/pytorch

ethanwee1 · 2026-03-16T18:51:37Z

Skip test_linalg_eig_stride_consistency_cuda & test_torch_return_types_returns_cuda as they are incorrectly running when MAGMA is not available.

…for py3.9; upgrade tensorboard compatible with numpy 2 Co-authored-by: Ethan Wee <Ethan.Wee@amd.com> (cherry picked from commit e867a3d) (cherry picked from commit c7a1e32) (cherry picked from commit 2a215e4) (cherry picked from commit 866cc1d) (cherry picked from commit 4b46310)

This PR fixes the unit test, test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction FAILED [0.1163s] ``` Traceback (most recent call last): File "/var/lib/jenkins/pytorch/test/test_cuda.py", line 471, in test_set_per_process_memory_fraction tmp_tensor = torch.empty(application, dtype=torch.int8, device="cuda") RuntimeError: Trying to create tensor with negative dimension -5681285432: [-5681285432] ``` This error occurs only on gfx1101 arch. This error is coming from an integer overflow when another unit test, test/test_cuda.py::TestCuda::test_randint_generation_for_large_numel creates a tensor with a huge numel, which overflows into a higher torch.cuda.max_memory_reserved() when you call test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction afterward. To avoid this we introduced torch.cuda.empty_cache() and torch.cuda.reset_peak_memory_stats() to clean up CUDA states. JIRA: https://ontrack-internal.amd.com/browse/SWDEV-535295 (cherry picked from commit f86d184) (cherry picked from commit 1b44228)

…d_memory_with_allocator (pytorch#2811) Use try/finally block. This follows a similar pattern elsewhere in test_cuda.py. Fixes #ROCm/TheRock#2118.

…ersistent reduction and no_x_dim removal (pytorch#2454) Cherry-pick of ROCm#2417 Need to resolve conflicts --------- Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> (cherry picked from commit eb47158) [release/2.9][ROCm][inductor] Add ROCm specific persistent reduction config. (pytorch#2861) In support of [SWDEV-566103](https://ontrack-internal.amd.com/browse/SWDEV-566103) [release/2.10] Fix Inductor Triton Heuristics (pytorch#2931) The ROCm release/2.10 branch was created by applying 15 commits to upstream release/2.10 branch. (See pytorch/pytorch@release/2.10...ROCm:pytorch:release/2.10) This PR fixes the issue with the missing disable_pointwise_autotuning function. There are three commits in this PR: First commit is a revert: 1c96f23 - Autotuning support for persistent reduction since it is already available in upstream release/2.10 and is not needed. (It reintroduced disable_pointwise_autotuning function.) The second commit (b9facd0) is needed for provenance, so I can apply the third commit: e5eee74 - Heuristics improvements for reduction kernels which was reverted last minute before the release/2.10 cutoff and then re-landed shortly afterwards the cutoff date but with a minor change. --------- Co-authored-by: Pandya, Vivek Vasudevbhai <vpandya@qti.qualcomm.com>

[AUTOGENERATED] release/2.11_IFU_20260224

…CL race condition (pytorch#3054) Cherry-pick of ROCm#3043 Co-authored-by: tom.jen <tomjen12@amd.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>

…orch#3057) Removing need for fences in normalization kernel by converting the stores into atomics+return. This is crucial for perf in architectures with split caches (e.g. MI300), where fences are inherently costly. This change speedups `batch_norm_stats ` function for tensors in `channels_last` format. ### Performance result on MI300: <img width="2311" height="1537" alt="batchnorm_latency_comparison" src="https://github.com/user-attachments/assets/dee39088-9f55-499a-a39b-b170805416bb" /> **Particular Example:** Before: Avg time for shape (20, 896, 59, 91): **1102.39 us** After: Avg time for shape (20, 896, 59, 91): **122.94 us** Reproducer: ``` import torch shapes = [(20, 896, 59, 91)] eps = 1e-5 for shape in shapes: x = torch.randn(shape, device='cuda', dtype=torch.bfloat16) x = x.to(memory_format=torch.channels_last) for _ in range(20): _ = torch.batch_norm_stats(x, eps) torch.cuda.synchronize() start_evt = torch.cuda.Event(enable_timing=True) end_evt = torch.cuda.Event(enable_timing=True) start_evt.record() for _ in range(100): _ = torch.batch_norm_stats(x, eps) end_evt.record() torch.cuda.synchronize() print(f"Avg time for shape {shape}: {start_evt.elapsed_time(end_evt) / 100 * 1e3:.2f} us") ``` Related fix which is released: pytorch#161180 Pull Request resolved: pytorch#175286 Approved by: https://github.com/amd-hhashemi, https://github.com/jerrymannil, https://github.com/jeffdaily

torch.linalg.eig requires MAGMA on ROCm (hipsolver does not support eig). Add skipCUDAIfNoMagma to test_linalg_eig_stride_consistency in test_torchinductor.py and test_torch_return_types_returns in test_vmap.py to match the skip pattern used by linalg eig tests in test_linalg.py.

rocm-repo-management-api · 2026-03-16T18:59:01Z

Jenkins build for 173556af377911e6b652276b641ce6cd84936048 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

jithunnair-amd · 2026-03-17T21:54:47Z

@ethanwee1 The test_linalg_eig_stride_consistency_cuda tests are failing in the CI with what looks to be a syntax error:
https://ml-ci-internal.amd.com/job/pytorch/job/pytorch-ci-pipeline/job/PR-3072/1/testReport/

Copilot

Pull request overview

This PR targets CUDA test stability on builds where MAGMA is not available by skipping two MAGMA-dependent test cases that currently run (and fail) despite missing MAGMA.

Changes:

Adds MAGMA-based CUDA skip coverage for an Inductor stride-consistency test that exercises torch.linalg.eig.
Adds MAGMA-based CUDA skip coverage for a functorch vmap return-type test that exercises torch.linalg.eig.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`test/inductor/test_torchinductor.py`	Adds `@skipCUDAIfNoMagma` to the Inductor `linalg.eig` stride-consistency test.
`test/functorch/test_vmap.py`	Adds `@skipCUDAIfNoMagma` to the vmap return-types test that includes `torch.linalg.eig`.

You can also share your feedback on Copilot code review. Take the survey.

            return res

        test(self, op, tuple(inputs), in_dims=tuple(in_dims))



jithunnair-amd · 2026-03-23T17:44:18Z


+    @skipCUDAIfNoMagma
    def test_torch_return_types_returns(self, device):
        t = torch.randn(3, 2, 2, device=device)
        self.assertTrue(


@ethanwee1 Your thoughts on this comment?

            reference_in_float=False,
        )

+    @skipCUDAIfNoMagma


rocm-repo-management-api · 2026-03-17T22:02:55Z

Jenkins build for aa25ee586d8f0a83f22c770641ab7b6ed1b52bbe commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

skipCUDAIfNoMagma uses skipCUDAIf which accesses slf.device_type, but GPUTests inherits from TestCase (not DeviceTypeTestBase) and doesn't have device_type. Use unittest.skipIf which works without device_type.

rocm-repo-management-api · 2026-03-17T22:47:06Z

Jenkins build for aa25ee586d8f0a83f22c770641ab7b6ed1b52bbe commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

jithunnair-amd and others added 9 commits February 26, 2026 22:34

reset per process memory fraction in test_cuda.py test_mempool_limite…

d961ce2

…d_memory_with_allocator (pytorch#2811) Use try/finally block. This follows a similar pattern elsewhere in test_cuda.py. Fixes #ROCm/TheRock#2118.

Update version to 2.11.0

5ea4c12

Merge pull request pytorch#3000 from ROCm/release/2.11_IFU_20260224

76659ce

[AUTOGENERATED] release/2.11_IFU_20260224

[AUTOGENERATED] [release/2.11] Move getenv to main thread to avoid NC…

427a597

…CL race condition (pytorch#3054) Cherry-pick of ROCm#3043 Co-authored-by: tom.jen <tomjen12@amd.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>

ethanwee1 changed the title ~~[ROCm] Skip linalg.eig tests when MAGMA is not available~~ [relesae/2.11] Skip linalg.eig tests when MAGMA is not available Mar 16, 2026

ethanwee1 changed the title ~~[relesae/2.11] Skip linalg.eig tests when MAGMA is not available~~ [release/2.11] Skip linalg.eig tests when MAGMA is not available Mar 17, 2026

ethanwee1 marked this pull request as ready for review March 17, 2026 15:43

ethanwee1 requested a review from jithunnair-amd March 17, 2026 15:43

jithunnair-amd requested a review from Copilot March 17, 2026 21:53

Copilot started reviewing on behalf of jithunnair-amd March 17, 2026 21:56 View session

Fix missing skipCUDAIfNoMagma import in test_vmap.py

c2ee153

Copilot AI reviewed Mar 17, 2026

View reviewed changes

Use unittest.skipIf for CommonTemplate eig test

aa25ee5

skipCUDAIfNoMagma uses skipCUDAIf which accesses slf.device_type, but GPUTests inherits from TestCase (not DeviceTypeTestBase) and doesn't have device_type. Use unittest.skipIf which works without device_type.

jithunnair-amd force-pushed the release/2.11 branch from a9d24b6 to 752cc24 Compare March 25, 2026 21:52

jithunnair-amd requested a review from jeffdaily as a code owner March 25, 2026 21:52

ethanwee1 closed this Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release/2.11] Skip linalg.eig tests when MAGMA is not available#3072

[release/2.11] Skip linalg.eig tests when MAGMA is not available#3072
ethanwee1 wants to merge 11 commits intoROCm:release/2.11from
ethanwee1:rocm-skip-eig-no-magma-2.11

ethanwee1 commented Mar 16, 2026 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

jithunnair-amd commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

jithunnair-amd Mar 23, 2026

Uh oh!

rocm-repo-management-api Bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

		return res

		test(self, op, tuple(inputs), in_dims=tuple(in_dims))

Conversation

ethanwee1 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jithunnair-amd commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

jithunnair-amd Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

rocm-repo-management-api Bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ethanwee1 commented Mar 16, 2026 •

edited

Loading

rocm-repo-management-api Bot commented Mar 16, 2026 •

edited

Loading

rocm-repo-management-api Bot commented Mar 17, 2026 •

edited

Loading

rocm-repo-management-api Bot commented Mar 17, 2026 •

edited

Loading