Skip to content

[release/2.11] Skip linalg.eig tests when MAGMA is not available#3072

Closed
ethanwee1 wants to merge 11 commits intoROCm:release/2.11from
ethanwee1:rocm-skip-eig-no-magma-2.11
Closed

[release/2.11] Skip linalg.eig tests when MAGMA is not available#3072
ethanwee1 wants to merge 11 commits intoROCm:release/2.11from
ethanwee1:rocm-skip-eig-no-magma-2.11

Conversation

@ethanwee1
Copy link
Copy Markdown

@ethanwee1 ethanwee1 commented Mar 16, 2026

Skip test_linalg_eig_stride_consistency_cuda & test_torch_return_types_returns_cuda as they are incorrectly running when MAGMA is not available.

jithunnair-amd and others added 9 commits February 26, 2026 22:34
…for py3.9;

upgrade tensorboard compatible with numpy 2

Co-authored-by: Ethan Wee <Ethan.Wee@amd.com>
(cherry picked from commit e867a3d)
(cherry picked from commit c7a1e32)
(cherry picked from commit 2a215e4)
(cherry picked from commit 866cc1d)
(cherry picked from commit 4b46310)
This PR fixes the unit test,

test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction FAILED
[0.1163s]

```
Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/test_cuda.py", line 471, in test_set_per_process_memory_fraction
    tmp_tensor = torch.empty(application, dtype=torch.int8, device="cuda")
RuntimeError: Trying to create tensor with negative dimension -5681285432: [-5681285432]
```
This error occurs only on gfx1101 arch.

This error is coming from an integer overflow when another unit test,
test/test_cuda.py::TestCuda::test_randint_generation_for_large_numel
creates a tensor with a huge numel, which overflows into a higher
torch.cuda.max_memory_reserved() when you call
test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction
afterward. To avoid this we introduced torch.cuda.empty_cache() and
torch.cuda.reset_peak_memory_stats() to clean up CUDA states.

JIRA: https://ontrack-internal.amd.com/browse/SWDEV-535295
(cherry picked from commit f86d184)
(cherry picked from commit 1b44228)
…d_memory_with_allocator (pytorch#2811)

Use try/finally block. This follows a similar pattern elsewhere in
test_cuda.py.

Fixes #ROCm/TheRock#2118.
…ersistent reduction and no_x_dim removal (pytorch#2454)

Cherry-pick of ROCm#2417
Need to resolve conflicts

---------

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
(cherry picked from commit eb47158)

[release/2.9][ROCm][inductor] Add ROCm specific persistent reduction config. (pytorch#2861)

In support of
[SWDEV-566103](https://ontrack-internal.amd.com/browse/SWDEV-566103)

[release/2.10] Fix Inductor Triton Heuristics (pytorch#2931)

The ROCm release/2.10 branch was created by applying 15 commits to
upstream release/2.10 branch.
(See
pytorch/pytorch@release/2.10...ROCm:pytorch:release/2.10)

This PR fixes the issue with the missing disable_pointwise_autotuning
function.

There are three commits in this PR:

First commit is a revert:
1c96f23 - Autotuning support for
persistent reduction

since it is already available in upstream release/2.10 and is not
needed. (It reintroduced disable_pointwise_autotuning function.)

The second commit (b9facd0) is needed
for provenance, so I can apply the third commit:
e5eee74 - Heuristics improvements for
reduction kernels

which was reverted last minute before the release/2.10 cutoff and then
re-landed shortly afterwards the cutoff date but with a minor change.

---------

Co-authored-by: Pandya, Vivek Vasudevbhai <vpandya@qti.qualcomm.com>
[AUTOGENERATED] release/2.11_IFU_20260224
…CL race condition (pytorch#3054)

Cherry-pick of ROCm#3043 

Co-authored-by: tom.jen <tomjen12@amd.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
…orch#3057)

Removing need for fences in normalization kernel by converting the
stores into atomics+return. This is crucial for perf in architectures
with split caches (e.g. MI300), where fences are inherently costly. This
change speedups `batch_norm_stats ` function for tensors in
`channels_last` format.

### Performance result on MI300:
<img width="2311" height="1537" alt="batchnorm_latency_comparison"
src="https://github.com/user-attachments/assets/dee39088-9f55-499a-a39b-b170805416bb"
/>

**Particular Example:**
Before:
Avg time for shape (20, 896, 59, 91): **1102.39 us**

After:
Avg time for shape (20, 896, 59, 91): **122.94 us**

Reproducer:
```

import torch

shapes = [(20, 896, 59, 91)]

eps = 1e-5

for shape in shapes:
    x = torch.randn(shape, device='cuda', dtype=torch.bfloat16)
    x = x.to(memory_format=torch.channels_last)
    for _ in range(20):
        _ = torch.batch_norm_stats(x, eps)
    torch.cuda.synchronize()

    start_evt = torch.cuda.Event(enable_timing=True)
    end_evt = torch.cuda.Event(enable_timing=True)
    start_evt.record()
    for _ in range(100):
        _ = torch.batch_norm_stats(x, eps)
    end_evt.record()
    torch.cuda.synchronize()
    print(f"Avg time for shape {shape}: {start_evt.elapsed_time(end_evt) / 100 * 1e3:.2f} us")

```

Related fix which is released:
pytorch#161180

Pull Request resolved: pytorch#175286
Approved by: https://github.com/amd-hhashemi,
https://github.com/jerrymannil, https://github.com/jeffdaily
torch.linalg.eig requires MAGMA on ROCm (hipsolver does not support eig).
Add skipCUDAIfNoMagma to test_linalg_eig_stride_consistency in
test_torchinductor.py and test_torch_return_types_returns in test_vmap.py
to match the skip pattern used by linalg eig tests in test_linalg.py.
@ethanwee1 ethanwee1 changed the title [ROCm] Skip linalg.eig tests when MAGMA is not available [relesae/2.11] Skip linalg.eig tests when MAGMA is not available Mar 16, 2026
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Mar 16, 2026

Jenkins build for 173556af377911e6b652276b641ce6cd84936048 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@ethanwee1 ethanwee1 changed the title [relesae/2.11] Skip linalg.eig tests when MAGMA is not available [release/2.11] Skip linalg.eig tests when MAGMA is not available Mar 17, 2026
@ethanwee1 ethanwee1 marked this pull request as ready for review March 17, 2026 15:43
@jithunnair-amd jithunnair-amd requested a review from Copilot March 17, 2026 21:53
@jithunnair-amd
Copy link
Copy Markdown
Collaborator

@ethanwee1 The test_linalg_eig_stride_consistency_cuda tests are failing in the CI with what looks to be a syntax error:
https://ml-ci-internal.amd.com/job/pytorch/job/pytorch-ci-pipeline/job/PR-3072/1/testReport/

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets CUDA test stability on builds where MAGMA is not available by skipping two MAGMA-dependent test cases that currently run (and fail) despite missing MAGMA.

Changes:

  • Adds MAGMA-based CUDA skip coverage for an Inductor stride-consistency test that exercises torch.linalg.eig.
  • Adds MAGMA-based CUDA skip coverage for a functorch vmap return-type test that exercises torch.linalg.eig.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
test/inductor/test_torchinductor.py Adds @skipCUDAIfNoMagma to the Inductor linalg.eig stride-consistency test.
test/functorch/test_vmap.py Adds @skipCUDAIfNoMagma to the vmap return-types test that includes torch.linalg.eig.

You can also share your feedback on Copilot code review. Take the survey.

return res

test(self, op, tuple(inputs), in_dims=tuple(in_dims))

Comment on lines 5054 to 5058

@skipCUDAIfNoMagma
def test_torch_return_types_returns(self, device):
t = torch.randn(3, 2, 2, device=device)
self.assertTrue(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ethanwee1 Your thoughts on this comment?

Comment thread test/inductor/test_torchinductor.py Outdated
reference_in_float=False,
)

@skipCUDAIfNoMagma
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Mar 17, 2026

Jenkins build for aa25ee586d8f0a83f22c770641ab7b6ed1b52bbe commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

skipCUDAIfNoMagma uses skipCUDAIf which accesses slf.device_type, but
GPUTests inherits from TestCase (not DeviceTypeTestBase) and doesn't
have device_type. Use unittest.skipIf which works without device_type.
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Mar 17, 2026

Jenkins build for aa25ee586d8f0a83f22c770641ab7b6ed1b52bbe commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants