From 897d03d1f23139d30373dd9ed9ff77f2267051ac Mon Sep 17 00:00:00 2001
From: Spencer Bryngelson <sbryngelson@gmail.com>
Date: Mon, 23 Feb 2026 20:47:28 -0500
Subject: [PATCH] Fix GPU example, compiler matrix, and AMD flang consistency

- Replace misleading GPU_PARALLEL+GPU_LOOP example with real
  GPU_PARALLEL_LOOP pattern (750+ uses in codebase); add warning
  that GPU_LOOP emits empty directives on Cray/AMD compilers
- Mark Intel ifx and GNU gfortran as Experimental for --gpu mp
  (CMake has code paths but not CI-tested or fully supported)
- Clarify AMD flang as additionally supported but not CI-gated,
  consistently across CLAUDE.md, common-pitfalls.md, gpu-and-mpi.md
- Clarify GPU_PARALLEL is for scalar reductions, not spatial loops

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 .claude/rules/common-pitfalls.md |  4 ++--
 .claude/rules/gpu-and-mpi.md     | 41 ++++++++++++++++++++------------
 CLAUDE.md                        |  5 ++--
 3 files changed, 31 insertions(+), 19 deletions(-)

diff --git a/.claude/rules/common-pitfalls.md b/.claude/rules/common-pitfalls.md
index 1e5f58aa6b..9861f24fbe 100644
--- a/.claude/rules/common-pitfalls.md
+++ b/.claude/rules/common-pitfalls.md
@@ -36,10 +36,10 @@
 - Boundary condition symmetry requirements must be maintained
 
 ## Compiler-Specific Issues
-- Code must compile on gfortran, nvfortran, Cray ftn, and Intel ifx
+- CI-gated compilers (must always pass): gfortran, nvfortran, Cray ftn, and Intel ifx
+- AMD flang is additionally supported for `--gpu mp` builds but not in the CI matrix
 - Each compiler has different strictness levels and warning behavior
 - Fypp macros must expand correctly for both GPU and CPU builds
-- GPU builds only work with nvfortran, Cray ftn, and AMD flang
 
 ## Test System
 - Tests are generated **programmatically** in `toolchain/mfc/test/cases.py`, not standalone files
diff --git a/.claude/rules/gpu-and-mpi.md b/.claude/rules/gpu-and-mpi.md
index ad84067b67..47aa93d0e9 100644
--- a/.claude/rules/gpu-and-mpi.md
+++ b/.claude/rules/gpu-and-mpi.md
@@ -38,20 +38,27 @@ Inline macros (use `$:` prefix):
 - `$:GPU_WAIT()` — Synchronization barrier.
 
 Block macros (use `#:call`/`#:endcall`):
-- `GPU_PARALLEL(...)` — GPU parallel region wrapping a code block.
+- `GPU_PARALLEL(...)` — GPU parallel region (used for scalar reductions like `maxval`/`minval`).
 - `GPU_DATA(copy=..., create=..., ...)` — Scoped data region.
 - `GPU_HOST_DATA(use_device_addr=[...])` — Host code with device pointers.
 
-Block macro usage:
+Typical GPU loop pattern (used 750+ times in the codebase):
 ```
-#:call GPU_PARALLEL(copyin='[var1]', copyout='[var2]')
-  $:GPU_LOOP(collapse=N)
-  do k = 0, n; do j = 0, m
-    ! loop body
-  end do; end do
-#:endcall GPU_PARALLEL
+$:GPU_PARALLEL_LOOP(private='[i,j,k,l]', collapse=3)
+do l = idwbuff(3)%beg, idwbuff(3)%end
+    do k = idwbuff(2)%beg, idwbuff(2)%end
+        do j = idwbuff(1)%beg, idwbuff(1)%end
+            ! loop body
+        end do
+    end do
+end do
+$:END_GPU_PARALLEL_LOOP()
 ```
 
+WARNING: Do NOT use `GPU_PARALLEL` wrapping `GPU_LOOP` for spatial loops. `GPU_LOOP`
+emits empty directives on Cray and AMD compilers, causing silent serial execution.
+Use `GPU_PARALLEL_LOOP` / `END_GPU_PARALLEL_LOOP` for all parallel spatial loops.
+
 NEVER write raw `!$acc` or `!$omp` directives. Always use `GPU_*` Fypp macros.
 The precheck source lint will catch raw directives and fail.
 
@@ -67,13 +74,17 @@ The precheck source lint will catch raw directives and fail.
 - These compile only for Cray (`_CRAYFTN`); other compilers skip them
 
 ### Compiler-Backend Matrix
-| Compiler        | `--gpu acc` (OpenACC) | `--gpu mp` (OpenMP) | CPU-only |
-|-----------------|----------------------|---------------------|----------|
-| GNU gfortran    | No                   | No                  | Yes      |
-| NVIDIA nvfortran| Yes (primary)        | Yes                 | Yes      |
-| Cray ftn (CCE)  | Yes                  | Yes (primary)       | Yes      |
-| Intel ifx       | No                   | No                  | Yes      |
-| AMD flang       | No                   | Yes                 | Yes      |
+
+CI-gated compilers (must always pass): gfortran, nvfortran, Cray ftn, Intel ifx.
+AMD flang is additionally supported for GPU builds but not in the CI matrix.
+
+| Compiler        | `--gpu acc` (OpenACC) | `--gpu mp` (OpenMP)    | CPU-only |
+|-----------------|----------------------|------------------------|----------|
+| GNU gfortran    | No                   | Experimental (AMD GCN) | Yes      |
+| NVIDIA nvfortran| Yes (primary)        | Yes                    | Yes      |
+| Cray ftn (CCE)  | Yes                  | Yes (primary)          | Yes      |
+| Intel ifx       | No                   | Experimental (SPIR64)  | Yes      |
+| AMD flang       | No                   | Yes                    | Yes      |
 
 ## Preprocessor Defines (`#ifdef` / `#ifndef`)
 
diff --git a/CLAUDE.md b/CLAUDE.md
index 2b6a4a6d2d..38918a2091 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -3,7 +3,8 @@
 MFC is an exascale multi-physics CFD solver written in modern Fortran 2008+ with Fypp
 preprocessing. It has three executables (pre_process, simulation, post_process), a Python
 toolchain for building/running/testing, and supports GPU acceleration via OpenACC and
-OpenMP target offload. It must compile with gfortran, nvfortran, Cray ftn, and Intel ifx.
+OpenMP target offload. It must compile with gfortran, nvfortran, Cray ftn, and Intel ifx (CI-gated).
+AMD flang is additionally supported for OpenMP target offload GPU builds.
 
 ## Commands
 
@@ -167,4 +168,4 @@ When reviewing PRs, prioritize in this order:
 4. MPI correctness (halo exchange, buffer sizing, GPU_UPDATE calls)
 5. GPU code (GPU_* Fypp macros only, no raw pragmas)
 6. Physics consistency (pressure formula matches model_eqns)
-7. Compiler portability (all four compilers)
+7. Compiler portability (4 CI-gated compilers + AMD flang for GPU)