Add Gemma 4 FLOPs & fix sliding window flops computations by gagika · Pull Request #3592 · AI-Hypercomputer/maxtext

gagika · 2026-04-07T19:44:48Z

Description

Adds TFLOPs calculations for the Gemma 4 architecture (including MoE) and fixes several inaccuracies in existing FLOPs math (sliding window overlap, vision encoder scaling, and shared KV projections).

Gemma 4 & MoE: Added Gemma 4 support. Fixed fallback MoE calculations to correctly use moe_mlp_dim and generalized MoE layer detection (num_experts > 1).
Sliding Window: Corrected local causal FLOPs to account for triangular overlap (max_target_length * window - 0.5 * window**2).
Vision Encoders: Fixed backward pass scaling (x3) for Gemma 3 and Llama 4 when parameters are unfreezed.
KV Projections: Factored in share_kv_projections for accurate QKV FLOPs.

Tests

Added maxtext_utils_flops_test.py to validate FLOPs calculations across 12 model architectures.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

github-actions · 2026-04-07T19:46:46Z

🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions · 2026-04-07T19:51:19Z

🤖 I'm sorry @gagika, but I was unable to process your request. Please see the logs for more details.

codecov · 2026-04-07T19:52:30Z

Codecov Report

❌ Patch coverage is 95.45455% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/utils/maxtext_utils.py	95.45%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

github-actions · 2026-04-07T21:33:29Z

🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This Pull Request introduces Gemma 4 FLOPs calculations and significantly improves the accuracy of existing FLOPs math, particularly for sliding window attention and mixed attention architectures. The addition of a comprehensive test suite covering multiple model families is a major highlight and ensures the reliability of these critical metrics.

🔍 General Feedback

Great Test Coverage: The new maxtext_utils_flops_test.py is excellent. It uses a robust 6 * params * tokens verification strategy that provides high confidence in the computed TFLOPs across various architectures.
Improved Accuracy: The fixes for sliding window area and vision encoder scaling (backward pass) are well-timed and correct.
Inconsistency in Shared KV Projections: There is a potential logic error in how share_kv_projections is applied to mixed attention models in the main caller. One unit test specifically assumes local layers do not share KV projections even when the flag is True, but the code currently applies it to both.
MoE Fallback Logic: The fallback for MoE layer detection is now more generalized, which is good, but might be too broad for future hybrid architectures.

github-actions · 2026-04-07T21:37:35Z

tests/unit/maxtext_utils_flops_test.py

+    config.decoder_block = maxtext_utils.DecoderBlockType.DEEPSEEK
+    config.per_device_batch_size = 1
+    config.max_target_length = 2048
+    config.emb_dim = 1024


🟢 Minor typo: "mamtul" -> "matmul".

Suggested change

config.emb_dim = 1024

# dense ffn matmul (silu: 2 * mlp_dim)

github-actions · 2026-04-07T21:37:35Z

src/maxtext/utils/maxtext_utils.py

+      total_ffn_flops if is_ffn_flops_already_total else total_ffn_flops * config.num_decoder_layers
+  )
+
  # Attention flops


🟠 The `kv_multiplier` is calculated here based on `config.share_kv_projections` and applied to `qkv_flops`. However, `qkv_flops` is used as the base for local attention layers in models like Gemma 3/4 (via `calculate_mixed_attention_model_tflops_training_per_device`).
According to the field description in types.py, share_kv_projections only applies to global attention. Applying it to qkv_flops here will incorrectly reduce the FLOPs estimate for local attention layers if share_kv_projections is True.

Consider calculating qkv_flops for local layers without this multiplier, or pass a separate multiplier for global layers.

Suggested change

# Attention flops

qkv_flops = (

2

* config.per_device_batch_size

* config.max_target_length

* config.emb_dim

* (config.num_query_heads + 2 * config.num_kv_heads)

* config.head_dim

)

github-actions · 2026-04-07T21:37:35Z

src/maxtext/utils/maxtext_utils.py

-  elif config.decoder_block == DecoderBlockType.QWEN3_NEXT:
-    num_moe_layers = config.num_decoder_layers
-    num_dense_layers = 0
  else:


🟡 The fallback here assumes that if `num_experts > 1`, all layers are MoE layers. While this might be true for current Gemma 4 configs, it's a broad assumption that could lead to inaccuracies if future models interleave dense and MoE layers but aren't explicitly handled in the `if/elif` blocks above.
It might be safer to check for specific model families or rely on a more explicit config flag for layer interleaving if available.

github-actions · 2026-04-07T21:37:40Z

tests/unit/maxtext_utils_flops_test.py

+    config.num_experts = 4
+    config.mlp_dim = 2048
+    config.moe_mlp_dim = 1024
+    config.shared_experts = 1


🟢 Minor typo: "mamtul" -> "matmul".

Suggested change

config.shared_experts = 1

# moe ffn matmul

…g window

github-actions · 2026-04-07T22:44:19Z

🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This pull request significantly improves the accuracy of FLOPs and MFU (Model Flops Utilization) calculations across multiple architectures, with a focus on Gemma 4 and corrected sliding window logic. The implementation is thorough, including a new comprehensive test suite that validates calculations for 12 different model configurations.

🔍 General Feedback

Accuracy Improvements: The switch to a precise triangular overlap formula for sliding window attention and the inclusion of backward pass FLOPs for vision encoders are excellent updates that prevent MFU over-estimation.
Architectural Coverage: The addition of Gemma 4 specific logic and the generalization of MoE layer detection make the utilities much more robust for future model support.
Testing: The new maxtext_utils_flops_test.py is a great addition, providing clear manual-calculation-based verification for various architectures.
Suggestions: I've provided a few suggestions to further generalize the MoE layer detection and ensure consistent dimension usage in MoE FFN calculations.

github-actions · 2026-04-07T22:46:56Z

src/maxtext/utils/maxtext_utils.py

    learnable_weight_flops += 2 * vision_embedder_flops  # only projector is learnable, add fwd+optimizer
  else:
    learnable_weight_flops *= 3  # multiply by 3 for fwd + bwd + optimizer
+    total_attn_flops *= 3  # multiply by 3 for fwd + bwd pass


🟠 High - Including the attention FLOPs in the vision encoder backward pass is correct when the encoder is unfreezed. This ensures that the total TFLOPs and MFU are not over-estimated during full fine-tuning.

github-actions · 2026-04-07T22:46:56Z

src/maxtext/utils/maxtext_utils.py

      * config.per_device_batch_size
-      * config.max_target_length
-      * min(config.sliding_window_size, config.max_target_length)
+      * (config.max_target_length * window - 0.5 * window**2)


🟢 Good catch on the causal sliding window FLOPs calculation. The new formula (config.max_target_length * window - 0.5 * window**2) is more accurate than the previous approximation.

github-actions · 2026-04-07T22:46:56Z

src/maxtext/utils/maxtext_utils.py

-    num_dense_layers = 0
  else:
-    raise ValueError("Currently we only support DeepSeek, Llama4, and Qwen3-Next calculation.")
+    if config.num_experts > 1:


🟡 Medium - This fallback logic for MoE layer detection can be made more general by respecting config.first_num_dense_layers for any MoE architecture that doesn't match the specific interleaved logic of Llama 4.

Suggested change

if config.num_experts > 1:

if config.num_experts > 1:

num_dense_layers = config.first_num_dense_layers

num_moe_layers = config.num_decoder_layers - config.first_num_dense_layers

else:

github-actions · 2026-04-07T22:46:56Z

src/maxtext/utils/maxtext_utils.py

      gate_flops = 2 * config.per_device_batch_size * config.max_target_length * config.emb_dim * config.num_experts
      total_ffn_flops = (
-          gate_flops + calculate_ffn_mamtul_tflops_per_device(config, config.mlp_dim) * config.num_experts_per_tok
+          gate_flops + calculate_ffn_mamtul_tflops_per_device(config, config.moe_mlp_dim) * config.num_experts_per_tok


🟠 High - Correcting this to use config.moe_mlp_dim is essential for accurate FLOPs in MoE models, as MoE layers typically use a different intermediate dimension than dense layers.

Suggested change

gate_flops + calculate_ffn_mamtul_tflops_per_device(config, config.moe_mlp_dim) * config.num_experts_per_tok

gate_flops + calculate_ffn_mamtul_tflops_per_device(config, config.moe_mlp_dim) * config.num_experts_per_tok

github-actions · 2026-04-07T22:46:56Z

src/maxtext/utils/maxtext_utils.py

    learnable_weight_flops += 2 * projector_flops  # only projector is learnable, add fwd+optimizer
  else:
    learnable_weight_flops *= 3  # multiply by 3 for fwd + bwd + optimizer
+    total_attn_flops *= 3  # multiply by 3 for fwd + bwd pass


🟠 High - Including the attention FLOPs in the vision encoder backward pass is correct when the encoder is unfreezed. This ensures that the total TFLOPs and MFU are not over-estimated during full fine-tuning.

github-actions · 2026-04-07T22:46:56Z

src/maxtext/utils/maxtext_utils.py

        * config.max_target_length
        * config.emb_dim
-        * (config.num_query_heads + 2 * config.num_kv_heads)
+        * (config.num_query_heads + kv_multiplier * config.num_kv_heads)


🟡 Factoring in share_kv_projections for QKV FLOPs ensures accuracy for models that share key and value projections.

Suggested change

* (config.num_query_heads + kv_multiplier * config.num_kv_heads)

* (config.num_query_heads + kv_multiplier * config.num_kv_heads)

github-actions · 2026-04-07T22:46:56Z

src/maxtext/configs/types.py

  base_moe_mlp_dim: int = Field(7168, description="Intermediate dimension at MoE layer (DeepSeek style).")
  first_num_dense_layers: NonNegativeInt = Field(0, description="Number of initial dense layers in the model.")
-  shared_experts: PositiveInt = Field(1, description="Number of shared experts.")
+  shared_experts: NonNegativeInt = Field(0, description="Number of shared experts.")


🟢 Allowing shared_experts to be 0 is necessary for models that do not use shared experts.

gobbleturk · 2026-04-07T22:49:19Z

I love the test! I am not sure how we have gotten this far without testing our tflops calculation...

NuojCheng · 2026-04-08T03:32:45Z

tests/unit/maxtext_utils_flops_test.py

@@ -0,0 +1,516 @@
+# Copyright 2023–2026 Google LLC


In MaxText we have flop_calculation_test.py in https://github.com/AI-Hypercomputer/maxtext/blob/main/tests/unit/flop_calculation_test.py. Seems replicated?

gagika added the gemini-review label Apr 7, 2026

gagika force-pushed the agagik-gemma4-flops branch from e46459c to 038d9e9 Compare April 7, 2026 21:12

gagika removed the gemini-review label Apr 7, 2026

gagika changed the title ~~Gemma4 TFLOPs calculations and fix causal attention flops for sliding window attention~~ Add Gemma 4 FLOPs & fix sliding window flops computations Apr 7, 2026

gagika force-pushed the agagik-gemma4-flops branch from 038d9e9 to ccbcf03 Compare April 7, 2026 21:29

gagika added the gemini-review label Apr 7, 2026

github-actions bot reviewed Apr 7, 2026

View reviewed changes

gobbleturk approved these changes Apr 7, 2026

View reviewed changes

Gemma4 TFLOPs calculations and fix causal attention flops for slidin…

ffd741b

…g window

gagika force-pushed the agagik-gemma4-flops branch from ccbcf03 to ffd741b Compare April 7, 2026 22:43

gagika added gemini-review and removed gemini-review labels Apr 7, 2026

github-actions bot reviewed Apr 7, 2026

View reviewed changes

gagika marked this pull request as ready for review April 8, 2026 02:41

gagika requested review from A9isha, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, hengtaoguo, khatwanimohit, richjames0, shralex and vipannalla as code owners April 8, 2026 02:41

gagika requested review from NicoGrande, NuojCheng, dipannita08, igorts-git, jesselu-google, jiangjy1982 and suexu1025 as code owners April 8, 2026 02:41

NuojCheng reviewed Apr 8, 2026

View reviewed changes

	gate_flops + calculate_ffn_mamtul_tflops_per_device(config, config.moe_mlp_dim) * config.num_experts_per_tok
	gate_flops + calculate_ffn_mamtul_tflops_per_device(config, config.moe_mlp_dim) * config.num_experts_per_tok

	* (config.num_query_heads + kv_multiplier * config.num_kv_heads)
	* (config.num_query_heads + kv_multiplier * config.num_kv_heads)

Conversation

gagika commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

github-actions bot commented Apr 7, 2026

Uh oh!

github-actions bot commented Apr 7, 2026

Uh oh!

codecov bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Apr 7, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

github-actions bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 7, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

github-actions bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gobbleturk commented Apr 7, 2026

Uh oh!

NuojCheng Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gagika commented Apr 7, 2026 •

edited

Loading

codecov bot commented Apr 7, 2026 •

edited

Loading