[DO NOT SUBMIT] EP-TP proof of concept (only supports EP + DP) by gobbleturk · Pull Request #3600 · AI-Hypercomputer/maxtext

gobbleturk · 2026-04-08T00:39:16Z

Description

Support EP-TP (EP acts like TP for attn/shared expert) for AG-RS (ring_of_experts) path

This is helpful for small token counts e.g. autoregressive inference and small prefills. This has been hacked together.

There are huge problems this PR does not solve - the token sorting and some other ops happen on the fully all gathered tokens (worst case size), which are by far the longest ops for EP>=4 (b/496676734). We need significant code changes and kernels to support ragged performance (e.g. ops grow as O(routed tokens) as opposed to our current O(worst case))

example command on my v6e-8 devbox

alias smoke_train='python3 -m MaxText.train maxtext/configs/base.yml run_name=mattdavidow-train-base base_output_directory=gs://maxtext-experiments-multipod dataset_path=gs://max-datasets-rogue dataset_type=synthetic steps=5 enable_checkpointing=False enable_goodput_recording=False'

alias smoke_moe='smoke_train decoder_block=mixtral num_experts=4 num_experts_per_tok=2 sparse_matmul=True megablox=True per_device_batch_size=4 base_num_decoder_layers=4'

smoke_moe ici_data_parallelism=2 ici_expert_parallelism=4 use_ring_of_experts=True custom_mesh_and_rule=ep-tp num_experts=8 per_device_batch_size=2 expert_shard_attention_option=tp profiler=xplane

profile

Tests

ran above command, generated xprof which looks expected (not great due to sorting + elementwise ops on worst case size tensors)

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-04-08T00:49:34Z

Codecov Report

❌ Patch coverage is 61.53846% with 5 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/layers/moe.py	58.33%	3 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

NicoGrande · 2026-04-08T16:01:20Z

src/maxtext/layers/moe.py

    else:
-      input_partition_pspec = self._logical_to_mesh_axes((batch_logical_axis, "activation_norm_length_moe", None))
+      # This is terrible =(
+      input_partition_pspec = self._logical_to_mesh_axes((batch_logical_axis, "activation_norm_length_moe", "activation_embed"))


Can you explain why this is terrible?

there are a lot of things wrong with our shardings any _moe rule should only be used deep inside the moe layer after tokens have been routed. At this point in the model/code things should still be sharded like attention (not routed yet), so we should not use any _moe rule. Additionally the weights below use "activation" logical axis rules when "activation" should only be use for activations

the resultant physical specs are probably what we want but we got there in a very poor way that is very hard to read / maintain

gobbleturk added 2 commits April 2, 2026 21:15

pain10

7212036

ragged 2hard 4jax

5dce390

gobbleturk requested review from A9isha, NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, dipannita08, gagika, hengtaoguo, igorts-git, jesselu-google, jiangjy1982, khatwanimohit, michelle-yooh, richjames0, shralex, suexu1025 and vipannalla as code owners April 8, 2026 00:39

gobbleturk marked this pull request as draft April 8, 2026 00:39

NicoGrande reviewed Apr 8, 2026

View reviewed changes

gobbleturk added 2 commits April 8, 2026 16:08

add the beautiful logical axis rules

6943f38

Positivity is the way

dfd3a5a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT SUBMIT] EP-TP proof of concept (only supports EP + DP)#3600

[DO NOT SUBMIT] EP-TP proof of concept (only supports EP + DP)#3600
gobbleturk wants to merge 4 commits intomainfrom
mattdavidow-ep-tp

gobbleturk commented Apr 8, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

NicoGrande Apr 8, 2026

Uh oh!

gobbleturk Apr 8, 2026 •

edited

Loading

Uh oh!

gobbleturk Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gobbleturk commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

NicoGrande Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gobbleturk Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gobbleturk Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gobbleturk commented Apr 8, 2026 •

edited

Loading

codecov bot commented Apr 8, 2026 •

edited

Loading

gobbleturk Apr 8, 2026 •

edited

Loading