NucleusMoE-Image by sippycoder · Pull Request #13317 · huggingface/diffusers

sippycoder · 2026-03-24T03:16:39Z

What does this PR do?

This PR introduces NucleusMoE-Image series into the diffusers library.

NucleusMoE-Image is a 2B active 17B parameter model trained with efficiency at its core. Our novel architecture highlights the scalability of sparse MoE architecture for Image generation. The technical report will be released very soon.

NucleusImage - text kv caching

sippycoder · 2026-03-24T04:07:56Z

cc: @sayakpaul @IlyasMoutawwakil

src/diffusers/models/transformers/transformer_nucleusmoe_image.py

dg845 · 2026-03-25T04:39:05Z

src/diffusers/models/transformers/transformer_nucleusmoe_image.py

+        gate1 = gate1.clamp(min=-2.0, max=2.0)
+        gate2 = gate2.clamp(min=-2.0, max=2.0)


It seems weird to me that we first clamp the gates to [-2.0, 2.0] and then essentially clamp again by squashing with the tanh function below. Is this intended?

I agree it's weird. :) I used it to stabilize the gradients if the tanh gates get saturated while training. I will evaluate the model performance without it and get back to you!

src/diffusers/models/transformers/transformer_nucleusmoe_image.py

src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py

src/diffusers/hooks/text_kv_cache.py

dg845

Thanks for the PR! Left an initial review :). @yiyixuxu, could you also take a look at the text KV cache code in src/diffusers/hooks/text_kv_cache.py?

HuggingFaceDocBuilderDev · 2026-03-25T05:29:21Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

IlyasMoutawwakil · 2026-03-25T09:04:48Z

src/diffusers/models/transformers/transformer_nucleusmoe_image.py

+        self.experts = nn.ModuleList(
+            [
+                FeedForward(
+                    dim=hidden_size,
+                    dim_out=hidden_size,
+                    inner_dim=moe_intermediate_dim,
+                    activation_fn="swiglu",
+                    bias=False,
+                )
+                for _ in range(num_experts)
+            ]
+        )


you would need the projections to be in packed/contiguous format for torch.grouped_mm support (num_experts, dim_in, dim_out), @sayakpaul is that possible ? in Transformers we use the inline weight converter

Not at the moment because MoEs are still a bit of a special case in this part of world.

I can pack the MoE weights. That's how I originally trained the model with Expert Parallel.

Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>

…mage.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>

nmnWithNucleus and others added 9 commits March 20, 2026 07:59

adding NucleusMoE-Image model

76dcd51

update system prompt

f691395

Add text kv caching

7eef03e

Class/function name changes

cb63a95

Merge pull request #1 from heuristicoder/caching

50792e8

NucleusImage - text kv caching

add missing imports

f2eec82

add RoPE credits

d8b50e5

Merge branch 'main' into main

9a84625

Merge branch 'main' into main

8bad648

sayakpaul requested review from dg845 and yiyixuxu March 24, 2026 04:08

Merge branch 'main' into main

115f765