Skip to content

NucleusMoE-Image#13317

Open
sippycoder wants to merge 19 commits intohuggingface:mainfrom
sippycoder:main
Open

NucleusMoE-Image#13317
sippycoder wants to merge 19 commits intohuggingface:mainfrom
sippycoder:main

Conversation

@sippycoder
Copy link

What does this PR do?

This PR introduces NucleusMoE-Image series into the diffusers library.

NucleusMoE-Image is a 2B active 17B parameter model trained with efficiency at its core. Our novel architecture highlights the scalability of sparse MoE architecture for Image generation. The technical report will be released very soon.

@sippycoder
Copy link
Author

cc: @sayakpaul @IlyasMoutawwakil

@sayakpaul sayakpaul requested review from dg845 and yiyixuxu March 24, 2026 04:08
Comment on lines +545 to +546
gate1 = gate1.clamp(min=-2.0, max=2.0)
gate2 = gate2.clamp(min=-2.0, max=2.0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird to me that we first clamp the gates to [-2.0, 2.0] and then essentially clamp again by squashing with the tanh function below. Is this intended?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's weird. :) I used it to stabilize the gradients if the tanh gates get saturated while training. I will evaluate the model performance without it and get back to you!

Copy link
Collaborator

@dg845 dg845 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Left an initial review :). @yiyixuxu, could you also take a look at the text KV cache code in src/diffusers/hooks/text_kv_cache.py?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment on lines +380 to +391
self.experts = nn.ModuleList(
[
FeedForward(
dim=hidden_size,
dim_out=hidden_size,
inner_dim=moe_intermediate_dim,
activation_fn="swiglu",
bias=False,
)
for _ in range(num_experts)
]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you would need the projections to be in packed/contiguous format for torch.grouped_mm support (num_experts, dim_in, dim_out), @sayakpaul is that possible ? in Transformers we use the inline weight converter

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at the moment because MoEs are still a bit of a special case in this part of world.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can pack the MoE weights. That's how I originally trained the model with Expert Parallel.

sippycoder and others added 9 commits March 25, 2026 09:34
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
…mage.py

Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants