Conversation
NucleusImage - text kv caching
src/diffusers/models/transformers/transformer_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
| gate1 = gate1.clamp(min=-2.0, max=2.0) | ||
| gate2 = gate2.clamp(min=-2.0, max=2.0) |
There was a problem hiding this comment.
It seems weird to me that we first clamp the gates to [-2.0, 2.0] and then essentially clamp again by squashing with the tanh function below. Is this intended?
There was a problem hiding this comment.
I agree it's weird. :) I used it to stabilize the gradients if the tanh gates get saturated while training. I will evaluate the model performance without it and get back to you!
src/diffusers/models/transformers/transformer_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| self.experts = nn.ModuleList( | ||
| [ | ||
| FeedForward( | ||
| dim=hidden_size, | ||
| dim_out=hidden_size, | ||
| inner_dim=moe_intermediate_dim, | ||
| activation_fn="swiglu", | ||
| bias=False, | ||
| ) | ||
| for _ in range(num_experts) | ||
| ] | ||
| ) |
There was a problem hiding this comment.
you would need the projections to be in packed/contiguous format for torch.grouped_mm support (num_experts, dim_in, dim_out), @sayakpaul is that possible ? in Transformers we use the inline weight converter
There was a problem hiding this comment.
Not at the moment because MoEs are still a bit of a special case in this part of world.
There was a problem hiding this comment.
I can pack the MoE weights. That's how I originally trained the model with Expert Parallel.
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
…mage.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
What does this PR do?
This PR introduces NucleusMoE-Image series into the diffusers library.
NucleusMoE-Image is a 2B active 17B parameter model trained with efficiency at its core. Our novel architecture highlights the scalability of sparse MoE architecture for Image generation. The technical report will be released very soon.