Prerequisites
Before submitting your question, please ensure the following:
Question Details
In PoweInfer paper, the authors said the model turns into dense mode in the prefill stage. So PowerInfer can have a latency comparable to llama.cpp.
However, in SmallThinker, the MoE/FFN weight is offloaded to the disk. I have some questions about the prefill stage:
- I think the "DP-Groups Global Load Balance Loss" can address this issue for shorter inputs, which explicitly requies the same experts for neighbor tokens. But will all experts be loaded into memory due to different group requiring different experts in long sequence?
- The sparsity seems working in the decoding stage. Will the whole FFN be used in the prefill stage, which means no sparsity in prefill stage?
Prerequisites
Before submitting your question, please ensure the following:
Question Details
In PoweInfer paper, the authors said the model turns into dense mode in the prefill stage. So PowerInfer can have a latency comparable to llama.cpp.
However, in SmallThinker, the MoE/FFN weight is offloaded to the disk. I have some questions about the prefill stage: