What will SmallThinker do in the prefill stage?

# Prerequisites

Before submitting your question, please ensure the following:

- [x] I am running the latest version of PowerInfer. Development is rapid, and as of now, there are no tagged versions.
- [x] I have carefully read and followed the instructions in the [README.md](https://github.com/SJTU-IPADS/PowerInfer/blob/main/README.md).
- [x] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).

# Question Details

In PoweInfer paper, the authors said the model turns into dense mode in the prefill stage. So PowerInfer can have a latency comparable to llama.cpp. 

However, in SmallThinker, the MoE/FFN weight is offloaded to the disk. I have some questions about the prefill stage:

1. I think the "DP-Groups Global Load Balance Loss" can address this issue for shorter inputs, which explicitly requies the same experts for neighbor tokens. But will all experts be loaded into memory due to different group requiring different experts in long sequence? 
2. The sparsity seems working in the decoding stage. Will the whole FFN be used in the prefill stage, which means no sparsity in prefill stage?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What will SmallThinker do in the prefill stage? #268

Prerequisites

Question Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

What will SmallThinker do in the prefill stage? #268

Description

Prerequisites

Question Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions