Fix music generation token stopping#2057
Open
dysangel wants to merge 3 commits intoLostRuins:concedofrom
Open
Conversation
In Phase 1 lyrics mode, the FSM transitions to CODES state after TOKEN_THINK_END and disables itself. The quantized Q4_K_M model was not efficiently generating TOKEN_IM_END to stop the generation, causing it to continue until hitting the 8192 token limit. This fix forces TOKEN_IM_END to be generated immediately after TOKEN_THINK_END in lyrics mode, ensuring clean completion of the planning phase without excessive token generation. Testing shows generation now completes in ~500ms instead of 80+ seconds with timeout errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Instead of forcing TOKEN_IM_END immediately after TOKEN_THINK_END, only force it when we've reached the token limit. This allows the model to generate lyrics after the thinking block while still preventing KV cache exhaustion. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
I was trying to get the music UI working on MacOS. bf16 inference is not hardware accelerated on metal, so I switched for quantised models.
The 'Plan' button would not work for me, as Music Phase 1 planning generation was always continuing generating to the kv cache limit.
Root Cause
After the FSM guides the model through metadata fields (bpm, caption, duration, keyscale, language, timesignature) and forces
TOKEN_THINK_END, it transitions to CODES state and disables itself. The model should then naturally generateTOKEN_IM_ENDto stop, but in some cases (especially with quantized models) this doesn't happen efficiently, causing the generation to continue until the KV cache is exhausted.Solution
Add a safety check that forces
TOKEN_IM_ENDwhen the generation reaches the token limit. This prevents KV cache exhaustion while still allowing the model to generate normally (including any lyrics after the thinking block).Changes
otherarch/acestep/ace-qwen3.cppto add a safety check before adding each tokenTOKEN_IM_ENDwhengen_tokens.size() >= max_new_tokens - 1Impact
Co-Authored-By: GLM-5