Fix music generation token stopping by dysangel · Pull Request #2057 · LostRuins/koboldcpp

dysangel · 2026-03-21T10:27:03Z

Problem

I was trying to get the music UI working on MacOS. bf16 inference is not hardware accelerated on metal, so I switched for quantised models.

The 'Plan' button would not work for me, as Music Phase 1 planning generation was always continuing generating to the kv cache limit.

Root Cause

After the FSM guides the model through metadata fields (bpm, caption, duration, keyscale, language, timesignature) and forces TOKEN_THINK_END, it transitions to CODES state and disables itself. The model should then naturally generate TOKEN_IM_END to stop, but in some cases (especially with quantized models) this doesn't happen efficiently, causing the generation to continue until the KV cache is exhausted.

Solution

Add a safety check that forces TOKEN_IM_END when the generation reaches the token limit. This prevents KV cache exhaustion while still allowing the model to generate normally (including any lyrics after the thinking block).

Changes

Modified otherarch/acestep/ace-qwen3.cpp to add a safety check before adding each token
Forces TOKEN_IM_END when gen_tokens.size() >= max_new_tokens - 1

Impact

Prevents KV cache exhaustion errors
Makes music generation more robust across all model types

Co-Authored-By: GLM-5

In Phase 1 lyrics mode, the FSM transitions to CODES state after TOKEN_THINK_END and disables itself. The quantized Q4_K_M model was not efficiently generating TOKEN_IM_END to stop the generation, causing it to continue until hitting the 8192 token limit. This fix forces TOKEN_IM_END to be generated immediately after TOKEN_THINK_END in lyrics mode, ensuring clean completion of the planning phase without excessive token generation. Testing shows generation now completes in ~500ms instead of 80+ seconds with timeout errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Instead of forcing TOKEN_IM_END immediately after TOKEN_THINK_END, only force it when we've reached the token limit. This allows the model to generate lyrics after the thinking block while still preventing KV cache exhaustion. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

dysangel changed the title ~~Fix music generation token stopping for quantized models~~ Fix music generation token stopping Mar 21, 2026

dysangel and others added 2 commits March 21, 2026 10:33

Clarify comment - fix applies to all models, not just quantized

630cfd1

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

dysangel marked this pull request as draft March 21, 2026 10:52

dysangel marked this pull request as ready for review March 21, 2026 11:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix music generation token stopping#2057

Fix music generation token stopping#2057
dysangel wants to merge 3 commits intoLostRuins:concedofrom
dysangel:fix/music-gen-token-stopping

dysangel commented Mar 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dysangel commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Solution

Changes

Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dysangel commented Mar 21, 2026 •

edited

Loading