Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eb9c17005d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| int chunkSize = ResolvePrefillChunkSize(_backend, tokens.Count); | ||
| if (chunkSize >= tokens.Count) | ||
| return _model.ForwardRefill(CopyTokenRange(tokens, 0, tokens.Count)); |
There was a problem hiding this comment.
Disable CUDA prefill chunking for multimodal prompts
ForwardPromptPrefill now chunks any CUDA prompt larger than 5120 tokens, but multimodal models store image/audio embedding insertion indices relative to the full prompt token sequence. With chunked prefill, ForwardRefill is called on partial token windows while those absolute positions are still applied, which can push embedding insertion past the chunk bounds (triggering Tensor.Narrow range exceptions) or place embeddings in the wrong segment. This breaks long multimodal inference on CUDA; chunking should be skipped for multimodal requests or embedding positions must be rebased per chunk.
Useful? React with 👍 / 👎.
No description provided.