Document SSE/chunked streaming roadmap scope and status#4196
Open
Document SSE/chunked streaming roadmap scope and status#4196
Conversation
Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>
Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>
Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Implement streaming token output for SSE and chunked response
Document SSE/chunked streaming roadmap scope and status
Mar 13, 2026
Copilot stopped work on behalf of
makr-code due to an error
March 13, 2026 19:41
There was a problem hiding this comment.
Pull request overview
Updates the LLM module’s future-enhancement documentation to clearly scope and describe the “Streaming Token Output (SSE / Chunked Response)” roadmap item, including constraints, interfaces, phases, and test status.
Changes:
- Replaces the prior streaming roadmap blurb with a structured section (scope/constraints/interfaces/phases/test strategy/perf/security).
- Documents thread-safety/ownership expectations for
InferenceRequest::stream_callbackandllm::StreamingHandler. - Records test execution as blocked due to an unrelated build failure, to keep status transparent.
You can also share your feedback on Copilot code review. Take the survey.
Comment on lines
+77
to
+94
| ### Scope | ||
| - Deliver OpenAI-style streaming for LLM responses via SSE framing and HTTP chunked responses. | ||
| - Expose token-level callbacks through `InferenceRequest::stream_callback` for both engines while keeping engines output-format agnostic. | ||
| - Provide reusable formatting helpers in `llm::StreamingHandler` for SSE events, `[DONE]` sentinel, and chunked-transfer frames. | ||
|
|
||
| ### Design Constraints | ||
| - SSE payloads must be valid JSON per RFC 8259 with control-character escaping; framing must end with `\n\n`. | ||
| - Terminal events must emit the canonical `data: [DONE]\n\n` sentinel; chunked responses must end with the zero-length chunk `0\r\n\r\n`. | ||
| - Streaming callbacks run on worker threads and must respect cancellation/deadlines before emitting tokens. | ||
| - Deduplication caching must be skipped for streaming requests to avoid serving partial cached content. | ||
|
|
||
| ### Required Interfaces | ||
| | Interface | Consumer | Notes | | ||
| |-----------|----------|-------| | ||
| | `InferenceRequest::stream_callback` | `AsyncInferenceEngine`, `InferenceEngineEnhanced`, HTTP SSE writers | Serial invocation on the producing worker thread; sink must be thread-safe when sharing state. | | ||
| | `llm::StreamingHandler::{formatSseEvent, formatDoneEvent, formatChunkedData, makeStreamCallback}` | HTTP layer (SSE endpoints, OpenAI compat adapter) | Static, reentrant helpers; atomic index for single-producer streams. | | ||
|
|
||
| ### Implementation Phases |
Comment on lines
+83
to
+97
| - SSE payloads must be valid JSON per RFC 8259 with control-character escaping; framing must end with `\n\n`. | ||
| - Terminal events must emit the canonical `data: [DONE]\n\n` sentinel; chunked responses must end with the zero-length chunk `0\r\n\r\n`. | ||
| - Streaming callbacks run on worker threads and must respect cancellation/deadlines before emitting tokens. | ||
| - Deduplication caching must be skipped for streaming requests to avoid serving partial cached content. | ||
|
|
||
| ### Required Interfaces | ||
| | Interface | Consumer | Notes | | ||
| |-----------|----------|-------| | ||
| | `InferenceRequest::stream_callback` | `AsyncInferenceEngine`, `InferenceEngineEnhanced`, HTTP SSE writers | Serial invocation on the producing worker thread; sink must be thread-safe when sharing state. | | ||
| | `llm::StreamingHandler::{formatSseEvent, formatDoneEvent, formatChunkedData, makeStreamCallback}` | HTTP layer (SSE endpoints, OpenAI compat adapter) | Static, reentrant helpers; atomic index for single-producer streams. | | ||
|
|
||
| ### Implementation Phases | ||
| - **Phase 1 — Design / API Contract** | ||
| - [x] Expose `InferenceRequest::stream_callback` (`include/llm/llm_plugin_interface.h`) as `std::function<void(const std::string&)>`, invoked serially on the worker thread; sinks must be thread-safe when sharing state and must handle abrupt stop (no further callbacks, possibly without a terminal marker) without throwing. | ||
| - [x] Define SSE/chunked framing surface via `StreamingHandler` (JSON escaping, `[DONE]` sentinel, zero-length terminal chunk) to keep engines output-format agnostic. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
src/llm/FUTURE_ENHANCEMENTS.mdwith scoped goals, constraints, interfaces, phase tasks, performance targets, and security considerations.stream_callbackandStreamingHandler, adds dedup-cache bypass rationale, and captures cancellation/deadline behavior and terminal markers.llm_deployment_plugin.cppincomplete-type build failure, keeping status transparent.Type of Change
Testing
📚 Research & Knowledge (wenn applicable)
/docs/research/angelegt?/docs/research/implementation_influence/eingetragen?Relevante Quellen:
Checklist
Original prompt
🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.