Skip to content

Document SSE/chunked streaming roadmap scope and status#4196

Open
Copilot wants to merge 4 commits intodevelopfrom
copilot/streaming-token-output-sse-chunked-response
Open

Document SSE/chunked streaming roadmap scope and status#4196
Copilot wants to merge 4 commits intodevelopfrom
copilot/streaming-token-output-sse-chunked-response

Conversation

Copy link
Contributor

Copilot AI commented Mar 13, 2026

Description

  • Clarifies the “Streaming Token Output (SSE / Chunked Response)” roadmap detail in src/llm/FUTURE_ENHANCEMENTS.md with scoped goals, constraints, interfaces, phase tasks, performance targets, and security considerations.
  • Specifies thread-safety/ownership for stream_callback and StreamingHandler, adds dedup-cache bypass rationale, and captures cancellation/deadline behavior and terminal markers.
  • Marks streaming tests as blocked due to existing llm_deployment_plugin.cpp incomplete-type build failure, keeping status transparent.

Type of Change

  • Bug fix
  • New feature
  • Refactoring
  • Documentation
  • Other:

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed

📚 Research & Knowledge (wenn applicable)

  • Diese PR basiert auf wissenschaftlichen Paper(s) oder Best Practices?
    • Falls JA: Research-Dateien in /docs/research/ angelegt?
    • Falls JA: Im Modul-README unter "Wissenschaftliche Grundlagen" verlinkt?
    • Falls JA: In /docs/research/implementation_influence/ eingetragen?

Relevante Quellen:

  • Paper:
  • Best Practice:
  • Architecture Decision:

Checklist

  • Code follows project style guidelines
  • Self-review completed
  • Documentation updated (if needed)
  • No new warnings introduced
Original prompt

This section details on the original issue you should resolve

<issue_title>Streaming Token Output (SSE / Chunked Response)</issue_title>
<issue_description>### Context

This issue implements the roadmap item 'Streaming Token Output (SSE / Chunked Response)' for the llm domain. It is sourced from the consolidated roadmap under 🟠 High Priority — Near-term (v1.5.0 – v1.8.0) and targets milestone v1.7.0.

Primary detail section: Detail section not found

Goal

Deliver the scoped changes for Streaming Token Output (SSE / Chunked Response) in src/llm/ and complete the linked detail section in a release-ready state for v1.7.0.

Detailed Scope

No matching section was found for anchor 'streaming-token-output-sse--chunked-response' in src/llm/FUTURE_ENHANCEMENTS.md.

Use this issue to restore roadmap/detail consistency and implement the scope described in the roadmap title.

Acceptance Criteria

  • Implement the scoped changes described in the linked detail section.
  • Add or update tests that verify the intended behaviour.

Relationships

References

  • src/ROADMAP.md
  • src/llm/FUTURE_ENHANCEMENTS.md#streaming-token-output-sse--chunked-response
  • Source key: roadmap:76:llm:v1.7.0:streaming-token-output-sse-chunked-response

Generated from the consolidated source roadmap. Keep the roadmap and issue in sync when scope changes.

</issue_description>

Comments on the Issue (you are @copilot in this section)


🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

Copilot AI and others added 3 commits March 13, 2026 18:27
Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>
Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>
Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the LLM module’s future-enhancement documentation to clearly scope and describe the “Streaming Token Output (SSE / Chunked Response)” roadmap item, including constraints, interfaces, phases, and test status.

Changes:

  • Replaces the prior streaming roadmap blurb with a structured section (scope/constraints/interfaces/phases/test strategy/perf/security).
  • Documents thread-safety/ownership expectations for InferenceRequest::stream_callback and llm::StreamingHandler.
  • Records test execution as blocked due to an unrelated build failure, to keep status transparent.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +77 to +94
### Scope
- Deliver OpenAI-style streaming for LLM responses via SSE framing and HTTP chunked responses.
- Expose token-level callbacks through `InferenceRequest::stream_callback` for both engines while keeping engines output-format agnostic.
- Provide reusable formatting helpers in `llm::StreamingHandler` for SSE events, `[DONE]` sentinel, and chunked-transfer frames.

### Design Constraints
- SSE payloads must be valid JSON per RFC 8259 with control-character escaping; framing must end with `\n\n`.
- Terminal events must emit the canonical `data: [DONE]\n\n` sentinel; chunked responses must end with the zero-length chunk `0\r\n\r\n`.
- Streaming callbacks run on worker threads and must respect cancellation/deadlines before emitting tokens.
- Deduplication caching must be skipped for streaming requests to avoid serving partial cached content.

### Required Interfaces
| Interface | Consumer | Notes |
|-----------|----------|-------|
| `InferenceRequest::stream_callback` | `AsyncInferenceEngine`, `InferenceEngineEnhanced`, HTTP SSE writers | Serial invocation on the producing worker thread; sink must be thread-safe when sharing state. |
| `llm::StreamingHandler::{formatSseEvent, formatDoneEvent, formatChunkedData, makeStreamCallback}` | HTTP layer (SSE endpoints, OpenAI compat adapter) | Static, reentrant helpers; atomic index for single-producer streams. |

### Implementation Phases
Comment on lines +83 to +97
- SSE payloads must be valid JSON per RFC 8259 with control-character escaping; framing must end with `\n\n`.
- Terminal events must emit the canonical `data: [DONE]\n\n` sentinel; chunked responses must end with the zero-length chunk `0\r\n\r\n`.
- Streaming callbacks run on worker threads and must respect cancellation/deadlines before emitting tokens.
- Deduplication caching must be skipped for streaming requests to avoid serving partial cached content.

### Required Interfaces
| Interface | Consumer | Notes |
|-----------|----------|-------|
| `InferenceRequest::stream_callback` | `AsyncInferenceEngine`, `InferenceEngineEnhanced`, HTTP SSE writers | Serial invocation on the producing worker thread; sink must be thread-safe when sharing state. |
| `llm::StreamingHandler::{formatSseEvent, formatDoneEvent, formatChunkedData, makeStreamCallback}` | HTTP layer (SSE endpoints, OpenAI compat adapter) | Static, reentrant helpers; atomic index for single-producer streams. |

### Implementation Phases
- **Phase 1 — Design / API Contract**
- [x] Expose `InferenceRequest::stream_callback` (`include/llm/llm_plugin_interface.h`) as `std::function<void(const std::string&)>`, invoked serially on the worker thread; sinks must be thread-safe when sharing state and must handle abrupt stop (no further callbacks, possibly without a terminal marker) without throwing.
- [x] Define SSE/chunked framing surface via `StreamingHandler` (JSON escaping, `[DONE]` sentinel, zero-length terminal chunk) to keep engines output-format agnostic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Streaming Token Output (SSE / Chunked Response)

3 participants