fix serve_multiplexed_model_id by Yunnglin · Pull Request #120 · modelscope/twinkle

Yunnglin · 2026-03-19T10:29:56Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

Write the detail information belongs to this PR.

Experiment results

Paste your experiment result here(if needed).

gemini-code-assist · 2026-03-19T10:30:18Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug in the Twinkle server's handling of the serve_multiplexed_model_id header, which was being inconsistently interpreted due to HTTP proxy normalization. The core solution involves introducing a robust patching mechanism for Ray Serve to correctly process these headers, alongside standardizing the header's naming convention across the codebase. Additionally, it includes minor bug fixes for model output access and adds request lifecycle hooks to various API endpoints for better extensibility and monitoring.

Highlights

Ray Serve Header Normalization: Implemented a patching mechanism for Ray Serve to correctly handle HTTP header normalization, specifically converting hyphens to underscores for the Serve-Multiplexed-Model-Id header, which was previously causing issues with proxies.
Consistent Header Naming: Standardized the serve_multiplexed_model_id header key to Serve-Multiplexed-Model-Id across both the server gateway and the HTTP client utilities for improved consistency and compatibility.
Ray Initialization with Patches: Updated the Ray initialization process to ensure that the newly introduced Ray Serve patches are applied not only in the main process but also in all worker processes via runtime_env configuration.
Robust Model Output Handling: Enhanced the transformers_model backend to safely access 'logits' from model outputs using .get() with a None check, preventing potential KeyError exceptions.
API Request Lifecycle Hooks: Integrated _on_request_start calls into numerous API endpoints within twinkle_handlers.py to allow for consistent pre-request processing.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses an issue with serve_multiplexed_model_id by correcting the HTTP header name to the canonical Serve-Multiplexed-Model-Id format and introducing a patch for Ray Serve to handle header normalization from proxies. The changes also improve robustness by using .get() for dictionary access and add a request start hook to several endpoints. My review focuses on improving maintainability by refactoring duplicated code in the endpoint handlers. The core fix and the patching mechanism appear solid.

src/twinkle/server/model/twinkle_handlers.py

Yunnglin · 2026-03-20T10:20:10Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant set of changes, primarily focused on fixing an HTTP header normalization issue with serve_multiplexed_model_id by patching Ray Serve. It also includes a major and beneficial refactoring of the server state management and resource lifecycle from synchronous to asynchronous, which improves the architecture.

My review has identified a critical bug in the new lifecycle management for processors that will cause a TypeError at runtime. I've also pointed out a high-severity issue regarding a potential resource leak in the sampler application due to the removal of adapter lifecycle management. Additionally, I've suggested a medium-severity refactoring to reduce code duplication in the model endpoint handlers. Please address the critical and high-severity issues before merging.

gemini-code-assist · 2026-03-20T10:24:06Z

src/twinkle/server/utils/lifecycle/processor.py

+    def _on_processor_expired(self, processor_id: str) -> None:
+        """Hook called when a processor's session expires.
+
+        Must be overridden by inheriting classes.
+
+        Raises:
+            NotImplementedError: If not overridden.
+        """
+        raise NotImplementedError(f'_on_processor_expired must be implemented by {self.__class__.__name__}')


The abstract method _on_processor_expired is defined as a synchronous function, but it is being awaited in the base SessionResourceMixin's cleanup loop (await self._on_resource_expired(resource_id) which in turn calls await self._on_processor_expired(resource_id)). This will lead to a TypeError: object NoneType can't be awaited at runtime when a processor expires. The implementation in src/twinkle/server/processor/app.py is also synchronous. Both the abstract method definition here and its implementation should be async def.

Suggested change

def _on_processor_expired(self, processor_id: str) -> None:

"""Hook called when a processor's session expires.

Must be overridden by inheriting classes.

Raises:

NotImplementedError: If not overridden.

"""

raise NotImplementedError(f'_on_processor_expired must be implemented by {self.__class__.__name__}')

async def _on_processor_expired(self, processor_id: str) -> None:

"""Hook called when a processor's session expires.

Must be overridden by inheriting classes.

Raises:

NotImplementedError: If not overridden.

"""

raise NotImplementedError(f'_on_processor_expired must be implemented by {self.__class__.__name__}')

gemini-code-assist · 2026-03-20T10:24:06Z

src/twinkle/server/sampler/app.py



-class SamplerManagement(TaskQueueMixin, AdapterManagerMixin):
+class SamplerManagement(TaskQueueMixin):


The AdapterManagerMixin has been removed from SamplerManagement, and the corresponding _init_adapter_manager and _on_adapter_expired logic has also been removed. This means that LoRA adapters added to the sampler via the /twinkle/add_adapter_to_sampler endpoint are no longer session-bound or automatically cleaned up upon session expiration. This could lead to a resource leak on the sampler replicas as adapters accumulate over time without being unloaded. Was this intentional? If not, AdapterManagerMixin (or the new SessionResourceMixin) should be restored to manage the adapter lifecycle on the sampler.

gemini-code-assist · 2026-03-20T10:24:06Z

src/twinkle/server/model/twinkle_handlers.py

    @app.post('/twinkle/create', response_model=types.CreateResponse)
    async def create(request: Request, body: types.CreateRequest,
                     self: ModelManagement = Depends(self_fn)) -> types.CreateResponse:
+        await self._on_request_start(request)


The call await self._on_request_start(request) is repeated at the beginning of almost every endpoint in this file. This introduces code duplication and makes the handlers more verbose. To improve maintainability, you could refactor this into a FastAPI dependency. This would centralize the request startup logic and clean up the endpoint definitions.
For example:

async def startup_dependency(request: Request, self: ModelManagement = Depends(self_fn)): await self._on_request_start(request) return self @app.post('/twinkle/create', response_model=types.CreateResponse) async def create(request: Request, body: types.CreateRequest, self: ModelManagement = Depends(startup_dependency)) -> types.CreateResponse: # No need for the explicit await here anymore return types.CreateResponse()

fix serve_multiplexed_model_id

c1139e2

gemini-code-assist bot reviewed Mar 19, 2026

View reviewed changes

src/twinkle/server/model/twinkle_handlers.py Show resolved Hide resolved

Yunnglin added 6 commits March 20, 2026 11:49

fix model check

ade7827

Merge remote-tracking branch 'origin' into fix_request

1386b01

update import

60820f0

update

5002a52

update async

2472409

update async

f7a3d0e

gemini-code-assist bot reviewed Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix serve_multiplexed_model_id#120

fix serve_multiplexed_model_id#120
Yunnglin wants to merge 7 commits intomainfrom
fix_request

Yunnglin commented Mar 19, 2026

Uh oh!

gemini-code-assist bot commented Mar 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Yunnglin commented Mar 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 20, 2026

Uh oh!

gemini-code-assist bot Mar 20, 2026

Uh oh!

gemini-code-assist bot Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant



		class SamplerManagement(TaskQueueMixin, AdapterManagerMixin):
		class SamplerManagement(TaskQueueMixin):

Conversation

Yunnglin commented Mar 19, 2026

PR type

PR information

Experiment results

Uh oh!

gemini-code-assist bot commented Mar 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Yunnglin commented Mar 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant