Skip to content

fix serve_multiplexed_model_id#120

Draft
Yunnglin wants to merge 7 commits intomainfrom
fix_request
Draft

fix serve_multiplexed_model_id#120
Yunnglin wants to merge 7 commits intomainfrom
fix_request

Conversation

@Yunnglin
Copy link
Collaborator

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Write the detail information belongs to this PR.

Experiment results

Paste your experiment result here(if needed).

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug in the Twinkle server's handling of the serve_multiplexed_model_id header, which was being inconsistently interpreted due to HTTP proxy normalization. The core solution involves introducing a robust patching mechanism for Ray Serve to correctly process these headers, alongside standardizing the header's naming convention across the codebase. Additionally, it includes minor bug fixes for model output access and adds request lifecycle hooks to various API endpoints for better extensibility and monitoring.

Highlights

  • Ray Serve Header Normalization: Implemented a patching mechanism for Ray Serve to correctly handle HTTP header normalization, specifically converting hyphens to underscores for the Serve-Multiplexed-Model-Id header, which was previously causing issues with proxies.
  • Consistent Header Naming: Standardized the serve_multiplexed_model_id header key to Serve-Multiplexed-Model-Id across both the server gateway and the HTTP client utilities for improved consistency and compatibility.
  • Ray Initialization with Patches: Updated the Ray initialization process to ensure that the newly introduced Ray Serve patches are applied not only in the main process but also in all worker processes via runtime_env configuration.
  • Robust Model Output Handling: Enhanced the transformers_model backend to safely access 'logits' from model outputs using .get() with a None check, preventing potential KeyError exceptions.
  • API Request Lifecycle Hooks: Integrated _on_request_start calls into numerous API endpoints within twinkle_handlers.py to allow for consistent pre-request processing.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an issue with serve_multiplexed_model_id by correcting the HTTP header name to the canonical Serve-Multiplexed-Model-Id format and introducing a patch for Ray Serve to handle header normalization from proxies. The changes also improve robustness by using .get() for dictionary access and add a request start hook to several endpoints. My review focuses on improving maintainability by refactoring duplicated code in the endpoint handlers. The core fix and the patching mechanism appear solid.

@Yunnglin
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant set of changes, primarily focused on fixing an HTTP header normalization issue with serve_multiplexed_model_id by patching Ray Serve. It also includes a major and beneficial refactoring of the server state management and resource lifecycle from synchronous to asynchronous, which improves the architecture.

My review has identified a critical bug in the new lifecycle management for processors that will cause a TypeError at runtime. I've also pointed out a high-severity issue regarding a potential resource leak in the sampler application due to the removal of adapter lifecycle management. Additionally, I've suggested a medium-severity refactoring to reduce code duplication in the model endpoint handlers. Please address the critical and high-severity issues before merging.

Comment on lines +127 to +135
def _on_processor_expired(self, processor_id: str) -> None:
"""Hook called when a processor's session expires.

Must be overridden by inheriting classes.

Raises:
NotImplementedError: If not overridden.
"""
raise NotImplementedError(f'_on_processor_expired must be implemented by {self.__class__.__name__}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The abstract method _on_processor_expired is defined as a synchronous function, but it is being awaited in the base SessionResourceMixin's cleanup loop (await self._on_resource_expired(resource_id) which in turn calls await self._on_processor_expired(resource_id)). This will lead to a TypeError: object NoneType can't be awaited at runtime when a processor expires. The implementation in src/twinkle/server/processor/app.py is also synchronous. Both the abstract method definition here and its implementation should be async def.

Suggested change
def _on_processor_expired(self, processor_id: str) -> None:
"""Hook called when a processor's session expires.
Must be overridden by inheriting classes.
Raises:
NotImplementedError: If not overridden.
"""
raise NotImplementedError(f'_on_processor_expired must be implemented by {self.__class__.__name__}')
async def _on_processor_expired(self, processor_id: str) -> None:
"""Hook called when a processor's session expires.
Must be overridden by inheriting classes.
Raises:
NotImplementedError: If not overridden.
"""
raise NotImplementedError(f'_on_processor_expired must be implemented by {self.__class__.__name__}')



class SamplerManagement(TaskQueueMixin, AdapterManagerMixin):
class SamplerManagement(TaskQueueMixin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The AdapterManagerMixin has been removed from SamplerManagement, and the corresponding _init_adapter_manager and _on_adapter_expired logic has also been removed. This means that LoRA adapters added to the sampler via the /twinkle/add_adapter_to_sampler endpoint are no longer session-bound or automatically cleaned up upon session expiration. This could lead to a resource leak on the sampler replicas as adapters accumulate over time without being unloaded. Was this intentional? If not, AdapterManagerMixin (or the new SessionResourceMixin) should be restored to manage the adapter lifecycle on the sampler.

@app.post('/twinkle/create', response_model=types.CreateResponse)
async def create(request: Request, body: types.CreateRequest,
self: ModelManagement = Depends(self_fn)) -> types.CreateResponse:
await self._on_request_start(request)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call await self._on_request_start(request) is repeated at the beginning of almost every endpoint in this file. This introduces code duplication and makes the handlers more verbose. To improve maintainability, you could refactor this into a FastAPI dependency. This would centralize the request startup logic and clean up the endpoint definitions.
For example:

async def startup_dependency(request: Request, self: ModelManagement = Depends(self_fn)):
    await self._on_request_start(request)
    return self

@app.post('/twinkle/create', response_model=types.CreateResponse)
async def create(request: Request, body: types.CreateRequest,
                 self: ModelManagement = Depends(startup_dependency)) -> types.CreateResponse:
    # No need for the explicit await here anymore
    return types.CreateResponse()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant