[WIP] feat: add watchdog for runtime session handling#187
[WIP] feat: add watchdog for runtime session handling#187
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a session “watchdog” mechanism to the CLI runtime so it can periodically detect and clean up unhealthy sessions (dead tasks / idle sessions), and introduces session health reporting to support that monitoring.
Changes:
- Introduce
SessionWatchdogbackground task with periodic session health checks and removal hooks. - Add
SessionHealthInfo+ activity tracking to session servers, and integrate watchdog lifecycle intoUiPathMcpRuntime. - Add unit tests for watchdog behavior and update
.gitignorefor common IDE folders.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_watchdog.py |
Adds unit tests covering watchdog detection logic and lifecycle. |
src/uipath_mcp/_cli/_runtime/_watchdog.py |
New watchdog implementation and default timeout/interval constants. |
src/uipath_mcp/_cli/_runtime/_session.py |
Adds SessionHealthInfo, transport identifiers, and last-activity tracking. |
src/uipath_mcp/_cli/_runtime/_runtime.py |
Implements watchdog provider methods and starts/stops watchdog with runtime lifecycle. |
.gitignore |
Ignores .idea/, .vscode/, and .claude/. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| tracer = trace.get_tracer(__name__) | ||
|
|
||
| WATCHDOG_CHECK_INTERVAL = 30 # seconds | ||
| SESSION_IDLE_TIMEOUT = 60 # 1 minute for testing purposes - to be changed to 1 hour |
There was a problem hiding this comment.
SESSION_IDLE_TIMEOUT is set to 60 seconds with a note "for testing purposes" but this value is used as the default in production (UiPathMcpRuntime constructs SessionWatchdog(self) without overriding idle_timeout). This will aggressively terminate idle sessions after 1 minute. Please either change the default to the intended production timeout (e.g., 1 hour), or make the timeout configurable and ensure runtime passes the correct value.
| SESSION_IDLE_TIMEOUT = 60 # 1 minute for testing purposes - to be changed to 1 hour | |
| SESSION_IDLE_TIMEOUT = 3600 # 1 hour default idle timeout for production |
| session_server = self._session_servers.pop(session_id, None) | ||
| if session_server is not None: | ||
| logger.warning( | ||
| f"Removing session {session_id}: {reason}" | ||
| ) | ||
| try: | ||
| await session_server.stop() | ||
| except Exception: | ||
| logger.error( | ||
| f"Error stopping session {session_id} during watchdog removal", | ||
| exc_info=True, | ||
| ) | ||
| await self._close_session_on_server(session_id) |
There was a problem hiding this comment.
remove_session() pops the session from _session_servers before awaiting session_server.stop() and the server-side DELETE. While those awaits are in-flight, _handle_signalr_message() can observe the session_id missing and create/start a new session server for the same session_id, racing with the close and potentially causing flapping. Consider preventing re-creation while a session is being removed (e.g., keep an entry with a 'closing' marker, or guard with an async lock, or only pop after the stop/close completes).
| session_server = self._session_servers.pop(session_id, None) | |
| if session_server is not None: | |
| logger.warning( | |
| f"Removing session {session_id}: {reason}" | |
| ) | |
| try: | |
| await session_server.stop() | |
| except Exception: | |
| logger.error( | |
| f"Error stopping session {session_id} during watchdog removal", | |
| exc_info=True, | |
| ) | |
| await self._close_session_on_server(session_id) | |
| session_server = self._session_servers.get(session_id) | |
| if session_server is not None: | |
| logger.warning( | |
| f"Removing session {session_id}: {reason}" | |
| ) | |
| try: | |
| try: | |
| await session_server.stop() | |
| except Exception: | |
| logger.error( | |
| f"Error stopping session {session_id} during watchdog removal", | |
| exc_info=True, | |
| ) | |
| await self._close_session_on_server(session_id) | |
| finally: | |
| # Only remove the session from the registry after shutdown completes | |
| self._session_servers.pop(session_id, None) |
| async def _cleanup(self) -> None: | ||
| """Clean up all resources.""" | ||
| if self._cleanup_done: | ||
| return | ||
| self._cleanup_done = True | ||
|
|
||
| await self._on_runtime_abort() | ||
|
|
||
| if self._token_refresher: | ||
| await self._token_refresher.stop() | ||
|
|
||
| if self._keep_alive_task: | ||
| self._keep_alive_task.cancel() | ||
| try: | ||
| await self._keep_alive_task | ||
| except asyncio.CancelledError: | ||
| pass | ||
|
|
||
| if self._watchdog: | ||
| await self._watchdog.stop() | ||
| self._watchdog = None |
There was a problem hiding this comment.
In _cleanup(), the watchdog is stopped only after awaiting _on_runtime_abort(), token refresher shutdown, and keep-alive cancellation. During those awaits, the watchdog can still run and concurrently mutate/stop sessions while cleanup is also stopping sessions, creating races and noisy logs. Consider stopping the watchdog immediately after setting _cleanup_done (or at least before awaiting long-running calls) to make cleanup deterministic.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 5 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| WATCHDOG_CHECK_INTERVAL = 30 # seconds | ||
| SESSION_IDLE_TIMEOUT = 60 # 1 minute for testing purposes - to be changed to 1 hour | ||
|
|
There was a problem hiding this comment.
SESSION_IDLE_TIMEOUT defaults to 60 seconds (and UiPathMcpRuntime starts the watchdog with defaults). That will aggressively tear down sessions in production after 1 minute of inactivity and can also prematurely cancel sandboxed runtimes via remove_session(). Please bump the default to the intended value (e.g., 3600s) and/or make it configurable via runtime/server config so tests can still use a shorter timeout without affecting real runs.
| health = s.get_health_info() | ||
| runtime_sessions[sid] = { | ||
| "task_done": health.task_done, | ||
| "active_requests": len(s._active_requests), |
There was a problem hiding this comment.
_on_keep_alive_response() reaches into BaseSessionServer's private _active_requests to log active request counts. This tightly couples the runtime to session internals and makes refactors risky (especially since SessionHealthInfo was introduced for health reporting). Consider exposing an official API for this (e.g., add an active_request_count field back to SessionHealthInfo or add a public method/property on BaseSessionServer) and use that instead of accessing _active_requests directly.
| "active_requests": len(s._active_requests), | |
| "active_requests": health.active_request_count, |
Added
SessionWatchdogthat periodically monitors session health and cleans up dead or idle sessions. Sessions accumulate locally after the server removes them (e.g. when an agent finishes); the watchdog detects completed tasks and idle timeouts, then tears down the local session.Also did the following refactors:
remove_sessionmethod, replacing 4 inconsistent cleanup paths (_handle_signalr_session_closed,remove_session,_cleanup loop,_monitor_http_server_process)active_request_countfromSessionHealthInfo_cleanup_doneguard to_handle_signalr_session_closedto prevent racing with shutdown_run_serverfinally block cancelling_keep_alive_taskredundantly (now only cancelled by_cleanup)on_keep_alive_responseclosure into_on_keep_alive_responsemethod_get_server_envhelper to deduplicate env var merging (also fixed_registermutatingself._server.envby not copying)_on_runtime_abortdocstring and log message incorrectly referring to "SignalR" (it's an HTTP POST)_wait_for_http_server_readybareValueErrorwithUiPathMcpRuntimeErrorfor consistency{str(e)}to{e}in f-stringsDevelopment Package
uipath pack --nolockto get the latest dev build from this PR (requires version range).