Skip to content

[WIP] feat: add watchdog for runtime session handling#187

Open
radugheo wants to merge 3 commits intomainfrom
feat/add-watchdog-for-sessions
Open

[WIP] feat: add watchdog for runtime session handling#187
radugheo wants to merge 3 commits intomainfrom
feat/add-watchdog-for-sessions

Conversation

@radugheo
Copy link
Collaborator

@radugheo radugheo commented Mar 23, 2026

Added SessionWatchdog that periodically monitors session health and cleans up dead or idle sessions. Sessions accumulate locally after the server removes them (e.g. when an agent finishes); the watchdog detects completed tasks and idle timeouts, then tears down the local session.
Also did the following refactors:

  • Unified session removal into a single remove_session method, replacing 4 inconsistent cleanup paths (_handle_signalr_session_closed, remove_session, _cleanup loop, _monitor_http_server_process)
  • Removed unused active_request_count from SessionHealthInfo
  • Added _cleanup_done guard to _handle_signalr_session_closed to prevent racing with shutdown
  • Fixed _run_server finally block cancelling _keep_alive_task redundantly (now only cancelled by _cleanup)
  • Extracted on_keep_alive_response closure into _on_keep_alive_response method
  • Extracted _get_server_env helper to deduplicate env var merging (also fixed _register mutating self._server.env by not copying)
  • Fixed _on_runtime_abort docstring and log message incorrectly referring to "SignalR" (it's an HTTP POST)
  • Replaced _wait_for_http_server_ready bare ValueError with UiPathMcpRuntimeError for consistency
  • Normalized {str(e)} to {e} in f-strings

Development Package

  • Use uipath pack --nolock to get the latest dev build from this PR (requires version range).
  • Add this package as a dependency in your pyproject.toml:
[project]
dependencies = [
  # Exact version:
  "uipath-mcp==0.1.4.dev1001870321",

  # Any version from PR
  "uipath-mcp>=0.1.4.dev1001870000,<0.1.4.dev1001880000"
]

[[tool.uv.index]]
name = "testpypi"
url = "https://test.pypi.org/simple/"
publish-url = "https://test.pypi.org/legacy/"
explicit = true

[tool.uv.sources]
uipath-mcp = { index = "testpypi" }

[tool.uv]
override-dependencies = [
    "uipath-mcp>=0.1.4.dev1001870000,<0.1.4.dev1001880000",
]

@radugheo radugheo marked this pull request as ready for review March 23, 2026 13:45
Copilot AI review requested due to automatic review settings March 23, 2026 13:45
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a session “watchdog” mechanism to the CLI runtime so it can periodically detect and clean up unhealthy sessions (dead tasks / idle sessions), and introduces session health reporting to support that monitoring.

Changes:

  • Introduce SessionWatchdog background task with periodic session health checks and removal hooks.
  • Add SessionHealthInfo + activity tracking to session servers, and integrate watchdog lifecycle into UiPathMcpRuntime.
  • Add unit tests for watchdog behavior and update .gitignore for common IDE folders.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_watchdog.py Adds unit tests covering watchdog detection logic and lifecycle.
src/uipath_mcp/_cli/_runtime/_watchdog.py New watchdog implementation and default timeout/interval constants.
src/uipath_mcp/_cli/_runtime/_session.py Adds SessionHealthInfo, transport identifiers, and last-activity tracking.
src/uipath_mcp/_cli/_runtime/_runtime.py Implements watchdog provider methods and starts/stops watchdog with runtime lifecycle.
.gitignore Ignores .idea/, .vscode/, and .claude/.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tracer = trace.get_tracer(__name__)

WATCHDOG_CHECK_INTERVAL = 30 # seconds
SESSION_IDLE_TIMEOUT = 60 # 1 minute for testing purposes - to be changed to 1 hour
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SESSION_IDLE_TIMEOUT is set to 60 seconds with a note "for testing purposes" but this value is used as the default in production (UiPathMcpRuntime constructs SessionWatchdog(self) without overriding idle_timeout). This will aggressively terminate idle sessions after 1 minute. Please either change the default to the intended production timeout (e.g., 1 hour), or make the timeout configurable and ensure runtime passes the correct value.

Suggested change
SESSION_IDLE_TIMEOUT = 60 # 1 minute for testing purposes - to be changed to 1 hour
SESSION_IDLE_TIMEOUT = 3600 # 1 hour default idle timeout for production

Copilot uses AI. Check for mistakes.
Comment on lines +137 to +149
session_server = self._session_servers.pop(session_id, None)
if session_server is not None:
logger.warning(
f"Removing session {session_id}: {reason}"
)
try:
await session_server.stop()
except Exception:
logger.error(
f"Error stopping session {session_id} during watchdog removal",
exc_info=True,
)
await self._close_session_on_server(session_id)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove_session() pops the session from _session_servers before awaiting session_server.stop() and the server-side DELETE. While those awaits are in-flight, _handle_signalr_message() can observe the session_id missing and create/start a new session server for the same session_id, racing with the close and potentially causing flapping. Consider preventing re-creation while a session is being removed (e.g., keep an entry with a 'closing' marker, or guard with an async lock, or only pop after the stop/close completes).

Suggested change
session_server = self._session_servers.pop(session_id, None)
if session_server is not None:
logger.warning(
f"Removing session {session_id}: {reason}"
)
try:
await session_server.stop()
except Exception:
logger.error(
f"Error stopping session {session_id} during watchdog removal",
exc_info=True,
)
await self._close_session_on_server(session_id)
session_server = self._session_servers.get(session_id)
if session_server is not None:
logger.warning(
f"Removing session {session_id}: {reason}"
)
try:
try:
await session_server.stop()
except Exception:
logger.error(
f"Error stopping session {session_id} during watchdog removal",
exc_info=True,
)
await self._close_session_on_server(session_id)
finally:
# Only remove the session from the registry after shutdown completes
self._session_servers.pop(session_id, None)

Copilot uses AI. Check for mistakes.
Comment on lines 347 to +367
async def _cleanup(self) -> None:
"""Clean up all resources."""
if self._cleanup_done:
return
self._cleanup_done = True

await self._on_runtime_abort()

if self._token_refresher:
await self._token_refresher.stop()

if self._keep_alive_task:
self._keep_alive_task.cancel()
try:
await self._keep_alive_task
except asyncio.CancelledError:
pass

if self._watchdog:
await self._watchdog.stop()
self._watchdog = None
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _cleanup(), the watchdog is stopped only after awaiting _on_runtime_abort(), token refresher shutdown, and keep-alive cancellation. During those awaits, the watchdog can still run and concurrently mutate/stop sessions while cleanup is also stopping sessions, creating races and noisy logs. Consider stopping the watchdog immediately after setting _cleanup_done (or at least before awaiting long-running calls) to make cleanup deterministic.

Copilot uses AI. Check for mistakes.
@radugheo radugheo added the build:dev Create a dev build from the pr label Mar 23, 2026
@radugheo radugheo requested a review from Copilot March 23, 2026 17:09
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +13 to +15
WATCHDOG_CHECK_INTERVAL = 30 # seconds
SESSION_IDLE_TIMEOUT = 60 # 1 minute for testing purposes - to be changed to 1 hour

Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SESSION_IDLE_TIMEOUT defaults to 60 seconds (and UiPathMcpRuntime starts the watchdog with defaults). That will aggressively tear down sessions in production after 1 minute of inactivity and can also prematurely cancel sandboxed runtimes via remove_session(). Please bump the default to the intended value (e.g., 3600s) and/or make it configurable via runtime/server config so tests can still use a shorter timeout without affecting real runs.

Copilot uses AI. Check for mistakes.
health = s.get_health_info()
runtime_sessions[sid] = {
"task_done": health.task_done,
"active_requests": len(s._active_requests),
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_on_keep_alive_response() reaches into BaseSessionServer's private _active_requests to log active request counts. This tightly couples the runtime to session internals and makes refactors risky (especially since SessionHealthInfo was introduced for health reporting). Consider exposing an official API for this (e.g., add an active_request_count field back to SessionHealthInfo or add a public method/property on BaseSessionServer) and use that instead of accessing _active_requests directly.

Suggested change
"active_requests": len(s._active_requests),
"active_requests": health.active_request_count,

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build:dev Create a dev build from the pr

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants