Draft
Conversation
…cesses (akin to JupyterHub)
When the pool is at capacity and all workers are busy, _evict_lru() returns without evicting. Previously, get_worker() would unconditionally spawn a new worker anyway, silently exceeding the limit. Now it raises a WorkerError(503) so the caller gets a proper "try again later" instead of unbounded subprocess growth. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
WorkerPool.__init__ was hardcoding max_workers=50 and idle_timeout=300 instead of reading from Settings.worker_pool_max_workers and worker_pool_idle_timeout. The eviction loop sleep is now also capped at the idle_timeout so shorter timeouts are honored promptly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two fd leak paths: (1) if an exception occurs mid-receive, any fds already collected from SCM_RIGHTS ancillary data were never closed; (2) if multiple fds arrived (unlikely but possible), only fds[0] was wrapped and the rest leaked. Now all fds are closed on error, and extra fds beyond the first are closed on success. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These handlers were missing try/except for FileNotFoundError and PermissionError, unlike all neighboring file action handlers. Exceptions would propagate to the worker main loop's generic handler and return 500 instead of proper 404/403. Also add status_code to _get_filestore error responses for consistency with other handlers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_spawn_worker() was using grp.getgrall() to build the child process's supplementary groups. On LDAP/NSS-backed systems, getgrall() may not enumerate all effective groups, causing workers to lose access to files that the user can normally reach. os.getgrouplist() queries NSS directly and matches the approach used by user_context.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bare except: catches SystemExit and KeyboardInterrupt, preventing clean shutdown. Use except Exception: instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The json.loads(model.model_dump_json()) pattern serializes to a JSON string then immediately parses it back to a dict. model_dump(mode='json') produces the same dict directly without the string round-trip, which matters for directory listings with many files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In dev/test mode (no worker pool), exceptions from action handlers would propagate unhandled to the generic exception handler. In production, the worker subprocess catches these and returns error dicts. Now dev mode also catches and logs action handler exceptions, matching the production error behavior more closely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_action_get_job_file_paths and _action_get_service_url were calling get_settings() to get the db_url while every other handler uses ctx.db_url. Use ctx.db_url for consistency with the rest of the worker action handlers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The worker socket had no timeout, so if a worker hung mid-response the run_in_executor thread would block indefinitely. Add a 120s timeout so the socket raises and the request fails rather than silently tying up a thread forever. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is an experimental branch which refactors Fileglancer to use user-specific worker processes for all setuid-based access, including all file operations and cluster tasks. This abandons the usage of seteuid in the Uvicorn workers, which has been problematic. Any async/await code can break that model, and it's easy to introduce hard-to-debug issues by accident. This new model is more similar to JupyterHub (as we had in the beginning). However, the workers are managed automatically and don't require users to manually start a server.
I've gone through code reviews with both Claude and Codex (they caught different issues). I've also tested it with a personal dev deployment on fileglancer-dev, but the code is quite complex and low-level and needs more testing and validation.
Claude PR Summary
Refactor all user-scoped file access to run on per-user persistent worker subprocesses. The main Uvicorn process never calls
seteuid/setegid/setgroups— instead, each user gets a long-lived worker subprocess that runs with their real UID/GID/groups, communicating over a Unix socketpair with length-prefixed JSON. File descriptors for opened files are passed back to the main process via SCM_RIGHTS so streaming responses work without copying data through the worker.worker_pool.py): Manages per-user workers with LRU eviction, idle timeout, configurable pool size, and graceful shutdown. Workers are spawned on demand and cleaned up automatically.user_worker.py): Handles all user-scoped operations (file I/O, cluster jobs, git ops, SSH keys, S3 proxy) in a synchronous request/response loop. Action handlers are registered in a dispatch table.server.py): All endpoints that previously usedEffectiveUserContextnow dispatch through_worker_exec(). In dev/test mode (nouse_access_flags), actions run directly in-process with no subprocess overhead.filestore.py): New_stream_contentsand_stream_rangestatic methods that stream from an already-opened file handle, used by the server to stream file content from worker-passed file descriptors.test_worker.py): IPC protocol tests, UserWorker integration tests, async execute tests (including concurrency serialization), action handler tests, and worker main loop tests.Notable fixes included in this branch
max_workerscap: raise 503 when pool is full instead of unbounded subprocess growthos.getgrouplist()for supplementary groups (fixes missing groups on LDAP/NSS systems)@JaneliaSciComp/fileglancer @StephanPreibisch