-
Notifications
You must be signed in to change notification settings - Fork 205
Added systemd oom handling and tests #227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
mishushakov
wants to merge
9
commits into
main
Choose a base branch
from
systemd-oom-restart
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+273
−16
Open
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
2f588c6
added systemd oom handling and tests
mishushakov 2f9e6e3
format
mishushakov ab704c2
reduce restart loop to 1s
mishushakov d924184
explicitly stdout/stderr
mishushakov 8f46d6b
sudo
mishushakov ba25ccc
move health check to avoid deadlocking process
mishushakov 2f56d04
added changeset
mishushakov 5a06e0d
updated as per comments
mishushakov 3797662
removed burst in jupyter.service too
mishushakov File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| --- | ||
| '@e2b/code-interpreter-template': patch | ||
| --- | ||
|
|
||
| added systemd to handle process restarts |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,67 @@ | ||
| import { expect } from 'vitest' | ||
| import { sandboxTest, wait } from './setup' | ||
|
|
||
| async function waitForHealth(sandbox: any, maxRetries = 10, intervalMs = 100) { | ||
| for (let i = 0; i < maxRetries; i++) { | ||
| try { | ||
| const result = await sandbox.commands.run( | ||
| 'curl -s -o /dev/null -w "%{http_code}" http://0.0.0.0:49999/health' | ||
| ) | ||
| if (result.stdout.trim() === '200') { | ||
| return true | ||
| } | ||
| } catch { | ||
| // Connection refused or other error, retry | ||
| } | ||
| await wait(intervalMs) | ||
| } | ||
| return false | ||
| } | ||
|
|
||
| sandboxTest('restart after jupyter kill', async ({ sandbox }) => { | ||
| // Verify health is up initially | ||
| const initialHealth = await waitForHealth(sandbox) | ||
| expect(initialHealth).toBe(true) | ||
|
|
||
| // Kill the jupyter process as root | ||
| // The command handle may get killed too (since killing jupyter cascades to code-interpreter), | ||
| // so we catch the error. | ||
| try { | ||
| await sandbox.commands.run("kill -9 $(pgrep -f 'jupyter server')", { | ||
| user: 'root', | ||
| }) | ||
| } catch { | ||
| // Expected — the kill cascade may terminate the command handle | ||
| } | ||
|
|
||
| // Wait for systemd to restart both services | ||
| const recovered = await waitForHealth(sandbox, 60, 500) | ||
| expect(recovered).toBe(true) | ||
|
|
||
| // Verify code execution works after recovery | ||
| const result = await sandbox.runCode('x = 1; x') | ||
| expect(result.text).toEqual('1') | ||
| }) | ||
|
|
||
| sandboxTest('restart after code-interpreter kill', async ({ sandbox }) => { | ||
| // Verify health is up initially | ||
| const initialHealth = await waitForHealth(sandbox) | ||
| expect(initialHealth).toBe(true) | ||
|
|
||
| // Kill the code-interpreter process as root | ||
| try { | ||
| await sandbox.commands.run("kill -9 $(pgrep -f 'uvicorn main:app')", { | ||
| user: 'root', | ||
| }) | ||
| } catch { | ||
| // Expected — killing code-interpreter may terminate the command handle | ||
| } | ||
|
|
||
| // Wait for systemd to restart it and health to come back | ||
| const recovered = await waitForHealth(sandbox, 60, 500) | ||
| expect(recovered).toBe(true) | ||
|
|
||
| // Verify code execution works after recovery | ||
| const result = await sandbox.runCode('x = 1; x') | ||
| expect(result.text).toEqual('1') | ||
| }) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,59 @@ | ||
| import asyncio | ||
|
|
||
| from e2b_code_interpreter.code_interpreter_async import AsyncSandbox | ||
|
|
||
|
|
||
| async def wait_for_health(sandbox: AsyncSandbox, max_retries=10, interval_ms=100): | ||
| for _ in range(max_retries): | ||
| try: | ||
| result = await sandbox.commands.run( | ||
| 'curl -s -o /dev/null -w "%{http_code}" http://0.0.0.0:49999/health' | ||
| ) | ||
| if result.stdout.strip() == "200": | ||
| return True | ||
| except Exception: | ||
| pass | ||
| await asyncio.sleep(interval_ms / 1000) | ||
| return False | ||
|
|
||
|
|
||
| async def test_restart_after_jupyter_kill(async_sandbox: AsyncSandbox): | ||
| # Verify health is up initially | ||
| assert await wait_for_health(async_sandbox) | ||
|
|
||
| # Kill the jupyter process as root | ||
| # The command handle may get killed too (killing jupyter cascades to code-interpreter), | ||
| # so we catch the error. | ||
| try: | ||
| await async_sandbox.commands.run( | ||
| "kill -9 $(pgrep -f 'jupyter server')", user="root" | ||
| ) | ||
| except Exception: | ||
| pass | ||
|
|
||
| # Wait for systemd to restart both services | ||
| assert await wait_for_health(async_sandbox, 60, 500) | ||
|
|
||
| # Verify code execution works after recovery | ||
| result = await async_sandbox.run_code("x = 1; x") | ||
| assert result.text == "1" | ||
|
|
||
|
|
||
| async def test_restart_after_code_interpreter_kill(async_sandbox: AsyncSandbox): | ||
| # Verify health is up initially | ||
| assert await wait_for_health(async_sandbox) | ||
|
|
||
| # Kill the code-interpreter process as root | ||
| try: | ||
| await async_sandbox.commands.run( | ||
| "kill -9 $(pgrep -f 'uvicorn main:app')", user="root" | ||
| ) | ||
| except Exception: | ||
| pass | ||
|
|
||
| # Wait for systemd to restart it and health to come back | ||
| assert await wait_for_health(async_sandbox, 60, 500) | ||
|
|
||
| # Verify code execution works after recovery | ||
| result = await async_sandbox.run_code("x = 1; x") | ||
| assert result.text == "1" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| import time | ||
|
|
||
| from e2b_code_interpreter.code_interpreter_sync import Sandbox | ||
|
|
||
|
|
||
| def wait_for_health(sandbox: Sandbox, max_retries=10, interval_ms=100): | ||
| for _ in range(max_retries): | ||
| try: | ||
| result = sandbox.commands.run( | ||
| 'curl -s -o /dev/null -w "%{http_code}" http://0.0.0.0:49999/health' | ||
| ) | ||
| if result.stdout.strip() == "200": | ||
| return True | ||
| except Exception: | ||
| pass | ||
| time.sleep(interval_ms / 1000) | ||
| return False | ||
|
|
||
|
|
||
| def test_restart_after_jupyter_kill(sandbox: Sandbox): | ||
| # Verify health is up initially | ||
| assert wait_for_health(sandbox) | ||
|
|
||
| # Kill the jupyter process as root | ||
| # The command handle may get killed too (killing jupyter cascades to code-interpreter), | ||
| # so we catch the error. | ||
| try: | ||
| sandbox.commands.run("kill -9 $(pgrep -f 'jupyter server')", user="root") | ||
| except Exception: | ||
| pass | ||
|
|
||
| # Wait for systemd to restart both services | ||
| assert wait_for_health(sandbox, 60, 500) | ||
|
|
||
| # Verify code execution works after recovery | ||
| result = sandbox.run_code("x = 1; x") | ||
| assert result.text == "1" | ||
|
|
||
|
|
||
| def test_restart_after_code_interpreter_kill(sandbox: Sandbox): | ||
| # Verify health is up initially | ||
| assert wait_for_health(sandbox) | ||
|
|
||
| # Kill the code-interpreter process as root | ||
| try: | ||
| sandbox.commands.run("kill -9 $(pgrep -f 'uvicorn main:app')", user="root") | ||
| except Exception: | ||
| pass | ||
|
|
||
| # Wait for systemd to restart it and health to come back | ||
| assert wait_for_health(sandbox, 60, 500) | ||
|
|
||
| # Verify code execution works after recovery | ||
| result = sandbox.run_code("x = 1; x") | ||
| assert result.text == "1" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| #!/bin/bash | ||
| # Custom health check for Jupyter Server | ||
| # Verifies the server is responsive via the /api/status endpoint | ||
|
|
||
| MAX_RETRIES=50 | ||
| RETRY_INTERVAL=0.2 | ||
|
|
||
| for i in $(seq 1 $MAX_RETRIES); do | ||
| status_code=$(curl -s -o /dev/null -w "%{http_code}" "http://localhost:8888/api/status") | ||
|
|
||
| if [ "$status_code" -eq 200 ]; then | ||
| echo "Jupyter Server is healthy" | ||
| exit 0 | ||
| fi | ||
|
|
||
| if [ $((i % 10)) -eq 0 ]; then | ||
| echo "Waiting for Jupyter Server to become healthy... (attempt $i/$MAX_RETRIES)" | ||
| fi | ||
| sleep $RETRY_INTERVAL | ||
| done | ||
|
|
||
| echo "Jupyter Server health check failed after $MAX_RETRIES attempts" | ||
| exit 1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,22 +1,16 @@ | ||
| #!/bin/bash | ||
|
|
||
| function start_jupyter_server() { | ||
| counter=0 | ||
| response=$(curl -s -o /dev/null -w "%{http_code}" "http://localhost:8888/api/status") | ||
| while [[ ${response} -ne 200 ]]; do | ||
| let counter++ | ||
| if ((counter % 20 == 0)); then | ||
| echo "Waiting for Jupyter Server to start..." | ||
| sleep 0.1 | ||
| fi | ||
|
|
||
| response=$(curl -s -o /dev/null -w "%{http_code}" "http://localhost:8888/api/status") | ||
| done | ||
| function start_code_interpreter() { | ||
| /root/.jupyter/jupyter-healthcheck.sh | ||
| if [ $? -ne 0 ]; then | ||
| echo "Jupyter Server failed to start, aborting." | ||
| exit 1 | ||
| fi | ||
|
|
||
| cd /root/.server/ | ||
| .venv/bin/uvicorn main:app --host 0.0.0.0 --port 49999 --workers 1 --no-access-log --no-use-colors --timeout-keep-alive 640 | ||
| } | ||
|
|
||
| echo "Starting Code Interpreter server..." | ||
| start_jupyter_server & | ||
| start_code_interpreter & | ||
| MATPLOTLIBRC=/root/.config/matplotlib/.matplotlibrc jupyter server --IdentityProvider.token="" >/dev/null 2>&1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| [Unit] | ||
| Description=Code Interpreter Server | ||
| Documentation=https://github.com/e2b-dev/code-interpreter | ||
| Requires=jupyter.service | ||
| After=jupyter.service | ||
| PartOf=jupyter.service | ||
| StartLimitBurst=0 | ||
|
|
||
| [Service] | ||
| Type=simple | ||
| WorkingDirectory=/root/.server | ||
| ExecStartPre=/root/.jupyter/jupyter-healthcheck.sh | ||
| ExecStart=/root/.server/.venv/bin/uvicorn main:app --host 0.0.0.0 --port 49999 --workers 1 --no-access-log --no-use-colors --timeout-keep-alive 640 | ||
| Restart=on-failure | ||
| RestartSec=1 | ||
| StandardOutput=journal | ||
| StandardError=journal |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| [Unit] | ||
| Description=Jupyter Server | ||
| Documentation=https://jupyter-server.readthedocs.io | ||
| Wants=code-interpreter.service | ||
| StartLimitBurst=0 | ||
|
|
||
| [Service] | ||
| Type=simple | ||
| Environment=MATPLOTLIBRC=/root/.config/matplotlib/.matplotlibrc | ||
| Environment=JUPYTER_CONFIG_PATH=/root/.jupyter | ||
| ExecStart=/usr/local/bin/jupyter server --IdentityProvider.token="" | ||
| ExecStartPost=-/usr/bin/systemctl reset-failed code-interpreter | ||
| Restart=on-failure | ||
| RestartSec=1 | ||
| StandardOutput=null | ||
| StandardError=journal | ||
mishushakov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before there was an infinite loop, won't this cause issue if the sandbox is too slow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried that there would be no way to signal that all attempts at checking the endpoint health have been exhausted and it should just try restarting the service again - basically I don't want it to stuck in the health-check loop forever if it should just restart the service