Possible k8s OOM Kill prevention pill 2 - rlimit#1370
Possible k8s OOM Kill prevention pill 2 - rlimit#1370taylordowns2000 wants to merge 4 commits intomainfrom
Conversation
|
Gosh there's a lot of stuff here, and I have no idea what any of it does. I'll take a close look at it (probably tomorrow) |
josephjclark
left a comment
There was a problem hiding this comment.
I'm a lot more comfortable with this approach over the cgroup stuff. This feels more direct and targeted at the use-case we need.
I still want to read a little more about it, and I'll make a few changes to the PR.
The big thing I'm concerned about is: will the engine be able to set the correct exit condition when rlimit kills the process? I need to test that locally
| const hasPrlimit = detectPrlimitSupport(logger); | ||
|
|
||
| if (hasPrlimit && options.maxWorkerMemoryMb) { | ||
| logger.info( |
There was a problem hiding this comment.
I would append this to the previous startup log - it'll be way more useful there
|
|
||
| type WorkerOptions = { | ||
| maxWorkers?: number; | ||
| maxWorkerMemoryMb?: number; // kernel-level memory limit per child process (cgroup v2) |
There was a problem hiding this comment.
I would just re-use the existing memoryLimitMb option
How that limit gets used in rgroups or heap memory settings or whatever is an implementation detail. The admin just needs to say "don't let any given job take more than 500mb"
| allWorkers[child.pid!] = child; | ||
|
|
||
| if (hasPrlimit && options.maxWorkerMemoryMb) { | ||
| // RLIMIT_AS counts virtual address space, not RSS. |
There was a problem hiding this comment.
This comment doesn't belong here. We should just pass the mb limit into the applyMemoryLimit function
There was a problem hiding this comment.
I think I'd also say: take the memory limit used in the run, add 10%, (20%?) and set that as the hard process limit.
I don't really know why - I just feel like like we should let node control the exit itself, and use limit as a hard fallback
| execFileSync('prlimit', [ | ||
| '--pid', | ||
| String(pid), | ||
| `--as=${limitBytes}:${limitBytes}`, |
There was a problem hiding this comment.
I need to look into:
- should the soft and hard limit be the same?
- should we be setting address space or RSS? or both?
| @@ -0,0 +1,51 @@ | |||
| import test from 'ava'; | |||
There was a problem hiding this comment.
Yeah I don't know what these tests are going to tell us. Maybe this is something to do at the integration test level. Maybe it's more appropriate that we don't test this at all?
|
Given the early success of "pill 1", I'm happy to close this. Whatcha think? |
|
@taylordowns2000 I'm still interested in this but do need to test and probe further. Memory was still spiking super high even after the fix. It might be that runs are consuming 1.5 gb memory before getting killed by the process - which is still 500mb over the allowed limit. Two of those workflows running at once would kill the worker. Using rlimit and enforcing memory constraints in the OS should give us far stricter control of memory, which would mean we can kill the job the moment it allocates outside of its bounds. I'm very interested in that. If it works well, we could even think more seriously about dynamic memory allocation: claiming a run with a 512 memory limit if you only have 800mb available, for example. Capacity would be determined purely by memory availability, not by an arbitrary number, and we could totally trust it. I've always been a bit stuck on cgroups (not that I really understand them) because they're designed to restrict a set of processes. But that's not really what I want, I don't think. |
Background: It took me about 20 seconds to crash a staging worker in Kubernetes:
The above run will show up as "lost" in the next 30 minutes.
This PR uses
prlimitto setRLIMIT_ASon each forked child process, capping virtual address space so a runawayrun crashes itself instead of OOM-killing the pod.
It's opt-in by detection: active when prlimit (from util-linux) is available on Linux; no-op on macOS / local dev, and it adds
util-linuxto the worker Docker image so it's availableTesting on staging
AI Usage
Please disclose whether you've used AI anywhere in this PR (it's cool, we just
want to know!):
You can read more details in our
Responsible AI Policy
Release branch checklist
Delete this section if this is not a release PR.
If this IS a release branch:
pnpm changeset versionfrom root to bump versionspnpm installpnpm changeset tagto generate tagsgit push --tagsTags may need updating if commits come in after the tags are first generated.