increase chan size to fix oom event lost#386
increase chan size to fix oom event lost#386ningmingxiao wants to merge 1 commit intocontainerd:mainfrom
Conversation
Signed-off-by: ningmingxiao <ning.mingxiao@zte.com.cn>
|
|
||
| func (c *Manager) EventChan() (<-chan Event, <-chan error) { | ||
| ec := make(chan Event, 1) | ||
| ec := make(chan Event, 16) |
There was a problem hiding this comment.
It is an experience value, I test it many times. I'm not sure .@mikebrow
There was a problem hiding this comment.
or we can try
go func() {
ec <- Event{
Low: out["low"],
High: out["high"],
Max: out["max"],
OOM: out["oom"],
OOMKill: out["oom_kill"],
}
}()
There was a problem hiding this comment.
I think the event lost is not caused by channel side. it's race condition in shim side.
shim sets GOMAXPROCS=4 and in CI action, critest runs 8 cases parrallel. So, ideally, it will run 8 pods in the same time on 4 cpu cores node. If shim gets exit event first then we read oom event, it will cause we can lost event.
I am thinking we should drain select-oom-event goroutine before sending exit event in shim side. let me do some performance or density tests for that. I will update it later.
There was a problem hiding this comment.
fair.. there may another window where we get the task exit event on the cri side first then a request for and report status as exited with error reason.. then receive the oom event and update container exit reason.. then if they ask again we give status with reason? The two events use up two checkpoints both protected by the same container store global mutex and because of this lock and storage locks.. the "tight" window for the racing oom test.
Thinking we might want to check if there is an exit reason queued up first before reporting status while it's exiting.. same for the generateAndSendContainerEvent() to the kubelet at the bottom of the task exit.. if we don't get the oom event first we have a window where we report with no reason for the task exit.
There was a problem hiding this comment.
worse.. when we get the container status:
func toCRIContainerStatus(ctx context.Context, container containerstore.Container, spec *runtime.ImageSpec, imageRef string) (*runtime.ContainerStatus, error) {
meta := container.Metadata
status := container.Status.Get()
reason := status.Reason
if status.State() == runtime.ContainerState_CONTAINER_EXITED && reason == "" {
if status.ExitCode == 0 {
reason = completeExitReason
} else {
reason = errorExitReason
}
}
...
if exited but reason is nil.. we invent an exit reason if there is non 0 exit code
There was a problem hiding this comment.
I add some log I find If I increase the size will have less chance to cause oom event lost.The oom event really happened but containerd failed to catch it.
There was a problem hiding this comment.
The final time I read from memory.events is
{Low:0x0, High:0x0, Max:0x1, OOM:0x0, OOMKill:0x0} c.id cad5a1e0473717dd873246764c857ecc7dbb3630516bba093c5700b246b6a282]
@fuweid @mikebrow
fix containerd/containerd#12681 ci failed reduce oom event lost probability.