How to categorize this issue?
/area control-plane
/kind enhancement
/priority 3
What would you like to be added:
Background
Currently, the machine-controller-manager moves Running machines to Unknown phase in case of errors and then to Failed phase after the configured machine-health-timeout. Failed machines are swiftly moved to the Terminating phase, the node is drained and the machine object deleted.
Need
There is a need for preserving VM's corresponding to Machines so that the operator/support/SRE can analyze and diagnose root cause of failure. However, there should be a limit to the number of machines that are preserved for the worker pool. There should also be a configurable timeout beyond which the MCM goes ahead with Machine termination.
We propose enhancing MachineConfiguration with FailedMachineTimeout *metav1.Duration and the MachineDeploymentSpec with the FailedMachinePreserveMax *int32. (Exact field names/locations are subject to change after design)
In addition, we will enhance the gardener machineControllerManager settings in the shoot spec to support operator configuration of the above fields in the worker pool in a separate gardener PR.
machineControllerManager:
failedMachinePreserveMax: 2
failedMachinePreserveTimeout: 3h
- The MCM will annotate all preserved failed machines with
node.machine.sapcloud.io/preserve-when-failed=true
- The user/operator can also explicitly mark a
Machine or its associated Node with the annotation node.machine.sapcloud.io/preserve-when-failed=true.
- If the current count of preserved
Failed machines is at or exceeds failedMachinePreserveMax then the annotation will not be accepted. (The annotation will be deleted)
- If the current count of preserved
Failed machines is at or exceeds failedMachinePreserveMax, then any Unknown Machines that move to the Failed phase will not be preserved and will be terminated.
- The
failedMachinePreserveMax MUST be set in the shoot spec, otherwise annotation node.machine.sapcloud.io/preserve-when-failed=true added by operator/support to a Machine has no effect.
- Preserved failed machines can be removed before the
failedMachinePreserveTimeout by setting the node.machine.sapcloud.io/preserve-when-failed=false annotation to the machine
Limitations
- During rolling updates we will NOT honor preserving Machines. The Machine will be replaced with a healthy one if it moves to
Failed phase. Otherwise logic becomes overly complicated.
- Since gardener worker pool can correspond to
1..N MachineDeployments depending on number of zones, we will need to distribute the failedMachinePreserveMax across N machine deployments. So the number chosen should chosen appropriately
Why is this needed:
For operator/support/SRE diagnosis of VM's/Nodes.
How to categorize this issue?
/area control-plane
/kind enhancement
/priority 3
What would you like to be added:
Background
Currently, the
machine-controller-managermovesRunningmachines toUnknownphase in case of errors and then toFailedphase after the configuredmachine-health-timeout.Failedmachines are swiftly moved to theTerminatingphase, the node is drained and the machine object deleted.Need
There is a need for preserving VM's corresponding to Machines so that the operator/support/SRE can analyze and diagnose root cause of failure. However, there should be a limit to the number of machines that are preserved for the worker pool. There should also be a configurable timeout beyond which the MCM goes ahead with Machine termination.
We propose enhancing MachineConfiguration with
FailedMachineTimeout *metav1.Durationand the MachineDeploymentSpec with theFailedMachinePreserveMax *int32. (Exact field names/locations are subject to change after design)In addition, we will enhance the gardener
machineControllerManagersettings in the shoot spec to support operator configuration of the above fields in the worker pool in a separate gardener PR.node.machine.sapcloud.io/preserve-when-failed=trueMachineor its associatedNodewith the annotationnode.machine.sapcloud.io/preserve-when-failed=true.Failedmachines is at or exceedsfailedMachinePreserveMaxthen the annotation will not be accepted. (The annotation will be deleted)Failedmachines is at or exceedsfailedMachinePreserveMax, then anyUnknownMachines that move to theFailedphase will not be preserved and will be terminated.failedMachinePreserveMaxMUST be set in the shoot spec, otherwise annotationnode.machine.sapcloud.io/preserve-when-failed=trueadded by operator/support to aMachinehas no effect.failedMachinePreserveTimeoutby setting thenode.machine.sapcloud.io/preserve-when-failed=falseannotation to the machineLimitations
Failedphase. Otherwise logic becomes overly complicated.1..NMachineDeployments depending on number of zones, we will need to distribute thefailedMachinePreserveMaxacrossNmachine deployments. So the number chosen should chosen appropriatelyWhy is this needed:
For operator/support/SRE diagnosis of VM's/Nodes.