You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Apr 23, 2026. It is now read-only.
people can shoot themselves in the foot:
any alerting setting that is just around the "sweet spot" (henceforth referred to as "sour spot") of what your data dances around, can cause a lot of alert notifications. basically constantly flipping between critical and ok because at each point in time the data changed enough to be considered as critical or ok. ("flapping" as per nagios)
our current default settings put a lot of people right in the sour spot, but even with adjusted defaults, the problem is there.
people's typical data might be outside of the sour spot, but in case they're having a service degradation, the amount of additional failures might be just the right amount that their data goes into the sour spot, and they still become flap victims. (for example they configured alert if 6 out 20 collectors return errors, and they normally always have <2 errors at a same time, but a service degradation brings them to 5~7 erroring collectors)
frankly, in case our collectors suffer some subtle issues that we can't easily detect or timely remediate, they might also contribute to the user's data going into the sour spot.
in all these cases, we can't just keep sending emails to people.
other systems I believe have a hard limit on max emails per hour/day.
I think we should do something similar, but also send them an email when we start surpressing notifications with an explanation of what is happening, what we did, and that they might want to change their settings
Wednesday Sep 09, 2015 at 13:05 GMT
Originally opened as raintank/grafana#458
people can shoot themselves in the foot:
any alerting setting that is just around the "sweet spot" (henceforth referred to as "sour spot") of what your data dances around, can cause a lot of alert notifications. basically constantly flipping between critical and ok because at each point in time the data changed enough to be considered as critical or ok. ("flapping" as per nagios)
our current default settings put a lot of people right in the sour spot, but even with adjusted defaults, the problem is there.
people's typical data might be outside of the sour spot, but in case they're having a service degradation, the amount of additional failures might be just the right amount that their data goes into the sour spot, and they still become flap victims. (for example they configured alert if 6 out 20 collectors return errors, and they normally always have <2 errors at a same time, but a service degradation brings them to 5~7 erroring collectors)
frankly, in case our collectors suffer some subtle issues that we can't easily detect or timely remediate, they might also contribute to the user's data going into the sour spot.
in all these cases, we can't just keep sending emails to people.
I think we should do something similar, but also send them an email when we start surpressing notifications with an explanation of what is happening, what we did, and that they might want to change their settings