Forum Discussion

Oreoluwa's avatar
Oreoluwa
Icon for Altocumulus rankAltocumulus
Dec 24, 2019

TOO MANY TRIGGERED EMAIL ALERTS BY HEALTH MONITOR

Hi guys,

 

I have successfully setup email alerts on my F5 BIGIP in my production environment and it works but there is an issue. For every second the monitor goes down or for every second the monitor sees the node as down, an email alert is sent to our mails as configured. But this causes too many influx of mails for negligible node downtimes. What can I do to correct this?

 

 

  • Hi Oreoluwa,

     

    You can change monitor's interval/timeout values to 8/25 or 10/31.

     

    You must verify that monitor settings are properly defined for your environment. F5 recommends that in most cases the timeout value should be equal to three times the interval value, plus one. For example, the default timeout/interval ratio is 5/16 (three times 5 plus one equals 16). This setting prevents the monitor from marking the node as down before sending the last check.

    REF: https://support.f5.com/csp/article/K12531

  • SWJO's avatar
    SWJO
    Icon for Cirrostratus rankCirrostratus

    It seems that your server setting wasn`t good.

    If you using http monitor, insert close session syntax.

    or most of case, your server`s kernal value related with tcp is root cause.

    • Oreoluwa's avatar
      Oreoluwa
      Icon for Altocumulus rankAltocumulus

      i don't understand this SWJO. Could you explain better please? I am interested in this

      • SWJO's avatar
        SWJO
        Icon for Cirrostratus rankCirrostratus

        likewise

        GET /test.html HTTP/1.1\r\nUser-Agent: \r\nHost: 127.0.0.1\r\nConnection: Close\r\n\r\n

  • Hi

     

    as @eaa mentionned, the key is probably you monitor settings. What are the current Interval / timeout values ?

     

    Yoann

    • Oreoluwa's avatar
      Oreoluwa
      Icon for Altocumulus rankAltocumulus

      Hi, my current interval/timeout values are 5/16.

      I am planning to change it to 5/300 or should I make it 90/300. The client wants 5 mins of repeated failed checks before the server is considered as down. It is believed that, the server only currently goes down when there​ is so much as just 4 secs downtime. So we think 5/300 or 90/300 will prevent that. What do you think?

  • Hi,

     

    To be tried on your environment, but yes on the paper that should do the trick.

     

    If you set 5/300, then you will send 60 requests MAX over 5 minutes if the server is not responding.

    If you set 90/300, then you will send 3 requests MAX over 5 minutes if the server is not responding.

     

    So it really depend on how insistent you want to be on your backend.

     

    Yoann

  •  

     

    The best practice is 3n+1.

    You really should not be adjusting the timeout & interval just to reduce the number of flaps or alerts to be suppressed. If you still do that, you won't have a stable infra.

     

    If the flappings are continuos, you should identify that and resolve it. Or put a proper monitor accordingly.

    But simply increasing the timeout & interval is not the right approach I'd say.

     

    Keep us posted if you need more help.