For more information regarding the security incident at F5, the actions we are taking to address it, and our ongoing efforts to protect our customers, click here.

Forum Discussion

werner_v_113449's avatar
werner_v_113449
Icon for Nimbostratus rankNimbostratus
Jul 23, 2015

ltm unit running at 100% after upgrade to 11.5.3

last weekend (19/7) we upgrade our HA-pair from version 11.4.1 towards version 11.5.3 .Mainly because we hit a memory leak due to source NAT. ( https://support.f5.com/kb/en-us/solutions/public/15000/000/sol150 10.html )

 

After the upgrade everything seemed OK . But we started experiencing that GUI (https) was not always responding . From time to time it responded , but on other moments it justed timed out. The second unit in HA-pair didn't give this behavior. Upon checking cpu stats (via top command) we saw that cpu is hitting 100% from time to time on first unit . When this occurs , GUI because slow & even not responsive . On 2nd unit cpu has also increased but is at lower level . So we are not experiencing same behavior over there.

 

Has somebody had recent similar behavior after upgrading to 11.5.3 (hf1) . We opened a case at F5 and are awaiting feedback

 

4 Replies

  • some output : Tasks: 401 total, 20 running, 381 sleeping, 0 stopped, 0 zombie Cpu(s): 68.7%us, 26.2%sy, 1.3%ni, 3.0%id, 0.2%wa, 0.1%hi, 0.6%si, 0.0%st Mem: 8189868k total, 8092768k used, 97100k free, 103412k buffers Swap: 1048504k total, 262332k used, 786172k free, 447220k cached

     

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

     

    18691 root RT 0 6327m 151m 133m S 30.9 1.9 1770:09 tmm

     

    18690 root RT 0 6327m 151m 133m S 28.3 1.9 1855:55 tmm

     

    7595 root 20 0 183m 83m 15m R 19.8 1.0 873:24.36 mcpd

     

    8889 root 20 0 40712 22m 12m R 11.4 0.3 879:25.49 bigd

     

    7113 root 25 5 94348 38m 35m R 5.2 0.5 284:34.64 merged

     

    21036 root 20 0 151m 75m 24m S 3.9 0.9 11:53.55 iControlPortal.

     

    typically both tmm processes are taking most cpu. (it's 69OO platform we are running on) This is behavior we see continuously. On 2nd unit tmm is consuming less cpu & other processes like mcpd and bigd can sometimes take more cpu than tmm. On first unit this is never the case , both tmm process always are taking most cpu.

     

    We are also working with traffic groups , so traffic is crossing on both units units in HA-group. 2nd units has also a raised cpu , but not hitting critical values. 1st unit is more impacted & is slowing down significantly. Main difference between 1st & 2nd unit is that , on 1st unit we are running ltm setups with virtual servers & pool members in directly connected subnets. On 2nd unit most setup have pool members in routed subnets, so not directly connected. (we are using snat for these setups) It's 1st unit who is mainly suffering from increased cpu .

     

    We already started cleaning in order to reduce UP/down events in log-files, but this has no impact. On advise of F5 we also adapted some idle timeout timers in snat translation addresses , but this didn't have any impact.

     

  • Did you ever get an answer? I'm about to perform the same upgrade and would really like to know.

     

  • yes ,

    we were using an external monitor.This monitor was causing the cpu to rise to nearby 100% .I've copied it hereunder but basically it's a curl command.

    Removing the monitor caused cpu to drop to normal values. Originally we used monitor with interval of 5 seconds & timeout of 16. this monitor was present multiple times in process queue .

    We now use it a interval of 1minute & timeout of 3 minutes. But even then we see elevated level of cpu , but it's under control.

    So no real problem with upgrade .but consequence of external monitor used (external monitor script running previously in version 11.2.1 without issues) :

    IP=

    echo ${1} | sed 's/::ffff://'

    PORT=${2}

    MONITOR_NAME=external-http-udp-healthcheck

    PIDFILE="/var/run/$MONITOR_NAME.${IP}_${PORT}.pid"

    if [ -f $PIDFILE ]

    then

     echo "EAV exceeded runtime needed to kill ${IP}:${PORT}" | logger -p local0.error
    
     kill -9 `cat $PIDFILE` > /dev/null 2>&1
    

    fi

    echo "$$" > $PIDFILE

    curl -fNs http://${IP}/healthcheck/udp.php?port=${PORT} | grep -i "UP" 2>&1 > /dev/null

    if [ $? -eq 0 ]

    then

    rm -f $PIDFILE
    
    echo "UP"
    

    else

    rm -f $PIDFILE
    

    fi

    exit