11-Jul-2011 00:48
We have two LTMs in a HA config. Not sure if the devices are actually failing over or not....
Mon Jul 11 11:48:24 EST 2011 notice local/LB003 sod[5151] 01140045 HA reports tmm NOT ready.
Mon Jul 11 11:48:24 EST 2011 notice local/LB003 sod[5151] 010c0050 Sod requests links down.
Mon Jul 11 11:48:25 EST 2011 info local/LB003 lacpd[5133] 01160016 Failover event detected. (Switchboard failsafe disabled while offline)
Mon Jul 11 11:48:25 EST 2011 err local/LB003 bcm56xxd[4993] 012c0010 Failover event detected. Marking external interfaces down. bsx.c(3276)
Mon Jul 11 11:48:25 EST 2011 info local/LB003 bcm56xxd[4993] 012c0015 Link: 1.15 is DOWN
Mon Jul 11 11:48:25 EST 2011 info local/LB003 bcm56xxd[4993] 012c0015 Link: 1.16 is DOWN
Mon Jul 11 11:48:25 EST 2011 info local/LB003 bcm56xxd[4993] 012c0015 Link: 3.1 is DOWN
Mon Jul 11 11:48:25 EST 2011 info local/LB003 lacpd[5133] 01160016 Interface 1.15, link admin status: enabled, link status: down, duplex mode: half, lacp operation state: down
Mon Jul 11 11:48:25 EST 2011 info local/LB003 lacpd[5133] 01160010 Link 1.15 removed from aggregation
Mon Jul 11 11:48:25 EST 2011 info local/LB003 bcm56xxd[4993] 012c0015 Link: 3.2 is DOWN
Mon Jul 11 11:48:25 EST 2011 info local/LB003 lacpd[5133] 01160016 Interface 1.16, link admin status: enabled, link status: down, duplex mode: half, lacp operation state: down
Mon Jul 11 11:48:25 EST 2011 info local/LB003 lacpd[5133] 01160010 Link 1.16 removed from aggregation
Mon Jul 11 11:48:25 EST 2011 info local/LB003 lacpd[5133] 01160016 Interface 3.1, link admin status: enabled, link status: down, duplex mode: half, lacp operation state: down
Mon Jul 11 11:48:25 EST 2011 info local/LB003 lacpd[5133] 01160010 Link 3.1 removed from aggregation
Mon Jul 11 11:48:25 EST 2011 info local/LB003 lacpd[5133] 01160016 Interface 3.2, link admin status: enabled, link status: down, duplex mode: half, lacp operation state: down
Mon Jul 11 11:48:25 EST 2011 info local/LB003 lacpd[5133] 01160010 Link 3.2 removed from aggregation
Mon Jul 11 11:48:35 EST 2011 notice local/LB003 sod[5151] 01140029 HA daemon_heartbeat tmm5 fails action is go offline down links and restart.
Mon Jul 11 11:48:35 EST 2011 notice local/LB003 sod[5151] 010c003e Offline
Mon Jul 11 11:48:35 EST 2011 notice local/LB003 sod[5151] 01140044 HA reports tmm ready.
Mon Jul 11 11:48:35 EST 2011 notice local/LB003 sod[5151] 010c0018 Standby
Mon Jul 11 11:48:36 EST 2011 info local/LB003 lacpd[5133] 01160016 Connected to failover service.
Mon Jul 11 11:48:36 EST 2011 info local/LB003 bcm56xxd[4993] 012c0012 Connected to failover service.
Mon Jul 11 11:48:36 EST 2011 notice local/LB003 sod[5151] 010c0048 Bcm56xxd and lacpd connected - links up.
Mon Jul 11 11:48:38 EST 2011 info local/LB003 bcm56xxd[4993] 012c0015 Link: 3.1 is UP
Mon Jul 11 11:48:38 EST 2011 info local/LB003 bcm56xxd[4993] 012c0015 Link: 3.2 is UP
Mon Jul 11 11:48:39 EST 2011 info local/LB003 bcm56xxd[4993] 012c0015 Link: 1.15 is UP
Mon Jul 11 11:48:39 EST 2011 info local/LB003 bcm56xxd[4993] 012c0015 Link: 1.16 is UP
Mon Jul 11 11:48:40 EST 2011 info local/LB003 lacpd[5133] 01160009 Link 1.15 added to aggregation
Mon Jul 11 11:48:40 EST 2011 info local/LB003 lacpd[5133] 01160009 Link 1.16 added to aggregation
Mon Jul 11 11:48:40 EST 2011 info local/LB003 lacpd[5133] 01160009 Link 3.1 added to aggregation
Mon Jul 11 11:48:40 EST 2011 info local/LB003 lacpd[5133] 01160009 Link 3.2 added to aggregation
11-Jul-2011 03:04
Can you get the logs from your units?
There's a point in the logs that local/LB003 went to standby, but I have to check your logs and configs to see if i twas the original standby or active unit.
I'm also concern that the "HA reports tmm NOT ready.", can you check /var/core if there's any core file?
Once you have the sn of the unit, can you open a ticket by sending an email to support@f5.com
Send the following too:
File: Standard qkview output
Command to Generate: qkview
Location of Output file: /var/tmp/$HOSTNAME.tgz
Files: Host log tarball
Command to Generate: tar zcvf /var/tmp/$HOSTNAME-logs.tar.gz /var/log/*
Location of Output file: /var/tmp/$HOSTNAME-logs.tar.gz
hth..
11-Jul-2011 16:14
Yes there are numerous files in /var/core
I will log a case today
12-Jul-2011 01:33
28-Oct-2014 05:02
04-Mar-2015 17:49
Hi, i had the exact same logs, and there was a failover. The logs start with "HA reports tmm NOT ready." line. I'm running an HA pair on 11.5.1 HF7. There are a couple of core files created. Would appreciate any light on this.
Thank you.
05-Mar-2015 06:14
Have you looked into if your switch is not flapping? looks like you might be monitoring your bcm56xxd[4993] port and it could be flapping.
10-Mar-2015
06:18
- last edited on
03-Jun-2023
09:14
by
JimmyPackets
Opened a case with support, its a bug in 11.5.1 HF7, resolved in 11.5.2 and 11.6.0 HF4.
/var/log/ltm:
Mar 4 21:29:55 slot1/bigip notice sod[6191]: 01140045:5: HA reports tmm NOT ready.
Mar 4 21:29:55 slot1/bigip notice sod[6191]: 010c0050:5: Sod requests links down.
Mar 4 21:29:55 slot1/bigip info lacpd[7970]: 01160016:6: Failover event detected. (Switchboard failsafe disabled while offline)
Mar 4 21:29:55 slot1/bigip err bcm56xxd[8241]: 012c0010:3: Failover event detected. Marking external interfaces down. bsx.c(3988)
Mar 4 21:29:56 slot1/bigip info bcm56xxd[8241]: 012c0015:6: Link: 1/1.6 is DOWN
Mar 4 21:29:56 slot1/bigip info bcm56xxd[8241]: 012c0015:6: Link: 1/1.5 is DOWN
Mar 4 21:29:56 slot1/bigip info lacpd[7970]: 01160016:6: Interface 1/1.6, link admin status: enabled, link status: down, duplex mode: half, lacp operation state: down
Mar 4 21:29:56 slot1/bigip info lacpd[7970]: 01160010:6: Link 1/1.6 removed from aggregation
Mar 4 21:29:56 slot1/bigip info bcm56xxd[8241]: 012c0015:6: Link: 1/1.4 is DOWN
Mar 4 21:29:56 slot1/bigip info bcm56xxd[8241]: 012c0015:6: Link: 1/1.3 is DOWN
Mar 4 21:29:56 slot1/bigip info bcm56xxd[8241]: 012c0015:6: Link: 1/1.2 is DOWN
Mar 4 21:29:56 slot1/bigip info bcm56xxd[8241]: 012c0015:6: Link: 1/1.1 is DOWN
Mar 4 21:29:56 slot1/bigip info bcm56xxd[8241]: 012c0015:6: Link: 1/9.1 is DOWN
Mar 4 21:29:56 slot1/bigip info lacpd[7970]: 01160016:6: Interface 1/1.3, link admin status: enabled, link status: down, duplex mode: half, lacp operation state: down
Mar 4 21:29:56 slot1/bigip info lacpd[7970]: 01160010:6: Link 1/1.3 removed from aggregation
Mar 4 21:29:56 slot1/bigip info lacpd[7970]: 01160016:6: Interface 1/1.2, link admin status: enabled, link status: down, duplex mode: half, lacp operation state: down
Mar 4 21:29:56 slot1/bigip info lacpd[7970]: 01160010:6: Link 1/1.2 removed from aggregation
Mar 4 21:29:56 slot1/bigip info lacpd[7970]: 01160016:6: Interface 1/1.1, link admin status: enabled, link status: down, duplex mode: half, lacp operation state: down
Mar 4 21:29:56 slot1/bigip info lacpd[7970]: 01160010:6: Link 1/1.1 removed from aggregation
Mar 4 21:29:59 slot1/bigip notice chmand[8246]: 012a0005:5: Interface: 1/8.1 is DOWN
Mar 4 21:29:59 slot1/bigip notice clusterd[6184]: 013a0006:5: Bumping this blade's revision and saving cluster config
Mar 4 21:30:08 slot1/bigip notice sod[6191]: 01140029:5: HA daemon_heartbeat tmm fails action is go offline down links and restart.
Mar 4 21:30:08 slot1/bigip notice sod[6191]: 01140029:5: HA daemon_heartbeat tmm1 fails action is go offline down links and restart.
Mar 4 21:30:08 slot1/bigip notice sod[6191]: 01140029:5: HA daemon_heartbeat tmm2 fails action is go offline down links and restart.
Mar 4 21:30:08 slot1/bigip notice sod[6191]: 01140029:5: HA daemon_heartbeat tmm3 fails action is go offline down links and restart.
Mar 4 21:30:08 slot1/bigip notice sod[6191]: 010c0054:5: Offline for traffic group /Common/traffic-group-1.
Mar 4 21:30:08 slot1/bigip notice sod[6191]: 010c003e:5: Offline
Mar 4 21:30:08 slot1/bigip err clusterd[6184]: 013a0018:3: Blade 1 turned RED: Run, HA TABLE offline
13-Aug-2015 17:17
I have a 4200 pair in active-standby and had the same problem a few times after to upgrade from 11.4 to 11.5. After 11.6 HF4 that problem didn't appear anymore until today, and I'm using 11.6.0 HF4... I'm really trying to understand what happened this time... but I just know one thing, i really bored with this...
27-Aug-2015 07:55
Hello, same here.
I have a Big-IP 2000 pair in active-standby running latest 11.6.0 HF5, and run into the same issue for several days. Mostly in the afternoon, supposedly when there are more users connected to it (around 50), it does the following: ltm:Aug 27 15:06:00 ssl-vpn-p notice sod[8619]: 01140045:5: HA reports tmm NOT ready. ltm:Aug 27 15:18:02 ssl-vpn-p notice sod[8619]: 01140045:5: HA reports tmm NOT ready. ltm:Aug 27 15:30:32 ssl-vpn-p notice sod[8619]: 01140045:5: HA reports tmm NOT ready. ltm:Aug 27 15:41:54 ssl-vpn-p notice sod[8619]: 01140045:5: HA reports tmm NOT ready. ltm:Aug 27 15:53:23 ssl-vpn-p notice sod[8619]: 01140045:5: HA reports tmm NOT ready.
Seems like a regression on the fix: https://support.f5.com/kb/en-us/solutions/public/15000/700/sol15713.html Will try to disable the TCP Segmentation Offload, but I don't like that... F5, please help!
27-Aug-2015 08:04
I oppened a case at F5 and they said that the problem was as bellow:
"The cause of the crash was because TMM encountered a segmentation fault after parsing a malformed URL."
This is fixed in HF5.
So I upgrade my system's e waiting for knows if it really solved!
27-Aug-2015 08:36
That's a good lead, still I daresay it's NOT fixed in HF5 since I run this very one. Do you have the slightest idea what your "malformed URL" might have been? Something on the possibly customized portal? I'm currently looking in the logs for "url" or "parse".
Thanks for your news, they're of great help! Best regards,
01-Sep-2015 02:10
Hello Eduardo,
Any update for you? Only option to somewhat "stabilize" the failovers was to "bigstart restart sod", sod being the process hertbeat responsible for failover. But this morning, I had 7 failovers already, and I can't find any root cause. I'm considering rolling back to 11.6HF2 since my issues came after upgrading to 11.6HF5.
Please let me know if you have anything at all. Regards,
Adrien Restaut
01-Sep-2015 08:11
For me is working, i had no problem anymore since we put HF5. I think better you open a support case with F5 and they can look into this. I read a lot of probably causes for it happens and concluded that it is a bug.! the unique thing I know is, this problem started (with me) after using ASM... but if you don't uses ASM i really have no idea.
14-Jan-2022 07:09
Had the same issue today while busy with ASM
01-Sep-2015 09:00
Good news for you. I'm running 11.6HF5 with the issue 😕 What is ASM, by the way? Regards, Adrien
01-Sep-2015 09:06
02-Sep-2015 06:50
Thanks Eduardo. Yesterday was just nightmare: failovers literally every 2 minutes, impossible for remote users to work. This early morning we rolled back to 11.6HF2. So far, no failovers. BIG ISSUE HERE! Anyway, glad we don't have the problem anymore, so far. Regards.
22-Jan-2016 17:48
Hello, I know it's a bit late but have you tried this SOL article? https://support.f5.com/kb/en-us/solutions/public/17000/100/sol17155.html
25-Jan-2016 12:17
I oppened a case at F5 and they said that the problem was as bellow:
"The cause of the crash was because TMM encountered a segmentation fault after parsing a malformed URL."
This is fixed in HF5.
So I upgrade my system's it really solved!
25-Jan-2016 12:26
I can concur - I've seen HF5 resolve this issue as well in the field.
30-Aug-2016 13:39
Hello, I have 11.6.0 Hotfix 6 and today I had same problem of failing over with the same error. Does anyone know the cause of this failover issue? It happened few months ago also. I opened a case then and I was told it is probably a networking issue which I am sure it is not. I opened a new case today, waiting for an answer. Thanks
13-Nov-2020
06:30
- last edited on
04-Jun-2023
21:11
by
JimmyPackets
Hello,
I'm confused, but we had a failover on 15.1.0.4 today when F5 switched twice and the symptoms are very simillar:
HA reports tmm NOT ready.
Sod requests links down.
Failover event detected. (Switchboard failsafe disabled while offline)
Bringing down interfaces on links-up change
Can you please, try to remember the support team answer? ASM is running on BIG-IP
31-Aug-2016 00:11
Hello Hajar, For your information, problem was never solved for us. We upgraded to 12.0.0 and the failovers issue reappeared. F5 provided us and Engineering Hotfix HF0-ENG42 which works. The fact is more and more of our users have Windows 10, and 12.0.0 does NOT support Windows 10. I tried to upgrade to the then latest version, 12.0.0HF2 and the same failovers issue reappeared after a few days. Only way to have a working situation is to stay with Engineering Hotfix, but with no Windows 10 support. We are moving away from F5 since we don't have neither proper support nor trust anymore. Too bad. Good luck and please keep us up to date with your current case 🙂
31-Aug-2016 05:49
Since I have upgrated for the version 11.6 HF5, ihad no problem anymore. I wnat upgrate to 12.0 i'm really scared for the case of it back happens