Forum Discussion
Sync-failover group doesn't sync properly
Hello,
I need some help with essential Active/Standby setup where I can't make two nodes to sync data. This is the problem I end up with: "did not receive last sync successfully"
VLANs are configured like this:
vlan | tag | tagged interface |
Client | 11 | 1.1 |
HA | 13 | 1.3 |
Server | 12 | 1.2 |
Self IPs and routes are following
[root@bigip1:Active:Standalone] config # ip route
default via 192.168.159.2 dev mgmt metric 4096
10.11.11.0/24 dev Client proto kernel scope link src 10.11.11.111
10.12.12.0/24 dev Server proto kernel scope link src 10.12.12.121
10.13.13.0/24 dev HA proto kernel scope link src 10.13.13.131
127.1.1.0/24 dev tmm proto kernel scope link src 127.1.1.254
127.7.0.0/16 via 127.1.1.253 dev tmm
127.20.0.0/16 dev tmm_bp proto kernel scope link src 127.20.0.254
192.168.159.0/24 dev mgmt proto kernel scope link src 192.168.159.129
[root@bigip2:Active:Standalone] config # ip route
default via 192.168.159.2 dev mgmt metric 4096
10.11.11.0/24 dev Client proto kernel scope link src 10.11.11.112
10.12.12.0/24 dev Server proto kernel scope link src 10.12.12.122
10.13.13.0/24 dev HA proto kernel scope link src 10.13.13.132
127.1.1.0/24 dev tmm proto kernel scope link src 127.1.1.254
127.7.0.0/16 via 127.1.1.253 dev tmm
127.20.0.0/16 dev tmm_bp proto kernel scope link src 127.20.0.254
192.168.159.0/24 dev mgmt proto kernel scope link src 192.168.159.130
Floating IPs on both devices are set to:
- Client: 10.11.11.110
- Server: 10.12.12.120
Both devices have certificates, time is in sync via NTP, have the same version 17.1.0.2 Build 0.0.2 (provisioned from the same OVA) and license.
Conif sync is set to: HA self IPs
Failover networks is: HA + Management
Mirroring: HA + Server
BigIP1 is Online, BigIP2 is Forced Offline before I start building cluster.
Hosts are connected via VmWare Workstation Lan Segments, thus no filtering is applied. I double check I can see packets in "tcpdump -nn -i" for any of the interfaces Client/Server/HA when for example trying to establish the SSH connection from the other host to the respective IP of the interface that is being watched.
Then I add device trust. Soon both devices are shown as "In sync" in the device_trust_group.
Then create a sync-failover group of two devices with Automatic Incremental Sync with Max sync size =10240. After this, the sync statuses are following:
- device_trust_group = In Sync
- Sync-Failover-Group = Awaiting Initial Sync
If I run "tcpdump -nn -i any tcp" I mostly see packets on HA network for ports 1029 and 4343
If I run "tcpdump -nn -i any udp" I mostly see packets on HA network for port 1026
tmm log
Sep 1 22:39:29 bigip1.sq.cloud notice mcpd[7261]: 01071436:5: CMI listener established at 10.13.13.131 port 6699
Sep 1 22:39:29 bigip1.sq.cloud err mcpd[7261]: 0107142f:3: Can't connect to CMI peer 10.13.13.132, TMM outbound listener not yet created
Sep 1 22:39:29 bigip1.sq.cloud err mcpd[7261]: 0107142f:3: Can't connect to CMI peer 10.13.13.132, TMM outbound listener not yet created
Sep 1 22:39:32 bigip1.sq.cloud notice mcpd[7261]: 01071451:5: Received CMI hello from /Common/bigip2.sq.cloud
Sep 1 22:39:34 bigip1.sq.cloud notice mcpd[7261]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries
Sep 1 22:44:48 bigip1.sq.cloud notice mcpd[7261]: 01071038:5: Master Key updated by user %cmi-mcpd-peer-10.13.13.132
Sep 1 22:52:33 bigip1.sq.cloud notice mcpd[7261]: 01071451:5: Received CMI hello from /Common/bigip2.sq.cloud
Sep 1 22:57:33 bigip1.sq.cloud notice mcpd[7261]: 01071451:5: Received CMI hello from /Common/bigip2.sq.cloud
Sep 1 23:01:09 bigip1.sq.cloud notice mcpd[7261]: 01070430:5: end_transaction message timeout on connection 0xedc5a0c8 (user %cmi-mcpd-peer-10.13.13.132)
Sep 1 23:01:09 bigip1.sq.cloud notice mcpd[7261]: 01070418:5: connection 0xedc5a0c8 (user %cmi-mcpd-peer-10.13.13.132) was closed with active requests
Sep 1 23:01:09 bigip1.sq.cloud notice mcpd[7261]: 0107143c:5: Connection to CMI peer 10.13.13.132 has been removed
Sep 1 23:01:09 bigip1.sq.cloud notice mcpd[7261]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries
Sep 1 23:06:11 bigip1.sq.cloud notice mcpd[7261]: 01070430:5: end_transaction message timeout on connection 0xedc5a0c8 (user %cmi-mcpd-peer-10.13.13.132)
Sep 1 23:06:11 bigip1.sq.cloud notice mcpd[7261]: 01070418:5: connection 0xedc5a0c8 (user %cmi-mcpd-peer-10.13.13.132) was closed with active requests
Sep 1 23:06:11 bigip1.sq.cloud notice mcpd[7261]: 0107143c:5: Connection to CMI peer 10.13.13.132 has been removed
Sep 1 23:06:11 bigip1.sq.cloud notice mcpd[7261]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries
Lastly I push the configuration from the device that is in the online state to the Sync-Failover-Group.
Then the sync status is like shown on the screenshot at the beginning of this message. Suggested sync actions (push A or B to group) do not help. Looked through: K63243467, K13946
Appreciate any suggestions that can resolve or properly push/pull the config. Thank you!
Thank you for the hints! I've followed some actions described in ID882609 , though it wasn't exactly the situation I had. Specifically one of the devices failed to correctly restart tmm: bigstart restart tmm. That started spawning the following message each two seconds: Re-starting mcpd
I restarted that second device and did tail -f /var/log/tmm on both hosts.
First device
Sep 2 13:55:11 bigip2.xx.yyyy notice mcpd[6967]: 01b00004:5: There is an unfinished full sync already being sent for device group /Common/Sync-Failover-Group on connection 0xea1726c8, delaying new sync until current one finishes.
Second device with sync issues contained end_transaction message timeout
Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 01070430:5: end_transaction message timeout on connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132) Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 01070418:5: connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132) was closed with active requests Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 0107143c:5: Connection to CMI peer 10.13.13.132 has been removed Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 01070430:5: end_transaction message timeout on connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132) Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 01070418:5: connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132) was closed with active requests Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 0107143c:5: Connection to CMI peer 10.13.13.132 has been removed Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries
That error message lead me to K25064172 and K10142141 despite I'm not running in AWS, my VmWare Workstation used vmxnet3 driver and I tried to switch to sock as suggested in that KB.
[root@bigip1:Standby:Not All Devices Synced] config # lspci -nn | grep -i eth 03:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01) 0b:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01) 13:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01) 1b:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01) [root@bigip1:Standby:Not All Devices Synced] config # tmctl -d blade tmm/device_probed pci_bdf pseudo_name type available_drivers driver_in_use ------------ ----------- --------- --------------------- ------------- 0000:03:00.0 F5DEV_PCI xnet, vmxnet3, sock, 0000:13:00.0 1.2 F5DEV_PCI xnet, vmxnet3, sock, vmxnet3 0000:0b:00.0 1.1 F5DEV_PCI xnet, vmxnet3, sock, vmxnet3 0000:1b:00.0 1.3 F5DEV_PCI xnet, vmxnet3, sock, vmxnet3
The fix for VmWare is
echo "device driver vendor_dev 15ad:07b0 sock" >> /config/tmm_init.tcl
And after I have restarted both nodes I saw the desired "In Sync" status.
What is interesting enough that I got this issue on two separate computers running the same VmWare Workstation version. I also reinstalled three different versions of BigIP and always got the same result. Another crazy thing is that if instead of Sync-Failover I would create Sync-Only group, there were no issues at all. It should be some compatibility issue I think.
- OvovAltostratus
Thank you for the hints! I've followed some actions described in ID882609 , though it wasn't exactly the situation I had. Specifically one of the devices failed to correctly restart tmm: bigstart restart tmm. That started spawning the following message each two seconds: Re-starting mcpd
I restarted that second device and did tail -f /var/log/tmm on both hosts.
First device
Sep 2 13:55:11 bigip2.xx.yyyy notice mcpd[6967]: 01b00004:5: There is an unfinished full sync already being sent for device group /Common/Sync-Failover-Group on connection 0xea1726c8, delaying new sync until current one finishes.
Second device with sync issues contained end_transaction message timeout
Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 01070430:5: end_transaction message timeout on connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132) Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 01070418:5: connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132) was closed with active requests Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 0107143c:5: Connection to CMI peer 10.13.13.132 has been removed Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 01070430:5: end_transaction message timeout on connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132) Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 01070418:5: connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132) was closed with active requests Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 0107143c:5: Connection to CMI peer 10.13.13.132 has been removed Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries
That error message lead me to K25064172 and K10142141 despite I'm not running in AWS, my VmWare Workstation used vmxnet3 driver and I tried to switch to sock as suggested in that KB.
[root@bigip1:Standby:Not All Devices Synced] config # lspci -nn | grep -i eth 03:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01) 0b:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01) 13:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01) 1b:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01) [root@bigip1:Standby:Not All Devices Synced] config # tmctl -d blade tmm/device_probed pci_bdf pseudo_name type available_drivers driver_in_use ------------ ----------- --------- --------------------- ------------- 0000:03:00.0 F5DEV_PCI xnet, vmxnet3, sock, 0000:13:00.0 1.2 F5DEV_PCI xnet, vmxnet3, sock, vmxnet3 0000:0b:00.0 1.1 F5DEV_PCI xnet, vmxnet3, sock, vmxnet3 0000:1b:00.0 1.3 F5DEV_PCI xnet, vmxnet3, sock, vmxnet3
The fix for VmWare is
echo "device driver vendor_dev 15ad:07b0 sock" >> /config/tmm_init.tcl
And after I have restarted both nodes I saw the desired "In Sync" status.
What is interesting enough that I got this issue on two separate computers running the same VmWare Workstation version. I also reinstalled three different versions of BigIP and always got the same result. Another crazy thing is that if instead of Sync-Failover I would create Sync-Only group, there were no issues at all. It should be some compatibility issue I think.
- shenoudaeNimbostratus
Hello!
I had th exact same problem with the same exact situation almot copy paste.
Been trying different thing for a week, and nothing worked until I followed your solution and it worked.
OMG, I finally see devices in Sync
Thank you very much!
- ragunath154Cirrostratus
Check the connectivity between the BIGIP's via HA interface IP's
10.13.13.131 and 10.13.13.132
Also check the Port lockdown settings for the HA Selfip.
make sure the HA interface is Tagged or untagged .
Do telnet on 4353 between the BIGIP on HA selfip
- OvovAltostratus
Thank you for the suggestion.
I haven't found the issue however:
- Port lockdown settings for HA Self IP's is set to "Allow All" for both devices
- Both HA interfaces are tagged with the same vlan 13
- 4353 connection is working fine, I can see packets travelling both ways on both hosts. Checked with: tcpdump -nn -i HA tcp port 4353
First host
09:39:39.272348 IP 10.13.13.132.4353 > 10.13.13.131.57460: Flags [P.], seq 71446:72894, ack 0, win 9018, options [nop,nop,TS val 1419664648 ecr 1419664639], length 1448 in slot1/tmm1 lis=_cgc_outbound_/Common/bigip2.xx.yyyy_6699 port=1.3 trunk= 09:39:39.272436 IP 10.13.13.131.57460 > 10.13.13.132.4353: Flags [.], ack 72894, win 65535, options [nop,nop,TS val 1419664647 ecr 1419664648], length 0 out slot1/tmm1 lis=_cgc_outbound_/Common/bigip2.xx.yyyy_6699 port=1.3 trunk= 09:39:39.283026 IP 10.13.13.132.4353 > 10.13.13.131.57460: Flags [.], seq 72894:74342, ack 0, win 9018, options [nop,nop,TS val 1419664651 ecr 1419664647], length 1448 in slot1/tmm1 lis=_cgc_outbound_/Common/bigip2.xx.yyyy_6699 port=1.3 trunk= 09:39:39.283110 IP 10.13.13.132.4353 > 10.13.13.131.57460: Flags [P.], seq 74342:74400, ack 0, win 9018, options [nop,nop,TS val 1419664651 ecr 1419664647], length 58 in slot1/tmm1 lis=_cgc_outbound_/Common/bigip2.xx.yyyy_6699 port=1.3 trunk= 09:39:39.793529 IP 10.13.13.132.25677 > 10.13.13.131.4353: Flags [P.], seq 1:203, ack 1, win 12316, length 202 in slot1/tmm1 lis=_cgc_inbound_/Common/bigip1.xx.yyyy port=1.3 trunk= 09:39:39.793643 IP 10.13.13.131.4353 > 10.13.13.132.25677: Flags [.], ack 203, win 16189, length 0 out slot1/tmm1 lis=_cgc_inbound_/Common/bigip1.xx.yyyy port=1.3 trunk= 09:39:39.811879 IP 10.13.13.131.4353 > 10.13.13.132.25677: Flags [P.], seq 1:76, ack 203, win 16189, length 75 out slot1/tmm1 lis=_cgc_inbound_/Common/bigip1.xx.yyyy port=1.3 trunk= 09:39:39.813850 IP 10.13.13.132.25677 > 10.13.13.131.4353: Flags [.], ack 76, win 12391, length 0 in slot1/tmm1 lis=_cgc_inbound_/Common/bigip1.xx.yyyy port=1.3 trunk= 09:39:39.824753 IP 10.13.13.131.57460 > 10.13.13.132.4353: Flags [P.], seq 0:202, ack 72894, win 65535, options [nop,nop,TS val 1419665200 ecr 1419664648], length 202 out slot1/tmm1 lis=_cgc_outbound_/Common/bigip2.xx.yyyy_6699 port=1.3 trunk=
Second host
09:41:24.654511 IP 10.13.13.132.4353 > 10.13.13.131.51678: Flags [P.], seq 39154:40551, ack 1, win 6565, options [nop,nop,TS val 1419770029 ecr 1419770026], length 1397 out slot1/tmm1 lis=_cgc_inbound_/Common/bigip2.xx.yyyy port=1.3 trunk= 09:41:24.658487 IP 10.13.13.131.51678 > 10.13.13.132.4353: Flags [.], ack 40551, win 65535, options [nop,nop,TS val 1419770030 ecr 1419770029], length 0 in slot1/tmm1 lis=_cgc_inbound_/Common/bigip2.xx.yyyy port=1.3 trunk= 09:41:24.658558 IP 10.13.13.132.4353 > 10.13.13.131.51678: Flags [P.], seq 40551:42079, ack 1, win 6565, options [nop,nop,TS val 1419770033 ecr 1419770030], length 1528 out slot1/tmm1 lis=_cgc_inbound_/Common/bigip2.xx.yyyy port=1.3 trunk= 09:41:25.189243 IP 10.13.13.132.25677 > 10.13.13.131.4353: Flags [.], ack 3575478456, win 13042, length 0 out slot1/tmm1 lis=_cgc_outbound_/Common/bigip1.xx.yyyy_6699 port=1.3 trunk= 09:41:25.190545 IP 10.13.13.131.4353 > 10.13.13.132.25677: Flags [.], ack 1, win 18138, length 0 in slot1/tmm1 lis=_cgc_outbound_/Common/bigip1.xx.yyyy_6699 port=1.3 trunk= 09:41:25.190633 IP 10.13.13.132.25677 > 10.13.13.131.4353: Flags [.], ack 1, win 13042, length 0 out slot1/tmm1 lis=_cgc_outbound_/Common/bigip1.xx.yyyy_6699 port=1.3 trunk= 09:41:25.191423 IP 10.13.13.131.4353 > 10.13.13.132.25677: Flags [.], ack 1, win 18138, length 0 in slot1/tmm1 lis=_cgc_outbound_/Common/bigip1.xx.yyyy_6699 port=1.3 trunk= 09:41:25.658648 IP 10.13.13.132.4353 > 10.13.13.131.51678: Flags [.], seq 40551:41999, ack 1, win 6565, options [nop,nop,TS val 1419771033 ecr 1419770030], length 1448 out slot1/tmm1 lis=_cgc_inbound_/Common/bigip2.xx.yyyy port=1.3 trunk= 09:41:25.764044 IP 10.13.13.131.51678 > 10.13.13.132.4353: Flags [.], ack 41999, win 65535, options [nop,nop,TS val 1419771136 ecr 1419771033], length 0 in slot1/tmm1 lis=_cgc_inbound_/Common/bigip2.xx.yyyy port=1.3 trunk= 09:41:25.764175 IP 10.13.13.132.4353 > 10.13.13.131.51678: Flags [P.], seq 41999:42079, ack 1, win 6565, options [nop,nop,TS val 1419771139 ecr 1419771136], length 80 out slot1/tmm1 lis=_cgc_inbound_/Common/bigip2.xx.yyyy port=1.3 trunk= 09:41:25.764206 IP 10.13.13.132.4353 > 10.13.13.131.51678: Flags [P.], seq 42079:43527, ack 1, win 6565, options [nop,nop,TS val 1419771139 ecr 1419771136], length 1448 out slot1/tmm1 lis=_cgc_inbound_/Common/bigip2.xx.yyyy port=1.3 trunk=
- ragunath154Cirrostratus
looks like connectivity issue
have you checked below links
https://cdn.f5.com/product/bugtracker/ID882609.html
Recent Discussions
Related Content
* Getting Started on DevCentral
* Community Guidelines
* Community Terms of Use / EULA
* Community Ranking Explained
* Community Resources
* Contact the DevCentral Team
* Update MFA on account.f5.com