troubleshooting
17 TopicsJavascript injecting systems effect on web application end users - a scenario review
Hello! ArvinF is back to share a scenario review where Javascript-injecting systems affected web application end users - web and mobile application. Problem Users are failing to login to a web application protected by BIG-IP ASM/Adv WAF and Shape Security Defense. The site owner notes that the authentication was failing for an unknown reason. There were ASM Support ID noted and an error informing to enable Javascript. Please enable JavaScript to view the page’s content. Your support ID is: xxxxxxxxxxxx Troubleshooting To understand the cause of the authentication failure, we gathered HTTP traffic through a HTTP sniffer. We used httpwatch and gathered HAR (HTTP Archive) files. The site was protected with both on-premise BIG-IP ASM/Adv WAF bot defense and back then, Shape Security Defense (now F5 Distributed Cloud Bot Defense). After the review of the HAR file in httpwatch, the following were noted: ASM blocks a request in a URL related to authentication with a Support ID in the response. There was also javascript code included and it references https[:]//s[.]go-mpulse[.]net/boomerang/. The authentication attempt failed with an error in the HTTP response: ...unable to process your request. Please try again later... BIG-IP ASM/Adv WAF related HTTP cookies from its various features such as Bot Defense Client Side challenges as TSPD_101* cookie was present and other TS cookies, which could also come from Bot defense and DoS profile and security policy configurations. There were also HTTP cookies coming from BIG-IP AVR - f5_cspm cookie was present. Application Visibility and Reporting (AVR) module provides detailed charts and graphs to give you more insight into the performance of web applications, with detailed views on HTTP and TCP stats, as well as system performance (CPU, memory, etc.). https://clouddocs.f5.com/training/community/analytics/html/index.html https://clouddocs.f5.com/api/irules/AVR_CSPM_INJECTION.html Seeing the javascript code referencing "/boomerang/" included in the ASM blocking response was interesting. Reviewing the HAR file, there were several instances of this "/boomerang/". This finding was inquired with the site owner and they noted that there is another system that is in the path between the end users and their web application - a CDN. The traffic flow is as follows: End user web browser / mobile application >>> CDN >>> FW >>> BIG-IP >>> web application On the BIG-IP Virtual Server that fronts the web application, F5 AVR profile, ASM/Adv WAF Bot defense, and security policy and Shape Security defense iRule are configured. From the F5 side, these were the products with features that may insert Javascript in the client-side response. As part of troubleshooting, to isolate the feature that might be causing the failing authentication for the web application, the bot defense profile was removed from the site's Virtual Server and the Shape Security iRule and AVR profile were left untouched. Site owner noted that the authentication works after this change. Shape Security Defense was implemented using an iRule to protect specific URIs. When the iRule was removed from the Virtual Server and the Bot defense and AVR profile were left on, the VS, Site owner noted that the authentication works after this change. But if both ASM/Adv WAF Bot defense and Shape Security Defense iRule is configured on the VS, the site's authentication fails. Per the site owner, there were no changes in the Bot Defense or Shape Security Defense iRule configurations prior to the incident and that these configurations were in place way before the incident. Site owners shared the findings with their respective internal teams for their review. Resolution Afterwards, Site owner shared that their site now works as expected and authentication works for the web application with no changes done on both ASM/Adv WAF Bot defense and Shape Security Defense iRule on the site's VS. The cause of the authentication failure was undetermined. A theory on the possible cause of the issue was perhaps, there was another system inserting Javascript code in the responses and it might have affected the authentication process of the web application by prevented that portion of the site from loading. Additional Troubleshooting Notes The data gathered during the troubleshooting were the qkview and HTTPWatch capture - HAR files. It would help if a packet capture was taken along with the HTTPWatch capture while the issue was happening to have a full view of the issue. Decrypt the packet capture to observe HTTP exchanges and to correlate it with HTTPWatch capture events. The corresponding BIG-IP ASM/Adv WAF application event logs, Bot Defense or DoS protection logs will also be helpful in the correlation. Having a visual idea on how the Security Policy, Bot Defense or DoS protection profile are configured is also helpful - so its good to have a screenshot of these. It helps in analysis when there is complete data. Gathering the asmqkview with report and traffic data and corresponding ASM and AVR db dumps helps in the analysis. asmqkview -s0 --add-request-log --include-traffic-data -f /var/tmp/`/bin/hostname`_asmqkview_`date +%Y%m%d%H%M%S`.tgz #mysqldump -uroot -p`perl -I/ts/packages -MPassCrypt -nle 'print PassCrypt::decrypt_password($_)' /var/db/mysqlpw` DCC | gzip -9 > /shared/tmp/dcc.dump.gz # mysqldump -uroot -p`perl -I/ts/packages -MPassCrypt -nle 'print PassCrypt::decrypt_password($_)' /var/db/mysqlpw` PLC | gzip -9 > /shared/tmp/plc.dump.gz # mysqldump -uroot -p`perl -I/ts/packages -MPassCrypt -nle 'print PassCrypt::decrypt_password($_)' /var/db/mysqlpw` PRX | gzip -9 > /shared/tmp/prx.dump.gz # mysqldump -uroot -p`perl -I/ts/packages -MPassCrypt -nle 'print PassCrypt::decrypt_password($_)' /var/db/mysqlpw` logdb | gzip -9 > /shared/tmp/logdb.dump.gz It would also help if the systems in the path of the web application are known and whether it has features that may interfere with the features of BIG-IP ASM/Adv WAF or Shape Security Defense. Per the findings, there was a CDN that was injecting javascript code in the HTTP response and it may have contributed to the authentication failure for the end users. Isolate potentially conflicting features by removing one of them one at a time and observe the HTTP responses. Per the reference configuration, BIG-IP ASM/Adv WAF, Shape Security Defense, and BIG-IP AVR worked well prior to the incident. boomerang The injected javascript code noted in the ASM blocking page response was loaded from https[:]//s[.]go-mpulse[.]net/boomerang/. Checking this reference, it was related to https://github.com/akamai/boomerang. boomerang is a JavaScript library that measures the page load time experienced by real users, commonly called RUM (Real User Measurement). It has the ability to send this data back to your server for further analysis. With boomerang, you find out exactly how fast your users think your site is. In BIG-IP, the similar product we have is BIG-IP AVR - Application Visibility and Reporting (AVR) - where it collects "performance of web applications, with detailed views on HTTP and TCP stats, as well as system performance (CPU, memory, etc.)." Organizations may have specific needs on data that they need to collect from their site/web application and using a customizable solution such as boomerang can help. That's It For Now I hope this scenario review on Javascript-injecting systems effect on web application end users will be helpful on your next troubleshooting and hopefully gives you guidance on what data to gather and look for and troubleshooting options. The F5 SIRT creates security-related content posted here in DevCentral, sharing the team’s security mindset and knowledge. Feel free to view the articles that are tagged with the following: F5 SIRT series-F5SIRT-this-week-in-security TWIS162Views1like0Comments9. SYN Cookie Troubleshooting: Logs
Introduction In this last article I will add the last piece of information you can check when troubleshooting TCP SYN Cookie attacks, logs. With this information together with all that you have learned until now you should be able to understand how SYN Cookie is behaving and decide if there is any change you should do in your configuration to improve it. Use cases LTM SYN Cookie at Global context Logs when Global SYN Check Threshold or Default Per Virtual Server SYN Check Threshold has been exceeded are similar, so in order to know in which context was SYN Cookie activated you need to focus on specific text in logs. For example, by having below config: turboflex profile feature => adc tmsh list sys db pvasyncookies.enabled => true tmsh list ltm global-settings connection default-vs-syn-challenge-threshold => 1500 <= tmsh list ltm global-settings connection global-syn-challenge-threshold => 2050 <= tmsh list ltm profile fastl4 syn-cookie-enable => enabled You will get logs similar to the ones below if Global SYN cache has been reached: Dec 7 03:03:02 B12050-R67-S8 warning tmm9[5507]: 01010055:4: Syncookie embryonic connection counter 2051 exceeded sys threshold 2050 Dec 7 03:03:02 B12050-R67-S8 warning tmm5[5507]: 01010055:4: Syncookie embryonic connection counter 2051 exceeded sys threshold 2050 Dec 7 03:03:02 B12050-R67-S8 notice tmm5[5507]: 01010240:5: Syncookie HW mode activated, server name = /Common/syncookie_test server IP = 10.10.20.212:80, HSB modId = 1 Dec 7 03:03:02 B12050-R67-S8 notice tmm9[5507]: 01010240:5: Syncookie HW mode activated, server name = /Common/syncookie_test server IP = 10.10.20.212:80, HSB modId = 2 As you can notice there are two different messages, the first one informs about Software SYN Cookie being activated at Global context, and the second one tells us that Hardware is offloading SYN Cookie from Software. Since there is a minimum delay before Hardware to start to offload SYN Cookie is expected to see a non zero value for the counter Current SYN Cache stats. See article in this SYN Cookie series for more information about stats. Global SYN cache value is configured per TMM, so you see in the log that 2050 threshold has been exceeded in the TMM, and therefore SYN Cookie is activated globally in the device. In this specific example the device has two HSBs and BIG-IP decided that tmm9 and tmm5 would activate each one of them this is why we see the logs repeated. LTM SYN Cookie at Virtual context For the same configuration example I showed above you will see log similar to one below if Virtual SYN cache has been reached: Oct 18 02:26:32 I7800-R68-S7 warning tmm[15666]: 01010038:4: Syncookie counter 251 exceeded vip threshold 250 for virtual = 10.10.20.212:80 Oct 18 02:26:32 I7800-R68-S7 notice tmm[15666]: 01010240:5: Syncookie HW mode activated, server name = /Common/wildcardCookie server IP = 10.10.20.212:80, HSB modId = 1 Oct 18 02:26:32 I7800-R68-S7 notice tmm[15666]: 01010240:5: Syncookie HW mode activated, server name = /Common/wildcardCookie server IP = 10.10.20.212:80, HSB modId = 2 Virtual SYN cache value is configured globally meaning that the configured value must be divided among TMMs to know when SYN cookie will be enabled on each TMM. Run below command to see physical number of cores: tmsh sho sys hard | grep core In this example device has 6 TMMs, so 1500/6 is 250. Note that you will see a warning message entry per TMM (I removed 3 log entries in above example order to summarize) and per HSB ID. Log does not always show the VIP’s IP, it depends on type of VIP. For example in below case: Oct 17 04:04:54 I7800-R68-S7 warning tmm2[22805]: 01010038:4: Syncookie counter 251 exceeded vip threshold 250 for virtual = 10.10.20.212:80 Oct 17 04:04:54 I7800-R68-S7 warning tmm3[22805]: 01010038:4: Syncookie counter 251 exceeded vip threshold 250 for virtual = 10.10.20.212:80 Oct 17 04:04:55 I7800-R68-S7 notice tmm2[22805]: 01010240:5: Syncookie HW mode activated, server name = /Common/wildcardCookie server IP = 10.10.20.212:80, HSB modId = 1 Oct 17 04:05:51 I7800-R68-S7 notice tmm2[22805]: 01010241:5: Syncookie HW mode exited,server name = /Common/wildcardCookie server IP = 10.10.20.212:80, HSB modId = 1 from HSB There is not any virtual configured with destination IP 10.10.20.212. In fact traffic is handled by a wildcard VIP listening on 0.0.0.0/0, this logged IP is the destination IP:Port in the request that triggered SYN Cookie. You can consider this IP as the most probable attacked IP since it was the one that exceeded the threshold, so you can assume there are more attacks to this IP, however attack could have a random destination IPs target. Important: Per-Virtual SYN Cookie threshold MUST be lower than Global threshold, if you configure Virtual Server threshold higher than Global, or 0, then internally BIG-IP will set SYN Cookie Global threshold equals to Per-Virtual SYN Cookie threshold. LTM SYN Cookie at VLAN context Configuration example for triggering LTM SYN Cookie at VLAN context: turboflex profile feature => adc tmsh list sys db pvasyncookies.enabled => true tmsh list ltm global-settings connection vlan-syn-cookie => enabled tmsh list net vlan hardware-syncookie => [vlan external: 2888] tmsh list ltm global-settings connection default-vs-syn-challenge-threshold => 0 tmsh list ltm global-settings connection global-syn-challenge-threshold => 2500 When SYN cookie is triggered you get log: Oct 17 10:27:23 I7800-R68-S7 notice tmm[15666]: 01010292:5: Hardware syncookie protection activated on VLAN 1160 (syncache:2916 syn flood pkt rate:0) In this case you will see that information related to virtual servers on this VLAN will show SYN cookie as ‘not activated’ because protection is at VLAN context: #tmsh show ltm virtual | grep ' status ' -i Statusnot-activated Statusnot-activated If you configure SYN Cookie per VLAN but Turboflex adc/security is not provisioned then you will get: Oct 17 04:39:52 I7800-R68-S7.sin.pslab.local warning mcpd[7643]: 01071859:4: Warning generated : This platform supports Neuron-based Syncookie protection on per VS basis (including wildcard virtual). Please use that feature instead AFM SYN Cookie at Global context Main different in AFM default log is that you will not get a message telling you the threshold it has been exceeded, instead log will inform you directly about the context that detected the attack. Configuration example for triggering AFM SYN Cookie at global context: turboflex profile feature=> security tmsh list ltm global-settings connection vlan-syn-cookie=> enabled tmsh list net vlan hardware-syncookie[not compatible with DoS device] tmsh list sys db pvasyncookies.enabled=> true tmsh list ltm global-settings connection default-vs-syn-challenge-threshold.=> 0 tmsh list ltm global-settings connection global-syn-challenge-threshold=> 2500 tmsh list security dos device-config default-internal-rate-limit (tcp-half-open)=> >2500 tmsh list security dos device-config detection-threshold-pps (tcp-half-open)=> 2500 tmsh list ltm profile fastl4 syn-cookie-enable=> enabled AFM Device DoS has preference over LTM Global SYN Cookie, so in above configuration AFM tcp half open vector will be triggered: Oct 19 02:23:41 I7800-R68-S7 err tmm[23288]: 01010252:3: A Enforced Device DOS attack start was detected for vector TCP half open, Attack ID 1213152658. Oct 19 02:29:23 I7800-R68-S7 notice tmm[23288]: 01010253:5: A Enforced Device DOS attack has stopped for vector TCP half open, Attack ID 1213152658. In the example above you can see that there are logs warning you about an attack that started and stopped, but there is not any log showing if attack is mitigated. This is because you have not configured AFM to log to local-syslog (/var/log/ltm). In this situation DoS logs are basic. If you want to see packets dropped or allowed you need to configure specific security log profile. Be aware that when SYN Cookie is active because Device TCP half open DoS vector’s threshold has been reached then you will not see any Virtual Server showing that SYN Cookie has been activated, as it happens when SYN Cookie VLAN is activated: SYN Cookies Statusnot-activated This is slightly different to LTM Global SYN Cookie, when LTM Global SYN Cookie is enabled BIG-IP will show which specific VIP has SYN Cookie activated (Status Full Hardware/Software). In case you have configured logging for DoS then you would get logs like these: Oct 23 03:56:15 I7800-R68-S7 err tmm[21638]: 01010252:3: A Enforced Device DOS attack start was detected for vector TCP half open, Attack ID 69679369. Oct 23 03:56:15 I7800-R68-S7 info tmm[21638]: 23003138 "Oct 23 2020 03:56:15","10.200.68.7","I7800-R68-S7.sin.pslab.local","Device","","","","","","","TCP half open","69679369","Attack Started","None","0","0","0000000000000000", "Enforced", "Volumetric, Aggregated across all SrcIP's, Device-Wide attack, metric:PPS" Oct 23 03:56:16 I7800-R68-S7 info tmm[21638]: 23003138 "Oct 23 2020 03:56:16","10.200.68.7","I7800-R68-S7.sin.pslab.local","Device","","","","","","","TCP half open","69679369","Attack Sampled","Drop","3023","43331","0000000000000000", "Enforced", "Volumetric, Aggregated across all SrcIP's, Device-Wide attack, metric:PPS" Oct 23 03:56:16 I7800-R68-S7 info tmm[21638]: 23003138 "Oct 23 2020 03:56:16","10.200.68.7","I7800-R68-S7.sin.pslab.local","Device","","","","","","","TCP half open","69679369","Attack Sampled","Drop","3017","69710","0000000000000000", "Enforced", "Volumetric, Aggregated across all SrcIP's, Device-Wide attack, metric:PPS” The meaning of below fields shown in above logs: "Drop","3023","43331","0000000000000000" "Drop","3017","69710","0000000000000000" Are as below: {action} {dos_packets_received} {dos_packets_dropped} {flow_id} Where: {dos_packets_received} - It counts the number of TCP SYNs received for which you have not received the ACK. Also called embryonic SYNs. {dos_packets_dropped} - It counts the number of TCP syncookies that you have sent for which you have not received valid ACK. If you see that {dos_packets_received}are high, but {dos_packets_dropped} are 0 or low, then it just means that AFM is responding to SYN packets with SYN cookies and it is receiving correct ACKs from client. Therefore, AFM is not dropping anything. So this could mean that this is not an attack but a traffic peak. It can happen that you configure a mitigation threshold lower than detection threshold, although you will get a message warning you, you could not realise about it. If this is the case you will not see any log informing you about that there is an attack. This will happen for example with below configuration: tmsh list ltm global-settings connection global-syn-challenge-threshold=> 3400 tmsh list security dos device-config default-internal-rate-limit (tcp-half-open)=> 3000 tmsh list security dos device-config detection-threshold-pps (tcp-half-open)=> 3900 tmsh list ltm profile fastl4 syn-cookie-enable=> disabled Due to this you will see in /var/log/ltm something like: Oct 23 03:38:12 I7800-R68-S7.sin.pslab.local warning mcpd[10516]: 01071859:4: Warning generated : DOS attack data (tcp-half-open): Since drop limit is less than detection limit, packets dropped below the detection limit rate will not be logged. AFM SYN Cookie at Virtual context All information provided in previous use case applies in here, so for below configuration example: tmsh list ltm global-settings connection global-syn-challenge-threshold=> 3400 tmsh list security dos device-config default-internal-rate-limit (tcp-half-open)=> 3000 tmsh list security dos device-config detection (tcp-half-open)=> 3900 list security dos profile SYNCookie dos-network default-internal-rate-limit (tcp-half-open)=> 2000 list security dos profile SYNCookie dos-network detection-threshold-pps (tcp-half-open)=> 1900 tmsh list ltm profile fastl4 <name>=> enabled AFM device SYN Cookie is activated for specific virtual server with security profile applied: Oct 23 04:10:26 I7800-R68-S7 notice tmm[21638]: 01010240:5: Syncookie HW mode activated, server = 0.0.0.0:0, HSB modId = 1 Oct 23 04:10:26 I7800-R68-S7 notice tmm5[21638]: 01010240:5: Syncookie HW mode activated, server = 0.0.0.0:0, HSB modId = 2 Oct 23 04:10:26 I7800-R68-S7 err tmm3[21638]: 01010252:3: A NETWORK /Common/SYNCookie_Test DOS attack start was detected for vector TCP half open, Attack ID 2147786126. Oct 23 04:10:28 I7800-R68-S7 info tmm[16005]: 23003156 "10.200.68.7","I7800-R68-S7.sin.pslab.local","Virtual Server","/Common/SYNCookie_Test","Cryptographic SYN Cookie","16973","0","0","0", Oct 23 04:10:57 I7800-R68-S7 notice tmm5[21638]: 01010253:5: A NETWORK /Common/SYNCookie_Test DOS attack has stopped for vector TCP half open, Attack ID 2147786126. Oct 23 04:12:46 I7800-R68-S7 notice tmm[21638]: 01010241:5: Syncookie HW mode exited,server = 0.0.0.0:0, HSB modId = 1 from HSB Oct 23 04:12:47 I7800-R68-S7 notice tmm5[21638]: 01010241:5: Syncookie HW mode exited,server = 0.0.0.0:0, HSB modId = 2 from HSB Troubleshooting steps When you need to troubleshoot how device is working against a SYN flood attack there are some steps you can follow. Check configuration to make a global idea of what should happen when SYN flood occurs: tmsh show sys turboflex profile feature tmsh list ltm global-settings connection vlan-syn-cookie tmsh list net vlan hardware-syncookie tmsh list sys db pvasyncookies.enabled tmsh list ltm global-settings connection default-vs-syn-challenge-threshold tmsh list ltm global-settings connection global-syn-challenge-threshold tmsh list ltm profile fastl4 syn-cookie-enable tmsh list ltm profile tcp all-properties | grep -E 'profile|syn-cookie' tmsh list ltm profile fastl4 all-properties| grep -E 'profile|syn-cookie' list security dos device-config syn-cookie-whitelist syn-cookie-dsr-flow-reset-by tscookie-vlans tmsh list security dos device-config dos-device-config | grep -A23 half tmsh list security dos profile dos-network {<profile> { network-attack-vector { tcp-half-open } } } *I can miss some commands since I cannot know specific configuration you are using, but above list can give you a good idea about what you have actually configured in your system. Are you using Hardware or Software SYN cookie? Are you using CMP or vCMP? Is device a Neuron platform? Is SYN cookie configured/working in AFM, in LTM or in both? Is SYN cookie enabled at Device, VLAN or Virtual Server context? If issue is at virtual server context, which virtual servers are affected? is the problem happening in a Standard or FastL4 VIP, …? Check logs (date/times) and stats to confirm what it has really happened and since when. Take captures to confirm your findings. Is this an attack? Were there other attacks at the same time (TCP BAD ACK, TCP RST maybe)? Are thresholds correctly configured attending to expected amount of traffic? If clients are hidden by a proxy maybe you could save resources by configuring Challenge and remember. If this is a Neuron platform, is there any error in /var/log/neurond? Check published IDs related to SYN Cookie for specific TMOS versions or/and platforms. Conclusion Now you have enough information to start troubleshooting your own BIG-IP devices if any issue happens, and also and maybe more important you have tools to create the most appropriate configuration attending to your specific platform and traffic patterns. So you can start to take the advantage of your knowledge to improve performance of your device when under TCP SYN flood attack.2.4KViews2likes1CommentPacket Analysis with Scapy and tcpdump: Checking Compatibility with F5 SSL Orchestrator
In this guide I want to demonstrate how you can use Scapy (https://scapy.net/) and tcpdump for reproducing and troubleshooting layer 2 issues with F5 BIG-IP devices. Just in case you get into a finger-pointing situation... Starting situation This is a quite recent story from the trenches: My customer uses a Bypass Tap to forward or mirror data traffic to inline tools such as IDS/IPS, WAF or threat intelligence systems. This ByPass Tap offers a feature called Network Failsafe (also known as Fail-to-Wire). This is a fault tolerance feature that protects the flow of data in the event of a power outage and/or system failure. It allows traffic to be rerouted while the inline tools (IDS/IPS, WAF or threat intelligence systems) are shutting down, restarting, or unexpectedly losing power (see red line namedFallbackin the picture below). Since the ByPass Tap itself does not have support for SSL decryption and re-encryption, an F5 BIG-IP SSL Orchestrator shall be introduced as an inline tool in a Layer 2 inbound topology. Tools directly connected to the Bypass Tap will be connected to the SSL Orchestrator for better visibility. To check the status of the inline tools, the Bypass Tap sends health checks through the inline tools. What is sent on one interface must be seen on the other interface and vice versa. So if all is OK (health check is green), traffic will be forwarded to the SSL Orchestrator, decrypted and sent to the IDS/IPS and the TAP, and then re-encrypted and sent back to the Bypass Tap. If the Bypass Tap detects that the SSL Orchestrator is in a failure state, it will just forward the traffic to the switch. This is the traffic flow of the health checks: Target topology This results in the following topology: Problem description During commissioning of the new topology, it turned out that the health check packets are not forwarded through the vWire configured on the BIG-IP. A packet analysis with Wireshark revealed that the manufacturer uses ARP-like packets with opcode 512 (HEX 02 00). This opcode is not defined in the RFC that describes ARP (https://datatracker.ietf.org/doc/html/rfc826), the RFC only describes the opcodes Request (1 or HEX 00 01) and Reply (2 or HEX 00 02). NOTE:Don't get confused that you see ARP packets on port 1.1 and 1.2. They are not passing through, the Bypass Tap is just send those packets from both sides of the vWire, as explained above. The source MAC on port 1.1 and 1.2 are different. Since the Bypass Tap is located right behind the customer's edge firewall, lengthy and time-consuming tests on the live system are not an option, since it would result in a massive service interruption. Therefore, a BIG-IP i5800 (the same model as the customer's) was set up as SSL Orchestrator and a vWire configuration was build in my employers lab. The vWire configuration can be found in this guide (https://clouddocs.f5.com/sslo-deployment-guide/chapter2/page2.7.html). INFO:For those not familiar with vWire: "Virtual wire … creates a layer 2 bridge across the defined interfaces. Any traffic that does not match a topology listener will pass across this bridge." Lab Topology The following topology was used for the lab: I build a vWire configuration on the SSL Orchestrator, as in the customer's environment. A Linux system with Scapy installed was connected to Interface 1.1. With Scapy TCP, UDP and ARP packets can be crafted or sent like a replay from a Wireshark capture. Interface 1.3 was connected to another Linux system that should receive the ARP packets. All tcpdumps were captured on the F5 and analyzed on the admin system (not plotted). Validating vWire Configuration To check the functionality of the F5 and the vWire configuration, two tests were performed. A replay of the Healthcheck packets from the Bypass Tap and a test with RFC-compliant ARP requests. Use Scapy to resend the faulty packets First, I used Wireshark to extract a single packet from packet analysis we took in the customer environment and saved it to a pcap file. I replayed this pcap file to the F5 with Scapy. The sendp() function will work at layer 2, it requires the parametersrdpcap(location of the pcap file for replay) andiface(which interface it shall use for sending). webserverdude@tux480:~$ sudo scapy -H WARNING: IPython not available. Using standard Python shell instead. AutoCompletion, History are disabled. Welcome to Scapy (2.5.0) >>> sendp(rdpcap("/home/webserverdude/cusomter-case/bad-example.pcap"),iface="enp0s31f6") . Sent 1 packets. This test confirmed the behavior that was observed in the customer's environment. The F5 BIG-IP does not forward this packet. Use PING and Scapy to send RFC-compliant ARP packets To create RFC-compliant ARP requests, I first sent an ARP request (opcode 1) through the vWire via PING command. As expected, this was sent through the vWire. To ensure that this also works with Scapy, I also resent this packet with Scapy. >>> sendp(rdpcap("/home/webserverdude/cusomter-case/good-example.pcap"),iface="enp0s31f6") . Sent 1 packets. In the Wireshark analysis it can be seen that this packet is incoming on port 1.1 and then forwarded to port 1.3 through the vWire. Solving the issue with the help of the vendor It became evident that the BIG-IP was dropping ARP packets that failed to meet RFC compliance, rendering the Bypass Tap from this particular vendor seemingly incompatible with the BIG-IP. Following my analysis, the vendor was able to develop and provide a new firmware release addressing this issue. To verify that the issue was resolved in this firmware release, my customer's setup, the exact same model of the Bypass Tap and a BIG-IP i5800, were deployed in my lab, where the new firmware underwent thorough testing. With this approach I could test the functionality and compatibility of the systems under controlled conditions. In this Wireshark analysis it can be seen that the Healthcheck packets are incoming on port 1.1 and then forwarded to port 1.3 through the vWire (marked in green) and also the other way round, coming in on port 1.3 and then forwarded to port 1.1 (marked in pink). Also now you can see that the packet is a proper gratuitous ARP reply (https://wiki.wireshark.org/Gratuitous_ARP). Because the Healthcheck packets were not longer dropped by the BIG-IP, but were forwarded through the vWire the Bypass Tap subsequently marked the BIG-IP as healthy and available. The new firmware resolved the issue. Consequently, my customer could confidently proceed with this project, free from the constraints imposed by the compatibility issue.435Views2likes2CommentsWhat CLI tools are in your toolbox for testing or troubleshooting application deployments on BIG-IP?
In trying to nail down a rate limiting solution, I reached back for my trusty apache-bench utility, ab, to push some load with concurrency at the virtual server. But...in today's SSL everywhere, discovered that ab does not have a flag to ignore SSL errors, and I don't have valid certs in my local lab, so I couldn't use that to push load. In a quick search I found hey, installed that on my ubuntu test box, and seems to be a reasonable replacement (see below). But that got me thinking...what cli tools are out there for test/troubleshooting that I'm missing out on? jrahm@unuc:~$ hey -n 10000 -c 5 https://172.16.101.50/ Summary: Total: 8.2474 secs Slowest: 0.0510 secs Fastest: 0.0023 secs Average: 0.0041 secs Requests/sec: 1212.5047 Total data: 20000 bytes Size/request: 2 bytes Response time histogram: 0.002 [1] | 0.007 [9956] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.012 [36] | 0.017 [2] | 0.022 [0] | 0.027 [0] | 0.032 [0] | 0.036 [0] | 0.041 [0] | 0.046 [0] | 0.051 [5] | Latency distribution: 10% in 0.0032 secs 25% in 0.0036 secs 50% in 0.0040 secs 75% in 0.0045 secs 90% in 0.0050 secs 95% in 0.0055 secs 99% in 0.0065 secs679Views3likes1CommentKerberos is Easy - Part 2
Yes my friends, this post has been long overdue. Life, work and all the other good excuses got in the way. However, there is nothing like a friend calling you out with a “WTF I need part 2” to get the motivation and kerberos mana flowing again. So where did we leave off? In part 1 we discussed some of the most common issues with Kerberos authentication and the necessity to break the problem down to client-side vs server-side authentication. In part 2 we will look at some of the first troubleshooting steps I take to determine why nothing is working - yes, this happens to me too. ADTest is Your New BF4L You know those people that open a web browser to see if their Internet works… yea, don’t be that person. Open a terminal, check if you can ping your gateway, public DNS server, etc. and then, only then, open a browser. ADTest is your equivalent of ping. Don’t assume because you configured an Active Directory AAA object that authentication is just going to work. Please open a console and verify that Kerberos authentication against the AD server is working with ADTest. Check out my APM Troubleshooting with ADTest for more information. Time is not on Your Side If you have worked with Kerberos before you know it is supper picky about time drift. If ADTest just won’t work and you can’t figure out why ensure the BIG-IP’s time matches the KDC. If you need to adjust the BIG-IP follow the F5 SOL3381. These Are Not the KDCs You’re Looking For Ever been in a multi-domain environment and the AD admin swears the KDC you’re talking to is the correct one; never mind the “Kerberos Principal Unknown” error you keep getting. So if ADTest doesn’t work then we need to ensure the AD server we’re talking to is a KDC for the expected realm. Now, if someone can RDP into the server this can be ruled out pretty quickly, but when in life is anything that easy. So it’s helpful to use nslookup to find all KDCs for the intended domain and ensure the IP you were give is in this list - example below: C:\Users\user>nslookup -type=SRV _kerberos._tcp.dc._msdcs.f5demo.com Server: ad1.f5demo.com Address: 10.1.10.2 _kerberos._tcp.dc._msdcs.f5demo.com SRV service location: priority = 0 weight = 100 port = 88 svr hostname = ad1.f5demo.com ad1.f5demo.com internet address = 10.1.10.2 ad1.f5demo.com internet address = 10.1.1.3 Playing Go Fish with SPNs You have to admire how Microsoft took something as complicated as Kerberos and made it trivial to deploy and manage inside of Active Directory. Having managed MIT and MS version of Kerberos myself I felt a little guilty after setting up my first AD server without hours of troubleshooting issues. However, with that ease of use Microsoft also made it easy to shoot your own foot off if you have no basic understanding of Kerberos. Think of the KDC as a key, value pair database. The KDC will let you store multiple keys of the same value even though you shouldn’t do that. So if you have multiple SPN entries in AD you are not guarantied that a request for a ticket will return the value you’re looking for. This typically presents itself in APM as authentication works onetime and not the other. An easy way to check this is to log into a domain machine and issue: setspn -X This will print out any duplicate SPNs in your KDC. If the SPN you are working with appears in this list then you need to correct this issue. The easiest way I find is to delete the service account you have created for APM and just use the service account the web server application pool is using. Cached Tickets APM caches Kerberos tickets for both client side Kerberos authentication and server side Kerberos SSO. If you’re troubleshooting Kerberos be sure to clear these caches after you’ve made modifications. For Kerberos AAA: bigstart restart apd rba For Kerberos SSO: bigstart restart websso I’ll Just Do It Myself If you are working with Kerberos SSO then you have to sometimes determine if the issue is with APM obtaining the token or the web server not accepting the token. If you have your WebSSO logging set to debug and Kerberos SSO is working then you should see S4U=====>OK If not, then there are a few CLI commands you can use to simulate the request APM makes to the KDC. First, remove all kerberos tickets. kdestroy Second, obtain a kerberos ticket as the AD delegation account (if this works you won’t get a response) kinit -f <SPN of AD delegation account> Finally, test if the ticket you obtain has delegation capabilities knvo -C -U <username> <SPN of AD delegation account> If you receive a key version number on the kvno command then everything is working and it proves that ASREQ and RSREQ work. So the issue more than likely is on the web server side and not the F5 - more on this in part 3.1.9KViews0likes5Commentsirule Troubleshooting:
Hi guys, once i try to assign the following irule to a VS i receive this error message: 01070151:3: Rule [/Common/rule_errorpage_test] error: Unable to find pool (sorry_pool) referenced at line 21: [pool sorry_pool] when RULE_INIT { set static::sorry_pool "/Common/pool_wwbtest_sorry" set static::wartung_pool "/Common/pool_wwbtest_wartung" } when HTTP_REQUEST { log local0. "Incoming URI = [HTTP::uri]" if { [catch { [active_members [LB::server pool]] < 1}] } { if { [catch {active_members $static::wartung_pool}] > 0 } { pool wartung_pool set mypick [LB::select] log local0. "The LB choice is: $mypick" set newuri [string map {/wartung.html/ /} [HTTP::uri]] log local0. "The new URI is = $newuri" HTTP::uri $newuri eval "GET [HTTP::uri $newuri] HTTP/1.1\r\nHost: $mypick\r\n\r\n" log local0. "[HTTP::uri] [LB::server addr]" } elseif { [catch {active_members $static::sorry_pool}] > 0 } { pool sorry_pool set mypick [LB::select] log local0. "The LB choice is: $mypick" set newuri [string map {/sorry.html/ /} [HTTP::uri]] log local0. "The new URI is = $newuri" HTTP::uri $newuri eval "GET [HTTP::uri $newuri] HTTP/1.1\r\nHost: $mypick\r\n\r\n" log local0. "[HTTP::uri] [LB::server addr]" } else { HTTP::respond 200 content "Lieber Kunde,\r\ndie Seite befindet sich aktuell in der Wartung.\r\n\r\nDear Customer,\r\nthis site is down for maintenance. We are sorry for the inconvience.\r\n" noserver connection close } } } i am really puzzled and clueless what the problem might be. If you have any ideas to solve this i would be grateful 🙂 best regards David523Views0likes13CommentsLoad Balancing to Only One Pool Member
I have an issue where all traffic in a pool is going to the fifth of 5 pool members. We are using cookie persistence and I know there can be issues with that. We are also using both an http and OneConnect profile. From the sols and DC articles I've read it seems like the problem will arise if you do NOT have a oneconnect profile assigned. I have tried pulling http/oneconnect/tcp profiles on and off, using default profiles. The only thing I haven't tried is falling back to source address persistence. That is a last resort I do not want to use. Due to the application architecture source address persistence will definitely result in uneven load. All the "custom" profiles are built from an F5 guide for the application (Epic HyperSpace Web; Link to Guide). That being said this config seems pretty straight forward, but any help would be very much appreciated. We have captured traffic and seen the cookies present in the sessions. Below is the config. ltm virtual /PARTITION/v_80 { destination /PARTITION/10.10.1.1:80 ip-protocol tcp mask 255.255.255.255 persist { /PARTITION/Custom-cookie { default yes } } pool /PARTITION/pool_80 profiles { /PARTITION/Custom-OneConnect { } /PARTITION/Custom-http { } /PARTITION/Custom-lan-optimized { } } source 0.0.0.0/0 source-address-translation { type automap } translate-address enabled translate-port enabled } ltm pool /PARTITION/Pool_80 { description "HTTP Pool" load-balancing-mode least-connections-member members { /PARTITION/001:80 { address 10.1.1.1 } /PARTITION/002:80 { address 10.1.1.2 } /PARTITION/003:80 { address 10.1.1.3 } /PARTITION/004:80 { address 10.1.1.4 } /PARTITION/005:80 { address 10.1.1.5 } } monitor /Common/http_head_f5 service-down-action reselect } Only change to custom oneconnect profile is mask is 255.255.255.255. Only change to tcp-lan-optimized profile is the idle timeout is set to 1200s. Only change to http profile is that 'Redirect Rewrite' is set to Matching. Custom cookie persistence uses default settings.1.6KViews0likes41Comments8. SYN Cookie: Troubleshooting tcpdump
Introduction In this troubleshooting article, I will explain what traffic is expected to collect in a tcpdump capture during an ongoing attack. In this way you can compare with your own environment and understand why you see what you see. Use cases In this section I will show you the most common SYN Cookie configurations and how they affect to the traffic that reach BIG-IP. For each use case I add a diagram explaining the role of each party, so you can use it as a rerefence when reviewing captures. Note that, as a rule of thumb, in tcpdump captures you will see only traffic reaching red TMM arrow in diagrams. FastL4 + Hardware SYN Cookie Fig15. FastL4 + HW SYN Cookie When enabling SYN Cookie hardware offloading in FastL4 you must have into account that, regardless if FastL4 profile is configured as EST or SYN for ‘PVA TCP Offload State’ the capture will be similar to the one I paste below. Also note that colours in captures match with the diagram above, that is, blue (on the left) for client, red (next to blue and green) for BIG-IP and green (on the right) for server. BLUE: 10:55:30.650614 IP 10.10.10.10.33669 > 10.10.20.212.80: Flags [S], seq 1486117989, win 5840, options [mss 1460,sackOK,TS val 257012221 ecr 0,nop,wscale 6], length 0 10:55:30.651084 IP 10.10.20.212.80 > 10.10.10.10.33669: Flags [S.], seq 2770916048, ack 1486117990, win 4380, options [mss 1460], length 0 10:55:30.651115 IP 10.10.10.10.33669 > 10.10.20.212.80: Flags [.], ack 1, win 5840, length 0 10:55:30.651271 IP 10.10.10.10.33669 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 5840, length 167 RED: 10:48:27.666478 IP 10.10.10.10.33669 > 10.10.20.212.80: Flags [.], ack 2770916049, win 5840, length 0 in slot1/tmm3 lis= 10:48:27.666793 IP 10.10.10.10.33669 > 10.10.20.212.80: Flags [P.], seq 0:167, ack 1, win 5840, length 167: HTTP: GET /zarapito HTTP/1.1 in slot1/tmm3 lis=/Common/fwd 10:48:27.667073 IP 10.10.10.10.33669 > 10.10.20.212.80: Flags [S], seq 1486117989, win 5840, options [mss 1460], length 0 out slot1/tmm3 lis=/Common/fwd 10:48:27.667245 IP 10.10.20.212.80 > 10.10.10.10.33669: Flags [S.], seq 3169243726, ack 1486117990, win 64240, options [mss 1460], length 0 in slot1/tmm3 lis=/Common/fwd 10:48:27.667586 IP 10.10.10.10.33669 > 10.10.20.212.80: Flags [.], ack 1, win 5840, length 0 out slot1/tmm3 lis=/Common/fwd 10:48:27.667588 IP 10.10.10.10.33669 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 5840, length 167: HTTP: GET /zarapito HTTP/1.1 out slot1/tmm3 lis=/Common/fwd GREEN: 09:54:39.236239 IP 10.10.10.10.33669 > 10.10.20.212.80: Flags [S], seq 1486117989, win 5840, options [mss 1460], length 0 09:54:39.236264 IP 10.10.20.212.80 > 10.10.10.10.33669: Flags [S.], seq 3169243726, ack 1486117990, win 64240, options [mss 1460], length 0 09:54:39.236692 IP 10.10.10.10.33669 > 10.10.20.212.80: Flags [.], ack 1, win 5840, length 0 09:54:39.236707 IP 10.10.10.10.33669 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 5840, length 167: HTTP: GET /zarapito HTTP/1.1 09:54:39.236728 IP 10.10.20.212.80 > 10.10.10.10.33669: Flags [.], ack 168, win 64073, length 0 There are three important things you can observe in above captures: TCP options are lost in this example. This is because SYN Cookie is offloaded in hardware and for this mode currently only MSS value is trasfered from client side to server side, with some limitations. Check article dedicated to 'Hardware Vs Software' in this article series. In BIG-IP you cannot see the first two packets of TCP 3WHS (SYN and SYN/ACK). This is because hardware is in charge of validating SYN Cookie, since tcpdump is capturing traffic in TMM you only see last ACK packet. Also note that SYN sequence number sent by client to BIG-IP is the same than the one BIG-IP sends to Server. FastL4 + Software SYN Cookie Fig16. FastL4 + SW SYN Cookie In this case, although you are using an accelerated profile (FastL4), since software is handling the TCP 3WHS this allows you to see all packets when you run a traffic capture. Blue is on the left, red in the middle, green on the right. BLUE: 17:32:15.588706 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [S], seq 2986408498, win 5840, options [mss 1460,sackOK,TS val 262963455 ecr 0,nop,wscale 6], length 0 17:32:15.588973 IP 10.10.20.212.80 > 10.10.10.10.34527: Flags [S.], seq 2739000888, ack 2986408499, win 4380, options [mss 1460], length 0 17:32:15.588996 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [.], ack 1, win 5840, length 0 17:32:15.589119 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 5840, length 167 RED: 17:25:11.566071 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [S], seq 2986408498, win 5840, options [mss 1460,sackOK,TS val 262963455 ecr 0,nop,wscale 6], length 0 in slot1/tmm3 lis= 17:25:11.566095 IP 10.10.20.212.80 > 10.10.10.10.34527: Flags [S.], seq 2739000888, ack 2986408499, win 4380, options [mss 1460], length 0 out slot1/tmm3 lis=/Common/fwd 17:25:11.566321 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [.], ack 1, win 5840, length 0 in slot1/tmm3 lis= 17:25:11.566451 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 5840, length 167: HTTP: GET / HTTP/1.1 in slot1/tmm3 lis=/Common/fwd 17:25:11.566834 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [S], seq 2986408498, win 5840, options [mss 1460], length 0 out slot1/tmm3 lis=/Common/fwd 17:25:11.567058 IP 10.10.20.212.80 > 10.10.10.10.34527: Flags [S.], seq 4131626625, ack 2986408499, win 64240, options [mss 1460], length 0 in slot1/tmm3 lis=/Common/fwd 17:25:11.567344 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [.], ack 1, win 5840, length 0 out slot1/tmm3 lis=/Common/fwd 17:25:11.567346 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 5840, length 167: HTTP: GET / HTTP/1.1 out slot1/tmm3 lis=/Common/fwd GREEN: 16:31:24.339490 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [S], seq 2986408498, win 5840, options [mss 1460], length 0 16:31:24.339551 IP 10.10.20.212.80 > 10.10.10.10.34527: Flags [S.], seq 4131626625, ack 2986408499, win 64240, options [mss 1460], length 0 16:31:24.339987 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [.], ack 1, win 5840, length 0 16:31:24.340003 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 5840, length 167: HTTP: GET / HTTP/1.1 As it happened in the previous example, BIG-IP does not ACK the HTTP request (PSH) sent by client because a FastL4 profile is being used, what it means that BIG-IP will act as a TCP transparent proxy between server and client. In other words, once correct TCP handshake has done BIG-IP will forward rest of this connection traffic to server directly. Standard + Hardware SYN Cookie Fig17. Standard + Hardware SYN Cookie The main difference in this case is that for Standard virtual server BIG-IP must act as a TCP proxy between client and server during all connection time. This means that BIG-IP sends ACK as response to PSH packet sent by client before this PSH is in fact received by the server. This image has the same color scheme as the first image. BLUE: 14:02:03.606740 IP 10.10.10.10.34522 > 10.10.20.212.80: Flags [S], seq 2828374030, win 5840, options [mss 1460,sackOK,TS val 259810460 ecr 0,nop,wscale 6], length 0 14:02:03.606861 IP 10.10.20.212.80 > 10.10.10.10.34522: Flags [S.], seq 2971392569, ack 2828374031, win 4380, options [mss 1460,sackOK,TS val 870079512 ecr 259810460], length 0 14:02:03.606885 IP 10.10.10.10.34522 > 10.10.20.212.80: Flags [.], ack 1, win 5840, options [nop,nop,TS val 259810460 ecr 870079512], length 0 14:02:03.607176 IP 10.10.10.10.34522 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 5840, options [nop,nop,TS val 259810460 ecr 870079512], length 167 RED: 13:55:00.128183 IP 10.10.10.10.34522 > 10.10.20.212.80: Flags [.], ack 2971392570, win 5840, options [nop,nop,TS val 259810460 ecr 870079512], length 0 in slot1/tmm4 lis= 13:55:00.128272 IP 10.10.10.10.34522 > 10.10.20.212.80: Flags [S], seq 2348737573, win 14600, options [mss 1460,nop,wscale 0,sackOK,TS val 1607698719 ecr 0], length 0 out slot1/tmm4 lis=/Common/std 13:55:00.128496 IP 10.10.10.10.34522 > 10.10.20.212.80: Flags [P.], seq 0:167, ack 1, win 5840, options [nop,nop,TS val 259810460 ecr 870079512], length 167: HTTP: GET / HTTP/1.1 in slot1/tmm4 lis=/Common/std 13:55:00.128501 IP 10.10.20.212.80 > 10.10.10.10.34522: Flags [.], ack 167, win 14767, options [nop,nop,TS val 870079512 ecr 259810460], length 0 out slot1/tmm4 lis=/Common/std 13:55:00.128559 IP 10.10.20.212.80 > 10.10.10.10.34522: Flags [S.], seq 1953609172, ack 2348737574, win 65160, options [mss 1460,sackOK,TS val 1991641654 ecr 1607698719,nop,wscale 7], length 0 in slot1/tmm4 lis=/Common/std 13:55:00.128564 IP 10.10.10.10.34522 > 10.10.20.212.80: Flags [.], ack 1, win 14600, options [nop,nop,TS val 1607698719 ecr 1991641654], length 0 out slot1/tmm4 lis=/Common/std 13:55:00.128567 IP 10.10.10.10.34522 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 14600, options [nop,nop,TS val 1607698719 ecr 1991641654], length 167: HTTP: GET / HTTP/1.1 out slot1/tmm4 lis=/Common/std GREEN: 13:01:12.268012 IP 10.10.10.10.34522 > 10.10.20.212.80: Flags [S], seq 2348737573, win 14600, options [mss 1460,nop,wscale 0,sackOK,TS val 1607698719 ecr 0], length 0 13:01:12.268103 IP 10.10.20.212.80 > 10.10.10.10.34522: Flags [S.], seq 1953609172, ack 2348737574, win 65160, options [mss 1460,sackOK,TS val 1991641654 ecr 1607698719,nop,wscale 7], length 0 13:01:12.268347 IP 10.10.10.10.34522 > 10.10.20.212.80: Flags [.], ack 1, win 14600, options [nop,nop,TS val 1607698719 ecr 1991641654], length 0 13:01:12.268355 IP 10.10.10.10.34522 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 14600, options [nop,nop,TS val 1607698719 ecr 1991641654], length 167: HTTP: GET / HTTP/1.1 Note that SYN sequence number sent by client to BIG-IP is different than the one BIG-IP sends to Server (2828374030 Vs 2348737573). This is because we are using Standard virtual server, so there are two independent TCP connections, one between client and BIG-IP and the second between BigIP and server. Again, as commented since in this example Hardware SYN Cookie is used we cannot see first two TCP handshake packets in the capture. Standard + Software SYN Cookie Fig18. Standard + Software SYN Cookie Apparently there are not any difference when capturing traffic of a client request when SYN Cookie is activated or disabled in this case. Only difference is that BIG-IP creates a SYN Cookie challenge and it adds it as a sequence number in SYN/ACK, but looking to capture this change is not clearly visible. We need to compare with log and stats to confirm that in fact SYN Cookie is enabled, also depending in config we can check as well TCP options passed to server side. Color scheme here is the same as the second image. BLUE: 14:32:58.966168 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [S], seq 3344434108, win 5840, options [mss 1460,sackOK,TS val 260274299 ecr 0,nop,wscale 6], length 0 14:32:58.966358 IP 10.10.20.212.80 > 10.10.10.10.34525: Flags [S.], seq 2686010241, ack 3344434109, win 14600, options [mss 1460,nop,wscale 0,sackOK,TS val 3273182234 ecr 260274299], length 0 14:32:58.966398 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [.], ack 1, win 92, options [nop,nop,TS val 260274299 ecr 3273182234], length 0 14:32:58.966649 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 92, options [nop,nop,TS val 260274300 ecr 3273182234], length 167 RED: 14:25:55.405917 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [S], seq 3344434108, win 5840, options [mss 1460,sackOK,TS val 260274299 ecr 0,nop,wscale 6], length 0 in slot1/tmm1 lis= 14:25:55.405931 IP 10.10.20.212.80 > 10.10.10.10.34525: Flags [S.], seq 2686010241, ack 3344434109, win 14600, options [mss 1460,nop,wscale 0,sackOK,TS val 3273182234 ecr 260274299], length 0 out slot1/tmm1 lis=/Common/std 14:25:55.406048 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [.], ack 1, win 92, options [nop,nop,TS val 260274299 ecr 3273182234], length 0 in slot1/tmm1 lis= 14:25:55.406355 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 92, options [nop,nop,TS val 260274300 ecr 3273182234], length 167: HTTP: GET / HTTP/1.1 in slot1/tmm1 lis=/Common/std 14:25:55.406359 IP 10.10.20.212.80 > 10.10.10.10.34525: Flags [.], ack 168, win 14767, options [nop,nop,TS val 3273182234 ecr 260274300], length 0 out slot1/tmm1 lis=/Common/std 14:25:55.406131 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [S], seq 2235159701, win 14600, options [mss 1460,nop,wscale 0,sackOK,TS val 1609553998 ecr 0], length 0 out slot1/tmm1 lis=/Common/std 14:25:55.406524 IP 10.10.20.212.80 > 10.10.10.10.34525: Flags [S.], seq 3205875957, ack 2235159702, win 65160, options [mss 1460,sackOK,TS val 1993497007 ecr 1609553998,nop,wscale 7], length 0 in slot1/tmm1 lis=/Common/std 14:25:55.406527 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [.], ack 1, win 14600, options [nop,nop,TS val 1609553998 ecr 1993497007], length 0 out slot1/tmm1 lis=/Common/std 14:25:55.406531 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 14600, options [nop,nop,TS val 1609553998 ecr 1993497007], length 167: HTTP: GET / HTTP/1.1 out slot1/tmm1 lis=/Common/std GREEN: 13:32:07.645733 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [S], seq 2235159701, win 14600, options [mss 1460,nop,wscale 0,sackOK,TS val 1609553998 ecr 0], length 0 13:32:07.645820 IP 10.10.20.212.80 > 10.10.10.10.34525: Flags [S.], seq 3205875957, ack 2235159702, win 65160, options [mss 1460,sackOK,TS val 1993497007 ecr 1609553998,nop,wscale 7], length 0 13:32:07.646011 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [.], ack 1, win 14600, options [nop,nop,TS val 1609553998 ecr 1993497007], length 0 13:32:07.646022 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 14600, options [nop,nop,TS val 1609553998 ecr 1993497007], length 167: HTTP: GET / HTTP/1.1 Note that, as in the previous example, SYN sequence number sent by client to BIG-IP is different than the one BIG-IP sends to Server (3344434108 Vs 2235159701). Conclusion At this point only one part left to have a complete picture of BIG-IP's SYN Cookie countermeasure functioning, so in next article I finish this series giving details about expected logs when SYN Cookie is working in our device.3.1KViews0likes0CommentsSource address persistence troubleshooting
Hi, I am looking a way to find out if source address persistence is working correctly. There are two VS (Standard TCP IP:any and Standard UDP IP:any both with pools pointing to the same nodes) with attached profile: ltm persistence source-addr lamp_persist_match_vs { app-service none defaults-from source_addr description none hash-algorithm default map-proxies enabled map-proxy-address none map-proxy-class none mask none match-across-pools disabled match-across-services disabled match-across-virtuals enabled mirror disabled override-connection-limit disabled partition Common timeout 32400 } match-across-virtuals is enabled so after first client connection to any of VSs following connections should go to the same node for at least 32400 s. I am looking a way to find out if it is working like that. So I need to figure out if given client IP is for at least defined timeout ONLY directed to the same IP. In other words I need to catch exception when given client IP is rebalanced to other IP during timeout period. Any ideas how to do that? I was thinking about using iRule with iStats but I am not so good with iStats stuff. My first approach was to create code like that: when LB_SELECTED { set node_sel [LB::server addr] log local0. "Selected node is $node_sel" ISTATS::incr "ltm.virtual [virtual name] node $node_sel client.ip [IP::client_addr] c balanced" 1 ISTATS::incr "ltm.virtual [virtual name] c [IP::client_addr]-${node_sel}" 1 ISTATS::incr "ltm.virtual [virtual name] c count_it" 1 log local0. "Currents counter is: [ISTATS::get "node $node_sel client.ip [IP::client_addr] c balanced"]" } Tried different syntax but I am not really sure if that's a way to go. Goal is to be able to collect all nodes given client IP connected during timeout period. If there is not persistence issue just one entry should be created listing client IP and selected node, if there is issue I expect that two entries will be created (pool contains two pool members). Piotr314Views0likes0CommentsTroubleshooting high CPU utilisation on BIG-IP systems
Introduction This is not really a step-by-step troubleshooting guide. What I'm sharing here is the result of reverse engineering the kind of knowledge that led me to succeed on troubleshooting CPU issues during the time I worked for Engineering Services department at F5. Here's what I'll cover sequentially with a mix of what we should know and where to find the problem: Know what HyperThreading (HT) is Know how HT is used within F5 Find out if F5 box supports HyperThreading (HT) Know the difference between Forwarding plane (TMM) vs Control plane (Linux) CPU consumption Confirm if the problem is TMM or another daemon Where to look further when TMM CPU is high What if it's a control plane daemon? Learn how to interpret graphs High CPU in non-HT boxes High CPU in HT+ boxes Use scripts when necessary to collect real time data 1. Know what HyperThreading (HT) is Physical core, as the name implies, is a physical CPU core connected to mothership's socket Physical CPU core has several execution units (modules) capable of performing different tasks e.g. basic integer maths, another for more advanced maths, loading and storing data from/to memory, etc. HT uses 2 or more logical CPU cores to use execution units that are not being utilised by process A, so process B can use them if needed. When 2 programs want to use the same part of the physical core, then it's inevitable that one of them will have to wait The Operating System (OS) scheduler decides which process gets execution priority in this case This is when 2 (or more) actual physical cores would perform better as this limitation is not present i.e. 2 physical cores would be able to concurrently perform tasks using their own execution units 2. Know how HT is used within F5 Before BIG-IP v11.5.0 on systems with HyperThreading (HT) Technology, we would have: 1 TMM per logical core Each logical core processes both data plane (TMM) and control plane (Linux) tasks v11.5.0+ (affects only processors with HT Technology) Data plane (TMM) reside in even-numbered cores (0, 2, 4, etc) Control plane cores (Linux) reside in odd-numbered cores (1, 3, 5, etc) When TMM reaches 80% of actual CPU utilisation, odd-numbered cores limit control plane tasks so they can only use up to 20% of CPU capacity, allowing remaining to be used by overloaded forwarding plane (TMM). vCMP host must also be using v11.5.0+ or newer in order for guests to use HTSplit technology. We can disable it manually by issuing the following command: 3 Find out if your box supports HyperThreading (HT) The hardware boxes listed with HT+ in K14358, all support HyperThreading technology. Here's how to check the number of cores in a given BIG-IP box (this is a VIPRION C2200 chassis with 2250 blade installed): The above box is able to run 2 threads per physical core (Thread(s) per core) with a total of 10 physical cores (Core(s) per socket) and a total of 20 (logical) cores (CPU(s)). Here's the same output from a 3900 series box that does not support HT: The above box is able to run 1 thread per physical core (Thread(s) per core) with a total of 4 physical cores (Core(s) per socket) and a total of 4 cores (CPU(s)). 4 Know the difference between Forwarding plane (TMM) vs Control plane (Linux) CPU consumption 4.1 Confirming if it's TMM or Linux BIG-IP's forwarding plane is TMM. TMM is a daemon/process within Linux space. If tmm CPU usage is high, then we know high CPU utilisation is a forwarding plane issue. The other daemons are part of BIG-IP's control plane (e.g. bigd - monitoring daemon). In this example, both tmm (102.3%) and bigd (51.8%) are high here: If TMM CPU utilisation is high, we will need to troubleshoot CPU usage of internal TMM components. For other daemons, there are different places to look. For example, for bigd (monitoring daemon), we need to check BIG-IP's monitors. AskF5 has a nice how-to guide here. Here's a list of BIG-IP daemons. 4.2 TMM CPU utilisation or forwarding plane CPU utilisation Checktmsh show ltm virtual<virtual server name> to confirm if there is a particular virtual server eating up tmm CPU cycles: Check iRules Checktmsh sys tmm-infoto see the breakdown of TMM cpu utilisation per tmm: 4.3 Linux CPU utilisation or data plane CPU utilisation For anything else apart from TMM,topoutput is your best friend for confirmation of which daemon is the culprit. tmsh show sys proc-infois also another command we can use to gather process specific CPU information. Here I'm checking bigd's monitoring daemon information: 5. Learn how to interpret graphs 5.1 High CPU in non-HT boxes The below graph is just an example taken from 3900 box that doesn't have HT split Because graphs are generated based on average cpu utilisation then we can assume that cpu utilisation is very high at times Because there is no HT-split the below cpu utilisation can be either due to TMM or due some other Linux daemon We can confirm usingtopcommand In the below graph it was due to bothtmmandbigd to confirm normal usage we always try to match with other numbers in the graph (e.g. active connections, etc) Note: this is a graph as seen in qkview (Clicking on System > Support) which takes a snapshot of the system. It can then be uploaded to ihealth and is mostly used to sharing snapshot of BIG-IP systems with F5 support. However, the graph here is used for illustrative purposes to understand CPU utilisation as seen in graphs. 5.2 High CPU in HT+ boxes This other graph here was taken from a 4200 series box which has HT split enabled Notice that CPU cores 0, 2, 4 and 6 (tmm/data plane) show CPU at about 60% Cores 1, 3, 5 and 7 show very minimal CPU utilisation with some spikes Spikes can be due to AVR/ASM daemons described inK16469andK15606 Or because TMM has reached 80% of cpu utilisation and is now using control plane's cores This is an example of mostly normal/regular cpu utilisation When HT is enabled and TMM cores use less than 80% of cpu, then data-plane cores remain mostly 'quiet'. 6. Use scripts when necessary to collect real time data Sometimes just by looking at the graphs and commands is not enough to determine why CPU is high. Here's an example of a script to collect real-time TMM/Linux CPU stats on BIG-IP every 60 seconds and copy output to /var/log/cpu-average.log top command output is also copied to /var/log/top-output.log: Output should be similar to this: The number after "Counter64" is the percentage value representing how busy each CPU core is. For example, TMM0.0 and TMM0.1 are both at 1% of capacity. We can add H to top command (e.g. top -Hcbn 1) in the script above to show the individual threads of a process, including TMM threads. When opening a support case with F5, it may be useful to include the full tmctl table as it contains roughly all raw data about everything we can possibly find on BIG-IP system. The below is an example of a script that collects all tmctl information every 5 seconds: Apart from knowing where to look, understanding the CPU usage pattern when it comes to our own organisation's production traffic is really important. It enables us to compare, for example, the number of active connections with a spike in CPU in the graphs to understand if the spike is related to a sudden and sharp increase in traffic.17KViews6likes3Comments