troubleshooting
17 TopicsTroubleshooting high CPU utilisation on BIG-IP systems
Introduction This is not really a step-by-step troubleshooting guide. What I'm sharing here is the result of reverse engineering the kind of knowledge that led me to succeed on troubleshooting CPU issues during the time I worked for Engineering Services department at F5. Here's what I'll cover sequentially with a mix of what we should know and where to find the problem: Know what HyperThreading (HT) is Know how HT is used within F5 Find out if F5 box supports HyperThreading (HT) Know the difference between Forwarding plane (TMM) vs Control plane (Linux) CPU consumption Confirm if the problem is TMM or another daemon Where to look further when TMM CPU is high What if it's a control plane daemon? Learn how to interpret graphs High CPU in non-HT boxes High CPU in HT+ boxes Use scripts when necessary to collect real time data 1. Know what HyperThreading (HT) is Physical core, as the name implies, is a physical CPU core connected to mothership's socket Physical CPU core has several execution units (modules) capable of performing different tasks e.g. basic integer maths, another for more advanced maths, loading and storing data from/to memory, etc. HT uses 2 or more logical CPU cores to use execution units that are not being utilised by process A, so process B can use them if needed. When 2 programs want to use the same part of the physical core, then it's inevitable that one of them will have to wait The Operating System (OS) scheduler decides which process gets execution priority in this case This is when 2 (or more) actual physical cores would perform better as this limitation is not present i.e. 2 physical cores would be able to concurrently perform tasks using their own execution units 2. Know how HT is used within F5 Before BIG-IP v11.5.0 on systems with HyperThreading (HT) Technology, we would have: 1 TMM per logical core Each logical core processes both data plane (TMM) and control plane (Linux) tasks v11.5.0+ (affects only processors with HT Technology) Data plane (TMM) reside in even-numbered cores (0, 2, 4, etc) Control plane cores (Linux) reside in odd-numbered cores (1, 3, 5, etc) When TMM reaches 80% of actual CPU utilisation, odd-numbered cores limit control plane tasks so they can only use up to 20% of CPU capacity, allowing remaining to be used by overloaded forwarding plane (TMM). vCMP host must also be using v11.5.0+ or newer in order for guests to use HTSplit technology. We can disable it manually by issuing the following command: 3 Find out if your box supports HyperThreading (HT) The hardware boxes listed with HT+ in K14358, all support HyperThreading technology. Here's how to check the number of cores in a given BIG-IP box (this is a VIPRION C2200 chassis with 2250 blade installed): The above box is able to run 2 threads per physical core (Thread(s) per core) with a total of 10 physical cores (Core(s) per socket) and a total of 20 (logical) cores (CPU(s)). Here's the same output from a 3900 series box that does not support HT: The above box is able to run 1 thread per physical core (Thread(s) per core) with a total of 4 physical cores (Core(s) per socket) and a total of 4 cores (CPU(s)). 4 Know the difference between Forwarding plane (TMM) vs Control plane (Linux) CPU consumption 4.1 Confirming if it's TMM or Linux BIG-IP's forwarding plane is TMM. TMM is a daemon/process within Linux space. If tmm CPU usage is high, then we know high CPU utilisation is a forwarding plane issue. The other daemons are part of BIG-IP's control plane (e.g. bigd - monitoring daemon). In this example, both tmm (102.3%) and bigd (51.8%) are high here: If TMM CPU utilisation is high, we will need to troubleshoot CPU usage of internal TMM components. For other daemons, there are different places to look. For example, for bigd (monitoring daemon), we need to check BIG-IP's monitors. AskF5 has a nice how-to guide here. Here's a list of BIG-IP daemons. 4.2 TMM CPU utilisation or forwarding plane CPU utilisation Checktmsh show ltm virtual<virtual server name> to confirm if there is a particular virtual server eating up tmm CPU cycles: Check iRules Checktmsh sys tmm-infoto see the breakdown of TMM cpu utilisation per tmm: 4.3 Linux CPU utilisation or data plane CPU utilisation For anything else apart from TMM,topoutput is your best friend for confirmation of which daemon is the culprit. tmsh show sys proc-infois also another command we can use to gather process specific CPU information. Here I'm checking bigd's monitoring daemon information: 5. Learn how to interpret graphs 5.1 High CPU in non-HT boxes The below graph is just an example taken from 3900 box that doesn't have HT split Because graphs are generated based on average cpu utilisation then we can assume that cpu utilisation is very high at times Because there is no HT-split the below cpu utilisation can be either due to TMM or due some other Linux daemon We can confirm usingtopcommand In the below graph it was due to bothtmmandbigd to confirm normal usage we always try to match with other numbers in the graph (e.g. active connections, etc) Note: this is a graph as seen in qkview (Clicking on System > Support) which takes a snapshot of the system. It can then be uploaded to ihealth and is mostly used to sharing snapshot of BIG-IP systems with F5 support. However, the graph here is used for illustrative purposes to understand CPU utilisation as seen in graphs. 5.2 High CPU in HT+ boxes This other graph here was taken from a 4200 series box which has HT split enabled Notice that CPU cores 0, 2, 4 and 6 (tmm/data plane) show CPU at about 60% Cores 1, 3, 5 and 7 show very minimal CPU utilisation with some spikes Spikes can be due to AVR/ASM daemons described inK16469andK15606 Or because TMM has reached 80% of cpu utilisation and is now using control plane's cores This is an example of mostly normal/regular cpu utilisation When HT is enabled and TMM cores use less than 80% of cpu, then data-plane cores remain mostly 'quiet'. 6. Use scripts when necessary to collect real time data Sometimes just by looking at the graphs and commands is not enough to determine why CPU is high. Here's an example of a script to collect real-time TMM/Linux CPU stats on BIG-IP every 60 seconds and copy output to /var/log/cpu-average.log top command output is also copied to /var/log/top-output.log: Output should be similar to this: The number after "Counter64" is the percentage value representing how busy each CPU core is. For example, TMM0.0 and TMM0.1 are both at 1% of capacity. We can add H to top command (e.g. top -Hcbn 1) in the script above to show the individual threads of a process, including TMM threads. When opening a support case with F5, it may be useful to include the full tmctl table as it contains roughly all raw data about everything we can possibly find on BIG-IP system. The below is an example of a script that collects all tmctl information every 5 seconds: Apart from knowing where to look, understanding the CPU usage pattern when it comes to our own organisation's production traffic is really important. It enables us to compare, for example, the number of active connections with a spike in CPU in the graphs to understand if the spike is related to a sudden and sharp increase in traffic.17KViews6likes3Comments8. SYN Cookie: Troubleshooting tcpdump
Introduction In this troubleshooting article, I will explain what traffic is expected to collect in a tcpdump capture during an ongoing attack. In this way you can compare with your own environment and understand why you see what you see. Use cases In this section I will show you the most common SYN Cookie configurations and how they affect to the traffic that reach BIG-IP. For each use case I add a diagram explaining the role of each party, so you can use it as a rerefence when reviewing captures. Note that, as a rule of thumb, in tcpdump captures you will see only traffic reaching red TMM arrow in diagrams. FastL4 + Hardware SYN Cookie Fig15. FastL4 + HW SYN Cookie When enabling SYN Cookie hardware offloading in FastL4 you must have into account that, regardless if FastL4 profile is configured as EST or SYN for ‘PVA TCP Offload State’ the capture will be similar to the one I paste below. Also note that colours in captures match with the diagram above, that is, blue (on the left) for client, red (next to blue and green) for BIG-IP and green (on the right) for server. BLUE: 10:55:30.650614 IP 10.10.10.10.33669 > 10.10.20.212.80: Flags [S], seq 1486117989, win 5840, options [mss 1460,sackOK,TS val 257012221 ecr 0,nop,wscale 6], length 0 10:55:30.651084 IP 10.10.20.212.80 > 10.10.10.10.33669: Flags [S.], seq 2770916048, ack 1486117990, win 4380, options [mss 1460], length 0 10:55:30.651115 IP 10.10.10.10.33669 > 10.10.20.212.80: Flags [.], ack 1, win 5840, length 0 10:55:30.651271 IP 10.10.10.10.33669 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 5840, length 167 RED: 10:48:27.666478 IP 10.10.10.10.33669 > 10.10.20.212.80: Flags [.], ack 2770916049, win 5840, length 0 in slot1/tmm3 lis= 10:48:27.666793 IP 10.10.10.10.33669 > 10.10.20.212.80: Flags [P.], seq 0:167, ack 1, win 5840, length 167: HTTP: GET /zarapito HTTP/1.1 in slot1/tmm3 lis=/Common/fwd 10:48:27.667073 IP 10.10.10.10.33669 > 10.10.20.212.80: Flags [S], seq 1486117989, win 5840, options [mss 1460], length 0 out slot1/tmm3 lis=/Common/fwd 10:48:27.667245 IP 10.10.20.212.80 > 10.10.10.10.33669: Flags [S.], seq 3169243726, ack 1486117990, win 64240, options [mss 1460], length 0 in slot1/tmm3 lis=/Common/fwd 10:48:27.667586 IP 10.10.10.10.33669 > 10.10.20.212.80: Flags [.], ack 1, win 5840, length 0 out slot1/tmm3 lis=/Common/fwd 10:48:27.667588 IP 10.10.10.10.33669 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 5840, length 167: HTTP: GET /zarapito HTTP/1.1 out slot1/tmm3 lis=/Common/fwd GREEN: 09:54:39.236239 IP 10.10.10.10.33669 > 10.10.20.212.80: Flags [S], seq 1486117989, win 5840, options [mss 1460], length 0 09:54:39.236264 IP 10.10.20.212.80 > 10.10.10.10.33669: Flags [S.], seq 3169243726, ack 1486117990, win 64240, options [mss 1460], length 0 09:54:39.236692 IP 10.10.10.10.33669 > 10.10.20.212.80: Flags [.], ack 1, win 5840, length 0 09:54:39.236707 IP 10.10.10.10.33669 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 5840, length 167: HTTP: GET /zarapito HTTP/1.1 09:54:39.236728 IP 10.10.20.212.80 > 10.10.10.10.33669: Flags [.], ack 168, win 64073, length 0 There are three important things you can observe in above captures: TCP options are lost in this example. This is because SYN Cookie is offloaded in hardware and for this mode currently only MSS value is trasfered from client side to server side, with some limitations. Check article dedicated to 'Hardware Vs Software' in this article series. In BIG-IP you cannot see the first two packets of TCP 3WHS (SYN and SYN/ACK). This is because hardware is in charge of validating SYN Cookie, since tcpdump is capturing traffic in TMM you only see last ACK packet. Also note that SYN sequence number sent by client to BIG-IP is the same than the one BIG-IP sends to Server. FastL4 + Software SYN Cookie Fig16. FastL4 + SW SYN Cookie In this case, although you are using an accelerated profile (FastL4), since software is handling the TCP 3WHS this allows you to see all packets when you run a traffic capture. Blue is on the left, red in the middle, green on the right. BLUE: 17:32:15.588706 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [S], seq 2986408498, win 5840, options [mss 1460,sackOK,TS val 262963455 ecr 0,nop,wscale 6], length 0 17:32:15.588973 IP 10.10.20.212.80 > 10.10.10.10.34527: Flags [S.], seq 2739000888, ack 2986408499, win 4380, options [mss 1460], length 0 17:32:15.588996 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [.], ack 1, win 5840, length 0 17:32:15.589119 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 5840, length 167 RED: 17:25:11.566071 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [S], seq 2986408498, win 5840, options [mss 1460,sackOK,TS val 262963455 ecr 0,nop,wscale 6], length 0 in slot1/tmm3 lis= 17:25:11.566095 IP 10.10.20.212.80 > 10.10.10.10.34527: Flags [S.], seq 2739000888, ack 2986408499, win 4380, options [mss 1460], length 0 out slot1/tmm3 lis=/Common/fwd 17:25:11.566321 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [.], ack 1, win 5840, length 0 in slot1/tmm3 lis= 17:25:11.566451 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 5840, length 167: HTTP: GET / HTTP/1.1 in slot1/tmm3 lis=/Common/fwd 17:25:11.566834 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [S], seq 2986408498, win 5840, options [mss 1460], length 0 out slot1/tmm3 lis=/Common/fwd 17:25:11.567058 IP 10.10.20.212.80 > 10.10.10.10.34527: Flags [S.], seq 4131626625, ack 2986408499, win 64240, options [mss 1460], length 0 in slot1/tmm3 lis=/Common/fwd 17:25:11.567344 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [.], ack 1, win 5840, length 0 out slot1/tmm3 lis=/Common/fwd 17:25:11.567346 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 5840, length 167: HTTP: GET / HTTP/1.1 out slot1/tmm3 lis=/Common/fwd GREEN: 16:31:24.339490 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [S], seq 2986408498, win 5840, options [mss 1460], length 0 16:31:24.339551 IP 10.10.20.212.80 > 10.10.10.10.34527: Flags [S.], seq 4131626625, ack 2986408499, win 64240, options [mss 1460], length 0 16:31:24.339987 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [.], ack 1, win 5840, length 0 16:31:24.340003 IP 10.10.10.10.34527 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 5840, length 167: HTTP: GET / HTTP/1.1 As it happened in the previous example, BIG-IP does not ACK the HTTP request (PSH) sent by client because a FastL4 profile is being used, what it means that BIG-IP will act as a TCP transparent proxy between server and client. In other words, once correct TCP handshake has done BIG-IP will forward rest of this connection traffic to server directly. Standard + Hardware SYN Cookie Fig17. Standard + Hardware SYN Cookie The main difference in this case is that for Standard virtual server BIG-IP must act as a TCP proxy between client and server during all connection time. This means that BIG-IP sends ACK as response to PSH packet sent by client before this PSH is in fact received by the server. This image has the same color scheme as the first image. BLUE: 14:02:03.606740 IP 10.10.10.10.34522 > 10.10.20.212.80: Flags [S], seq 2828374030, win 5840, options [mss 1460,sackOK,TS val 259810460 ecr 0,nop,wscale 6], length 0 14:02:03.606861 IP 10.10.20.212.80 > 10.10.10.10.34522: Flags [S.], seq 2971392569, ack 2828374031, win 4380, options [mss 1460,sackOK,TS val 870079512 ecr 259810460], length 0 14:02:03.606885 IP 10.10.10.10.34522 > 10.10.20.212.80: Flags [.], ack 1, win 5840, options [nop,nop,TS val 259810460 ecr 870079512], length 0 14:02:03.607176 IP 10.10.10.10.34522 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 5840, options [nop,nop,TS val 259810460 ecr 870079512], length 167 RED: 13:55:00.128183 IP 10.10.10.10.34522 > 10.10.20.212.80: Flags [.], ack 2971392570, win 5840, options [nop,nop,TS val 259810460 ecr 870079512], length 0 in slot1/tmm4 lis= 13:55:00.128272 IP 10.10.10.10.34522 > 10.10.20.212.80: Flags [S], seq 2348737573, win 14600, options [mss 1460,nop,wscale 0,sackOK,TS val 1607698719 ecr 0], length 0 out slot1/tmm4 lis=/Common/std 13:55:00.128496 IP 10.10.10.10.34522 > 10.10.20.212.80: Flags [P.], seq 0:167, ack 1, win 5840, options [nop,nop,TS val 259810460 ecr 870079512], length 167: HTTP: GET / HTTP/1.1 in slot1/tmm4 lis=/Common/std 13:55:00.128501 IP 10.10.20.212.80 > 10.10.10.10.34522: Flags [.], ack 167, win 14767, options [nop,nop,TS val 870079512 ecr 259810460], length 0 out slot1/tmm4 lis=/Common/std 13:55:00.128559 IP 10.10.20.212.80 > 10.10.10.10.34522: Flags [S.], seq 1953609172, ack 2348737574, win 65160, options [mss 1460,sackOK,TS val 1991641654 ecr 1607698719,nop,wscale 7], length 0 in slot1/tmm4 lis=/Common/std 13:55:00.128564 IP 10.10.10.10.34522 > 10.10.20.212.80: Flags [.], ack 1, win 14600, options [nop,nop,TS val 1607698719 ecr 1991641654], length 0 out slot1/tmm4 lis=/Common/std 13:55:00.128567 IP 10.10.10.10.34522 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 14600, options [nop,nop,TS val 1607698719 ecr 1991641654], length 167: HTTP: GET / HTTP/1.1 out slot1/tmm4 lis=/Common/std GREEN: 13:01:12.268012 IP 10.10.10.10.34522 > 10.10.20.212.80: Flags [S], seq 2348737573, win 14600, options [mss 1460,nop,wscale 0,sackOK,TS val 1607698719 ecr 0], length 0 13:01:12.268103 IP 10.10.20.212.80 > 10.10.10.10.34522: Flags [S.], seq 1953609172, ack 2348737574, win 65160, options [mss 1460,sackOK,TS val 1991641654 ecr 1607698719,nop,wscale 7], length 0 13:01:12.268347 IP 10.10.10.10.34522 > 10.10.20.212.80: Flags [.], ack 1, win 14600, options [nop,nop,TS val 1607698719 ecr 1991641654], length 0 13:01:12.268355 IP 10.10.10.10.34522 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 14600, options [nop,nop,TS val 1607698719 ecr 1991641654], length 167: HTTP: GET / HTTP/1.1 Note that SYN sequence number sent by client to BIG-IP is different than the one BIG-IP sends to Server (2828374030 Vs 2348737573). This is because we are using Standard virtual server, so there are two independent TCP connections, one between client and BIG-IP and the second between BigIP and server. Again, as commented since in this example Hardware SYN Cookie is used we cannot see first two TCP handshake packets in the capture. Standard + Software SYN Cookie Fig18. Standard + Software SYN Cookie Apparently there are not any difference when capturing traffic of a client request when SYN Cookie is activated or disabled in this case. Only difference is that BIG-IP creates a SYN Cookie challenge and it adds it as a sequence number in SYN/ACK, but looking to capture this change is not clearly visible. We need to compare with log and stats to confirm that in fact SYN Cookie is enabled, also depending in config we can check as well TCP options passed to server side. Color scheme here is the same as the second image. BLUE: 14:32:58.966168 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [S], seq 3344434108, win 5840, options [mss 1460,sackOK,TS val 260274299 ecr 0,nop,wscale 6], length 0 14:32:58.966358 IP 10.10.20.212.80 > 10.10.10.10.34525: Flags [S.], seq 2686010241, ack 3344434109, win 14600, options [mss 1460,nop,wscale 0,sackOK,TS val 3273182234 ecr 260274299], length 0 14:32:58.966398 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [.], ack 1, win 92, options [nop,nop,TS val 260274299 ecr 3273182234], length 0 14:32:58.966649 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 92, options [nop,nop,TS val 260274300 ecr 3273182234], length 167 RED: 14:25:55.405917 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [S], seq 3344434108, win 5840, options [mss 1460,sackOK,TS val 260274299 ecr 0,nop,wscale 6], length 0 in slot1/tmm1 lis= 14:25:55.405931 IP 10.10.20.212.80 > 10.10.10.10.34525: Flags [S.], seq 2686010241, ack 3344434109, win 14600, options [mss 1460,nop,wscale 0,sackOK,TS val 3273182234 ecr 260274299], length 0 out slot1/tmm1 lis=/Common/std 14:25:55.406048 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [.], ack 1, win 92, options [nop,nop,TS val 260274299 ecr 3273182234], length 0 in slot1/tmm1 lis= 14:25:55.406355 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 92, options [nop,nop,TS val 260274300 ecr 3273182234], length 167: HTTP: GET / HTTP/1.1 in slot1/tmm1 lis=/Common/std 14:25:55.406359 IP 10.10.20.212.80 > 10.10.10.10.34525: Flags [.], ack 168, win 14767, options [nop,nop,TS val 3273182234 ecr 260274300], length 0 out slot1/tmm1 lis=/Common/std 14:25:55.406131 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [S], seq 2235159701, win 14600, options [mss 1460,nop,wscale 0,sackOK,TS val 1609553998 ecr 0], length 0 out slot1/tmm1 lis=/Common/std 14:25:55.406524 IP 10.10.20.212.80 > 10.10.10.10.34525: Flags [S.], seq 3205875957, ack 2235159702, win 65160, options [mss 1460,sackOK,TS val 1993497007 ecr 1609553998,nop,wscale 7], length 0 in slot1/tmm1 lis=/Common/std 14:25:55.406527 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [.], ack 1, win 14600, options [nop,nop,TS val 1609553998 ecr 1993497007], length 0 out slot1/tmm1 lis=/Common/std 14:25:55.406531 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 14600, options [nop,nop,TS val 1609553998 ecr 1993497007], length 167: HTTP: GET / HTTP/1.1 out slot1/tmm1 lis=/Common/std GREEN: 13:32:07.645733 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [S], seq 2235159701, win 14600, options [mss 1460,nop,wscale 0,sackOK,TS val 1609553998 ecr 0], length 0 13:32:07.645820 IP 10.10.20.212.80 > 10.10.10.10.34525: Flags [S.], seq 3205875957, ack 2235159702, win 65160, options [mss 1460,sackOK,TS val 1993497007 ecr 1609553998,nop,wscale 7], length 0 13:32:07.646011 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [.], ack 1, win 14600, options [nop,nop,TS val 1609553998 ecr 1993497007], length 0 13:32:07.646022 IP 10.10.10.10.34525 > 10.10.20.212.80: Flags [P.], seq 1:168, ack 1, win 14600, options [nop,nop,TS val 1609553998 ecr 1993497007], length 167: HTTP: GET / HTTP/1.1 Note that, as in the previous example, SYN sequence number sent by client to BIG-IP is different than the one BIG-IP sends to Server (3344434108 Vs 2235159701). Conclusion At this point only one part left to have a complete picture of BIG-IP's SYN Cookie countermeasure functioning, so in next article I finish this series giving details about expected logs when SYN Cookie is working in our device.3KViews0likes0Comments9. SYN Cookie Troubleshooting: Logs
Introduction In this last article I will add the last piece of information you can check when troubleshooting TCP SYN Cookie attacks, logs. With this information together with all that you have learned until now you should be able to understand how SYN Cookie is behaving and decide if there is any change you should do in your configuration to improve it. Use cases LTM SYN Cookie at Global context Logs when Global SYN Check Threshold or Default Per Virtual Server SYN Check Threshold has been exceeded are similar, so in order to know in which context was SYN Cookie activated you need to focus on specific text in logs. For example, by having below config: turboflex profile feature => adc tmsh list sys db pvasyncookies.enabled => true tmsh list ltm global-settings connection default-vs-syn-challenge-threshold => 1500 <= tmsh list ltm global-settings connection global-syn-challenge-threshold => 2050 <= tmsh list ltm profile fastl4 syn-cookie-enable => enabled You will get logs similar to the ones below if Global SYN cache has been reached: Dec 7 03:03:02 B12050-R67-S8 warning tmm9[5507]: 01010055:4: Syncookie embryonic connection counter 2051 exceeded sys threshold 2050 Dec 7 03:03:02 B12050-R67-S8 warning tmm5[5507]: 01010055:4: Syncookie embryonic connection counter 2051 exceeded sys threshold 2050 Dec 7 03:03:02 B12050-R67-S8 notice tmm5[5507]: 01010240:5: Syncookie HW mode activated, server name = /Common/syncookie_test server IP = 10.10.20.212:80, HSB modId = 1 Dec 7 03:03:02 B12050-R67-S8 notice tmm9[5507]: 01010240:5: Syncookie HW mode activated, server name = /Common/syncookie_test server IP = 10.10.20.212:80, HSB modId = 2 As you can notice there are two different messages, the first one informs about Software SYN Cookie being activated at Global context, and the second one tells us that Hardware is offloading SYN Cookie from Software. Since there is a minimum delay before Hardware to start to offload SYN Cookie is expected to see a non zero value for the counter Current SYN Cache stats. See article in this SYN Cookie series for more information about stats. Global SYN cache value is configured per TMM, so you see in the log that 2050 threshold has been exceeded in the TMM, and therefore SYN Cookie is activated globally in the device. In this specific example the device has two HSBs and BIG-IP decided that tmm9 and tmm5 would activate each one of them this is why we see the logs repeated. LTM SYN Cookie at Virtual context For the same configuration example I showed above you will see log similar to one below if Virtual SYN cache has been reached: Oct 18 02:26:32 I7800-R68-S7 warning tmm[15666]: 01010038:4: Syncookie counter 251 exceeded vip threshold 250 for virtual = 10.10.20.212:80 Oct 18 02:26:32 I7800-R68-S7 notice tmm[15666]: 01010240:5: Syncookie HW mode activated, server name = /Common/wildcardCookie server IP = 10.10.20.212:80, HSB modId = 1 Oct 18 02:26:32 I7800-R68-S7 notice tmm[15666]: 01010240:5: Syncookie HW mode activated, server name = /Common/wildcardCookie server IP = 10.10.20.212:80, HSB modId = 2 Virtual SYN cache value is configured globally meaning that the configured value must be divided among TMMs to know when SYN cookie will be enabled on each TMM. Run below command to see physical number of cores: tmsh sho sys hard | grep core In this example device has 6 TMMs, so 1500/6 is 250. Note that you will see a warning message entry per TMM (I removed 3 log entries in above example order to summarize) and per HSB ID. Log does not always show the VIP’s IP, it depends on type of VIP. For example in below case: Oct 17 04:04:54 I7800-R68-S7 warning tmm2[22805]: 01010038:4: Syncookie counter 251 exceeded vip threshold 250 for virtual = 10.10.20.212:80 Oct 17 04:04:54 I7800-R68-S7 warning tmm3[22805]: 01010038:4: Syncookie counter 251 exceeded vip threshold 250 for virtual = 10.10.20.212:80 Oct 17 04:04:55 I7800-R68-S7 notice tmm2[22805]: 01010240:5: Syncookie HW mode activated, server name = /Common/wildcardCookie server IP = 10.10.20.212:80, HSB modId = 1 Oct 17 04:05:51 I7800-R68-S7 notice tmm2[22805]: 01010241:5: Syncookie HW mode exited,server name = /Common/wildcardCookie server IP = 10.10.20.212:80, HSB modId = 1 from HSB There is not any virtual configured with destination IP 10.10.20.212. In fact traffic is handled by a wildcard VIP listening on 0.0.0.0/0, this logged IP is the destination IP:Port in the request that triggered SYN Cookie. You can consider this IP as the most probable attacked IP since it was the one that exceeded the threshold, so you can assume there are more attacks to this IP, however attack could have a random destination IPs target. Important: Per-Virtual SYN Cookie threshold MUST be lower than Global threshold, if you configure Virtual Server threshold higher than Global, or 0, then internally BIG-IP will set SYN Cookie Global threshold equals to Per-Virtual SYN Cookie threshold. LTM SYN Cookie at VLAN context Configuration example for triggering LTM SYN Cookie at VLAN context: turboflex profile feature => adc tmsh list sys db pvasyncookies.enabled => true tmsh list ltm global-settings connection vlan-syn-cookie => enabled tmsh list net vlan hardware-syncookie => [vlan external: 2888] tmsh list ltm global-settings connection default-vs-syn-challenge-threshold => 0 tmsh list ltm global-settings connection global-syn-challenge-threshold => 2500 When SYN cookie is triggered you get log: Oct 17 10:27:23 I7800-R68-S7 notice tmm[15666]: 01010292:5: Hardware syncookie protection activated on VLAN 1160 (syncache:2916 syn flood pkt rate:0) In this case you will see that information related to virtual servers on this VLAN will show SYN cookie as ‘not activated’ because protection is at VLAN context: #tmsh show ltm virtual | grep ' status ' -i Statusnot-activated Statusnot-activated If you configure SYN Cookie per VLAN but Turboflex adc/security is not provisioned then you will get: Oct 17 04:39:52 I7800-R68-S7.sin.pslab.local warning mcpd[7643]: 01071859:4: Warning generated : This platform supports Neuron-based Syncookie protection on per VS basis (including wildcard virtual). Please use that feature instead AFM SYN Cookie at Global context Main different in AFM default log is that you will not get a message telling you the threshold it has been exceeded, instead log will inform you directly about the context that detected the attack. Configuration example for triggering AFM SYN Cookie at global context: turboflex profile feature=> security tmsh list ltm global-settings connection vlan-syn-cookie=> enabled tmsh list net vlan hardware-syncookie[not compatible with DoS device] tmsh list sys db pvasyncookies.enabled=> true tmsh list ltm global-settings connection default-vs-syn-challenge-threshold.=> 0 tmsh list ltm global-settings connection global-syn-challenge-threshold=> 2500 tmsh list security dos device-config default-internal-rate-limit (tcp-half-open)=> >2500 tmsh list security dos device-config detection-threshold-pps (tcp-half-open)=> 2500 tmsh list ltm profile fastl4 syn-cookie-enable=> enabled AFM Device DoS has preference over LTM Global SYN Cookie, so in above configuration AFM tcp half open vector will be triggered: Oct 19 02:23:41 I7800-R68-S7 err tmm[23288]: 01010252:3: A Enforced Device DOS attack start was detected for vector TCP half open, Attack ID 1213152658. Oct 19 02:29:23 I7800-R68-S7 notice tmm[23288]: 01010253:5: A Enforced Device DOS attack has stopped for vector TCP half open, Attack ID 1213152658. In the example above you can see that there are logs warning you about an attack that started and stopped, but there is not any log showing if attack is mitigated. This is because you have not configured AFM to log to local-syslog (/var/log/ltm). In this situation DoS logs are basic. If you want to see packets dropped or allowed you need to configure specific security log profile. Be aware that when SYN Cookie is active because Device TCP half open DoS vector’s threshold has been reached then you will not see any Virtual Server showing that SYN Cookie has been activated, as it happens when SYN Cookie VLAN is activated: SYN Cookies Statusnot-activated This is slightly different to LTM Global SYN Cookie, when LTM Global SYN Cookie is enabled BIG-IP will show which specific VIP has SYN Cookie activated (Status Full Hardware/Software). In case you have configured logging for DoS then you would get logs like these: Oct 23 03:56:15 I7800-R68-S7 err tmm[21638]: 01010252:3: A Enforced Device DOS attack start was detected for vector TCP half open, Attack ID 69679369. Oct 23 03:56:15 I7800-R68-S7 info tmm[21638]: 23003138 "Oct 23 2020 03:56:15","10.200.68.7","I7800-R68-S7.sin.pslab.local","Device","","","","","","","TCP half open","69679369","Attack Started","None","0","0","0000000000000000", "Enforced", "Volumetric, Aggregated across all SrcIP's, Device-Wide attack, metric:PPS" Oct 23 03:56:16 I7800-R68-S7 info tmm[21638]: 23003138 "Oct 23 2020 03:56:16","10.200.68.7","I7800-R68-S7.sin.pslab.local","Device","","","","","","","TCP half open","69679369","Attack Sampled","Drop","3023","43331","0000000000000000", "Enforced", "Volumetric, Aggregated across all SrcIP's, Device-Wide attack, metric:PPS" Oct 23 03:56:16 I7800-R68-S7 info tmm[21638]: 23003138 "Oct 23 2020 03:56:16","10.200.68.7","I7800-R68-S7.sin.pslab.local","Device","","","","","","","TCP half open","69679369","Attack Sampled","Drop","3017","69710","0000000000000000", "Enforced", "Volumetric, Aggregated across all SrcIP's, Device-Wide attack, metric:PPS” The meaning of below fields shown in above logs: "Drop","3023","43331","0000000000000000" "Drop","3017","69710","0000000000000000" Are as below: {action} {dos_packets_received} {dos_packets_dropped} {flow_id} Where: {dos_packets_received} - It counts the number of TCP SYNs received for which you have not received the ACK. Also called embryonic SYNs. {dos_packets_dropped} - It counts the number of TCP syncookies that you have sent for which you have not received valid ACK. If you see that {dos_packets_received}are high, but {dos_packets_dropped} are 0 or low, then it just means that AFM is responding to SYN packets with SYN cookies and it is receiving correct ACKs from client. Therefore, AFM is not dropping anything. So this could mean that this is not an attack but a traffic peak. It can happen that you configure a mitigation threshold lower than detection threshold, although you will get a message warning you, you could not realise about it. If this is the case you will not see any log informing you about that there is an attack. This will happen for example with below configuration: tmsh list ltm global-settings connection global-syn-challenge-threshold=> 3400 tmsh list security dos device-config default-internal-rate-limit (tcp-half-open)=> 3000 tmsh list security dos device-config detection-threshold-pps (tcp-half-open)=> 3900 tmsh list ltm profile fastl4 syn-cookie-enable=> disabled Due to this you will see in /var/log/ltm something like: Oct 23 03:38:12 I7800-R68-S7.sin.pslab.local warning mcpd[10516]: 01071859:4: Warning generated : DOS attack data (tcp-half-open): Since drop limit is less than detection limit, packets dropped below the detection limit rate will not be logged. AFM SYN Cookie at Virtual context All information provided in previous use case applies in here, so for below configuration example: tmsh list ltm global-settings connection global-syn-challenge-threshold=> 3400 tmsh list security dos device-config default-internal-rate-limit (tcp-half-open)=> 3000 tmsh list security dos device-config detection (tcp-half-open)=> 3900 list security dos profile SYNCookie dos-network default-internal-rate-limit (tcp-half-open)=> 2000 list security dos profile SYNCookie dos-network detection-threshold-pps (tcp-half-open)=> 1900 tmsh list ltm profile fastl4 <name>=> enabled AFM device SYN Cookie is activated for specific virtual server with security profile applied: Oct 23 04:10:26 I7800-R68-S7 notice tmm[21638]: 01010240:5: Syncookie HW mode activated, server = 0.0.0.0:0, HSB modId = 1 Oct 23 04:10:26 I7800-R68-S7 notice tmm5[21638]: 01010240:5: Syncookie HW mode activated, server = 0.0.0.0:0, HSB modId = 2 Oct 23 04:10:26 I7800-R68-S7 err tmm3[21638]: 01010252:3: A NETWORK /Common/SYNCookie_Test DOS attack start was detected for vector TCP half open, Attack ID 2147786126. Oct 23 04:10:28 I7800-R68-S7 info tmm[16005]: 23003156 "10.200.68.7","I7800-R68-S7.sin.pslab.local","Virtual Server","/Common/SYNCookie_Test","Cryptographic SYN Cookie","16973","0","0","0", Oct 23 04:10:57 I7800-R68-S7 notice tmm5[21638]: 01010253:5: A NETWORK /Common/SYNCookie_Test DOS attack has stopped for vector TCP half open, Attack ID 2147786126. Oct 23 04:12:46 I7800-R68-S7 notice tmm[21638]: 01010241:5: Syncookie HW mode exited,server = 0.0.0.0:0, HSB modId = 1 from HSB Oct 23 04:12:47 I7800-R68-S7 notice tmm5[21638]: 01010241:5: Syncookie HW mode exited,server = 0.0.0.0:0, HSB modId = 2 from HSB Troubleshooting steps When you need to troubleshoot how device is working against a SYN flood attack there are some steps you can follow. Check configuration to make a global idea of what should happen when SYN flood occurs: tmsh show sys turboflex profile feature tmsh list ltm global-settings connection vlan-syn-cookie tmsh list net vlan hardware-syncookie tmsh list sys db pvasyncookies.enabled tmsh list ltm global-settings connection default-vs-syn-challenge-threshold tmsh list ltm global-settings connection global-syn-challenge-threshold tmsh list ltm profile fastl4 syn-cookie-enable tmsh list ltm profile tcp all-properties | grep -E 'profile|syn-cookie' tmsh list ltm profile fastl4 all-properties| grep -E 'profile|syn-cookie' list security dos device-config syn-cookie-whitelist syn-cookie-dsr-flow-reset-by tscookie-vlans tmsh list security dos device-config dos-device-config | grep -A23 half tmsh list security dos profile dos-network {<profile> { network-attack-vector { tcp-half-open } } } *I can miss some commands since I cannot know specific configuration you are using, but above list can give you a good idea about what you have actually configured in your system. Are you using Hardware or Software SYN cookie? Are you using CMP or vCMP? Is device a Neuron platform? Is SYN cookie configured/working in AFM, in LTM or in both? Is SYN cookie enabled at Device, VLAN or Virtual Server context? If issue is at virtual server context, which virtual servers are affected? is the problem happening in a Standard or FastL4 VIP, …? Check logs (date/times) and stats to confirm what it has really happened and since when. Take captures to confirm your findings. Is this an attack? Were there other attacks at the same time (TCP BAD ACK, TCP RST maybe)? Are thresholds correctly configured attending to expected amount of traffic? If clients are hidden by a proxy maybe you could save resources by configuring Challenge and remember. If this is a Neuron platform, is there any error in /var/log/neurond? Check published IDs related to SYN Cookie for specific TMOS versions or/and platforms. Conclusion Now you have enough information to start troubleshooting your own BIG-IP devices if any issue happens, and also and maybe more important you have tools to create the most appropriate configuration attending to your specific platform and traffic patterns. So you can start to take the advantage of your knowledge to improve performance of your device when under TCP SYN flood attack.2.4KViews2likes1CommentKerberos is Easy - Part 2
Yes my friends, this post has been long overdue. Life, work and all the other good excuses got in the way. However, there is nothing like a friend calling you out with a “WTF I need part 2” to get the motivation and kerberos mana flowing again. So where did we leave off? In part 1 we discussed some of the most common issues with Kerberos authentication and the necessity to break the problem down to client-side vs server-side authentication. In part 2 we will look at some of the first troubleshooting steps I take to determine why nothing is working - yes, this happens to me too. ADTest is Your New BF4L You know those people that open a web browser to see if their Internet works… yea, don’t be that person. Open a terminal, check if you can ping your gateway, public DNS server, etc. and then, only then, open a browser. ADTest is your equivalent of ping. Don’t assume because you configured an Active Directory AAA object that authentication is just going to work. Please open a console and verify that Kerberos authentication against the AD server is working with ADTest. Check out my APM Troubleshooting with ADTest for more information. Time is not on Your Side If you have worked with Kerberos before you know it is supper picky about time drift. If ADTest just won’t work and you can’t figure out why ensure the BIG-IP’s time matches the KDC. If you need to adjust the BIG-IP follow the F5 SOL3381. These Are Not the KDCs You’re Looking For Ever been in a multi-domain environment and the AD admin swears the KDC you’re talking to is the correct one; never mind the “Kerberos Principal Unknown” error you keep getting. So if ADTest doesn’t work then we need to ensure the AD server we’re talking to is a KDC for the expected realm. Now, if someone can RDP into the server this can be ruled out pretty quickly, but when in life is anything that easy. So it’s helpful to use nslookup to find all KDCs for the intended domain and ensure the IP you were give is in this list - example below: C:\Users\user>nslookup -type=SRV _kerberos._tcp.dc._msdcs.f5demo.com Server: ad1.f5demo.com Address: 10.1.10.2 _kerberos._tcp.dc._msdcs.f5demo.com SRV service location: priority = 0 weight = 100 port = 88 svr hostname = ad1.f5demo.com ad1.f5demo.com internet address = 10.1.10.2 ad1.f5demo.com internet address = 10.1.1.3 Playing Go Fish with SPNs You have to admire how Microsoft took something as complicated as Kerberos and made it trivial to deploy and manage inside of Active Directory. Having managed MIT and MS version of Kerberos myself I felt a little guilty after setting up my first AD server without hours of troubleshooting issues. However, with that ease of use Microsoft also made it easy to shoot your own foot off if you have no basic understanding of Kerberos. Think of the KDC as a key, value pair database. The KDC will let you store multiple keys of the same value even though you shouldn’t do that. So if you have multiple SPN entries in AD you are not guarantied that a request for a ticket will return the value you’re looking for. This typically presents itself in APM as authentication works onetime and not the other. An easy way to check this is to log into a domain machine and issue: setspn -X This will print out any duplicate SPNs in your KDC. If the SPN you are working with appears in this list then you need to correct this issue. The easiest way I find is to delete the service account you have created for APM and just use the service account the web server application pool is using. Cached Tickets APM caches Kerberos tickets for both client side Kerberos authentication and server side Kerberos SSO. If you’re troubleshooting Kerberos be sure to clear these caches after you’ve made modifications. For Kerberos AAA: bigstart restart apd rba For Kerberos SSO: bigstart restart websso I’ll Just Do It Myself If you are working with Kerberos SSO then you have to sometimes determine if the issue is with APM obtaining the token or the web server not accepting the token. If you have your WebSSO logging set to debug and Kerberos SSO is working then you should see S4U=====>OK If not, then there are a few CLI commands you can use to simulate the request APM makes to the KDC. First, remove all kerberos tickets. kdestroy Second, obtain a kerberos ticket as the AD delegation account (if this works you won’t get a response) kinit -f <SPN of AD delegation account> Finally, test if the ticket you obtain has delegation capabilities knvo -C -U <username> <SPN of AD delegation account> If you receive a key version number on the kvno command then everything is working and it proves that ASREQ and RSREQ work. So the issue more than likely is on the web server side and not the F5 - more on this in part 3.1.9KViews0likes5CommentsLoad Balancing to Only One Pool Member
I have an issue where all traffic in a pool is going to the fifth of 5 pool members. We are using cookie persistence and I know there can be issues with that. We are also using both an http and OneConnect profile. From the sols and DC articles I've read it seems like the problem will arise if you do NOT have a oneconnect profile assigned. I have tried pulling http/oneconnect/tcp profiles on and off, using default profiles. The only thing I haven't tried is falling back to source address persistence. That is a last resort I do not want to use. Due to the application architecture source address persistence will definitely result in uneven load. All the "custom" profiles are built from an F5 guide for the application (Epic HyperSpace Web; Link to Guide). That being said this config seems pretty straight forward, but any help would be very much appreciated. We have captured traffic and seen the cookies present in the sessions. Below is the config. ltm virtual /PARTITION/v_80 { destination /PARTITION/10.10.1.1:80 ip-protocol tcp mask 255.255.255.255 persist { /PARTITION/Custom-cookie { default yes } } pool /PARTITION/pool_80 profiles { /PARTITION/Custom-OneConnect { } /PARTITION/Custom-http { } /PARTITION/Custom-lan-optimized { } } source 0.0.0.0/0 source-address-translation { type automap } translate-address enabled translate-port enabled } ltm pool /PARTITION/Pool_80 { description "HTTP Pool" load-balancing-mode least-connections-member members { /PARTITION/001:80 { address 10.1.1.1 } /PARTITION/002:80 { address 10.1.1.2 } /PARTITION/003:80 { address 10.1.1.3 } /PARTITION/004:80 { address 10.1.1.4 } /PARTITION/005:80 { address 10.1.1.5 } } monitor /Common/http_head_f5 service-down-action reselect } Only change to custom oneconnect profile is mask is 255.255.255.255. Only change to tcp-lan-optimized profile is the idle timeout is set to 1200s. Only change to http profile is that 'Redirect Rewrite' is set to Matching. Custom cookie persistence uses default settings.1.6KViews0likes41Commentsringdump
Problem this snippet solves: loop a tcpdump until a log message is seen Code : # Updated: 10/16/06 #!/usr/bin/perl ## VERSION v0.9b use strict; ################ # tcpdump settings ########## my %SETTINGS = ( external => { filter => "port 443" }, internal => { filter => "port 80" }, lo0 => { filter => "port 80" }, ); my $SNAPLEN = 4352; ################ # script settings ###### # free space checking my $FREE_SPACE_CHECK_INTERVAL = 1; # check free space every this number of seconds my $MIN_FREE_SPACE = 5; # minimum percent space left on parition my $CAPTURE_LOCATION = $ARGV[0]; # file rotation settings my $CAPTURES_TO_ROTATE = 4; # tcpdump capture files to rotate my $DESIRED_CAPTURE_SIZE = 15; # megabytes per capture file before rotating my $OVERLAP_DURING_ROTATE = 5; # seconds to overlap previous capture while starting a new one my $CAPTURE_CHECK_INTERVAL = 1; # how often (seconds) to check the size of capture files for rotating # trigger settings - time (run tcpdumps for x seconds) #my $TRIGGER = "time-based"; my $TIME_TO_CAPTURE = 300; # trigger settings - log-message (stop tcpdump when log message is received) my $TRIGGER = "log-message based"; my $LOG_FILE = "/var/log/messages"; my $LOG_MESSAGE = "no space in response line"; my $FOUND_MESSAGE_WAIT = 5; # how many seconds to gather tcpdumps after we match the log message # misc my $IDLE_TIMER = 5; # if ! receiving log entries, how long before checking if log is rotated my $MAX_ROTATED_LINES = 10000; # max lines to read from file we're re-reading because it's been rotated my $PID_FILE = "/var/run/ring_dump.pid"; my $DEBUG = 0; # 0/1 #################################################### # END OF THINGS THAT SHOULD NEED TO BE CONFIGURED #################################################### ######## # set defaults ### $SNAPLEN ||= 4352; $TRIGGER ||= "time"; $CAPTURE_LOCATION ||= "/var/tmp"; $TIME_TO_CAPTURE ||= 60; $FREE_SPACE_CHECK_INTERVAL ||= 5; $CAPTURES_TO_ROTATE ||= 3; $DESIRED_CAPTURE_SIZE ||= 10; $OVERLAP_DURING_ROTATE ||= 5; $CAPTURE_CHECK_INTERVAL ||= 5; $MIN_FREE_SPACE ||= 5; $LOG_FILE ||= "/var/log/messages"; $LOG_MESSAGE ||= "FAILED"; $FOUND_MESSAGE_WAIT ||= 5; $IDLE_TIMER ||= 5; $PID_FILE ||= "/var/run/ring_dump.pid"; $DEBUG ||= 0; unless (-d $CAPTURE_LOCATION) { print "$CAPTURE_LOCATION isn't a directory, using /mnt instead\n\n"; $CAPTURE_LOCATION = "/mnt"; } if (! -r $LOG_FILE) { die "Can't read \"$LOG_FILE\", EXIT\n"; } # insert code to find tcpdump instead of relying on path HERE: my $tcpdump = "/usr/sbin/tcpdump"; ###### # misc global variable declaration ########## my($answer, $interface, $pid, $tail_child, $F_LOG); my($current_size, $current_inode, $last_size, $last_inode); my @child_pids; my $ppid = $$; my $min_megabytes = $CAPTURES_TO_ROTATE * $DESIRED_CAPTURE_SIZE; $current_size = $current_inode = $last_size = $last_inode = 0; $|++; ########### # functions ####### # exit function that does does necessary child handling sub finish { $_ = shift(); if (defined($_) && $_ ne "") { print; } foreach $interface (keys( %SETTINGS )) { push(@child_pids, $SETTINGS{$interface}{pid}); } $DEBUG && print "INTERRUPT: sending SIGINT and SIGTERM to: ", join(" ", @child_pids), "\n"; kill(2, @child_pids); sleep(1); kill(15, @child_pids); $DEBUG && print "INTERRUPT: done, unlink pidfile and exit\n"; unlink($PID_FILE); exit(0); } $SIG{INT} = sub { finish(); }; # report usage on CAPTURE_LOCATION's MB free from df sub free_megabytes { my $partition = shift(); $partition ||= $CAPTURE_LOCATION; my $free_megabytes; $DEBUG && print "free_megabytes(): capture partition is $partition\n"; open(DF, "df $partition|"); # discard the first line; $_ = ; # parse the usage out of the second line $_ = ; $free_megabytes = (split)[3]; $free_megabytes = int($free_megabytes / 1024); close(DF); $DEBUG && print "free_megabytes(): finished reading df, output is: $free_megabytes\n"; $free_megabytes; } # report usage on CAPTURE_LOCATION's % usage from df sub free_percent { my $partition = shift(); $partition ||= $CAPTURE_LOCATION; my $free_percent; $DEBUG && print "free_percent(): capture partition is $partition\n"; open(DF, "df $partition|"); # discard the first line; $_ = ; # parse the usage out of the second line $_ = ; $free_percent = (split)[4]; chop($free_percent); ## chop off '%' $free_percent = (100 - $free_percent); close(DF); $DEBUG && print "free_percent(): finished reading df, output is: $free_percent\n"; $free_percent; } # simple sub to send SIGHUP to syslogd sub restart_syslogd () { if (-f "/var/run/syslog.pid") { open(PIDFILE, "; chomp; kill(1, ($_)); 1; } # simple wrapper to start tcpdumps, assuming obvious globals sub start_tcpdump { my $interface = shift(); my $capture_file = shift(); my $filter = shift(); my @cmd = ("$tcpdump", "-s$SNAPLEN", "-i$interface", "-w$capture_file", "$filter"); $DEBUG || open(STDERR, ">/dev/null"); $DEBUG && print "start_tcpdump(): about to start: ", join(" ", @cmd), "\n"; exec($cmd[0], @cmd[1..$#cmd]) || print "start_tcpdump(): FAILED to start: ", join(" ", @cmd), ", command not found\n"; $DEBUG || close(STDERR); exit(1); } # sub to see how much space a given capture file is using (to decide to rotate or not) sub capture_space ($) { my $capture_file = shift(); my $size = ( stat($capture_file) )[7]; $DEBUG && print "capture_space(): size of $capture_file is $size\n"; # return size of argument in megabytes, but don't divide by zero if ($size == 0) { return 0; } else { return ($size / 1048576); } } # gives user the option to create a MFS sub create_mfs () { if (-d $CAPTURE_LOCATION) { $DEBUG && print "create_mfs(): directory $CAPTURE_LOCATION exists\n"; } else { mkdir($CAPTURE_LOCATION, oct(0755)) || die "FAILED to create $CAPTURE_LOCATION\n"; print "Capture directory ($CAPTURE_LOCATION) did not exist, so it was created\n"; } # figure out the partition CAPTURE_LOCATION is on. This is cheap... fixme my $partition = $CAPTURE_LOCATION; $partition =~ s!(/[A-z0-9]*)/{0,1}.*!$1!g; open(MOUNT, "mount|") || die "FAILED to run \"mount\": !$\n"; while ( ) { next unless ((split())[2] =~ /^$partition$/); $DEBUG && print "create_mfs(): partition: $partition is already mounted, return\n"; # return 1 if it's already mounted return 1; } close(MOUNT); print "Mount a Memory File System (MFS) on ${CAPTURE_LOCATION}? [y/n]: "; my $answer = ; if (lc($answer) =~ "y") { print "Enter size of MFS in blocks (200000 = 100M), or just press enter for 100M: "; chomp (my $mfs_size = ); $mfs_size = 200000 if ($mfs_size eq ""); print "Allocating $mfs_size blocks to $CAPTURE_LOCATION for MFS\n"; system("mount_mfs -s $mfs_size $CAPTURE_LOCATION"); if (($? >> 8) != 0) { print "an error occurring trying to mount the MFS filesystem, exit status: $?\n"; 0; } else { print "MFS file system established\n\n"; 1; } } } sub fork_to_background ($) { my $cmd = shift(); my $pid = fork(); if ($pid == 0) { exec($cmd) || die "exec() failed: $!\n"; } else { return($pid); } } sub popen_read ($) { my $cmd = shift(); my $child; $DEBUG && print "Background: \"$cmd\"\n"; pipe(READLOG, WRITELOG); select(READLOG); $|++; select(WRITELOG); $|++; select(STDOUT); ## dup STDOUT and STDERR open(T_STDOUT, ">&STDOUT"); open(T_STDERR, ">&STDERR"); ## redir STDOUT to pipe for child open(STDOUT, ">&WRITELOG"); open(STDERR, ">&WRITELOG"); $child = fork_to_background($cmd); ## close STDOUT, STDERR and FILE close(STDOUT); close(STDERR); ## re-open STDOUT as normal and close dup open(STDOUT, ">&T_STDOUT"); close(T_STDOUT); open(STDERR, ">&T_STDERR"); close(T_STDERR); return($child, \*READLOG); } sub open_log ($$) { my $LOG_FILE = shift(); my $lines = shift(); if (defined($F_LOG) && defined(fileno($F_LOG)) ) { $DEBUG && print "Killing child before closing LOG\n"; kill(15, $tail_child); waitpid($tail_child, 0); $DEBUG && print "Closing LOG\n"; close($F_LOG); } $DEBUG && print "Opening \"$LOG_FILE\"\n"; ($tail_child, $F_LOG) = popen_read("tail -n $lines -f $LOG_FILE"); push(@child_pids, $tail_child); 1; } ## check to see if log is rotated, returns true if rotated sub is_rotated ($) { my $LOG_FILE = shift(); $DEBUG && print "enter is_rotated()\n"; ($current_inode, $current_size) = (stat($LOG_FILE))[1,7]; if (($last_size != 0) && ($last_size > $current_size)) { $DEBUG && print "File is now smaller. File must have been rotated\n"; $last_size = $current_size; $last_inode = $current_inode; open_log($LOG_FILE, $MAX_ROTATED_LINES) || die "open_log $LOG_FILE failed: $!\n"; return(1); } elsif (($last_inode != 0) && ($last_inode != $current_inode)) { $DEBUG && print "Inode changed. File must have been rotated\n"; $last_inode = $current_inode; $last_size = $current_size; open_log($LOG_FILE, $MAX_ROTATED_LINES) || die "open_log $LOG_FILE failed: $!\n"; return(1); } ($last_inode, $last_size) = ($current_inode, $current_size); 0; } ########### # MAIN ######## if (free_megabytes() < $min_megabytes) { print "free space on $CAPTURE_LOCATION is below ${min_megabytes}MB, you must create a Memory File System or choose another location to gather tcpdumps\n"; goto MUST_MFS; } ######### GET USER INPUT ############### if (free_percent() < $MIN_FREE_SPACE) { print "free space on $CAPTURE_LOCATION is below ${MIN_FREE_SPACE}%, you must create a Memory File System or choose another location to gather tcpdumps\n"; MUST_MFS: # require the user to create a MFS if they don't have enough free space exit(1) unless (create_mfs()); } else { create_mfs(); } if (free_percent() < $MIN_FREE_SPACE || free_megabytes() < $min_megabytes) { print "it appears the Memory File System is in place, but there is still insufficient space, exiting\n"; exit(1); } print "capturing to $CAPTURE_LOCATION using the following interfaces and filters:\n"; foreach $interface (keys( %SETTINGS )) { system("ifconfig $interface >/dev/null 2>&1"); if ( ($? >> 8) != 0) { $DEBUG && print "couldn't ifconfig $interface, removing from list\n"; delete( $SETTINGS{$interface} ); } else { print " $interface: $SETTINGS{$interface}{filter}\n"; } } print "does this look right? [y/n]: "; $answer = ; exit unless lc($answer) =~ "y"; ####### DAEMONIZE ############# chdir("/"); exit unless (fork() == 0); # kill old self, write pid file if (-f $PID_FILE) { open(PIDFILE, "<$PID_FILE"); kill(15, ); close(PIDFILE); } open(PIDFILE, ">$PID_FILE"); syswrite(PIDFILE, $$); close(PIDFILE); ########### START PROCESSING ############### foreach $interface (keys( %SETTINGS )) { my $filter = $SETTINGS{$interface}{filter}; $pid = fork(); $SETTINGS{$interface}{rotate_number} = 1; if (!defined($pid)) { print "fork() failed! exiting\n"; exit 1; } if ($pid == 0) { start_tcpdump( $interface, "$CAPTURE_LOCATION/${interface}.dump.$SETTINGS{$interface}{rotate_number}", $filter ); exit 1; } else { $SETTINGS{$interface}{pid} = $pid; print "started tcpdump as pid $pid on \"$interface\" filtered as \"$filter\"\n"; } } ###### # fork off a process to keep an eye on free space ######## $pid = fork(); if ($pid == 0) { while (1) { my $sleep_return = sleep($FREE_SPACE_CHECK_INTERVAL); $DEBUG && ($sleep_return != $FREE_SPACE_CHECK_INTERVAL) && print "WARN: free_percent() loop: sleep returned $sleep_return instead of $FREE_SPACE_CHECK_INTERVAL !\n"; if (free_percent() < $MIN_FREE_SPACE) { print "WARN: free space is below ${MIN_FREE_SPACE}%, killing main script\n"; kill(2, $ppid); sleep(1); kill(15, $ppid); print "WARN: sent SIGTERM to $ppid (main script), exiting\n"; exit 1; } else { $DEBUG && print "free_percent(): space is fine, continue\n"; } } } else { push(@child_pids, $pid); $DEBUG && print "started free_percent watcher as: $pid\n"; } ###### # fork off a process to rotate capture files as necessary ######## $pid = fork(); if ($pid == 0) { my $capture_file; while (1) { my $sleep_return = sleep($CAPTURE_CHECK_INTERVAL); $DEBUG && ($sleep_return != $CAPTURE_CHECK_INTERVAL) && print "WARN: start_tcpdump() loop: sleep returned $sleep_return instead of $CAPTURE_CHECK_INTERVAL !\n"; foreach $interface (keys( %SETTINGS )) { if (capture_space("$CAPTURE_LOCATION/${interface}.dump.$SETTINGS{$interface}{rotate_number}") >= $DESIRED_CAPTURE_SIZE) { if ($SETTINGS{$interface}{rotate_number} == $CAPTURES_TO_ROTATE) { print "reached maximum number of captures to rotate: $CAPTURES_TO_ROTATE, starting over at 1\n"; $SETTINGS{$interface}{rotate_number} = 1; } else { $SETTINGS{$interface}{rotate_number}++; } print "rotating capture file: ${interface}.dump, new extension .$SETTINGS{$interface}{rotate_number}\n"; $pid = fork(); if ($pid == 0) { start_tcpdump( $interface, "$CAPTURE_LOCATION/${interface}.dump.$SETTINGS{$interface}{rotate_number}", $SETTINGS{$interface}{filter}, ); exit 0; } push(@child_pids, $pid); # get some overlap in the two files sleep($OVERLAP_DURING_ROTATE); # kill the old tcpdump kill(2, $SETTINGS{$interface}{pid}); $DEBUG && print "sent SIGINT to $interface: $SETTINGS{$interface}{pid}, new pid $pid\n"; # record the new pid $SETTINGS{$interface}{pid} = $pid; } else { $DEBUG && print "capture file doesn't need to be rotated yet: ${interface}.dump\n"; } } # Reap any zombies from old tcpdumps $DEBUG && print "start_tcpdump() loop: \@child_pids = (", join(' ', @child_pids), ")\n"; while (1) { use POSIX ":sys_wait_h"; my $child = waitpid(-1, WNOHANG); if (defined $child and $child > 0) { # remove PID from @child_pids @child_pids = grep {$_ != $child} @child_pids; $DEBUG && print "start_tcpdump() loop: reaped child PID $child\n"; } else { # no one to reap last; } } } } else { push(@child_pids, $pid); $DEBUG && print "started capture file watcher as: $pid\n"; } ################ # watch triggers (time or log based) #################### $SIG{TERM} = sub { finish(); }; if (lc($TRIGGER) =~ /time/) { print "time-based trigger, will capture for $TIME_TO_CAPTURE seconds\n"; sleep($TIME_TO_CAPTURE); print "captured for $TIME_TO_CAPTURE seconds, stopping tcpdumps\n"; } elsif (lc($TRIGGER) =~ /log/) { print "log-based trigger, waiting for \"$LOG_MESSAGE\" in \"$LOG_FILE\"\n"; # creates global $F_LOG filehandle of $LOG_FILE open_log($LOG_FILE, 0) || finish("open_log $LOG_FILE failed: $!\n"); # flush syslogd's buffers (avoid never getting the message due to "last message repeated....") restart_syslogd() || finish("Restarting syslogd failed, EXIT\n"); # tail -f the log and wait for message while (1) { # reap any zombies during each loop my $return; while (1) { use POSIX ":sys_wait_h"; my $child = waitpid(-1, WNOHANG); if (defined $child and $child > 0) { $DEBUG && print "log trigger loop: reaped child PID $child\n"; } else { # no one to reap last; } } eval { $SIG{ALRM} = sub { die("ALRM\n"); }; alarm($IDLE_TIMER); $_ = <$F_LOG>; alarm(0); }; if ($@) { # this only occurs if we're idle for $IDLE_TIMER seconds because no new log entries are occuring $@ = undef; is_rotated($LOG_FILE); next; } $DEBUG && print "in LOG reading loop, current line: \"$_\"\n"; if (/$LOG_MESSAGE/) { $DEBUG && print "Current line matches: \"$LOG_MESSAGE\"\n"; last; } $DEBUG && print "no match, next\n"; } print "received log message, sleeping $FOUND_MESSAGE_WAIT seconds then stopping tcpdumps\n"; sleep($FOUND_MESSAGE_WAIT); } # figure out current tcpdump child_pids and push them onto the list foreach $interface (keys( %SETTINGS )) { push(@child_pids, $SETTINGS{$interface}{pid}); } # kill all tcpdumps + free space watcher + capture file rotator -- doesn't return finish(); 0;700Views0likes1CommentWhat CLI tools are in your toolbox for testing or troubleshooting application deployments on BIG-IP?
In trying to nail down a rate limiting solution, I reached back for my trusty apache-bench utility, ab, to push some load with concurrency at the virtual server. But...in today's SSL everywhere, discovered that ab does not have a flag to ignore SSL errors, and I don't have valid certs in my local lab, so I couldn't use that to push load. In a quick search I found hey, installed that on my ubuntu test box, and seems to be a reasonable replacement (see below). But that got me thinking...what cli tools are out there for test/troubleshooting that I'm missing out on? jrahm@unuc:~$ hey -n 10000 -c 5 https://172.16.101.50/ Summary: Total: 8.2474 secs Slowest: 0.0510 secs Fastest: 0.0023 secs Average: 0.0041 secs Requests/sec: 1212.5047 Total data: 20000 bytes Size/request: 2 bytes Response time histogram: 0.002 [1] | 0.007 [9956] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.012 [36] | 0.017 [2] | 0.022 [0] | 0.027 [0] | 0.032 [0] | 0.036 [0] | 0.041 [0] | 0.046 [0] | 0.051 [5] | Latency distribution: 10% in 0.0032 secs 25% in 0.0036 secs 50% in 0.0040 secs 75% in 0.0045 secs 90% in 0.0050 secs 95% in 0.0055 secs 99% in 0.0065 secs669Views3likes1Commentviewing HTTP header in tcpdump
Hi guys, is it possible to view the content of http headers in a tcpdump trace file? I tried: tcpdump -i /Common/ -vvvs 1024 -l -A host example.com It looks like as the device is tracing packets, but when I download them the file is empty or broken. Anyone an idea on this? I am using Version 11.3 HF8 Cheers, Merlin507Views0likes1Commentirule Troubleshooting:
Hi guys, once i try to assign the following irule to a VS i receive this error message: 01070151:3: Rule [/Common/rule_errorpage_test] error: Unable to find pool (sorry_pool) referenced at line 21: [pool sorry_pool] when RULE_INIT { set static::sorry_pool "/Common/pool_wwbtest_sorry" set static::wartung_pool "/Common/pool_wwbtest_wartung" } when HTTP_REQUEST { log local0. "Incoming URI = [HTTP::uri]" if { [catch { [active_members [LB::server pool]] < 1}] } { if { [catch {active_members $static::wartung_pool}] > 0 } { pool wartung_pool set mypick [LB::select] log local0. "The LB choice is: $mypick" set newuri [string map {/wartung.html/ /} [HTTP::uri]] log local0. "The new URI is = $newuri" HTTP::uri $newuri eval "GET [HTTP::uri $newuri] HTTP/1.1\r\nHost: $mypick\r\n\r\n" log local0. "[HTTP::uri] [LB::server addr]" } elseif { [catch {active_members $static::sorry_pool}] > 0 } { pool sorry_pool set mypick [LB::select] log local0. "The LB choice is: $mypick" set newuri [string map {/sorry.html/ /} [HTTP::uri]] log local0. "The new URI is = $newuri" HTTP::uri $newuri eval "GET [HTTP::uri $newuri] HTTP/1.1\r\nHost: $mypick\r\n\r\n" log local0. "[HTTP::uri] [LB::server addr]" } else { HTTP::respond 200 content "Lieber Kunde,\r\ndie Seite befindet sich aktuell in der Wartung.\r\n\r\nDear Customer,\r\nthis site is down for maintenance. We are sorry for the inconvience.\r\n" noserver connection close } } } i am really puzzled and clueless what the problem might be. If you have any ideas to solve this i would be grateful 🙂 best regards David500Views0likes13Comments