on 29-Jan-2020 10:23
I´m about to write a series of DDoS articles, which will hopefully help you to better understand how F5 DDoS works and how to configure it. I plan to release them in a more or less regular frequency. Feel free to send me your feedback and let me know which topics are relevant for you and should get covered.
This first article is intended to give an explanation of the BIG-IP Device DOS and Per Service-DOS protection (DOS profile). It covers the concepts of both approaches and explains in high level the threshold modes “Fully manual”, “Fully Automatic” and “Multiplier Based Mitigation” including the principles of stress measurement. In later articles I will describe this threshold modes in more detail.
In the end of the article you will also find an explanation of the physical and logical data path of the BIG-IP.
Device DOS with static DoS vectors
The primary goal of “Device DOS” is to protect the BIG-IP itself. BIG-IP has to deal with all packets arriving on the device, regardless of an existing listener (Virtual Server (VS)/ Protected Object (PO)), connection entry, NAT entry or whatever. As soon as a packet hits the BIG-IP it´s CPU has basically to deal with it. Picking up each and every packet consumes CPU cycles, especially under heavy DOS conditions. The more operations each packet needs, depending on configurations, the higher the CPU load gets, because it consumes more CPU cycles. The earlier BIG-IP can drop a packet (if the packet needs to get dropped) the better it is for the CPU load. If the drop process is done in hardware (FPGA), the packet gets dropped before it consumes CPU cycles.
Using static DOS vectors helps to lower the number of packets hitting the CPU when under DoS. If the number of packets is above a certain threshold (usually when a flood happens), BIG-IP rate-limits for that specific vector, which keeps the CPU available, because it sees less packets.
Figure 1: Principle of Attack mitigation with static DoS vectors
The downside with this approach is that the BIG-IP cannot differentiate between "legitimate" and "attacking" packets, when using static DOS rate-limit. BIG-IP (FPGA) just drops on the predicates it gets from the static DOS vector. Predicates are attributes of a packet like protocol “TCP” or “flag” set “ACK”. -Keep in mind, when BIG-IP runs stateful, "bad" packets will get dropped in software anyway by the operating system (TMOS), when it identifies them as not belonging to an existing connection or as invalid/broken (for example bad CRC checksum). This is of course different for SYN packets, because they create connection entries. This is where SYN-Cookies play an important role under DoS conditions and will be explained in a later article.
I usually recommend running Behavioral DoS mitigation in conjunction with static DoS vectors. It is way more precise and able to filter out only the attack traffic, but I will discuss this in more detail also in one of the following articles.
Manual threshold vs. Fully Automatic
Before I start to explain the “per service DoS protection” (DoS profiles), I would like to give you a brief overview of some Threshold modes you can use per DoS vector.
There are different ways to set thresholds to activate the detection and rate-limiting on a DOS vector.
The operator can do it manual by using the option “Fully Manual” and then fine tune the pre-configured values on the DOS vectors, which can be challenging, because beside doing it all manually it´s mostly difficult to know the thresholds, especially because they are usually related to the day and time of the week.
Figure 2: Example of a manual DoS vector configuration
That’s why most of the vectors have the option "Fully Automatic" available. It means BIG-IP will learn from the history and “knows” how many packets it usually “sees” at that specific time for the specific vector. This is called the baselining and calculates the detection threshold.
Figure 3: Threshold Modes
As soon as a flood hits the BIG-IP and crosses the detection threshold for a vector, BIG-IP detects it as an attack for that vector, which basically means it identifies it as an anomaly. However, it does not start to mitigate (drop). This will only happen as soon as a TMM load (CPU load) is also above a certain utilization (Mitigation sensitivity: Low: 78%, Medium 68%, High: 51%). If both conditions (packet rate and TMM/CPU load is too high) are true, the mitigation starts and lowers (rate-limit) the number of packets for this vector going into the BIG-IP for the specific TMM.
That means dropping packets by the DOS feature will only happen if necessary, in order to protect the BIG-IP CPU. This is a dynamic process which drops more packets when the attack becomes more aggressive and less when the attack becomes less aggressive.
For TCP traffic keep in mind the “invalid” traffic will not get forwarded anyway, which is a strong benefit of running the device stateful.
When a DOS vector is hardware supported then FPGAs drop the packets basically at the switch level of the BIG-IP. If it’s not hardware supported, then the packet is dropped at a very early stage of the life cycle of a packet “inside” a BIG-IP.
Device DOS counts ALL packets going to the BIG-IP. It doesn´t matter if they have a destination behind the BIG-IP, or when it is the BIG-IP itself.
Because Device DOS is to protect the BIG-IP, the thresholds you can configure in the “manual configuration” mode on the DOS vectors are per TMM! This means you configure the maximum number of packets one TMM can handle. Very often operators want to set an overall threshold of let’s say 10k, then they need to divide this limit by the number of TMMs. An Exception is the Sweep and Flood vector. Here the threshold is per BIG-IP and not per TMM.
DOS profile for protected Objects
Now let’s talk about the “per Service DoS Protection (DoS profile)”. The goal is to protect the service which runs behind the BIG-IP. Important to know, when BIG-IP runs stateful, the service is already protected by the state of BIG-IP (true for TCP traffic). That means a randomized ACK flood for example will never hit the service. On stateless traffic (UDP/ICMP) it is different, here the BIG-IP simply forward packets. Fully Automatic used on a DoS profile discovers the health of the service, which works well if its TCP or DNS traffic. (This is done by evaluating the TCP behavior like Window Size, Retransmission, Congestions, or counting DNS requests vs. responses, ....).
If the detection rate for that service is crossed (anomaly detected) and stress on the service is identified, the mitigation starts. Again, keep in mind for TCP this will only happen for “legitimate” traffic, because out of the state traffic will never reach the service. It is already protected by the BIG-IP being configured state-full.
An example could be a HTTP flood. My recommendation: If you have the option to configure L7 BaDOS (Behavioral DOS running on DHD/A-WAF), then go with that one. It gives much better results and better mitigation options on HTTP floods then just L3/4 DOS, because it does also operate on L7. Please also do then not configure TCP DOS for “Push Flood”, "bad ACK Flood" and “SYN Flood”, on that DOS profile, except you use a TMOS version newer than 14.1. For other TCP services like POP3, SMTP, LDAP … you can always go with the L3/4 DOS vectors. Since TMOS version 15.0 L7 BaDOS and L3/4 DOS vectors work in conjunction.
Stateless traffic (UDP/ICMP)
For UDP traffic I would like to separate UDP into DNS traffic and non-DNS-UDP traffic.
DNS is a very good example where the health detection mechanism works great. This is done by measuring the ratio of requests vs. responses. For example, if BIG-IP sees 100k queries/sec going to the DNS server and the server sends back 100k answers/sec, the server shows it can handle the load.
If it sees 200k queries/sec going to the server and the server sends back only 150k queries/sec, then it is a good indication of that the server cannot handle the load. In that case the BIG-IP would start to rate-limit, when the current rate is also above the detection rate (The rate BIG-IP expects based on the history rates). The 'Device UDP' vector gives you the option to exclude UDP traffic on specific ports from the UDP vector (packet counting). For example, when you exclude port 53 and UDP traffic hits that port, then it will not count into the UDP counter. In that case you would handle the DNS traffic with the DNS vectors.
Figure 4 : UDP Port exclusion
Here you see an example for port 53 (DNS), 5060 (SIP), 123 (NTP)
Auto Detection / Multiplier Based Mitigation
If the traffic is non-DNS-UDP traffic or ICMP traffic the stress measurement does not work very accurate, so I recommend going with the “multiplier option” on the DOS profile. Here BIG-IP does the baselining (calculating the detection rate) similar to “fully automatic” mode, except it will kick in (rate-limit) if it sees more than the defined multiplication for the specific vector, regardless of the CPU load.
For example, when the calculated detection rate is 250k packets/sec and the multiplication is set to 500 (which means 5 times), then the mitigation rate would be 250k x 5 = 1.250.000 packets/sec.
The multiplier feature gives the nice option to configure a threshold based on the multiplication of an expected rate.
Figure 5 : Multiplier based mitigation
On a DOS profile the threshold configuration is per service (all packets targeted to the protected service), which actually means on that BIG-IP and NOT per TMM like on the Device level. Here the goal is to set how many packets are allowed to pass the BIG-IP and reach the service. The distribution of these thresholds to the TMMs is done in a dynamic way: Every TMM gets a percentage of the configured threshold, based on the EPS (Events Per Second, which is in this context Packets Per Second) for the specific vector the system has seen in the second before on this TMM. This mechanism protects against hash type of attacks.
Physical and logical data path
There is a physical and logical traffic path in BIG-IP. When a packet gets dropped via a DOS profile on a VS/PO, then BIG-IP will not see it on the device level counter anymore. If the threshold on the device level is reached, the packet gets dropped by the device level and will not get to the DOS profile. So, the physical path is device first and then comes the VS/PO, but the logical path is VS/PO first and then device.
That means, when a packet arrives on the device level (physical first), that counter gets incremented. If the threshold is not reached, the packet goes to the VS/PO (DOS profile) level. If the threshold is reached here, then the packet gets dropped and the counter for the device DOS vector decremented. If the packet does not get dropped, then the packet counts into both counters (Device/VS). This means that the VS/PO thresholds should always set lower than device thresholds (remember on Device level you set the thresholds per TMM!). The device threshold should be the maximum the BIG-IP (TMM) can handle overall.
It is important to note that mitigation in manual mode are “hard” numbers and do not take into account the stress on the service. In “Fully Automatic” mode on the VS/PO level, mitigation kicks in when the detection threshold is reached AND the stress on the service is too high.
Now let’s assume a protected TCP service behind the BIG-IP gets attacked by a TCP attack like a randomized Push/ACK food. Because of the high packet rate hitting that vector on the attached DOS profile for that PO, it will go into 'detect mode'. But because it is a TCP flood and BIG-IP is configured to run state-full, the attack packets will never reach the service behind the BIG-IP and therefore the stress on that service will never go high in this case. The flood is then handled by the session state (CPU) of the BIG-IP until this gets too much under pressure and then the Device DOS will kick in “upfront” and mitigate the flood.
If you use the “multiplication” option, then the mitigation kicks in when the packet detection rate + multiplication is reached. This can happen on the VS/PO level and/or on the Device level, because it is independent of stress.
“Fully automatic” will not work properly with asymmetric traffic for VS/PO, because here BIG-IP cannot identify if the service is under stress due to the fact that BIG-IP will only see half of the traffic. -Either request or response depending on where it is initiated.
For deployments with asymmetric traffic I recommend using the manual or “multiplier option”, on VS/PO configurations. The “Multiplier option” keeps it dynamic in regard to the calculated detection and mitigation rate based on the history.
“Fully Automatic” works with Asymmetric traffic on Device DOS, because it measures the CPU (TMM) stress of the BIG-IP itself. Of course, when you run asymmetric traffic, BIG-IP can´t be state-full.
I recommend to always start first with the Device DOS configuration to protect the BIG-IP itself against DOS floods and then to focus on the DOS protection for the services behind the BIG-IP.
On Device DOS you will probably mostly use “Fully Manual” and “Fully Automatic” threshold modes.
To protect services, you will mostly go with “Fully Manual” and “Fully Automatic” or “Auto Detection/ Multiplier based mitigation”.
In one of the next articles I will describe in more detail when to use what and why.
As mentioned before, additional to the static DoS vectors I recommend enabling Behavioral DOS, which is way smarter than the static DOS vectors and can filter out specifically only the attack traffic which tremendously decrease the chance of false positives. You can enable it like the static DoS vectors on Device- an on VS/PO level.
But this topic will be covered in another article as well.
With that said I would like to finish my first article. Let me know your feedback.
Thank you, sVen Mueller
Great detailed article explaining the nuances of DOS protection. I look forward to the next articles.
That's simply the most useful resource i found so far for anti DDOS protection implementation. This is a complex topic where you can be easily overwhelmed by configuration aspect without knowing what is the strategy to set. This article definitely helps, please keep posting, i really look forward for more.
Hey MyGoul, great you like the article. Your feedback is really motivating. The next article will be online probably next week Thursday. Thanks, sVen
would you recommend running learn only for some period of time before moving into any sort of detection / mitigation phase?
well in productive environments I prefer to run in mitigation mode immediately. At least on the Device level, just to make sure the Device is protected.
Mitigation will only kick in when the BIG-IP is too much under stress and then it is a good reason to kick in. Learn only is nice, but it will consider anything as legitimate traffic, which can effect the learning negative (when an attack happens during the learning period).
Going with mitigation or detection mode is from my point of view a better approach. But keep in mind you may need to adjust the floor value, when a vector goes into detect mode, just because there is more traffic than expected, which can easily happen within the first week.
I also plan to write an article about my best practices on configuration and integration of BIG-IP. Then I will discuss it in more details.
Thanks for you nice feedback!
I saw this KB state that with new behaviour the DOS profile count as globally instead of per TMM.
Did that related with this article that you mention about Device DOS Profile?
Thanks in advance.