on 08-Apr-2020 12:31
This article is intended to show what effect the different threshold modes have on the Device and Per-Service (VS/PO) context. I will be using practical examples to demonstrate those effects. You will get to review a couple of scripts which will help you to do DoS flood tests and “visualizing” results on the CLI.
In my first article (https://devcentral.f5.com/s/articles/Concept-of-F5-Device-DoS-and-DoS-profiles), I talked about the concept of F5 Device DoS and Per-Service DoS protection (DoS profiles). I also covered the physical and logical data path, which explains the order of Device DoS and Per-Service DoS using the DoS profiles.
In the second article (https://devcentral.f5.com/s/articles/Explanation-of-F5-DDoS-threshold-modes), I explained how the different threshold modes are working.
In this third article, I would like to show you what it means when the different modes work together.
But, before I start doing some tests to show the behavior, I would like to give you a quick introduction into the toolset I´m using for these tests.
First of all, how to do some floods? Different tools that can be found on the Internet are available for use. Whatever tools you might prefer, just download the tool and run it against your Device Under Test (DUT).
If you would like to use my script you can get it from GitHub: https://github.com/sv3n-mu3ll3r/DDoS-Scripts
With this script - it uses hping - you can run different type of attacks.
Simply start it with: $ ./attack_menu.sh <IP> <PORT>
A menu of different attacks will be presented which you can launch against the used IP and port as a parameter.
Figure 1: Attack Menu
To see what L3/4 DoS events are currently ongoing on your BIG-IP, simply go to the DoS Overview page.
Figure 2: DoS Overview page
I personally prefer to use the CLI to get the details I´m interested in. This way I don´t need to switch between CLI to launch my attacks and GUI to see the results.
For that reason, I created a script which shows me what I am most interested in.
Figure 3: DoS stats via CLI
You can download that script here:
Simply run it with the “watch” command and the parameter “-c” to get a colored output (-c is only available starting with TMOS version 14.0):
What is this script showing you?
context_name: This is the context, either PO/VS or the Device in which the vector is running
vector_name: This is the name of the DoS vector
attack_detected: When it shows “1”, then an attack has been detected, which means the ‘stats_rate’ is above the ‘detection-rate'.
stats_rate: Shows you the current incoming pps rate for that vector in this context
drops_rate: Shows you the number dropped pps rate in software (not FPGA) for that vector in this context
int_drops_rate: Shows you the number dropped pps rate in hardware (FPGA) for that vector in this context
ba_stats_rate: Shows you the pps rate for bad actors
ba_drops_rate: Shows you the pps rate of dropped ‘bad actors’ in HW and SW
bd_stats_rate: Shows you the pps rate for attacked destination
bd_drop_rate: Shows you the pps rate for dropped ‘attacked destination’
mitigation_curr: Shows the current mitigation rate (per tmm) for that vector in that context
detection: Shows you the current detection rate (per tmm) for that vector in that context
wl_count: Shows you the number of whitelist hits for that vector in that context
hw_offload: When it shows ‘1’ it means that FPGAs are involved in the mitigation
int_dropped_bytes_rate: Gives you the rate of in HW dropped bytes for that vector in that context
dropped_bytes_rate: Gives you the rate of in SW dropped bytes for that vector in that context
When a line is displayed in green, it means packets hitting that vector. However, no anomaly is detected or anything is mitigated (dropped) via DoS.
If a line turns yellow, then an attack - anomaly – has been detected but no packets are dropped via DoS functionalities.
When the color turns red, then the system is actually mitigating and dropping packets via DoS functionalities on that vector in that context.
Before we start doing some tests, let me provide you with a quick overview of my own lab setup.
I´m using a DHD (DDoS Hybrid Defender) running on a i5800 box with TMOS version 15.1
My traffic generator sends around 5-6 Gbps legitimate (HTTP and DNS) traffic through the box which is connected in L2 mode (v-wire) to my network.
On the “client” side, where my clean traffic generator is located, my attacking clients are located as well by use of my DoS script. On the “server” side, I run a web service and DNS service, which I´m going to attack.
Ok, now let’s do some test so see the behavior of the box and double check that we really understand the DDoS mitigation concept of BIG-IP.
Y-MAS flood against a protected server
Let’s start with a simple Y-MAS (all TCP flags cleared) flood. You can only configure this vector on the device context and only in manual mode. Which is ok, because this TCP packet is not valid and would get drop by the operating system (TMOS) anyway. But, because I want this type of packet get dropped in hardware (FPGA) very early, when they hit the box, mostly without touching the CPU, I set the thresholds to ‘10’ on the Mitigation Threshold EPS and to ‘10’ on Detection Threshold EPS. That means as soon as a TMM sees more then 10 pps for that vector it will give me a log message and also rate-limit this type of packets per TMM to 10 packets per second. That means that everything below that threshold will get to the operating system (TMOS) and get dropped there.
Figure 4: Bad TCP Flags vector
As soon as I start the attack, which targets the web service (10.103.1.50, port 80) behind the DHD with randomized source IPs.
$ /usr/sbin/hping3 --ymas -p 80 10.103.1.50 --flood --rand-source
I do get a log messages in /var/log/ltm:
Feb 5 10:57:52 lon-i5800-1 err tmm3: 01010252:3: A Enforced Device DOS attack start was detected for vector Bad TCP flags (all cleared), Attack ID 546994598.
And, my script shows me the details on that attack in real time (the line is in ‘red’, indicating we are mitigating):
Currently 437569 pps are hitting the device. 382 pps are blocked by DDoS in SW (CPU) and 437187 are blocked in HW (FPGA).
Figure 5: Mitigating Y-Flood
Great, that was easy. 🙂
Now, let’s do another TCP flood against my web server.
RST-flood against a protected server with Fully manual Threshold mode:
For this test I´m using the “Fully Manual” mode, which configures the thresholds for the whole service we are protecting with the DoS profile, which is attached to my VS/PO.
Figure 6: RST flood with manual configuration
My Detection Threshold and my Mitigation Threshold EPS is set to ‘100’. That means as soon as we see more then 100 RST packets hitting the configured VS/PO on my BIG-IP for this web server, the system will start to rate-limit and send a log message.
Figure 7: Mitigating RST flood on PO level
Perfect. We see the vector in the context of my web server (/Common/PO_10.103.1.50_HTTP) is blocking (rate-limiting) as expected from the configuration. Please ignore the 'IP bad src' which is in detected mode. This is because 'hping' creates randomized IPs and not all of them are valid.
RST-flood against a protected server with Fully Automatic Threshold mode:
In this test I set the Threshold Mode for the RST vector on the DoS profile which is attached to my web server to ‘Fully Automatic’ and this is what you would most likely do in the real world as well.
Figure 8: RST vector configuration with Fully Automatic
But, what does this mean for the test now?
I run the same flood against the same destination and my script shows me the anomaly on the VS/PO (and on the device context), but it does not mitigate! Why would this happen?
Figure 9: RST flood with Fully Automatic configuration
When we take a closer look at the screenshot we see that the ‘stats_rate’ shows 730969 pps. The detection rate shows 25. From where is this 25 coming? As we know, when ‘Fully Automatic’ is enabled then the system learns from history. In this case, the history was even lower than 25, but because I set the floor value to 100, the detection rate per TMM is 25 (floor_value / number of TMMs), which in my case is 100/4 = 25
So, we need to recognize, that the ‘stats_rate’ value represents all packets for that vector in that context and the detection value is showing the per TMM value.
This value explains us why the system has detected an attack, but why is it not dropping via DoS functionalities? To understand this, we need to remember that the mitigation in ‘Fully Automatic’ mode will ONLY kick in if the incoming rate for that vector is above the detection rate (condition is here now true) AND the stress on the service is too high. But, because BIG-IP is configures as a stateful device, this randomized RST packets will never reach the web service, because they get all dropped latest by the state engine of the BIG-IP. Therefor the service will never have stress caused by this flood.This is one of the nice benefits of having a stateful DoS device. So, the vector on the web server context will not mitigate here, because the web server will not be stressed by this type of TCP attack.
This does also explains the Server Stress visualization in the GUI, which didn´t change before and during the attack.
Figure 10: DoS Overview in the GUI
But, what happens if the attack gets stronger and stronger or the BIG-IP is too busy dealing with all this RST packets?
This is when the Device DOS kicks in but only if you have configured it in ‘Fully Automatic’ mode as well. As soon as the BIG-IP receives more RST packets then usually (detection rate) AND the stress (CPU load) on the BIG-IP gets too high, it starts to mitigate on the device context.
This is what you can see here:
Figure 11: 'massive' RST flood with Fully Automatic configuration
The flood still goes against the same web server, but the mitigation is done on the device context, because the CPU utilization on the BIG-IP is too high.
In the screenshot below you can see that the value for the mitigation (mitigation_curr) is set to 5000 on the device context, which is the same as the detection value. This value results from the 'floor' value as well. It is the smallest possible value, because the detection and mitigation rate will never be below the 'floor' value. The mitigation rate is calculated dynamically based on the stress of the BIG-IP. In this test I artificially increased the stress on the BIG-IP and therefor the mitigation rate was calculated to the lowest possible number, which is the same as the detection rate.
I will provide an explanation of how I did that later.
Figure 13: Device context configuration for RST Flood
Because this is the device config, the value you enter in the GUI is per TMM and this is reflected on the script output as well.
What does this mean for the false-positive rate?
First of all, all RST packets not belonging to an existing flow will kicked out by the state engine. At this point we don´t have any false positives. If the attack increases and the CPU can´t handle the number of packets anymore, then the DOS protection on the device context kicks in. With the configuration we have done so far, it will do the rate-limiting on all RST packets hitting the BIG-IP. There is no differentiation anymore between good and bad RST, or if the RST has the destination of server 1 or server 2, and so on.
This means that at this point we can face false positives with this configuration.
Is this bad? Well, false-positives are always bad, but at this point you can say it´s better to have few false-positives then a service going down or, even more critical, when your DoS device goes down.
What can you do to only have false positives on the destination which is under attack?
You probably have recognized that you can also enable “Attacked Destination Detection” on the DoS vector, which makes sense on the device context and on a DoS profile which is used on protected object (VS), that covers a whole network.
Figure 14: Device context configuration for RST Flood with 'Attacked Destination Detection' enabled
If the attack now hits one or multiple IPs in that network, BIG-IP will identify them and will do the rate-limiting only on the destination(s) under attack. Which still means that you could face false positives, but they will be at least only on the IPs under attack.
This is what we see here:
Figure 15: Device context mitigation on attacked destination(s)
The majority of the RST packet drops are done on the “bad destination” (bd_drops_rate), which is the IP under attack.
The other option you can also set is “Bad Actor Detection”. When this is enabled the system identifies the source(s) which causes the load and will do the rate limiting for that IP address(es). This usually works very well for amplification attacks, where the attack packets coming from ‘specific’ hosts and are not randomized sources.
Figure 16: Device context mitigation on attacked destination(s) and bad actor(s)
Here you can see the majority of the mitigation is done on ‘bad actors’. This reduces the chance of false positives massively.
Figure 17: Device context configuration for RST Flood with 'Attacked Destination Detection' and 'Bad Actor Detection' enabled
You also have multiple additional options to deal with ‘attacked destination’ and ‘bad actors’, but this is something I will cover in another article.
Artificial increase BIG-IPs CPU load
Before I finish this article, I would like to give you an idea on how you could increase the CPU load of the BIG-IP for your own testing.
Because, as we know, with “Fully Automatic” on the device context, the mitigation kicks only in if the packet rate is above the detection rate AND the CPU utilization on the BIG-IP is “too” high. This is sometimes difficult to archive in a lab because you may not be able to send enough packets to stress the CPUs of a HW BIG-IP.
In this use-case I use a simple iRule, which I attach to the VS/PO that is under attack.
Figure 18: Stress iRule
When I start my attack, I send the traffic with a specific TTL for identification. This TTL is configured in my iRule in order to get a CPU intensive string compare function to work on every attack packet.
Here is an example for 'hping':
$ /usr/sbin/hping3 --rst -p 80 10.103.1.50 --ttl 5 --flood --rand-source
This can easily drive the CPU very high and the DDoS rate-limiting kicks in very aggressive.
I hope that this article provides you with an even better understanding on what effect the different threshold modes have on the attack mitigation.
Of course, keep in mind this are just the ‘static DoS’ vectors. In later articles I will explain also the 'Behavioral DoS' and the Cookie based mitigation, which helps to massively reduce the chance of a false positives. But, also keep in mind, the DoS vectors start to act in microseconds and are very effective for protecting.
Thank you, sVen Mueller
Thanks so much for your posts and sharing your scripts.
I've a question regarding the scripts, I was checking with the scripts one PO that has manually configured thresholds, and the value that is shown in the detection column doesn't correspond with the mitigation value divided by the tmm count.
Specifically, it was tested on an i5800 running BIG-IP 126.96.36.199 version, DDoS 16.1.0-9.0.20 version, and the mitigation threshold value is 4000 eps, the detection that is shown running the script is 1736.
Is something changed from where the scripts takes this value? or there is anotherway to see the detection per tmm?
right the script does only show the auto-calculated values, not the manual thresholds.
Keep in mind the shown detection and mitigation values are per tmm and the stats_rate is an accumulates value.