Security
2775 TopicsConcept of F5 Device DoS and DoS profiles
Dear Reader, I´m about to write a series of DDoS articles, which will hopefully help you to better understand how F5 DDoS works and how to configure it. I plan to release them in a more or less regular frequency. Feel free to send me your feedback and let me know which topics are relevant for you and should get covered. This first article is intended to give an explanation of the BIG-IP Device DOS and Per Service-DOS protection (DOS profile). It covers the concepts of both approaches and explains in high level the threshold modes “Fully manual”, “Fully Automatic” and “Multiplier Based Mitigation” including the principles of stress measurement. In later articles I will describe this threshold modes in more detail. In the end of the article you will also find an explanation of the physical and logical data path of the BIG-IP. Device DOS with static DoS vectors The primary goal of “Device DOS” is to protect the BIG-IP itself. BIG-IP has to deal with all packets arriving on the device, regardless of an existing listener (Virtual Server (VS)/ Protected Object (PO)), connection entry, NAT entry or whatever. As soon as a packet hits the BIG-IP it´s CPU has basically to deal with it. Picking up each and every packet consumes CPU cycles, especially under heavy DOS conditions. The more operations each packet needs, depending on configurations, the higher the CPU load gets, because it consumes more CPU cycles. The earlier BIG-IP can drop a packet (if the packet needs to get dropped) the better it is for the CPU load. If the drop process is done in hardware (FPGA), the packet gets dropped before it consumes CPU cycles. Using static DOS vectors helps to lower the number of packets hitting the CPU when under DoS. If the number of packets is above a certain threshold (usually when a flood happens), BIG-IP rate-limits for that specific vector, which keeps the CPU available, because it sees less packets. Figure 1: Principle of Attack mitigation with static DoS vectors The downside with this approach is that the BIG-IP cannot differentiate between "legitimate" and "attacking" packets, when using static DOS rate-limit. BIG-IP (FPGA) just drops on the predicates it gets from the static DOS vector. Predicates are attributes of a packet like protocol “TCP” or “flag” set “ACK”. -Keep in mind, when BIG-IP runs stateful, "bad" packets will get dropped in software anyway by the operating system (TMOS), when it identifies them as not belonging to an existing connection or as invalid/broken (for example bad CRC checksum). This is of course different for SYN packets, because they create connection entries. This is where SYN-Cookies play an important role under DoS conditions and will be explained in a later article. I usually recommend running Behavioral DoS mitigation in conjunction with static DoS vectors. It is way more precise and able to filter out only the attack traffic, but I will discuss this in more detail also in one of the following articles. Manual threshold vs. Fully Automatic Before I start to explain the “per service DoS protection” (DoS profiles), I would like to give you a brief overview of some Threshold modes you can use per DoS vector. There are different ways to set thresholds to activate the detection and rate-limiting on a DOS vector. The operator can do it manual by using the option “Fully Manual” and then fine tune the pre-configured values on the DOS vectors, which can be challenging, because beside doing it all manually it´s mostly difficult to know the thresholds, especially because they are usually related to the day and time of the week. Figure 2: Example of a manual DoS vector configuration That’s why most of the vectors have the option "Fully Automatic" available. It means BIG-IP will learn from the history and “knows” how many packets it usually “sees” at that specific time for the specific vector. This is called the baselining and calculates the detection threshold. Figure 3: Threshold Modes As soon as a flood hits the BIG-IP and crosses the detection threshold for a vector, BIG-IP detects it as an attack for that vector, which basically means it identifies it as an anomaly. However, it does not start to mitigate (drop). This will only happen as soon as a TMM load (CPU load) is also above a certain utilization (Mitigation sensitivity: Low: 78%, Medium 68%, High: 51%). If both conditions (packet rate and TMM/CPU load is too high) are true, the mitigation starts and lowers (rate-limit) the number of packets for this vector going into the BIG-IP for the specific TMM. That means dropping packets by the DOS feature will only happen if necessary, in order to protect the BIG-IP CPU. This is a dynamic process which drops more packets when the attack becomes more aggressive and less when the attack becomes less aggressive. For TCP traffic keep in mind the “invalid” traffic will not get forwarded anyway, which is a strong benefit of running the device stateful. When a DOS vector is hardware supported then FPGAs drop the packets basically at the switch level of the BIG-IP. If it’s not hardware supported, then the packet is dropped at a very early stage of the life cycle of a packet “inside” a BIG-IP. Device DOS counts ALL packets going to the BIG-IP. It doesn´t matter if they have a destination behind the BIG-IP, or when it is the BIG-IP itself. BecauseDevice DOSis to protect the BIG-IP,the thresholdsyou can configure in the “manual configuration” mode on the DOS vectorsare per TMM! This means you configure the maximum number of packets one TMM can handle. Very often operators want to set an overall threshold of let’s say 10k, then they need to divide this limit by the number of TMMs.An Exception is the Sweep and Flood vector. Here the threshold is per BIG-IP and not per TMM. DOS profile for protected Objects Now let’s talk about the “per Service DoS Protection (DoS profile)”. The goal is to protect the service which runs behind the BIG-IP. Important to know, when BIG-IP runs stateful, the service is already protected by the state of BIG-IP (true for TCP traffic). That means a randomized ACK flood for example will never hit the service. On stateless traffic (UDP/ICMP) it is different, here the BIG-IP simply forward packets.Fully Automatic used on a DoS profile discovers thehealth of the service, which works well if its TCP or DNS traffic. (This is done by evaluating the TCP behavior like Window Size, Retransmission, Congestions, or counting DNS requests vs. responses, ....). If the detection rate for that service is crossed (anomaly detected) and stress on the service is identified, the mitigation starts. Again, keep in mind for TCP this will only happen for “legitimate” traffic, because out of the state traffic will never reach the service. It is already protected by the BIG-IP being configured state-full. An example could be a HTTP flood.My recommendation:If you have the option to configureL7 BaDOS(Behavioral DOS running on DHD/A-WAF), then go with that one. It gives much better results and better mitigation options on HTTP floods then just L3/4 DOS, because it does also operate on L7. Please also do thennot configureTCP DOS for “Push Flood”, "bad ACK Flood" and “SYN Flood”, on that DOS profile, except you use a TMOS version newer than 14.1. For other TCP services like POP3, SMTP, LDAP … you can always go with the L3/4 DOS vectors. Since TMOS version 15.0 L7 BaDOS and L3/4 DOS vectors work in conjunction. Stateless traffic (UDP/ICMP) For UDP traffic I would like to separate UDP into DNS traffic and non-DNS-UDP traffic. DNS is a very good example where the health detection mechanism works great. This is done by measuring the ratio of requests vs. responses. For example, if BIG-IP sees 100k queries/sec going to the DNS server and the server sends back 100k answers/sec, the server shows it can handle the load. If it sees 200k queries/sec going to the server and the server sends back only 150k queries/sec, then it is a good indication of that the server cannot handle the load. In that case the BIG-IP would start to rate-limit, when the current rate is also above the detection rate (The rate BIG-IP expects based on the history rates). The 'Device UDP' vector gives you the option to exclude UDP traffic on specific ports from the UDP vector (packet counting). For example, when you exclude port 53 and UDP traffic hits that port, then it will not count into the UDP counter. In that case you would handle the DNS traffic with the DNS vectors. Figure 4 : UDP Port exclusion Here you see an example for port 53 (DNS), 5060 (SIP), 123 (NTP) Auto Detection / Multiplier Based Mitigation If the traffic is non-DNS-UDP traffic or ICMP traffic the stress measurement does not work very accurate, so I recommend going with the“multiplier option” on the DOS profile. Here BIG-IP does the baselining (calculating the detection rate) similar to “fully automatic” mode, except it will kick in (rate-limit) if it sees more than the defined multiplication for the specific vector, regardless of the CPU load. For example, when the calculated detection rate is 250k packets/sec and the multiplication is set to 500 (which means 5 times), then the mitigation rate would be 250k x 5 = 1.250.000 packets/sec. The multiplier feature gives the nice option to configure a threshold based on the multiplication of an expected rate. Figure 5 : Multiplier based mitigation On aDOS profilethethreshold configuration is per service (all packets targeted to the protected service), which actually means on that BIG-IP and NOT per TMM like on the Device level. Here the goal is to set how many packets are allowed to pass the BIG-IP and reach the service. The distribution of these thresholds to the TMMs is done in a dynamic way: Every TMM gets a percentage of the configured threshold, based on the EPS (Events Per Second, which is in this context Packets Per Second) for the specific vector the system has seen in the second before on this TMM. This mechanism protects against hash type of attacks. Physical and logical data path There is aphysical and logical traffic pathin BIG-IP. When a packet gets dropped via a DOS profile on a VS/PO, then BIG-IP will not see it on the device level counter anymore. If the threshold on the device level is reached, the packet gets dropped by the device level and will not get to the DOS profile. So, the physical path is device first and then comes the VS/PO, but the logical path is VS/PO first and then device. That means, when a packet arrives on the device level (physical first), that counter gets incremented. If the threshold is not reached, the packet goes to the VS/PO (DOS profile) level. If the threshold is reached here, then the packet gets dropped and the counter for the device DOS vector decremented. If the packet does not get dropped, then the packet counts into both counters (Device/VS). This means that the VS/PO thresholds should always set lower than device thresholds (remember on Device level you set the thresholds per TMM!). The device threshold should be the maximum the BIG-IP (TMM) can handle overall. It is important to note that mitigation in manual mode are “hard” numbers and do not take into account the stress on the service. In “Fully Automatic” mode on the VS/PO level, mitigation kicks in when the detection threshold is reached AND the stress on the service is too high. Now let’s assume a protected TCP service behind the BIG-IP gets attacked by a TCP attack like a randomized Push/ACK food. Because of the high packet rate hitting that vector on the attached DOS profile for that PO, it will go into 'detect mode'. But because it is a TCP flood and BIG-IP is configured to run state-full, the attack packets will never reach the service behind the BIG-IP and therefore the stress on that service will never go high in this case. The flood is then handled by the session state (CPU) of the BIG-IP until this gets too much under pressure and then the Device DOS will kick in “upfront” and mitigate the flood. If you use the “multiplication” option, then the mitigation kicks in when the packet detection rate + multiplication is reached. This can happen on the VS/PO level and/or on the Device level, because it is independent of stress. “Fully automatic” will not work properly with asymmetric traffic for VS/PO, because here BIG-IP cannot identify if the service is under stress due to the fact that BIG-IP will only see half of the traffic. -Either request or response depending on where it is initiated. For deployments with asymmetric traffic I recommend using the manual or “multiplier option”, on VS/PO configurations. The “Multiplier option” keeps it dynamic in regard to the calculated detection and mitigation rate based on the history. “Fully Automatic” works with Asymmetric traffic on Device DOS, because it measures the CPU (TMM) stress of the BIG-IP itself. Of course, when you run asymmetric traffic, BIG-IP can´t be state-full. I recommend to always start first with the Device DOS configuration to protect the BIG-IP itself against DOS floods and then to focus on the DOS protection for the services behind the BIG-IP. On Device DOS you will probably mostly use “Fully Manual” and “Fully Automatic” threshold modes. To protect services, you will mostly go with “Fully Manual” and “Fully Automatic” or “Auto Detection/ Multiplier based mitigation”. In one of the next articles I will describe in more detail when to use what and why. As mentioned before, additional to the static DoS vectors I recommend enabling Behavioral DOS, which is way smarter than the static DOS vectors and can filter out specifically only the attack traffic which tremendously decrease the chance of false positives. You can enable it like the static DoS vectors on Device- an on VS/PO level. But this topic will be covered in another article as well. With that said I would like to finish my first article. Let me know your feedback. Thank you, sVen Mueller7.4KViews34likes7CommentsExplanation of F5 DDoS threshold modes
Der Reader, In my article “Concept of Device DOS and DOS profile”, I recommended to use the “Fully Automatic” or “Multiplier” based configuration option for some DOS vectors. In this article I would like to explain how these threshold modes work and what is happening behind the scene. When you configure a DOS vector you have the option to choose between different threshold modes: “Fully Automatic”, “Auto Detection / Multiplier Based”, “Manual Detection / Auto Mitigation” and “Fully Manual”. Figure 1: Threshold Modes The two options I normally use on many vectors are “Fully Automatic” and “Auto Detection / Multiplier Based”. But what are these two options do for me? To manually set thresholds is for some vectors not an easy task. I mean who really knows how many PUSH/ACK packets/sec for example are usually hitting the device or a specific service? And when I have an idea about a value, should this be a static value? Or should I better take the maximum value I have seen so far? And how many packets per second should I put on top to make sure the system is not kicking in too early? When should I adjust it? Do I have increasing traffic? Fully Automatic In reality, the rate changes constantly and most likely during the day I will have more PUSH/ACK packets/sec then during the night. What happens when there is a campaign or an event like “Black Friday” and way more users are visiting the webpage then usually? During these high traffic events, my suggested thresholds might be no longer correct which could lead to “good” traffic getting dropped. All this should be taken into consideration when setting a threshold and it ends up being very difficult to do manually. It´s better to make the machine doing it for you and this is what “Fully Automatic” is about. Figure 2: Expected EPS As soon as you use this option, it leverages from the learning it has done since traffic is passing through the BIG-IP, or since you have enforced the relearning, which resets everything learned so far and starts from new. The system continuously calculates the expected rates for all the vectors based on the historic rates of traffic. It takes the information up to one year and calculates them with different weights in order to know which packets rate should be expected at that time and day for that specific vector in the specific context (Device, Virtual Server/Protected Object). The system then calculates a padding on top of this expected rate. This rate is called Detection Rate and is dependent on the “threshold sensitivity” you have configured: Low Sensitivity means 66% padding Medium Sensitivity means 40% padding High Sensitivity means 0% padding Figure 3: Detection EPS As soon as the current rate is above the detection value, the BIG-IP will show the message “Attack detected”, which actually means anomaly detected, because it sees more packets of that specific vector then expected + the padding (detection_rate). But DoS mitigation will not start at that point! Figure 4: Current EPS Keep in mind, when you run the BIG-IP in stateful mode it will drop 'out of state' packets anyway. This has nothing to do with DoS functionalities. But what happens when there is a serious flood and the BIG-IP CPU gets high because of the massive number of packets it has to deal with? This is when the second part of the “Fully Automatic” approach comes into the game. Again, depending on your threshold sensitivity the DOS mitigation starts as soon as a certain level of stress is detected on the CPU of the BIG-IP. Figure 5: Mitigation Threshold Low Sensitivity means 78,3% TMM load Medium Sensitivity means 68,3% TMM load High Sensitivity means 51,6% TMM load Note, that the mitigation is per TMM and therefore the stress and rate per TMM is relevant. When the traffic rate for that vector is above the detection rate and the CPU of the BIG-IP (Device DOS) is “too” busy, the mitigation kicks in and will rate limit on that specific vector. When a DOS vector is hardware supported, FPGAs drop the packets at the switch level of the BIG-IP. If that DOS vector is not hardware supported, then the packet is dropped at a very early stage of the life cycle of a packet inside a BIG-IP. The rate at which are packets dropped is dynamic (mitigation rate), depending on the incoming number of packets for that vector and the CPU (TMM) stress of the BIG-IP. This allows the stress of the CPU to go down as it has to deal with less packets. Once the incoming rate is again below the detection rate, the system declares the attack as ended. Note: When an attack is detected, the packet rate during that time will not go into the calculation of expected rates for the future. This ensures that the BIG-IP will not learn from attack traffic skewing the automatic thresholds. All traffic rates below the detection (or below the floor value, when configured) rate modify the expected rate for the future and the BIG-IP will adjust the detection rate automatically. For most of the vectors you can configure a floor and ceiling value. Floor means that as long as the traffic value is below that threshold, the mitigation for that vector will never kick in. Even when the CPU is at 100%. Ceiling means that mitigation always kicks in at that rate, even when the CPU is idle. With these two values the dynamic and automatic process is done between floor and ceiling. Mitigation only gets executed when the rate is above the rate of the Floor EPS and Detection EPS AND stress on the particular context is measured. Figure 6: Floor and Ceiling EPS What is the difference when you use “Fully Automatic” on Device level compared to VS/PO (DOS profile) level? Everything is the same, except that on VS or Protected Object (PO) level the relevant stress is NOT the BIG-IP device stress, it is the stress of the service you are protecting (web-, DNS, IMAP-server, network, ...). BIG-IP can measure the stress of the service by measuring TCP attributes like retransmission, window size, congestion, etc. This gives a good indication on how busy a service is. This works very well for request/response protocols like TCP, HTTP, DNS. I recommend using this, when the Protected Object is a single service and not a “wildcard” Protected Object covering for example a network or service range. When the Protected Object is a “wildcard” service and/or a UDP service (except DNS), I recommend using “Auto Detection / Multiplier Option”. It works in the same way as the “Fully Automatic” from the learning perspective, but the mitigation condition is not stress, it is the multiplication of the detection rate. For example, the detection rate for a specific vector is calculated to be 100k packets/sec. By default, the multiplication rate is “500”, which means 5x. Therefore, the mitigation rate is calculated to 500k packets/sec. If that particular vector has more than 500k packets/sec those packets would be dropped. The multiplication rate can also be individually configured. Like in the screenshot, where it is set to 8x (800). Figure 7: Auto Detection / Multiplier Based Mitigation The benefit of this mode is that the BIG-IP will automatically learn the baseline for that vector and will only start to mitigate based on a large spike. The mitigation rate is always a multiplication of the detection rate, which is 5x by default but is configurable. When should I use “Fully Manual”? When you want to rate-limit a specific vector to a certain number of packets/sec, then “Fully Manual” is the right choice. Very good examples for that type of vector are the “Bad Header” vector types. These type of packets will never get forwarded by the BIG-IP so dropping them by a DoS vector saves the CPU, which is beneficial under DoS conditions. In the screenshot below is a vector configured as “Fully Manual”. Next I’ll describe what each of the options means. Figure 8: Fully Manual Detection Threshold EPS configures the packet rate/sec (pps) when you will get a log messages (NO mitigation!). Detection Threshold % compares the current pps rate for that vector with the multiplication of the configured percentage (in this example 5 for 500%) with the 1-minute average rate. If the current rate is higher, then you will get a log message. Mitigation Threshold EPS rate limits to that configured value (mitigation). I recommend setting the threshold (Mitigation Threshold EPS) to something relatively low like ‘10’ or ‘100’ on ‘Bad Header’ type of vectors. You can also set it also to ‘0’, which means all packets hitting this vector will get dropped by the DoS function which usually is done in hardware (FPGA). With the ‘Detection Threshold EPS’ you set the rate at which you want to get a log messages for that vector. If you do it this way, then you get a warning message like this one to inform you about the logging behavior: Warning generated: DOS attack data (bad-tcp-flags-all-set): Since drop limit is less than detection limit, packets dropped below the detection limit rate will not be logged. Another use-case for “Fully Manual” is when you know the maximum number of these packets the service can handle. But here my recommendation is to still use “Fully Automatic” and set the maximum rate with the Ceiling threshold, because then the protected service will benefit from both threshold options. Important: Please keep in mind, when you set manual thresholds for Device DoS the thresholds are to protect each TMM. Therefore the value you set is per TMM! An exception to this is the Sweep and Flood vector where the threshold is per BIG-IP/service and not per TMM like on DoS profiles. When using manual thresholds for aDOS profileof a Protected Object thethreshold configuration is per service (all packets targeted to the protected service) NOT per TMM like on the Device level. Here the goal is to set how many packets are allowed to pass the BIG-IP and reach the service. The distribution of these thresholds to the TMMs is done in a dynamic way: Every TMM gets a percentage of the configured threshold, based on the EPS (Events Per Second, which is in this context Packets Per Second) for the specific vector the system has seen in the second before on this TMM. This mechanism protects against hash type of attacks. Ok, I hope this article gives you a better understanding on how ‘Fully Manual’, ‘Fully Automatic’ and ‘Auto Detection / Multiplier Based Mitigation’ works. They are important concepts to understand, especially when they work in conjunction with stress measurement. This means the BIG-IP will only kick in with the DoS mitigation when the protected object (BIG-IP or the service behind the BIG-IP) is under stress. -Why risk false positives, when not necessary? With my next article I will demonstrate you how Device DoS and the DoS profiles work together andhow the stateful concept cooperates with the DoS mitigation. I will show you some DoS commands to test it and also commands to get details from the BIG-IP via CLI. Thank you, sVen Mueller10KViews21likes10CommentsF5 Distributed Cloud - Regional Decryption with Virtual Sites
In this article we discuss how the F5 Distributed Cloud can be configured to support regulatory demands for TLS termination of traffic to specific regions around the world. The article provides insight into the F5 Distributed Cloud global backbone and application delivery network (ADN). The article goes on to inspect how the F5 Distriubted Cloud is able to achieve these custom topologies in a multi-tenant architecture while adhearing to the "rules of the internet" for route summarization. Read on to learn about the flexibility of F5's SaaS platform providing application delivery and security solutions for your applications.5.6KViews17likes2CommentsWhy We CVE
Why We CVE Background First, for those who may not already know, I should probably explain what those three letters, CVE, mean. Sure, they stand for “Common Vulnerabilities and Exposures”, but that does that mean? What is the purpose? Borrowed right from the CVE.org website: The mission of the CVE® Program is to identify, define, and catalog publicly disclosed cybersecurity vulnerabilities. There is one CVE Record for each vulnerability in the catalog. The vulnerabilities are discovered then assigned and published by organizations from around the world that have partnered with the CVE Program. Partners publish CVE Records to communicate consistent descriptions of vulnerabilities. Information technology and cybersecurity professionals use CVE Records to ensure they are discussing the same issue, and to coordinate their efforts to prioritize and address the vulnerabilities. To state it simply, the purpose of a CVE record is to provide a unique identifier for a specific issue. I’m sure many of those reading this have dealt with questions such as “Does that new vulnerability announced today affect us?” or “Do we need to worry about that TCP vulnerability?” To which the reaction is likely exasperation and a question along the lines of “Which one? Can you provide any more detail?” We certainly see a fair number of support tickets along these lines ourselves. It gets worse when trying to discuss something that isn’t brand new, but years old. Sure, you might be able to say “Heartbleed”, and many will know what you’re talking about. (And do you believe that was April 2014? That’s forever ago in infosec years.) But what about the thousands of vulnerabilities announced each year that don’t get cute names and make headlines? Remember that Samba vulnerability? You know, the one from around the same time? No, the other one, improper initialization or something? Fun, right? It is much easier to say CVE-2014-0178 and everyone knows exactly what is being discussed, or at least can immediately look it up. Heartbleed, BTW, was CVE-2014-0160. If you have the CVE ID you can look it up at CVE.org, NVD (National Vulnerability Database), and many other resources. All parties can immediately be on the same page with the same basic understanding of the fundamentals. That is, simply, the power of CVE. It saves immeasurable time and confusion. I’m not going to go into detail on how the CVE program works, that’s not the intent of this article – though perhaps I could do that in the future if there is interest. Leave a comment below. Like and subscribe. Hit the bell icon… Sorry, too much YouTube. All that’s important is to note that F5 is a CNA, or CVE Numbering Authority: An organization responsible for the regular assignment of CVE IDs to vulnerabilities, and for creating and publishing information about the Vulnerability in the associated CVE Record. Each CNA has a specific Scope of responsibility for vulnerability identification and publishing. Each CNA has a ‘Scope’ statement, which defines what the CNA is responsible for within the CVE program. This is F5’s statement: All F5 products and services, commercial and open source, which have not yet reached End of Technical Support (EoTS). All legacy acquisition products and brands including, but not limited to, NGINX, Shape Security, Volterra, and Threat Stack. F5 does not issue CVEs for products which are no longer supported. And F5’s disclosure policy is defined by K4602: Overview of the F5 security vulnerability response policy. F5, CVEs, and Disclosures While CVEs have been published sporadically for F5 products since at least 2002 (CVE-1999-1550 – yes, a 1999 ID but it was published in 2002 – that’s another topic), things really changed in 2016 after the creation of the F5 SIRT in late 2015. One of the first things the F5 SIRT did was to officially join the CVE program, making F5 a CNA, and to formalize F5’s approach to first-party security disclosures, including CVE assignment. This was all in place by late 2016 and the F5 SIRT began coordinating F5’s disclosures. I’ve been involved with that since very early on, and have been F5’s primary point of contact with the CVE program and related working groups (I participate in the AWG, QWG, and CNACWG) for a number of years now. Over time I became F5’s ‘vulnerability person’ and have been involved in pretty much every disclosure F5 has made for a number of years now. It’s my full-time role. The question has been asked, why? Why disclose at all? Why air ‘dirty laundry’? There is, I think, a natural reluctance to announce to the world when you make a mistake. You’d rather just quietly correct it and hope no one noticed, right? I’m sure we’ve all done that at some point in our lives. No harm, no foul. Except that doesn’t work with security. I’ve made the argument about ‘doing the right thing’ for our customers in various ways over the years, but eventually it distilled down to what has become something of a personal catchphrase: Our customers can’t make informed decisions about their networks if we don’t inform them. Networks have dozens, hundreds, thousands of devices from many different vendors. It is easy to say “Well, if everyone keeps up with the latest versions, they’ll always have the latest fixes.” But that’s trite, dismissive, and wholly unrealistic – in my not-so-humble opinion. Resources are finite and prioritizations must be made. Do I need to install this new update, or can I wait for the next one? If I need to install it, does it have to happen today, or can it wait for the next scheduled maintenance? We cannot, and should not, be making decisions for our customers and their networks. Customers and networks are unique, and all have different needs, attack surfaces, risk tolerance, regulatory requirements, etc. And so F5’s role is to provide the information necessary for them to conduct their own analysis and make their own decisions about the actions they need to take, or not. We must support our customers, and that means admitting when we make mistakes and have security issues that impact them. This is something I believe in strongly, even passionately, and it is what guides us. Our guiding philosophy since day one, as the F5 SIRT, has been to ‘do the right thing for our customers’, even if that may not show F5 in the best light or may sometimes make us unpopular with others. We’re there to advocate for improved security in our products, and for our customers, above all else. We never want to downplay anything, and our approach has always been to err on the side of caution. If an issue could theoretically be exploited, then it is considered a vulnerability. We don’t want to cause undue alarm, or Fear, Uncertainty, and Doubt (FUD), for anyone, but in security a false negative is worse than a false positive. It is better to take an action to address an issue that may not truly be a problem than to ignore an issue that is. All vendors have vulnerabilities, that’s inevitable with any complex product and codebase. Some vendors seem to never disclose any vulnerabilities, and I’m highly skeptical when I see that. I don’t care for the secretive approach, personally. Some vendors may disclose issues but choose not to participate in the CVE program. I think that’s unfortunate. While I’m all for disclosure, I hope those vendors come to see the value in the CVE program not only for their customers, but for themselves. It does bring with it some structure and rigor that may not otherwise be present in the processes. Not to mention all of the tooling designed to work with CVEs. I’ve been heartened to see the rapid growth in the CVE program the past few years, and especially the past year. There has been a steady influx of new CNAs to the program. The original structure of the program was fairly ‘vendor-centric’, but it has been updated to welcome open-source projects and there has been increasing participation from the FOSS community as well. The Future In 2022 F5 introduced a new way of handling disclosures, our Quarterly Security Notification (QSN) process, after an initial trial in late 2021. While not universal, the response has been overwhelmingly positive – you may not be able to please all the people, all the time, but it seems you can please a lot of them. The QSN was primarily designed to make disclosures more predictable and less disruptive to our customers. Consolidating disclosures and decoupling them from individual software releases has allowed us to radically change our processes, introducing additional levels of review and rigor. At the same time, independent of the QSN process, the F5 SIRT had also begun work on standardized language templates for our Security Advisories. As you might expect, there are teams of people who work on issues – engineers who work on the technical evaluation, content creators, technical writers, etc. With different people working on different issues, it was only natural that they’d find different ways to say the same thing. We might disclose similar DoS issues at the same time, only to have the language in each Security Advisory (SA) be different. This could create confusion, especially as sometimes people can read a little too much into things. “These are different, there must be some significance in that.” No, they’re different because different people wrote them is all. Still, confusion or uncertainty is not what you want with security documentation. We worked to create standardized templates so that similar issues will have similar language, no matter who works on the issue. I believe that these efforts have resulted in a higher quality of Security Advisory, and the feedback we’ve received seems to support that. I hope you agree. These efforts are ongoing. The templates are not carved in stone but are living documents. We listen to feedback and update the templates as needed. When we encounter an issue that doesn’t fit an existing template a new template is created. Over time we’ve introduced new features to the advisories, such as the Common Vulnerability Scoring System (CVSS) and, more recently, Common Weakness Enumeration (CWE). We continue to evaluate feedback and requests, and industry trends, for incorporation into future disclosures. We’re currently working on internal tooling to automate some of our processes, which should improve consistency and repeatability – while allowing us to expand the work we do. Frankly, I only scale so far, and the cloning vats just didn’t work out. Having more tooling will allow us to do more with our resources. Part of the plan is that the tooling will allow us to provide disclosures in multiple formats – but I don’t want to say anything more about that just yet as much work remains to be done. So why do we CVE? For you – our customers, security professionals, and the industry in general. We assign CVEs and disclose issues not only for the benefit of our customers, but to lead by example. The more vendors who embrace openness and disclose CVEs, the more the practice is normalized, and the better the security community is for it. There isn’t really any joy in being the bearer of bad news, other than the hope that it creates a better future. Postscript If you’re still reading this, thank you for sticking with me. Vulnerability management and disclosure is certainly not the sexy side of information security, but it is a critical component. If there is interest, I’d be happy to explore different aspects further, so let us know. Perhaps I can peel back the curtain a bit more in another article and provide a look at the vulnerability management processes we use internally. How the sausage, or security advisory, is made, as it were. Especially if it might be useful for others looking to build their own practice. But I like my job so I’ll have to get permission before I start disclosing internal information. We welcome all feedback, positive or negative, as it helps us do a better job for you. Thank you.3.8KViews13likes3CommentsA Day in the Life of a Security Engineer from Tel Aviv
October 2022 is the Cybersecurity Awareness Month, so we decided to focus on the human aspect of the F5 SIRT team and share some of our day to day work. When I started writing this, I thought it would be trivial to capture what I do on an average day and write about it. But it turned out to be challenging task simply because we do so much. We interact with many groups and there is always a new top priority. So bouncing back and forth between tasks is the only way to execute when you are deeply involved with security in the organization. There is really no average day as the next security emergency is right around the corner. First, a little background info on me: I started working in F5 at 2006 as a New Products Introduction (NPI) engineer representing the customer throughout the product life cycle. The job included attending design meetings on new features and their implementation in real world with Product Development (PD) and Product Management (PM). The deliverables were technical presentations for both online and in-person at internal F5 conferences. The feedback that I got from the various departments were consolidated into improvements list to PD and PM, acting as a feedback loop for new features. The product that I represented as subject matter expert was BIG-IP Application Security Manager (ASM) that evolved to BIG-IP Advanced WAF, which is my specialization and my favorite technical topic until today. Then at end of 2016 I moved to the F5 SIRT team. The shift was beneficial as it started a new chapter in becoming a full time security engineer. Let me describe to you what that looks like. Morning: coffee & emergency catchup first It is a 12-minute drive from my apartment to the the office based in Tel Aviv. Living close to the office is great and you can see the sea from the 30 th floor. Yes, I know I should/can bike, and I will do more biking now that the summer heat is going away. First cortado is for reading emails that pile up overnight. We are a "follow the sun" coverage group so here's a quick time orientation: when I arrive to the office at 9AM, it's lunch time for theSingapore guys at 13:00 (5 hours ahead) and the US guys are getting ready to sleep at 11PM Seattle time (10 behind). This means that I usually have a long list of emails and messages to read. I catch up on all the emergency cases that are ongoing and reach my time zone for monitoring or follow up actions. F5 SIRT is a unique group of top engineers with many years of experience with F5 security products and security in general. We are responsible for three main pillars and the first one is assisting F5 customers when they are under attack. Since we are an emergency team, we are ready to act from the minute we come to work and we like the excitement of solving emergency cases. The F5 security team is ready to help customers when they need us the most, when the customers are under attack. This is what I call the money time as this is why people buy security products, to mitigate attacks using the F5 products. This moment has arrived. The first line of defense is the F5 SIRT specialist group which handles the request from the customer and marks it as emergency. If they need assistance from a security engineer, then they will ping us. Working in collaboration with the SIRT specialist group always feels good. It's great to have someone to trust, especially working with EMEA F5 SIRT specialists who always set a high standard. When I’m needed to help a customer under attack, verbal communication is always more effective and faster than written communication. This call ensures the technical issues, the risks, and the benefits involved in mitigation are the right ones so that the customer can choose the best path forward. Common action includes understanding the customer environment, the attack indication they have seen, and the severity of the incident. Once we collect the information from the customer, we create a plan that lists all the possibilities to mitigate the attack. Sometimes we simply give good advice on the possible mitigation and how to proceed but sometimes we need to have a full war room where we do deep traffic analysis and provide the specific mitigation to kill the attack. We have seen many attacks and each attack is different but essentially, we classify them to those main categories: Graph : Distribution of the attack over time. Working in SIRT requires understanding of the different environments, the attack landscape and above all a deep understanding of F5 security products. These are your best friends in killing the attack. Finding the best mitigation strategies with our products which leads to a successful prevention is what we do best. It is a very good fealing to lead the way to an incident win. At the end of each incident, we create a report with recommendation to the customers as well as internal analysis because if it's not documented it doesn’t exist. We have a high success rate in mitigating attacks mostly because F5 product suite is one of the best in the industry for mitigating network and web application attacks.We usually get a lot of warm words from customers. And with that, it's now lunch time. Time flies when you are busy. Noon: lunch and CVE’s Deciding what to eat should be evaluated carefully: too heavy and you fall asleep, too light and you will be hungry in 2-3 hours. If time permits, I'll walk 10 minutes each direction to the local food market with the local F5 employees and have fun conversations over lunch. When I get back it is time to take a black coffee and review the additional work that needs to be done for the day and decide which of the items I can delay. Most of the time we define our own deadlines, so we plan ahead. This means we have no one to blame if we are late. So don’t be late. This is also a good time to read some of the security industry news. If there is something notable, I will paste the link in the group team’s chat. If it's my turn to write the This Week in Security (TWIS), then this is where I will mark topics to write about. Writing TWIS can be time consuming, but it provides the ability to express yourself and keep up to date with the security industry around the globe. Now, it is CVE work time, which is our second pilar of responsibility: vulnerability management for F5 products. F5 SIRT owns the vulnerability management and publishes public CVEs as part of the F5's commitment to security best practices with F5 products. We have public policy that we follow: K4602: Overview of the F5 security vulnerability response policy. CVEs can originate from internal or external sources such as a security researcher who approached the F5 SIRT team directly. We evaluate CVEs to make sure we understand the vulnerability from both the exploitation aspect and the relevant fix introduced by Product Engineering (PE). After Interacting with PE, and once the software fix is in place, we start writing the security advisory which is the actual article that will be published. All CVEs are under embargo until publication day and just before we publish we provide briefing to internal audience to inform them of what to expect and which type of questions they might encounter. We work as a group to cover all the regions and keep everyone on the same page. Publication day is always a big event for us. This is where all the hard work comes into the light. We are constantly monitoring customers inquiries about fresh CVEs and are ready to solve any challenges customers may face. We always invest a lot of time and effort, so we created a well-defined playbook and a common language so that we can publish well-documented CVEs. Vulnerabilities and their CVEs will never run out, this is the nature of software and hardware. Time for ristretto and the Zero Day (0day) aka the OMG scenario. Every now and then, a new high-profile 0day is being published. This is the start of a race to mitigation and our play books are ready for those situations. We start by collecting all the possible information available and evaluate the situation. If F5 products are affected by the 0day, a software fix will be issued ASAP and customer notification will be released by us describing the actions that need to be taken. If we are not affected, then we want to find a mitigation to help our customer protect themselves. In both cases we will write a security article in AskF5, as well as internal communication and briefing on our findings and remediations. Those will include all possible mitigations such as WAF signatures, iRules, AFM IPS signature, LTM configuration and more. The Log4j 0day was a good example of how a solid process works like magic and we published mitigation list articles and email notifications very fast. In such cases we work with the Security Research Team from the local Tel Aviv office, a very talented group of people that assists and collaborates with us all the time with full dedication for high profile CVEs. This is where the power of F5 as a company shows its face. Once we have our mitigations plan for the 0day, F5 SIRT will send a notification email to our customers and publish information on the Ask F5 site and on social media (of course). This typically increases customers inquiries about the level of exposure they have from this new 0day, so publishing articles and knowledge is critical to fast mitigations for our customers. And it is afternoon already. Afternoon: tea, knowledge share and projects Technology is constantly improving and new features, products and services are being released to confront upcoming attacks. Therefore learning and practicing new releases is mandatory. The more we learn, know and get our hands on, the better we can mitigate security challenges when dealing with customers under attack and vulnerability management. This is also our thirdpillar: security advisor, which is about learning and building security mindset by sharing knowledge and experience. We write knowledge base articles on Ask F5, we mentor whenever we have good advice, and we answer security inquiries from both internal and external sources. This knowledge and experience translates to projects that we chose to do every quarter. My favorite project that I was leading (and is still very alive and relevant today) is the Attack Matrix that is used as a battle cards for customers and F5 personnel. The basic concept is to have attacks and their corresponding mitigations with F5 products. This is a very effective tool for customers and demonstrates the power of the F5 security capabilities. I mostly liked doing the WAF section (remember my favorite F5 product is BIG-IP Advance WAF) which IMHO is the best WAF technology in the industry. Late afternoon: meet the team You probably already figured out that my time zone is EMEA. Together with AaronJB, we cover the three pillars of the F5 SIRT team for the EMEA region. We discuss new ideas often and sometimes it feels like we can talk about security for weeks. So thank you, Aaron, for helping me and for being around. No matter how good you are as an individual, you must have a team to really succeed! As the day comes to an end and North America wakes up (8AM Seattle time is 6PM in Tel Aviv), we have a sync calls for the core team and other teams. It always feels good to talk to the F5 SIRT core personnel from APCJ and NA whom I work with every day. With our fearless leader who established this security A-Team, it is such a pleasure working in this group. Day report This was a day in the life of an F5 SIRT team memeber and it is totally subject to immediate changes, an emergency can arrive at any time of the day. There are days where everything becomes a war room, when there is a worldwide high-profile security incident is invoked. And there are days where I can have a cup of coffee and write an article like this. Security became a necessity, every aspect of software and computer system is affected directly by threats. So security mitigation is here to stay and is key to keeping it all going. There is much more to these organized and erratic workdays and I can talk and talk but the day has ended so until next time... Keep it up.3.2KViews13likes3CommentsF5 402 Exam reading list and notes
Disclaimer: The collection of articles and documentation are credited to original owners. This is not an official F5 402 exam guide. I recently passed the F5 402 - Certified Solution Expert - Cloud exam. I am pleased that I finally achieved it. Many are asking what I used to prepare for the exam. First, be familiar with the 402 - CLOUD SOLUTIONS EXAM BLUEPRINT. It is located at K29900360: F5 certification | Exams and blueprints. https://support.f5.com/csp/article/K29900360 The pre requisite to take the F5 402 exam is that you are currently a F5 CTS for LTM (301a and 301b) and DNS (302). These exams would have already exposed you to BIG-IP LTM and DNS. However, you should also read on and have an idea what are the other BIG-IP modules and their functionality. The F5 402 exam blueprint already gives you the topics you will need to familiar with. It really helps if you have hands on experience on working on cloud environments, such as AWS and Azure, and container environments such as in Kubernetes. For me, it was a bit of AWS and Kubernetes. You will need to be familiar with cloud terminologies - services, features, etc - and how they relate to cloud vendors. Familiarity with container orchestration terminologies such as in Kubernetes will also help. Bundle these Cloud/Container terms and features and how they relate to BIG-IP deployments in the cloud, plus, mapping them per the F5 402 exam blueprint, will help you organize your knowledge and prepare for the exam. Looking back and while preparing for the exam, here are the documentation which I would start to review and build a knowledge map. There are links in the articles that would supplement the concepts described, my suggestion, consult the F5 402 exam blueprint and see if you need more familiarity with a topic after reading thru the articles. https://clouddocs.f5.com/cloud/public/v1/ https://clouddocs.f5.com/cloud/public/v1/aws_index.html https://clouddocs.f5.com/cloud/public/v1/azure_index.html https://clouddocs.f5.com/cloud/public/v1/matrix.html https://clouddocs.f5.com/containers/latest/ https://aws.amazon.com/blogs/enterprise-strategy/6-strategies-for-migrating-applications-to-the-cloud/ https://www.f5.com/company/blog/networking-in-the-age-of-containers https://aws.amazon.com/blogs/networking-and-content-delivery/deployment-models-for-aws-network-firewall/ https://docs.microsoft.com/en-us/azure/architecture/aws-professional/services Good Luck!6.8KViews13likes8CommentsF5 Distributed Cloud - Service Policy - Header Matching Logic & Processing
Learn about the F5 Distributed Cloud service policy feature and how to apply logic to your match criteria (and/or). This understanding of the logic structures within service policies unlocks endless combinations of application security services.1.9KViews12likes1CommentDeploying WAF in production using Azure Resource Manager template with F5 Nginx App Protect
Introduction: In production-grade deployments, it is always a challenge for anyone who wants to give a demo in their environment with a WAF deployment. Usually, it takes at least a few weeks for an average team to design and implement a production-grade WAF in a cloud environment because for each cloud deployment, virtual networking, infrastructure security, virtual machine images, auto-scaling, logging, monitoring, automation, and many more topics require detailed analysis. To mitigate this time and effort, we came up with the conclusion that a proper WAF deployment can be templatized and automated, so a team doesn’t need to spend time on deployment and maintenance and uses a WAF from day zero. In this article we introduced a project that implements an Azure Resource Manager template to deploy a production-grade WAF in Azure cloud in just a few clicks. The WAF is using the F5 NGINX App Protect WAF official image, which is available under the Azure marketplace. This eliminates the need to manually prebuild the VM image for your WAF deployment. It contains all the necessary code and packages on top of the OS of your choice. Additionally, it allows you to pay as you go for NGINX App Protect WAF software instead of purchasing a year-long license. Why Azure? Globally, 90% of Fortune 500 companies are using Microsoft Azure to drive their business. Using deeply integrated Azure cloud services, enterprises can rapidly build, deploy, and manage simple to complex applications with ease. Azure supports a wide range of programming languages, frameworks, operating systems, databases, and devices, allowing enterprises to leverage tools and technologies they trust. Here are some of the reasons why customers are deploying their applications using Azure. Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) capabilities Security offers scalability and ductility Environmental integration with other Microsoft tools Cost efficiency and interoperability Project: This project implements an ARM (Azure Resource Manager) template that automatically deploys a production grade WAF using NGINX App Protect WAF to Azure cloud. It allows administrators to deploy, manage, and monitor Azure resources. It also allows administrators to apply access controls to all services in a resource group with role-based access control (RBAC), which is available in ARM. Architecture: The high-level architecture represents an Azure availability system that runs an application load balancer, Virtual Machine Scale Set (VMSS), and a subset of virtual machines running NGINX App Protect WAF software behind it. A load balancer is supposed to manage TLS certificates, receive traffic, and distribute it across all Azure VMs (Virtual Machine). NGINX App Protect WAF VM instance inspects traffic and forwards it to the application backend. The VMSS scales up the virtual machines based on the rules configured. Major components: ARM template (GIT repository which contains the source of data plane and security policy configurations): The pipeline runs the ARM templates which will connect to the Azure portal and deploy the solution. Also, a user can directly login to the Azure portal and run the template under the Template Spec which will deploy the solution directly. Auto-scaling (data plane based on official NGINX App Protect WAF Azure VM Images): The solution uses a Virtual Machine Scale Set configured to spin up new NGINX App Protect WAF Virtual Machine instances based on incoming traffic volumes. This removes operational headaches and optimizes costs as the Scale Set adjusts the amount of computing resources and charges a user on an as-you-go basis. Visibility (dashboards displaying the NGINX App Protect WAF health and security data): The template sets up a set of visibility dashboards in Azure Dashboard Service. Data plane VMs send logs and metrics to the Dashboard service that visualizes incoming data as a set of charts and graphs showing NGINX App Protect WAF health and security violations. Example: These three components form a complete NGINX App Protect WAF solution that is easy to deploy, doesn’t impose any operational headache, and provides handy interfaces for NGINX App Protect WAF configuration and visibility right out of the box. Automation: The following diagram represents the end-to-end automation solution. GitHub is being used as the CI/CD platform. The GitHub pipeline sets up and configures the entire system from the ground up. The first stage creates all necessary Azure resources such as Azure AS (Analysis Service), VMSS, Virtual Machines, and the Load Balancer. The second stage sends test traffic (including malicious requests) and verifies the solution. Project Repository: f5devcentral/azure-waf-solution-template (github.com) Steps: Pre-requisites: Azure account and credentials. Admin privileges to your Azure resource group. Service principal and password (follow link to create the service principal (https://docs.microsoft.com/en-us/cli/azure/create-an-azure-service-principal-azure-cli)) Resource group created in the Azure portal. Add the below variables under github-->secrets AZURE_SP --> Azure service principle AZURE_PWD --> Azure client password Add your resource group and other params under Lib/azure-user-params file. mandatory params: ResourceGroup TenandId SubscriptionId On GitHub.com, navigate to the main page of the repository and below repository name, click Actions tab. In the left sidebar select the workflow with name "Resource Manager Template Deployment in Azure". Above the list of workflow runs, select Run workflow. LOG: Conclusion: Using a template to deploy a cloud WAF significantly reduces the time spent on WAF deployment and maintenance. It also provides a complete and easy-to-use solution to deploy resources and verify the NGINX App Protect WAF security solution on the Azure platform in any location. Handy interfaces for configuration and visibility turn this project into a boxed solution, allowing a user to easily operate a WAF and focus on application security.2.4KViews12likes0CommentsDemonstration of Device DoS and Per-Service DoS protection
Dear Reader, This article is intended to show what effect the different threshold modes have on the Device and Per-Service (VS/PO) context. I will be using practical examples to demonstrate those effects. You will get to review a couple of scripts which will help you to do DoS flood tests and “visualizing” results on the CLI. In my first article (https://devcentral.f5.com/s/articles/Concept-of-F5-Device-DoS-and-DoS-profiles), I talked about the concept of F5 Device DoS and Per-Service DoS protection (DoS profiles). I also covered the physical and logical data path, which explains the order of Device DoS and Per-Service DoS using the DoS profiles. In the second article (https://devcentral.f5.com/s/articles/Explanation-of-F5-DDoS-threshold-modes), I explained how the different threshold modes are working. In this third article, I would like to show you what it means when the different modes work together. But, before I start doing some tests to show the behavior, I would like to give you a quick introduction into the toolset I´m using for these tests. Toolset First of all, how to do some floods? Different tools that can be found on the Internet are available for use. Whatever tools you might prefer, just download the tool and run it against your Device Under Test (DUT). If you would like to use my script you can get it from GitHub: https://github.com/sv3n-mu3ll3r/DDoS-Scripts With this script - it uses hping - you can run different type of attacks. Simply start it with: $ ./attack_menu.sh <IP> <PORT> A menu of different attacks will be presented which you can launch against the used IP and port as a parameter. Figure 1:Attack Menu To see what L3/4 DoS events are currently ongoing on your BIG-IP, simply go to the DoS Overview page. Figure 2:DoS Overview page I personally prefer to use the CLI to get the details I´m interested in. This way I don´t need to switch between CLI to launch my attacks and GUI to see the results. For that reason, I created a script which shows me what I am most interested in. Figure 3:DoS stats via CLI You can download that script here: https://github.com/sv3n-mu3ll3r/F5_show_DoS_stats_scripts Simply run it with the “watch” command and the parameter “-c” to get a colored output (-c is only available starting with TMOS version 14.0): What is this script showing you? context_name: This is the context, either PO/VS or the Device in which the vector is running vector_name: This is the name of the DoS vector attack_detected:When it shows “1”, then an attack has been detected, which means the ‘stats_rate’ is above the ‘detection-rate'. stats_rate: Shows you the current incoming pps rate for that vector in this context drops_rate: Shows you the number dropped pps rate in software (not FPGA) for that vector in this context int_drops_rate: Shows you the number dropped pps rate in hardware (FPGA) for that vector in this context ba_stats_rate: Shows you the pps rate for bad actors ba_drops_rate: Shows you the pps rate of dropped ‘bad actors’ in HW and SW bd_stats_rate:Shows you the pps rate for attacked destination bd_drop_rate: Shows you the pps rate for dropped ‘attacked destination’ mitigation_curr: Shows the current mitigation rate (per tmm) for that vector in that context detection: Shows you the current detection rate (per tmm) for that vector in that context wl_count: Shows you the number of whitelist hits for that vector in that context hw_offload: When it shows ‘1’ it means that FPGAs are involved in the mitigation int_dropped_bytes_rate: Gives you the rate of in HW dropped bytes for that vector in that context dropped_bytes_rate: Gives you the rate of in SW dropped bytes for that vector in that context When a line is displayed in green, it means packets hitting that vector. However, no anomaly is detected or anything is mitigated (dropped) via DoS. If a line turns yellow, then an attack - anomaly – has been detected but no packets are dropped via DoS functionalities. When the color turns red, then the system is actually mitigating and dropping packets via DoS functionalities on that vector in that context. Start Testing Before we start doing some tests, let me provide you with a quick overview of my own lab setup. I´m using a DHD (DDoS Hybrid Defender) running on a i5800 box with TMOS version 15.1 My traffic generator sends around 5-6 Gbps legitimate (HTTP and DNS) traffic through the box which is connected in L2 mode (v-wire) to my network. On the “client” side, where my clean traffic generator is located, my attacking clients are located as well by use of my DoS script. On the “server” side, I run a web service and DNS service, which I´m going to attack. Ok, now let’s do some test so see the behavior of the box and double check that we really understand the DDoS mitigation concept of BIG-IP. Y-MAS flood against a protected server Let’s start with a simple Y-MAS (all TCP flags cleared) flood. You can only configure this vector on the device context and only in manual mode. Which is ok, because this TCP packet is not valid and would get drop by the operating system (TMOS) anyway. But, because I want this type of packet get dropped in hardware (FPGA) very early, when they hit the box, mostly without touching the CPU, I set the thresholds to ‘10’ on the Mitigation Threshold EPS and to ‘10’ on Detection Threshold EPS. That means as soon as a TMM sees more then 10 pps for that vector it will give me a log message and also rate-limit this type of packets per TMM to 10 packets per second. That means that everything below that threshold will get to the operating system (TMOS) and get dropped there. Figure 4:Bad TCP Flags vector As soon as I start the attack, which targets the web service (10.103.1.50, port 80) behind the DHD with randomized source IPs. $ /usr/sbin/hping3 --ymas -p 80 10.103.1.50 --flood --rand-source I do get a log messages in /var/log/ltm: Feb5 10:57:52 lon-i5800-1 err tmm3[15367]: 01010252:3: A Enforced Device DOS attack start was detected for vector Bad TCP flags (all cleared), Attack ID 546994598. And, my script shows me the details on that attack in real time (the line is in ‘red’, indicating we are mitigating): Currently 437569 pps are hitting the device. 382 pps are blocked by DDoS in SW (CPU) and 437187 are blocked in HW (FPGA). Figure 5:Mitigating Y-Flood Great, that was easy. :-) Now, let’s do another TCP flood against my web server. RST-flood against a protected server with Fully manual Threshold mode: For this test I´m using the “Fully Manual” mode, which configures the thresholds for the whole service we are protecting with the DoS profile, which is attached to my VS/PO. Figure 6:RST flood with manual configuration My Detection Threshold and my Mitigation Threshold EPS is set to ‘100’. That means as soon as we see more then 100 RST packets hitting the configured VS/PO on my BIG-IP for this web server, the system will start to rate-limit and send a log message. Figure 7:Mitigating RST flood on PO level Perfect. We see the vector in the context of my web server (/Common/PO_10.103.1.50_HTTP) is blocking (rate-limiting) as expected from the configuration. Please ignore the 'IP bad src' which is in detected mode. This is because 'hping' creates randomized IPs and not all of them are valid. RST-flood against a protected server with Fully Automatic Threshold mode: In this test I set the Threshold Mode for the RST vector on the DoS profile which is attached to my web server to ‘Fully Automatic’ and this is what you would most likely do in the real world as well. Figure 8:RST vector configuration with Fully Automatic But, what does this mean for the test now? I run the same flood against the same destination and my script shows me the anomaly on the VS/PO (and on the device context), but it does not mitigate! Why would this happen? Figure 9:RST flood with Fully Automatic configuration When we take a closer look at the screenshot we see that the ‘stats_rate’ shows 730969 pps. The detection rate shows 25. From where is this 25 coming? As we know, when ‘Fully Automatic’ is enabled then the system learns from history. In this case, the history was even lower than 25, but because I set the floor value to 100, the detection rate per TMM is 25 (floor_value / number of TMMs), which in my case is 100/4 = 25 So, we need to recognize, that the ‘stats_rate’ value represents all packets for that vector in that context and the detection value is showing the per TMM value. This value explains us why the system has detected an attack, but why is it not dropping via DoS functionalities? To understand this, we need to remember that the mitigation in ‘Fully Automatic’ mode will ONLY kick in if the incoming rate for that vector is above the detection rate (condition is here now true) AND the stress on the service is too high. But, because BIG-IP is configures as a stateful device, this randomized RST packets will never reach the web service, because they get all dropped latest by the state engine of the BIG-IP. Therefor the service will never have stress caused by this flood.This is one of the nice benefits of having a stateful DoS device. So, the vector on the web server context will not mitigate here, because the web server will not be stressed by this type of TCP attack. This does also explains the Server Stress visualization in the GUI, which didn´t change before and during the attack. Figure 10:DoS Overview in the GUI But, what happens if the attack gets stronger and stronger or the BIG-IP is too busy dealing with all this RST packets? This is when the Device DOS kicks in but only if you have configured it in ‘Fully Automatic’ mode as well. As soon as the BIG-IP receives more RST packets then usually (detection rate) AND the stress (CPU load) on the BIG-IP gets too high, it starts to mitigate on the device context. This is what you can see here: Figure 11:'massive' RST flood with Fully Automatic configuration The flood still goes against the same web server, but the mitigation is done on the device context, because the CPU utilization on the BIG-IP is too high. In the screenshot below you can see that the value for the mitigation (mitigation_curr) is set to 5000 on the device context, which is the same as the detection value. This value resultsfrom the 'floor' value as well. It is the smallest possible value, because the detection and mitigation rate will never be below the 'floor' value. The mitigation rate is calculated dynamically based on the stress of the BIG-IP. In this test I artificially increased the stress on the BIG-IP and therefor the mitigation rate was calculated to the lowest possible number, which is the same as the detection rate. I will provide an explanation of how I did that later. Figure 13:Device context configuration for RST Flood Because this is the device config, the value you enter in the GUI is per TMM and this is reflected on the script output as well. What does this mean for the false-positive rate? First of all, all RST packets not belonging to an existing flow will kicked out by the state engine. At this point we don´t have any false positives. If the attack increases and the CPU can´t handle the number of packets anymore, then the DOS protection on the device context kicks in. With the configuration we have done so far, it will do the rate-limiting on all RST packets hitting the BIG-IP. There is no differentiation anymore between good and bad RST, or if the RST has the destination of server 1 or server 2, and so on. This means that at this point we can face false positives with this configuration. Is this bad? Well, false-positives are always bad, but at this point you can say it´s better to have few false-positives then a service going down or, even more critical, when your DoS device goes down. What can you do to only have false positives on the destination which is under attack? You probably have recognized that you can also enable “Attacked Destination Detection” on the DoS vector, which makes sense on the device context and on a DoS profile which is used on protected object (VS), that covers a whole network. Figure 14:Device context configuration for RST Flood with 'Attacked Destination Detection' enabled If the attack now hits one or multiple IPs in that network, BIG-IP will identify them and will do the rate-limiting only on the destination(s) under attack. Which still means that you could face false positives, but they will be at least only on the IPs under attack. This is what we see here: Figure 15:Device context mitigation on attacked destination(s) The majority of the RST packet drops are done on the “bad destination” (bd_drops_rate), which is the IP under attack. The other option you can also set is “Bad Actor Detection”. When this is enabled the system identifies the source(s) which causes the load and will do the rate limiting for that IP address(es). This usually works very well for amplification attacks, where the attack packets coming from ‘specific’ hosts and are not randomized sources. Figure 16:Device context mitigation on attacked destination(s) and bad actor(s) Here you can see the majority of the mitigation is done on ‘bad actors’. This reduces the chance of false positives massively. Figure 17:Device context configuration for RST Flood with 'Attacked Destination Detection' and 'Bad Actor Detection' enabled You also have multiple additional options to deal with ‘attacked destination’ and ‘bad actors’, but this is something I will cover in another article. Artificial increase BIG-IPs CPU load Before I finish this article, I would like to give you an idea on how you could increase the CPU load of the BIG-IP for your own testing. Because, as we know, with “Fully Automatic” on the device context, the mitigation kicks only in if the packet rate is above the detection rate AND the CPU utilization on the BIG-IP is “too” high. This is sometimes difficult to archive in a lab because you may not be able to send enough packets to stress the CPUs of a HW BIG-IP. In this use-case I use a simple iRule, which I attach to the VS/PO that is under attack. Figure 18:Stress iRule When I start my attack, I send the traffic with a specific TTL for identification. This TTL is configured in my iRule in order to get a CPU intensive string compare function to work on every attack packet. Here is an example for 'hping': $ /usr/sbin/hping3 --rst -p 80 10.103.1.50 --ttl 5 --flood --rand-source This can easily drive the CPU very high and the DDoS rate-limiting kicks in very aggressive. Summary I hope that this article provides you with an even better understanding on what effect the different threshold modes have on the attack mitigation. Of course, keep in mind this are just the ‘static DoS’ vectors. In later articles I will explain also the 'Behavioral DoS' and the Cookie based mitigation, which helps to massively reduce the chance of a false positives. But, also keep in mind, the DoS vectors start to act in microseconds and are very effective for protecting. Thank you, sVen Mueller3.7KViews12likes3Comments