Hi everyone, If you have any questions or comments about the performance report or it's supporting documents, please feel free to post them here. I'm one of the engineers who helped to create the performance report, and I'll be actively monitoring this forum to answer questions. Mike Lowell

Other than a vendor comparison, I would be interested in having performance documents relating to common parameters that one applies such as perhaps Loose initiation or loose close on a Fast L4 profile thus causing PVA to be turned off. In my environmnent, i find that a lot of vips that i use i have to turn on or off some knobs which cause PVA acceleration to set to none, so be interested in getting that perfromance info as it applies to that config.

Good question. It sounds like you're a BIG-IP pro, so you know that BIG-IP has a lot of options. :) By my count there are over 2000 configurable options, and like any device, changing options can affect performance. The good news is we can change the vast majority of the knobs and dials and the impact is barely measurable, sometimes higher, sometimes lower, but generally small enough to fit inside a margin of error. For the few knobs that can have an impact, we usually have some idea of "how much" the impact can be. On the BIG-IP 8800 the performance for "full" and "partial" PVA acceleration is typically within 1% of being identical, due to the processing power of the BIG-IP 8800. The real change switching from "full" to "partial" is that CPU usage goes from 0% (i.e. handled entirely within PVA) to probably 30-40% for small responses, less than 20% for 8KB, and less than 5% for 64KB and larger. Generically speaking, for configurations that have "no" acceleration (software only), top-end performance is likely to be reduced by 5-35%, depending on response size. As an interesting data point, even assuming a worst-case-scenario the L4 performance of the BIG-IP 8800 is still a healthy margin above the competition. :) As always, I personally suggest that customers run their own performance tests to be sure all environmental factors are properly considered. For example, it's often not well appreciated how much server performance factors in. Similarly, many customers don't have a good idea of their response size mix, etc -- there are hundreds of factors like this. The only way to be sure everything is factored in is to test in your particular environment. Having said that, the relative performance between vendors should be similar to what you see in the report. I hope this helps! Mike Lowell p.s. I'll provide your feedback to the right people to look new documents in the future. Thanks!

Mike, Was there testing done to determine the maximum number of persistent SSL connections that could be established? I have an application that requires a large number of concurrent SSL connections (small number of TPS/connection) and was wondering if BIG-IP would fit the bill. Thanks, Jeff

I had an interesting question from my counterpart at SAP Labs. His question was, how does F5 compare in terms of the performance/price ratio? This gets interesting when you talk about disruptive innovators in the marketplace. Thanks and I look forward to your thoughts. Ron

Hi Jeff, Concurrent SSL connections was not tested. In general, the "hard limit" for SSL concurrency is based on the amount of memory that can be allocated to store SSL connection information. The practical limit for SSL concurrency depends on your threshold for latency. As concurrency increases so do the processing requirements for handling each connection. Depending on the particular implementation, factors that contribute to this increase in per-connection overhead may include: connection hash table collisions, memory fragmentation, session cache overflows, connection maintenance functions (checking for connection timeouts, for example), and more. Depending on implementation, the effects of increased overhead may start to appear with thousands, tens of thousands, or hundreds of thousands of connections. It's highly dependent on vendor implementation. All things considered, you're more likely to hit a "practical limit" that causes poor performance (high latency) than you are to hit a "hard limit" based on memory limits. With that in mind, the numbers that vendors have published thus far are based "hard limits". Without having more exact information to give you, I'd like to point out that you're likely to see practical performance limits that mirror the other SSL performance recorded in the performance report. In other words, if vendorX can process 48,000 SSL TPS, vendorY 28,000, and vendorZ 15,000, it's reasonable to expect vendorX has more processing power dedicated to SSL, and thus is likely to have an advantage in practical SSL concurrency as well. I hope that helps. Good luck! Mike Lowell

Any questions? Post'em

38 Replies

ukiran22_113041
Nimbostratus
Apr 16, 2009
Thanks Mike, I think I have what I needed. I don't think I'm going to tune it to that extent. Also, I realized my Ixia is set to window scaling, but I could not find an option on 8800 to enable WS. Is there a way to enable WS on 8800??

Jay
Mike_Lowell_108
Historic F5 Account
Apr 16, 2009
That's great! Regarding retries, this can be reduced to almost zero even when you're at capacity if you get the concurrency/rate "just right", but it's tough. :) You should be able to eliminate timeouts entirely by reducing the load just a little bit -- timeouts suggest you're seeing quite a lot more retries than I'd expect, because it means the same flow had to have multiple retransmits. It's tough to eliminate retries entirely when you're near capacity limits (on any type of device), but it's definitely possible to minimize the impact. In a world with fast retransmits, TCP timestamps, and SACK, losing a random packet here and there doesn't have a practical impact on users/servers, and it's to be expected when you're up against the device capacity.

Some ideas to help dial it in:

1) Reduce the simuser constraint by increments that are a multiple of the number of physical ports (i.e. 12)

2) Change the congestion control algorithm on your TCP profile to "highspeed"

3) Disable rfc1323 on the TCP profile.

4) Reduce the send/recv buffer sizes on BIG-IP (or Ixia) in 8KB increments.

... this sort of tuning can take a while, but if you really want to get perfect results at the edge of capacity, it's what you need to do. :)

Good luck!

Mike Lowell
ukiran22_113041
Nimbostratus
Apr 16, 2009
hi Mike,

I don't have any switch in between. My Ixia cards are aggregated into 2 10G ports and I connect those ports to the 10G ports on 8800.

Thanks to your detailed suggestions, I disabled flow control on teh 10G interfaces on 8800 and my throughput went up to 6.5Gbps. I increased the number of users to 1000 and it went up to 7Gbps.

I'm happy with my throughput, but since I disabled flow control, I do see a bunch of TCP retries and timeouts on the server side Ixia. Is 8800 dropping packets at those throughput levels or Ixia is just sending more and since 8800 just does 7Gbps, anything above that is dropped by 8800??? Just wanted to understand your viewpoint of those drops.

Your suggestions have been an excellent help so far. Thanks a lot.

Jay
Mike_Lowell_108
Historic F5 Account
Apr 16, 2009
Hmmm. Sounds like a pretty good setup to me. :) It's probably just a matter of tuning to get what you need.

It's likely that somewhat more than 24 simusers would reduce throughput, but a lot more than 24 should increase throughput by ensuring that both Ixia and BIG-IP constantly have something to do. I definitely suggest trying 1020 (you have 1x 12-port client blade and 1x 12 port server blade, right?) since I've run similar tests in the past and had good luck with equivalent settings.

One challenge with tests that have only large responses is that it's hard to keep BIG-IP/Ixia busy with enough work. Like you say, it's only 30% utilized. :) BIG-IP/Ixia doesn't have too much more work for 1500byte packets compared to 64byte packets, but it takes ~24x longer to send/receive the bigger ones (12.1us vs 0.51us as mentioned above, the speed of ethernet). In the end it means that you need a lot more concurrency to make sure there's always a queue of work that's waiting, otherwise BIG-IP/Ixia will be underutilized. On a much bigger scale it's the same reason that throughput for WAN links is often substantially lower than the available bandwidth: latency is the killer. Ethernet is obviously worlds faster than a WAN connection across the country, but the same principle applies.

With ethernet you're often better-off to have more physically-unique clients -- this makes it easier to generate a truly concurrent workload. I can't say that I've tried a test with just 2x cards, but I'm guessing it'll still achieve the goal with some tuning, though it would be easier to ensure the needed parallelism with more blades (because of more unique ethernet clocks -- greater possibility for keeping a stream of "back-to-back" packets flowing). The smaller the number of unique ethernet interfaces on the client/server, the harder you have to push them to ensure they're generating a constant stream of traffic that'll keep BIG-IP busy.

The most common issues I've run into with throughput tests customers are running:

1) Not enough concurrency (i.e., see above)

2) Intermediate switch connecting test equipment to BIG-IP can't do line-rate (dropping packets...)

3) Switch and/or BIG-IP are using flow control too early (try manually disabling flow control on both sides instead of using auto)

4) Not enough client/server capacity (I don't think this could apply to you with 2x 12-port Ixia blades)

5) Bad cables/optics (rather unlikely, but not impossible, given the performance you're already getting)

6) Uneven distribution of clients/servers, causing one BIG-IP CPU or switch uplink to get overwhelmed (this typically only happens with L2 testing equipment where there's a small number of hard-coded MAC/IP/ports -- not likely to be your issue). You can check this by running "tmstat" and making sure the various links links have roughly the same throughput (they're usually within 1%, but being within 10% is still not a problem in most cases)

7) Some odd bug. It's always a good idea to make sure you're running the latest version. :)

Mike Lowell
ukiran22_113041
Nimbostratus
Apr 15, 2009
Thanks Mike for the explanation about throughput calculations. I always take into account throughput calculation adjustments with Ixia. Since my oackets are mostly full size, I usually add about 5 % to Ixia throughput numbers.

Some answers to your questions -

a) LTM version - 9.4.6

b) All four TMMs running at about 30%

c) I have set a constraint of 24 for simulated users as my response size is higher and any higher number of simulated users is bringing down my throughput.

I'm trying standard http with these profiles - TCP, http, and One connect

I tried http_lan_optimized profile and it actually brought down my throughput a little bit.

I have 10 self IPs assigned to teh server facing VLAN.

I have my buffer sizes set to 32k in Ixia. I tried 64k and did not make much of a difference.

I'm not sure what I'm doing wrong.

Thanks again for your help

Jay
Mike_Lowell_108
Historic F5 Account
Apr 15, 2009
Jay, internally we use Ixia as our primary testing platform so I'm hopeful we can help. :)

I'm sure you're already aware, but I feel compelled to mention this anyway for other folks watching the list: Ixia only reports L7 throughput (i.e. bytes transfered over TCP, excluding TCP/IP/ethernet headers). This means the throughput reporting by Ixia is likely to be ~8% "low" for a throughput test (maybe ~12% for a conn/s test). This is an important point because the wire itself is limited to L2 throughput, so L2 throughput is what really matters from a limits-testing point of view.

Every single packet has 18 bytes of Ethernet + 20 bytes of IP + 20 bytes of TCP (IP and TCP could have more depending on options), in addition to the actual TCP data. So for a single packet that contains 100 bytes of TCP data, Ixia only reports the "100 bytes", even though the full packet size is at least 100 + 18 + 20 + 20 == 158 bytes. That's the most extreme case, but it proves the point nicely. :)

An additional relevant detail about gigabit Ethernet is that despite the "1000Mbps" name, only about 996Mbps is technically possible with a standard 1500 byte MTU. The reason for this is simple: gigabit Ethernet takes 12.1us (microseconds) to transmit a a full-size 1518 byte packet (0.51us for 64byte), and Ethernet requires a 0.096us gap between packets. This means the maximum full size packets (1500 bytes + 18 bytes Ethernet) per second is 12.1us + 0.096us divided into 1 second: 1 / (12.1us + 0.096us). The maximum number of packets per second multiplied by 1518 gives you the maximum number of bytes per second, and multiplied by 8, gives you the maximum bits per second, which is roughly 996Mbps.

This means 7 Gbps line-rate is actually ~6,972Mbps, for example. This means 7 Gbps line-rate using a large-file test is likely to be reported by Ixia as ~6,455 Mbps (this is a very rough estimate).

Anyway, moving on to your actual question.... :) I have some questions to start, and some suggestions below:

1) What BIG-IP version are you using?

2) What's TMM CPU utilization?

3) How many simusers are configured? (and do you have a simuser constraint set? I suggest a constraint of 1024).

It should be possitlve to get > 7Gbps with an 8KB response or larger using FastHTTP with default settings, or 64KB+ with standard mode using default settings, regardless of whether your using an HTTP profile (w/ or w/o RAMcache) and/or OneConnect.

When using FastHTTP (or anytime you're using SNATs) you'll want to make sure BIG-IP has enough self IP addresses so that it won't run out of empheral ports. My standard test config has 20 self IP's on the server-facing VLAN for this purpose. 20 is overkill for almost all needs, but it doesn't hurt. :) Also, this is rarely a factor for large-file throughput tests.

I also suggest trying the "tcp-lan-optimized" profile to see if that helps. It's also worth looking at your Ixia's TCP send/recv buffer settings. Depending on what version of Ixia you have they might default to 4k , which is clearly far too small to simulate what you'd see from regular Windows/Mac/Linux boxes -- I recommend 64k to start.

Mike Lowell
ukiran22_113041
Nimbostratus
Apr 15, 2009
hi Mike,

I'm trying to test an 8800 in my lab before we move it to production. Before trying our specific requirements, I was trying to baseline 8800's performance to the test report published by you. And I am unable to get 8800 to do 7Gbps of L7 throughput.

Specific cases I tried -

a) fast http mode with 24 servers. 2 Ixia 10G ASM cards. Response size 512K. Throughput is about 6.2Gbps.

b) standard http mode with round robin lb across 24 servers, profiles used oc, tcp, http. automap is set, and response size 512K. Throughput is 5.5Gbps.

Ixia is configured to maximize transactions per connection. Number of users is 24 with 2 concurrent connections per user.

Any suggestions you could provide in terms of 8800 config will be really helpful. Thanks in advance.

Jay
hoolio
Cirrostratus
Jan 30, 2009
Hi Ugur,

This article from Deb is a good place to start:

iRules Optimization 101 - 5 - Evaluating iRule Performance

http://devcentral.f5.com/Default.aspx?tabid=63&articleType=ArticleView&articleId=123

Aaron
ugurtanyildiz_9
Nimbostratus
Jan 30, 2009
Hi Mike

As a global customer,my company use F5.

In a current project we have redundant LTM 340 topology.

It was working properly but we have to run an irule now, and investigating the effect of running an irule on the performance.

I mean is there a report that shows the CPU usage of F5, when we run a basic irule,

I check out the forums however, could not find the answer,

I would be glad if you could help.

Thanks in advance,

ugur
Mike_Lowell_108
Historic F5 Account
Jul 01, 2008
Hi Zafer,

I encourage you to work with your local field engineer to help size deployments. When a product does more work, it requires more resources, and understanding the performance of a combined feature-set (SSL + compression + ...) is a fairly involved multi-dimensional problem (lots of variables, many things to consider).

Having said that, the relative advantage of the BIG-IP product versus competitors is still strong. If BIG-IP can handle more L4, L7, SSL, compression, and so on for individual tests, that also means BIG-IP can handle more in combined tests. For example, if BIG-IP handles 6Gbps of compression and 6Gbps of SSL in separate tests, where a competitor may handle 3Gbps of compression and 3Gbps of SSL in separate tests, neither vendor will achieve both metrics at the same time, but perhaps BIG-IP will achieve 4Gbps of SSL+compression, and the competitor will achieve 2Gbps of SSL+compression -- BIG-IP's advantage is the same whether you look at individual metrics or combined metrics.

About iRules performance, I've performed extensive tests of L7 performance on both BIG-IP and competitive platforms (including Alteon, Redline, and Foundry). If you compare similar platform models and feature-sets between products, BIG-IP's performance with iRules is consistently higher. However, if you're using iRules to do something uncommon, something that the competitors are unable to do at all, then there's of course no way to compare this directly to the competitors. For the "L7" tests that vendors in our market use as the baseline, it's inspecting an HTTP URL to select between different groups of servers. All vendors support this basic L7 feature-set, so it's a good baseline comparison. The advertised L7 performance of BIG-IP is based on this same sort of test, using iRules. As you've seen in the report(s), BIG-IP's performance using iRules for this task is very competitive against our competitors. The real key in judging L7 performance is to make sure there's a similar feature-set in use between platforms that are being compared. Based on my extensive testing, I'm very confident BIG-IP will come out ahead. :)

For common tasks like selecting a pool of servers based on the HTTP URI you don't even need iRules -- I suggest using the httpclass profile (HTTP Classification profile) instead. For more advanced tasks, my experience testing Alteon, Redline, Foundry, and more has shown that BIG-IP's performance is very competitive. If you're seeing 60% CPU on the BIG-IP, then you'll see more than 60% CPU on similar competitive hardware. If you see higher utilization on the BIG-IP compared to a similar competitive platform, then I suggest a different set of features must be in use (perhaps BIG-IP is acting as a full proxy, whereas the competitor is not, in which case it's appropriate to use a simpler non-full proxy mode on the BIG-IP).

Hope this helps, good luck!

Mike Lowell