Forum Discussion

Historic F5 Account

Jan 28, 2010

Unexpected TMOS utalsiation hit

I have a bigip 6400 platform running LTM v 9.4.7

We have an iRule to direct HTTP requests coming into the F5 from the datacentre out to the appropriate external service (For XML messaging). This is a lot cleaner than having 250 vips for 250 different services.

The rule matches the HTTP host to a pool defined on the server, so if we need to update a host, we can do this via the web interface, without making any changes to the iRule by hand (= lower risk, easier operation).

Slap the iRule on the virtual server and a serverssl profile on there and you've got encrypted traffic going out.

We've also got some error handling to deal with people trying to get to services they shouldn't from this particular vip, and error messages if the external service is down.

Here is the code:

 iRule to direct HTTP requests based on hostname    
   when HTTP_REQUEST {   
   Extract hostname - needs to be lower case to match pool name   
   set host [string tolower [HTTP::host]]   
    Check if the hostname (and therefore the pool name the request will be sent to   
    end with correct domain - Prevent using this VS to hop out a different pool   
   if { ($host ends_with ".bob.com") and !($host contains ".test.")} {   
   if { [catch { pool $host } ] } {   
    no matching pool name - so move on to error   
   HTTP::respond 404 content "Endpoint not defined"     
   }   
   } else {   
      
   HTTP::respond 403 content "Invalid URL"     
   }   
      
   }   
      
   when LB_FAILED {   
   HTTP::respond 504 content "Endpoint Unavailable"     
   }

So far, so good. We've tested, we've run iRule benchmarking and its reasonably efficient...

config  cat /proc/cpuinfo   
   model name      : AMD Opteron(tm) Processor 246   
   stepping        : 1   
   cpu MHz         : 1992.276   
      
   config  bigpipe rule Prod_Spine_Generic_Routing show all   
   RULE Prod_Spine_Generic_Routing   
   +-> HTTP_REQUEST   3123 total   0 fail   0 abort   
   |   |     Cycles (min, avg, max) = (17437, 51642, 92343)   
   +-> LB_FAILED   0 total   0 fail   0 abort   
       |     Cycles (min, avg, max) = (0, 0, 0)

When that's spreadsheeted it comes back with 40,000 TPS, which is perfectly reasonable for the amount of traffic we're expecting.

So at that point we said, we've got this sorted, put it into production and carried on to the next piece of work.

We've now got people using this VIP, and during their first bulk load run they got to the dizzying heights of 12 HTTP requests per second. During this period the TMOS utilisation moved from its usual 1-2% to 25%, and when they backed off to 6 TPS it reduced to 10-12%.

Clearly somethings not right here, but the timing stats (Produced from production data) show its all fine. There has to be some sort of hidden cost somewhere, but I can't see any obvious place its coming from.

Our reseller isn't being very helpful, so I'm between a rock and a hard place here.

Any thoughts would be most gratefully received...

14 Replies

hoolio
Cirrostratus
Jan 28, 2010
Hi Steve,

The iRule looks very straightforward, and not something that would eat up a lot of CPU cycles.

How were you measuring the CPU utilization? Ideally, you would check the TMM CPU utilization using tmstat from the CLI or less ideally using the performance graphs.

Thanks,

Aaron

Steve_Scott_873

Historic F5 Account

Jan 28, 2010

Aaron,

Indeed, the timing stats tell us its not something thats eating many CPU cycles. There has to be an expensive operation which the iRule is causing tmos to do, but is not directly executed in iRule (Or it would / should show in the timings)

I'm looking at TMOS CPU utilisation from the graphs, but have also dumped the raw RRD files which seem to agree with the graphs.

Have done about 3 windows with a looped curl, and got the following TMstats - this it not hitting it that hard

 
  CPU:   5% busy    95% idle    0% sleep                                                     Thu Jan 28 19:19:56 2010 
  
        Memory Allocated                                                   New Flow     Old Flow         Poll 
    34,248,152 / 3,472,883,712                                               14,169      441,524          135 Cycles 
   [  .  :  .  |  .  :  .  ]                                                     16          215   14,651,347 Total 
  
          Tcp            Crypto Ops       Random Class                                                    106 Timers 
           56 Open            3 (total)      407 (total)                                                    0 Stats 
           15 Accepts         0 rsa            0 Pseudo 
           13 Connects        0 full hs        0 Entropy                                         Virtual Class 
              Wait            5 record       407 Secure                                       12,349,665 (total) 
            0 Rtx             0 cipher                                                        10,485,780 mco db 
           32 Del ACK        -1 (unseen)                                                       1,399,046 ssl 
                                                                                                 258,108 tcl 
                                                                                                 206,731 (unseen) 
  
                                                                                                    Umem Class 
                                                                                                  53,519 (total) 
                                                                                                  47,632 ssl_session 
                                                                                                   1,559 listener 
                                                                                                   1,333 xfrag 
                                                                                                     727 poolmbr 
                                                                                                     582 laddr 
                                                                                                     565 pool 
                                                                                                     436 vaddr 
                                                                                                     306 packet 
                                                                                                     158 selfip 
                                                                                                      66 connflow 
                                                                                                      31 rt_entry 
                                                                                                      19 arp_entry 
                                                                                                      15 cn_key 
                                                                                                      15 proxy_ctx cac 
                                                                                                      15 ssl_profile 
                                                                                                      12 http_data 
                                                                                                      11 rtm_internal 
                                                                                                       9 lasthop 
                                                                                                       9 ssl_cn 
                                                                                                       8 ncache_entry 
                        vnic                                                                           4 CallFrame 
          118,280b rx        link       125,928b tx                                                    3 ssl_shim 
 [ . : . | . : . ]       bg       [ . : . | . : . ]                                                    1 mpi_recv_desc 
          471,792b rx  1,000 link       129,536b tx                                                    1 shaper_domain 
 [ . : . | . : . ]       bg       [ . : . | . : . ]                                                    1 ssl_hs 
           21,616b rx  1,000 link       118,496b tx                                                    1 ssl_keys 
 [ . : . | . : . ]                [ . : . | . : . ] 
                0b rx      0 link             0b tx 
 [ . : . | . : . ]                [ . : . | . : . ] 
                0b rx      0 link             0b tx 
 [ . : . | . : . ]                [ . : . | . : . ] 
                0b rx      0 link             0b tx 
 [ . : . | . : . ]                [ . : . | . : . ] 
                0b rx      0 link             0b tx 
 [ . : . | . : . ]                [ . : . | . : . ] 
                0b rx      0 link             0b tx 
 [ . : . | . : . ]                [ . : . | . : . ] 
                0b rx      0 link             0b tx

Steve

Steve_Scott_873

Historic F5 Account

Jan 28, 2010

TMStat without anything going (Baseline)

 
  CPU:   0% busy   100% idle    0% sleep                                                     Thu Jan 28 19:36:14 2010 
  
        Memory Allocated                                                   New Flow     Old Flow         Poll 
    34,009,560 / 3,472,883,712                                               18,435        9,425          132 Cycles 
   [  .  :  .  |  .  :  .  ]                                                      3           29   15,048,450 Total 
  
          Tcp            Crypto Ops       Random Class                                                     64 Timers 
           30 Open            0 (total)        0 (total)                                                    0 Stats 
            2 Accepts         0 rsa            0 Pseudo 
            0 Connects        0 full hs        0 Entropy                                         Virtual Class 
              Wait            0 record         0 Secure                                       12,181,457 (total) 
            0 Rtx             0 cipher                                                        10,485,780 mco db 
            8 Del ACK         0 (unseen)                                                       1,231,264 ssl 
                                                                                                 257,682 tcl 
                                                                                                 206,731 (unseen) 
  
                                                                                                    Umem Class 
                                                                                                  53,481 (total) 
                                                                                                  47,632 ssl_session 
                                                                                                   1,559 listener 
                                                                                                   1,335 xfrag 
                                                                                                     727 poolmbr 
                                                                                                     582 laddr 
                                                                                                     565 pool 
                                                                                                     436 vaddr 
                                                                                                     306 packet 
                                                                                                     158 selfip 
                                                                                                      38 connflow 
                                                                                                      31 rt_entry 
                                                                                                      19 arp_entry 
                                                                                                      15 cn_key 
                                                                                                      15 ssl_profile 
                                                                                                      13 proxy_ctx cac 
                                                                                                      11 rtm_internal 
                                                                                                      10 http_data 
                                                                                                      10 ssl_cn 
                                                                                                       8 ncache_entry 
                                                                                                       7 lasthop 
                        vnic                                                                           1 CallFrame 
           57,232b rx        link        26,408b tx                                                    1 mpi_recv_desc 
 [ . : . | . : . ]       bg       [ . : . | . : . ]                                                    1 shaper_domain 
           18,400b rx  1,000 link        21,496b tx                                                    1 ssl_keys 
 [ . : . | . : . ]       bg       [ . : . | . : . ] 
           13,288b rx  1,000 link         9,504b tx 
 [ . : . | . : . ]                [ . : . | . : . ] 
                0b rx      0 link             0b tx 
 [ . : . | . : . ]                [ . : . | . : . ] 
                0b rx      0 link             0b tx 
 [ . : . | . : . ]                [ . : . | . : . ] 
                0b rx      0 link             0b tx 
 [ . : . | . : . ]                [ . : . | . : . ] 
                0b rx      0 link             0b tx 
 [ . : . | . : . ]                [ . : . | . : . ] 
                0b rx      0 link             0b tx 
 [ . : . | . : . ]                [ . : . | . : . ] 
                0b rx      0 link             0b tx

Steve_Scott_873
Historic F5 Account
Jan 28, 2010
This is a graph from the TMOS CPU graph. This is real traffic, with nobody on the F5 (CLI or otherwise)

The spike was tracked down to using the iRule.

Hopefully that will discount any stats gathering as the cause of the issue?

Have also trimmed the HTML responses
hoolio
Cirrostratus
Jan 28, 2010
Hmm.. so the TMM CPU (CPU1) usage was most likely peaking at 30%. I just did a similar test with 10 curl clients looping requests to a VIP with your rule set up. tmstat and the TMM Utilization graph both show less than 3% usage.

When you retested with three curl loops, did the TMM Utilization graph show such a high CPU usage?

Is there anything in the LTM or TMM log files from the original time period (~11:40pm - 3am).

Aaron
Steve_Scott_873
Historic F5 Account
Jan 28, 2010
Yes, they did a test run with 2 servers running initially, the 2nd server ran into trouble so they stopped, and then ran with a single server for the remainder of the session. First spike was 12 HTTP requests a second, second prolonged period was 6 transactions per second. As you can see, normal operation (Its a webapp we're hosting, so most busy during the day, ~ 100 TPS), the TMOS utalsiation is minimal, despite there being more complex iRules running.

The three curl loops don't show the same elevation, but its hard to tell how many requests per second its pushing. I'll put together a counter on the while true; do curl; done loop and run that from the standby box to avoid contaminating.

What hardware / software version did you run a test and see 3% utal. Did you have a serverssl profile on?

Nothing in the LTM logs. The way i tracked it down to this particular flow is 1) asking around what we did differently overnight. 2) next time they repeated it running packet captures and looking at what vips went busy.
hoolio
Cirrostratus
Jan 28, 2010
I tested on a 6400 running 9.4.8. I didn't test to a pool. All my requests hit the catch statement for the non-existent pool condition. I was focusing on the iRule operation as opposed to load balancing. I guess there could be something horrible happening with the SSL negotiations or serverside connections, but I've never seen something like that use so many CPU cycles.

I actually work as a consultant for the partner you're working with. I talked with Chris about the case earlier this evening. The issue is a bit of a gray area between support and sizing/performance tuning. You might try going back to the support group on this. I can talk with Chris and give him some tips for troubleshooting with you.

Thanks,

Aaron
Steve_Scott_873
Historic F5 Account
Jan 28, 2010
It could be something going wrong at the pool / SSL level - thats the sort of area i'd expect, the iRule timings seem to indicate things aren't being held up there. I could try giving it a curling at the catch level and seeing if i can reproduce. That would at least push it down to pool selection or ssl profile.

I know its a grey area, I'm trying to get the support ticket raised on the basis of the iRule timing functionality being way off here, which would seem to be an intergral part of the device rather than something we've slapped on the side. TMOS maxed at potentially 60 transactions per second is pretty hideous when the device is rated for 15k SSL TPS and 2Gbit throughput. The timing profiles are of course an estimate tool, but somethings not right here!

Wish we had a dedicated test lab, but the F5's are too expensive for us to have for odd occations such as this.
hoolio
Cirrostratus
Jan 28, 2010
The timing command is used to calculate the CPU cycles required to run the iRule. That data is used to extrapolate how many evaluations of the rule can be done for the platform. As far as I'm aware, timing doesn't take into account any other resource requirements. Unfortunately, there isn't a simple way to determine how many TMM CPU cycles other activities are using. So I'm not sure that your scenario shows that the iRule timing is broken. And I don't think there is a linear relationship between the requests per second and the CPU usage.

There is definitely something odd going on, but I think it's a bit early to say exactly where the problem is. I'll try to keep in touch with Chris on this to see how the troubleshooting progresses.

Aaron
Steve_Scott_873
Historic F5 Account
Jan 28, 2010
Agreed, but its its not the iRule eating the cycles, its not custom development eating the cycles.

I'll do my best to jerry rig something with our HA pair of 6400's and then at least i can curl over our heartbeat lan and avoid skewing the results. If i bring down the iRule until its pretty much selecting a pool, and try and then try and remove the serverssl profile, although thats a long shot from helping me sort the iRule for production