optimization
114 TopicsAdvanced iRules: An Abstract View of iRules with the Tcl Bytecode Disassembler
In case you didn't already know, I'm a child of the 70's. As such, my formative years were in the 80's, where the music and movies were synthesized and super cheesy. Short Circuit was one of those cheesy movies, featuring Ally Sheedy, Steve Guttenberg, and Johnny Five, the tank-treaded laser-wielding robot with feelings and self awareness. Oh yeah! The plot...doesn't at all matter, but Johnny's big fear once reaching self actualization was being disassembled. Well, in this article, we won't dissamble Johnny Five, but we will take a look at disassembling some Tcl code and talk about optimizations. Tcl forms the foundation of several code environments on BIG-IP: iRules, iCall, tmsh, and iApps. The latter environments don't carry the burden of performance that iRules do, so efficiency isn't as big a concern. When we speak at conferences, we often commit some time to cover code optimization techniques due to the impactful nature of applying an iRule to live traffic. This isn't to say that the system isn't highly tuned and optimized already, it's just important not to introduce any more impact than is absolutely necessary to carry out purpose. In iRules, you can turn timing on to see the impact of an iRule, and in the Tcl shell (tclsh) you can use the time command. These are ultimately the best tools to see what the impact is going to be from a performance perspective. But if you want to see what the Tcl interpreter is actually doing from an instruction standpoint, well, you will need to disassemble the code. I've looked at bytecode in some of the python scripts I've written, but I wasn't aware of a way to do that in Tcl. I found a thread on stack that indicated it was possible, and after probing a little further was given a solution. This doesn't work in Tcl 8.4, which is what the BIG-IP uses, but it does work on 8.5+ so if you have a linux box with 8.5+ you're good to go. Note that there are variances from version to version that could absolutely change the way the interpreter works, so understand that this is an just an exercise in discovery. Solution 1 Fire up tclsh and then grab a piece of code. For simplicity, I'll use two forms of a simple math problem. The first is using the expr command to evaluate 3 times 4, and the second is the same math problem, but wraps the evaluation with curly brackets. The command that will show how the interpreter works its magic is tcl::unsupported::disassemble. ## ## unwrapped expression ## ## % tcl::unsupported::disassemble script { expr 3 * 4 } ByteCode 0x0x1e1ee20, refCt 1, epoch 16, interp 0x0x1d59670 (epoch 16) Source " expr 3 * 4 " Cmds 1, src 12, inst 14, litObjs 4, aux 0, stkDepth 5, code/src 0.00 Commands 1: 1: pc 0-12, src 1-11 Command 1: "expr 3 * 4 " (0) push1 0 # "3" (2) push1 1 # " " (4) push1 2 # "*" (6) push1 1 # " " (8) push1 3 # "4" (10) concat1 5 (12) exprStk (13) done ## ## wrapped expression ## ## % tcl::unsupported::disassemble script { expr { 3 * 4 } } ByteCode 0x0x1de7a40, refCt 1, epoch 16, interp 0x0x1d59670 (epoch 16) Source " expr { 3 * 4 } " Cmds 1, src 16, inst 3, litObjs 1, aux 0, stkDepth 1, code/src 0.00 Commands 1: 1: pc 0-1, src 1-15 Command 1: "expr { 3 * 4 } " (0) push1 0 # "12" (2) done Because the first expression is unwrapped, the interpreter has to build the expression and then call the runtime expression engine, resulting in 4 objects and a stack depth of 5. With the wrapped expression, the interpreter found a compile-time constant and used that directly, resulting in 1 object and a stack depth of 1 as well. Much thanks to Donal Fellows on Stack Overflow for the details. Using the time command in the shell, you can see that wrapping the expression results in a wildly more efficient experience. % time { expr 3 * 4 } 100000 1.02325 microseconds per iteration % time { expr {3*4} } 100000 0.07945 microseconds per iteration Solution 2 I was looking in earnest for some explanatory information for the bytecode fields displayed with tcl::unsupported::disassemble, and came across a couple pages on the Tcl wiki, one building on the other. Combining the pertinent sections of code from each page results in this script you can paste into tclsh: namespace eval tcl::unsupported {namespace export assemble} namespace import tcl::unsupported::assemble rename assemble asm interp alias {} disasm {} ::tcl::unsupported::disassemble proc aproc {name argl body args} { proc $name $argl $body set res [disasm proc $name] if {"-x" in $args} { set res [list proc $name $argl [list asm [dis2asm $res]]] eval $res } return $res } proc dis2asm body { set fstart " push -1; store @p; pop " set fstep " incrImm @p +1;load @l;load @p listIndex;store @i;pop load @l;listLength;lt " set res "" set wait "" set jumptargets {} set lines [split $body \n] foreach line $lines { ;#-- pass 1: collect jump targets if [regexp {\# pc (\d+)} $line -> pc] {lappend jumptargets $pc} } set lineno 0 foreach line $lines { ;#-- pass 2: do the rest incr lineno set line [string trim $line] if {$line eq ""} continue set code "" if {[regexp {slot (\d+), (.+)} $line -> number descr]} { set slot($number) $descr } elseif {[regexp {data=.+loop=%v(\d+)} $line -> ptr]} { #got ptr, carry on } elseif {[regexp {it%v(\d+).+\[%v(\d+)\]} $line -> copy number]} { set loopvar [lindex $slot($number) end] if {$wait ne ""} { set map [list @p $ptr @i $loopvar @l $copy] set code [string map $map $fstart] append res "\n $code ;# $wait" set wait "" } } elseif {[regexp {^ *\((\d+)\) (.+)} $line -> pc instr]} { if {$pc in $jumptargets} {append res "\n label L$pc;"} if {[regexp {(.+)#(.+)} $instr -> instr comment]} { set arg [list [lindex $comment end]] if [string match jump* $instr] {set arg L$arg} } else {set arg ""} set instr0 [normalize [lindex $instr 0]] switch -- $instr0 { concat - invokeStk {set arg [lindex $instr end]} incrImm {set arg [list $arg [lindex $instr end]]} } set code "$instr0 $arg" switch -- $instr0 { done { if {$lineno < [llength $lines]-2} { set code "jump Done" } else {set code ""} } startCommand {set code ""} foreach_start {set wait $line; continue} foreach_step {set code [string map $map $fstep]} } append res "\n [format %-24s $code] ;# $line" } } append res "\n label Done;\n" return $res } proc normalize instr { regsub {\d+$} $instr "" instr ;# strip off trailing length indicator set instr [string map { loadScalar load nop "" storeScalar store incrScalar1Imm incrImm } $instr] return $instr } Now that the script source is in place, you can test the two expressions we tested in solution 1. The output is very similar, however, there is less diagnostic information to go with the bytecode instructions. Still, the instructions are consistent between the two solutions. The difference here is that after "building" the proc, you can execute it, shown below each aproc expression. % aproc f x { expr 3 * 4 } -x proc f x {asm { push 3 ;# (0) push1 0 # "3" push { } ;# (2) push1 1 # " " push * ;# (4) push1 2 # "*" push { } ;# (6) push1 1 # " " push 4 ;# (8) push1 3 # "4" concat 5 ;# (10) concat1 5 exprStk ;# (12) exprStk ;# (13) done label Done; }} % f x 12 % aproc f x { expr { 3 * 4 } } -x proc f x {asm { push 12 ;# (0) push1 0 # "12" ;# (2) done label Done; }} % f x 12 Deeper Down the Rabbit Hole Will the internet explode if I switch metaphors from bad 80's movie to literary classic? I guess we'll find out. Simple comparisons are interesting, but now that we're peeling back the layers, let's look at something a little more complicated like a for loop and a list append. % tcl::unsupported::disassemble script { for { $x } { $x < 50 } { incr x } { lappend mylist $x } } ByteCode 0x0x2479d30, refCt 1, epoch 16, interp 0x0x23ef670 (epoch 16) Source " for { $x } { $x < 50 } { incr x } { lappend mylist $x " Cmds 4, src 57, inst 43, litObjs 5, aux 0, stkDepth 3, code/src 0.00 Exception ranges 2, depth 1: 0: level 0, loop, pc 8-16, continue 18, break 40 1: level 0, loop, pc 18-30, continue -1, break 40 Commands 4: 1: pc 0-41, src 1-56 2: pc 0-4, src 7-9 3: pc 8-16, src 37-54 4: pc 18-30, src 26-32 Command 1: "for { $x } { $x < 50 } { incr x } { lappend mylist $x }" Command 2: "$x " (0) push1 0 # "x" (2) loadStk (3) invokeStk1 1 (5) pop (6) jump1 +26 # pc 32 Command 3: "lappend mylist $x " (8) push1 1 # "lappend" (10) push1 2 # "mylist" (12) push1 0 # "x" (14) loadStk (15) invokeStk1 3 (17) pop Command 4: "incr x " (18) startCommand +13 1 # next cmd at pc 31 (27) push1 0 # "x" (29) incrStkImm +1 (31) pop (32) push1 0 # "x" (34) loadStk (35) push1 3 # "50" (37) lt (38) jumpTrue1 -30 # pc 8 (40) push1 4 # "" (42) done You'll notice that there are four commands in this code. The for loop, the x variable evaluation, the lappend operations, and the loop control with the incr command. There are a lot more instructions in this code, with jump pointers from the x interpretation to the incr statement, the less than comparison, then a jump to the list append. Wrapping Up I went through an exercise years ago to see how far I could minimize the Solaris kernel before it stopped working. I personally got down into the twenties before the system was unusable, but I think the record was somewhere south of 15. So...what's the point? Minimal for minimal's sake is not the point. Meet the functional objectives, that is job one. But then start tuning. Less is more. Less objects. Less stack depth. Less instantiation. Reviewing bytecode is good for that, and is possible with the native Tcl code. However, it is still important to test the code performance, as relying on bytecode objects and stack depth alone is not a good idea. For example, if we look at the bytecode differences with matching an IP address, there is no discernable difference from Tcl's perspective between the two regexp versions, and very little difference between the two regexp versions and the scan example. % dis script { regexp {([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})} 192.168.101.20 _ a b c d } ByteCode 0x0x24cfd30, refCt 1, epoch 15, interp 0x0x2446670 (epoch 15) Source " regexp {([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-" Cmds 1, src 90, inst 19, litObjs 8, aux 0, stkDepth 8, code/src 0.00 Commands 1: 1: pc 0-17, src 1-89 Command 1: "regexp {([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9" (0) push1 0 # "regexp" (2) push1 1 # "([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})" (4) push1 2 # "192.168.101.20" (6) push1 3 # "_" (8) push1 4 # "a" (10) push1 5 # "b" (12) push1 6 # "c" (14) push1 7 # "d" (16) invokeStk1 8 (18) done % dis script { regexp {^(\d+)\.(\d+)\.(\d+)\.(\d+)$} 192.168.101.20 _ a b c d } ByteCode 0x0x24d1730, refCt 1, epoch 15, interp 0x0x2446670 (epoch 15) Source " regexp {^(\d+)\.(\d+)\.(\d+)\.(\d+)$} 192.168.101.20 _" Cmds 1, src 64, inst 19, litObjs 8, aux 0, stkDepth 8, code/src 0.00 Commands 1: 1: pc 0-17, src 1-63 Command 1: "regexp {^(\d+)\.(\d+)\.(\d+)\.(\d+)$} 192.168.101.20 _ " (0) push1 0 # "regexp" (2) push1 1 # "^(\d+)\.(\d+)\.(\d+)\.(\d+)$" (4) push1 2 # "192.168.101.20" (6) push1 3 # "_" (8) push1 4 # "a" (10) push1 5 # "b" (12) push1 6 # "c" (14) push1 7 # "d" (16) invokeStk1 8 (18) done % dis script { scan 192.168.101.20 %d.%d.%d.%d a b c d } ByteCode 0x0x24d1930, refCt 1, epoch 15, interp 0x0x2446670 (epoch 15) Source " scan 192.168.101.20 %d.%d.%d.%d a b c d " Cmds 1, src 41, inst 17, litObjs 7, aux 0, stkDepth 7, code/src 0.00 Commands 1: 1: pc 0-15, src 1-40 Command 1: "scan 192.168.101.20 %d.%d.%d.%d a b c d " (0) push1 0 # "scan" (2) push1 1 # "192.168.101.20" (4) push1 2 # "%d.%d.%d.%d" (6) push1 3 # "a" (8) push1 4 # "b" (10) push1 5 # "c" (12) push1 6 # "d" (14) invokeStk1 7 (16) done However, if you look at the time results from these examples, they are very different. % time { regexp {([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})} 192.168.101.20 matched a b c d } 100000 11.29749 microseconds per iteration % time { regexp {^(\d+)\.(\d+)\.(\d+)\.(\d+)$} 192.168.101.20 _ a b c d } 100000 7.78696 microseconds per iteration % time { scan 192.168.101.20 %d.%d.%d.%d a b c d } 100000 1.03708 microseconds per iteration Why is that? Well, bytecode is a good indicator, but it doesn't address the inherent speed of the commands being invoked. Regex is a very slow operation comparatively. And within the regex engine, the second example is a simpler regex to evaluate, so it's faster (though less accurate, so make sure you are actually passing an IP address.) Then of course, scan shows off its optimized self in grand fashion. Was this a useful exercise in understanding Tcl under the hood? Drop some feedback in the comments if you'd like more tech tips like this that aren't directly covering a product feature or solution, but reveal some utility that assists in learning how things tick.1.4KViews1like3CommentsUnusual high CPU cycles on iRule
Hi, So I am using the following iRule on all virtual servers & just enabled timing on this: priority 50 when RULE_INIT { set static::maindatalist set static::debug 0 } when HTTP_REQUEST { return the company logo from ifile if { ([HTTP::uri] eq "/bip-company-logo.gif") }{ HTTP::respond 200 content [ifile get /Common/company-logo] return } set vs [virtual name] if {($static::debug == 1)}{ log local0. "VS is $vs, looking at $static::maindatalist" } set status [class match -value "[virtual name]" equals "maintenance-list"] if {($static::debug == 1)}{ log local0. "Looked up status: $status" } if { ($status eq "1") } { if { ([HTTP::uri] eq "/favicon.ico") }{ HTTP::respond 404 content "" return } set contact "Application Support () or your Regional Helpdesk" if {([virtual name] starts_with "/ITSS/")}{ set contact "your Regional Helpdesk" } if {($static::debug == 1)}{ log local0. "Showing maintenance page" } HTTP::respond 200 content " Down for maintenance - company Maintenance This application is currently undergoing maintenance. It should be available again within the specified time period. For any questions, please contact $contact. " Connection Close return } } when LB_FAILED { set contact "Application Support () or your Regional Helpdesk" if {([virtual name] starts_with "/ITSS/")}{ set contact "your Regional Helpdesk" } HTTP::respond 503 content " Application Unavailable - company Application unavailable This application is currently not available. Please contact $contact. " Connection Close return } (378 requests) RULE_INIT: 5900 min, 12300 avg, 12300 max LB_FAILED: 46000 min, 60200 avg, 167000 max HTTP_REQUEST: 21300 min, 44000 avg, 384900 max I used the F5DevCentral_iRulesRuntimeCalculator to calculate the CPU usage, coming down to 28% max CPU usage per request. This seems VERY high. Is there anything I can do about this? I tested with a healthy node & a disabled node, no maintenance mode.362Views0likes1CommentiRules Optimization for MAC filtering with Data Groups (If/else)
Hello Everyone, I'm trying to figure out an optimized version of the following (currently working) iRules, in order to validate via Machine Info, the incoming MAC Address from different customers to a BIG-IP APM Access Policy. The iRules have been validated in version 12.1 and 13. Any advice/recommendation will be welcome. Here we have an example of the LTM Data Groups deployed: pedro.haoa@(f5chile)(cfg-sync In Sync)(Active)(/Common)(tmos) list ltm data-group one-line ltm data-group internal MACGRP_1001_external_chile { records { F4:15:63:11:22:33 { } F4:15:63:11:22:34 { } F4:15:63:11:22:35 { } } type string } . .(Output Omitted) . ltm data-group internal MACGRP_1370_external_chile { records { F4:15:63:44:55:66 { } F4:15:63:44:55:67 { } F4:15:63:44:55:68 { } } type string } . .(Output Omitted) . ltm data-group internal MACGRP_2001_external_bolivia { records { 00:23:E9:22:33:44 { } 00:23:E9:22:33:44 { } 00:23:E9:22:33:44 { } } type string } . .(Output Omitted) . ltm data-group internal MACGRP_2350_external_bolivia { records { 00:23:E9:55:66:77 { } 00:23:E9:55:66:78 { } 00:23:E9:55:66:79 { } } type string } . .(Output Omitted) . And here we have two iRules to validate more than 700 different Data Groups: BIG-IP APM Event when ACCESS_POLICY_AGENT_EVENT priority 410 { Access Policy Branch Filter if { [ACCESS::policy agent_id] eq "macgrp" } { Variables for LAN/WLAN Interfaces set mac0 [ACCESS::session data get "session.machine_info.last.net_adapter.list.\[0\].mac_address"] set mac1 [ACCESS::session data get "session.machine_info.last.net_adapter.list.\[1\].mac_address"] Variable to reduce data along the iRule due to the 64k limit. set s session.logon.custom.macgrp if/else statements to validate the MAC addresses contained within each data group if {[class match $mac0 eq MACGRP_1001_external_chile]||[class match $mac1 eq MACGRP_1001_external_chile]} {ACCESS::session data set $s 1} elseif {[class match $mac0 eq MACGRP_1002_external_chile]||[class match $mac1 eq MACGRP_1002_external_chile]} {ACCESS::session data set $s 1} elseif {[class match $mac0 eq MACGRP_1003_external_chile]||[class match $mac1 eq MACGRP_1003_external_chile]} {ACCESS::session data set $s 1} . .(Output Omitted) . elseif {[class match $mac0 eq MACGRP_1369_external_chile]||[class match $mac1 eq MACGRP_1369_external_chile]} {ACCESS::session data set $s 1} elseif {[class match $mac0 eq MACGRP_1370_external_chile]||[class match $mac1 eq MACGRP_1370_external_chile]} {ACCESS::session data set $s 1} } } Second iRule (Split mode): BIG-IP APM Event when ACCESS_POLICY_AGENT_EVENT priority 420 { Access Policy Branch Filter if { [ACCESS::policy agent_id] eq "macgrp" } { Variables for LAN/WLAN Interfaces set mac0 [ACCESS::session data get "session.machine_info.last.net_adapter.list.\[0\].mac_address"] set mac1 [ACCESS::session data get "session.machine_info.last.net_adapter.list.\[1\].mac_address"] Variable to reduce data along the iRule due to the 64k limit. set s session.logon.custom.macgrp if/else statements to validate the MAC addresses contained within each data group if {[class match $mac0 eq MACGRP_2001_external_bolivia]||[class match $mac1 eq MACGRP_2001_external_bolivia]} {ACCESS::session data set $s 1} elseif {[class match $mac0 eq MACGRP_2002_external_bolivia]||[class match $mac1 eq MACGRP_2002_external_bolivia]} {ACCESS::session data set $s 1} elseif {[class match $mac0 eq MACGRP_2003_external_bolivia]||[class match $mac1 eq MACGRP_2003_external_bolivia]} {ACCESS::session data set $s 1} . .(Output Omitted) . elseif {[class match $mac0 eq MACGRP_2349_external_bolivia]||[class match $mac1 eq MACGRP_2349_external_bolivia]} {ACCESS::session data set $s 1} elseif {[class match $mac0 eq MACGRP_2350_external_bolivia]||[class match $mac1 eq MACGRP_2350_external_bolivia]} {ACCESS::session data set $s 1} elseif {[class match $mac0 eq MACADM_CHECK]||[class match $mac1 eq MACADM_CHECK]} {ACCESS::session data set $s 1} } } Cheers!413Views0likes2CommentsAAM iSession replacement?
Hi, I am aware per this that the AAM product suite is going away, and is not even available on current hardware models. I am currently using iSessions to create optimized tunnels for some TCP traffic between sites, which is then secured with SSL profiles. In particular the deduplication and compression is what we mainly take advantage of. My question is, is this functionality going to eventually ship in the eventual BigIP product? Or is iSession/Dedup/Compression simply going away? Thanks, Bryan483Views0likes2CommentsSelective Compression on BIG-IP
BIG-IP provides Local Traffic Policies that simplify the way in which you can manage traffic associated with a virtual server. You can associate a BIG-IP local traffic policy to support selective compression for types of content that can benefit from compression, like HTML, XML, and CSS stylesheets. These file types can realize performance improvements, especially across slow connections, by compressing them. You can easily configure your BIG-IP system to use a simple Local Traffic Policy that selectively compresses these file types. In order to use a policy, you will want to create and configure a draft policy, publish that policy, and then associate the policy with a virtual server in BIG-IP v12. Alright, let’s log into a BIG-IP The first thing you’ll need to do is create a draft policy. On the main menu select Local Traffic>Policies>Policy List and then the Create or + button. This takes us to the create policy config screen. We’ll name the policy SelectiveCompression, add a description like ‘This policy compresses file types,’ and we’ll leave the Strategy as the default of Execute First matching rule. This is so the policy uses the first rule that matches the request. Click Create Policy which saves the policy to the policies list. When saved, the Rules search field appears but has no rules. Click Create under Rules. This brings us to the Rules General Properties area of the policy. We’ll give this rule a name (CompressFiles) and then the first settings we need to configure are the conditions that need to match the request. Click the + button to associate file types. We know that the files for compression are comprised of specific file types associated with a content type HTTP Header. We choose HTTP Header and select Content-Type in the Named field. Select ‘begins with’ next and type ‘text/’ for the condition and compress at the ‘response’ time. We’ll add another condition to manage CPU usage effectively. So we click CPU Usage from the list with a duration of 1 minute with a conditional operator of ‘less than or equal to’ 5 as the usage level at response time. Next under Do the following, click the create + button to create a new action when those conditions are met. Here, we’ll enable compression at the response time. Click Save. Now the draft policy screen appears with the General Properties and a list of rules. Here we want to click Save Draft. Now we need to publish the draft policy and associate it with a virtual server. Select the policy and click Publish. Next, on the main menu click Local Traffic>Virtual Servers>Virtual Server List and click the name of the virtual server you’d like to associate for the policy. On the menu bar click Resources and for Policies click Manage. Move SelectiveCompression to the Enabled list and click Finished. The SelectiveCompression policy is now listed in the policies list which is now associated with the chosen virtual server. The virtual server with the SelectiveCompression Local Traffic Policy will compress the file types you specified. Congrats! You’ve now added a local traffic policy for selective compression! You can also watch the full video demo thanks to our TechPubs team. ps984Views0likes7CommentsOptimizing the CVE-2015-1635 iRule
A couple days ago an iRule was published that mitigates Microsoft’s HTTP.sys vulnerability described in CVE-2015-1635 and MS15-034. It’s a short rule, but it features the dreaded regex. Every time I get the chance at user groups, conferences, webinars, I preach the evils of regex on the data plane. One of the many reasons it is not the best choice is that just calling the regex engine is the equivalent of an 8x operation, let alone actually performing the matching. So I decided to look into some optimizations. Original Rule The rule as published is pretty simple. And actually, the regex is clean and the string it’s matching against is not long, so whereas the instantiation penalty is still high, the matching penalty should not be, so this isn’t a terrible use of regex. I’ve cleaned up the rule to only test the regex itself, setting the malicious header to a variable in CLIENT_ACCEPTED so I can just slam the box with an ab request (ab -n 5000 http://testvip/) from the BIG-IP cli. when CLIENT_ACCEPTED { set x "GET / HTTP/1.1\r\nHost: stuff\r\nRange: bytes=0-18446744073709551615\r\n\r\n" } when HTTP_REQUEST { if { $x matches_regex {bytes\s*=.*[0-9]{10}} } { HTTP::respond 200 ok } } With this modified version, over 5000 requests my average CPU cycles landed at 45.4K on a BIG-IP VE on TMOS 11.6 running in VMware Fusion 5. Not bad at all, regex or not. My Optimized Rule My approach was to use scan to pull out the appropriate fields, then string commands to match the bytes and the digits in the number 10 or greater. For a scan deep dive, see my scan revisited article in the iRules 101 series. when CLIENT_ACCEPTED { set x "GET / HTTP/1.1\r\nHost: stuff\r\nRange: bytes=0-18446744073709551615\r\n\r\n" } when HTTP_REQUEST { scan $x {%[^:]:%[^:]:%[^=]=%[^-]-%[0-9]} 0 0 a 0 b if { ($a eq " bytes") && ([string length $b] > 9) } { HTTP::respond 200 ok } } This is definitely faster at 42.6K average CPU cycles, but not nearly as much of an improvement as I had anticipated. Why? Well, scan is doing a lot, and then I’m using the ends_with operator, which is equivalent to a string match, and then also performing a string length operation and finally a comparison operation. So whereas it is a little faster, the number of operations performed was more than it needs to be, mostly because I wasn’t clever enough to make a better string match. Another Way Better Optimized Rule Which leads us to the current course leader…who I’ll leave unnamed for now, but you all know him (or her) very well in these parts. Back to a better string match. I love that 6+ years in, I learn new ways to manipulate the string commands in Tcl. I suppose if I really RTFM’d the Tcl docs a little better I’d know this, but I had no idea you could match quite the way this rule does. when CLIENT_ACCEPTED { set x "GET / HTTP/1.1\r\nHost: stuff\r\nRange: bytes=0-18446744073709551615\r\n\r\n" } when HTTP_REQUEST { if { [string match -nocase {*bytes*=*[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]*} $x] } { HTTP::respond 200 ok } } So there you go. One simple string command. And it comes in at 31.3K average CPU cycles. That’s a significant reduction. (Almost Always) A Better Way Sometimes regex is the only way. But it’s a rare day when that is the case. There is almost always a better way than regex when it comes to iRules. And since speed is the name of the game on the data plane, I highly encourage all you iRulers out there to not be like me and get yourself deep into the weeds of the Tcl string commands and arm yourself with some seriously powerful weapons. You’ve been charged—Go forth!330Views0likes3CommentsTrue or False: Application acceleration solutions teach developers to write inefficient code
It has been suggested that the use of application acceleration solutions as a means to improve application performance would result in programmers writing less efficient code. In a comment on “The House that Load Balancing Built” a reader replies: Not only will it cause the application to grow in cost and complexity, it's teaching new and old programmers to not write efficient code and rely on other products and services on [sic] thier behalf. I.E. Why write security into the app, when the ADC can do that for me. Why write code that executes faster, the ADC will do that for me, etc., etc. While no one can control whether a programmer writes “fast” code, the truth is that application acceleration solutions do not affect the execution of code in any way. A poorly constructed loop will run just as slow with or without an application acceleration solution in place. Complex mathematical calculations will execute with the same speed regardless of the external systems that may be in place to assist in improving application performance. The answer is, unequivocally, that the presence or lack thereof of an application acceleration solution should have no impact on the application developer because it does nothing to affect the internal execution of written code. If you answered false, you got the answer right. The question has to be, then, just what does an application acceleration solution do that improves performance? If it isn’t making the application logic execute faster, what’s the point? It’s a good question, and one that deserves an answer. Application acceleration is part of a solution we call “application delivery”. Application delivery focuses on improving application performance through optimization of the use and behavior of transport (TCP) and application transport (HTTP/S) protocols, offloading certain functions from the application that are more efficiently handled by an external often hardware-based system, and accelerating the delivery of the application data. OPTIMIZATION Application acceleration improves performance by understanding how these protocols (TCP, HTTP/S) interact across a WAN or LAN and acting on that understanding to improve its overall performance. There are a large number of performance enhancing RFCs (standards) around TCP that are usually implemented by application acceleration solutions. Delayed and Selective Acknowledgments (RFC 2018) Explicit Congestion Notification (RFC 3168) Limited and Fast Re-Transmits (RFC 3042 and RFC 2582) Adaptive Initial Congestion Windows (RFC 3390) Slow Start with Congestion Avoidance (RFC 2581) TCP Slow Start (RFC 3390) TimeStamps and Windows Scaling (RFC 1323) All of these RFCs deal with TCP and therefore have very little to do with the code developers create. Most developers code within a framework that hides the details of TCP and HTTP connection management from them. It is the rare programmer today that writes code to directly interact with HTTP connections, and even rare to find one coding directly at the TCP socket layer. The execution of code written by the developer takes just as long regardless of the implementation or lack of implementation of these RFCs. The application acceleration solution improves the performance of the delivery of the application data over TCP and HTTP which increases the performance of the application as seen from the user’s point of view. OFFLOAD Offloading compute intensive processing from application and web servers improves performance by reducing the consumption of CPU and memory required to perform those tasks. SSL and other encryption/decryption functions (cookie security, for example) are computationally expensive and require additional CPU and memory on the server. The reason offloading these functions to an application delivery controller or stand-alone application acceleration solution improves application performance is because it frees the CPU and memory available on the server and allows it to be dedicated to the application. If the application or web server does not need to perform these tasks, it saves CPU cycles that would otherwise be used to perform them. Those cycles can be used by the application and thus increases the performance of the application. Also beneficial is the way in which application delivery controllers manage TCP connections made to the web or application server. Opening and closing TCP connections takes time, and the time required is not something a developer – coding within a framework – can affect. Application acceleration solutions proxy connections for the client and subsequently reduce the number of TCP connections required on the web or application server as well as the frequency with which those connections need to be open and closed. By reducing the connections and frequency of connections the application performance is increased because it is not spending time opening and closing TCP connections, which are necessarily part of the performance equation but not directly affected by anything the developer does in his or her code. The commenter believes that an application delivery controller implementation should be an afterthought. However, the ability of modern application delivery controllers to offload certain application logic functions such as cookie security and HTTP header manipulation in a centralized, optimized manner through network-side scripting can be a performance benefit as well as a way to address browser-specific quirks and therefore should be seriously considered during the development process. ACCELERATION Finally, application acceleration solutions improve performance through the use of caching and compression technologies. Caching includes not just server-side caching, but the intelligent use of the client (usually the browser) cache to reduce the number of requests that must be handled by the server. By reducing the number of requests the server is responding to, the web or application server is less burdened in terms of managing TCP and HTTP sessions and state, and has more CPU cycles and memory that can be dedicated to executing the application. Compression, whether using traditional industry standard web-based compression (GZip) or WAN-focused data de-duplication techniques, decreases the amount of data that must be transferred from the server to the client. Decreasing traffic (bandwidth) results in fewer packets traversing the network which results in quicker delivery to the user. This makes it appear that the application is performing faster than it is, simply because it arrived sooner. Of all these techniques, the only one that could possibly contribute to the delinquency of developers is caching. This is because application acceleration caching features act on HTTP caching headers that can be set by the developer, but rarely are. These headers can also be configured by the web or application server administrator, but rarely are in a way that makes sense because most content today is generated dynamically and is rarely static, even though individual components inside the dynamically generated page may in fact be very static (CSS, JavaScript, images, headers, footers, etc…). However, the methods through which caching (pragma) headers are set is fairly standard and the actual code is usually handled by the framework in which the application is developed, meaning the developer ultimately cannot affect the efficiency of the use of this method because it was developed by someone else. The point of the comment was likely more broad, however. I am fairly certain that the commenter meant to imply that if developers know the performance of the application they are developing will be accelerated by an external solution that they will not be as concerned about writing efficient code. That’s a layer 8 (people) problem that isn’t peculiar to application delivery solutions at all. If a developer is going to write inefficient code, there’s a problem – but that problem isn’t with the solutions implemented to improve the end-user experience or scalability, it’s a problem with the developer. No technology can fix that.253Views0likes4CommentsThe Image Format Wars
In the late 1990’s we experienced the browser wars between Internet Explorer and Netscape Navigator, more recently the battle has waged on between IE, Firefox, Chrome, Safari and Opera. Now be prepared for the next battle – the image format wars. Browsers are now battling for a more optimized image format with Microsoft, Google, and Mozilla all introducing new image formats, but before we go into these new image formats let’s look at why this is emerging and the history behind web based images. The most popular image formats on the web today are GIF, JPEG and PNG. The oldest image format is the GIF, introduced in 1987, although the oldest its popularity is waning in favor of image formats with more colors options and better compression algorithms like JPEG (introduced in the early 90s) and PNG (introduced in 1995) . These image formats have worked for many years so why the sudden interest in introducing new file formats, if it ain’t broke don’t fix it. Well, it’s not that the image formats are broken, but rather how they impact page performance is the issue. The size of the average web page is growing exponentially and the largest contributor to that is images. According to HttpArchive images comprise 65% of the content of a typical web page at just over 1MB in total size. This presents a huge opportunity for improving the web browsing experience. Doesn’t everybody want a faster loading web page and to use less data on their smartphones? This has lead to the search for a way to optimize images without sacrificing quality. This brings us to the introduction of new image formats and encoding algorithms that are in contention to improve the performance of images on the web. Google introduced WebP , Microsoft introduced JPEG XR in IE 9 , and Mozilla announced earlier this year a new project entitled mozjpeg. Let’s start with WebP and JPEG XR both provide significant savings when it comes to the size of images. We did some testing recently to see how much improvement can be achieved with WebP and JPEG XR. The test consisted of 417 JPEG images and 50 PNG images. The results showed that both provided substantial benefits reducing the file size by over 50%. If the results are this good, why aren’t more sites using these formats? The answer: browser compatibility. WebP only works in Chrome and Opera while JPEG XR only works in IE. If you want your users to see the benefits of these new formats you need to create 3 versions of every image, which can be time consuming. Luckily the image optimization functionality in BIG-IP Application Acceleration Manager allows an administrator to click a single check mark enabling an image to be converted to the appropriate format on a per client basis. This eliminates the need to create multiple versions of the same object and all users can benefit from the optimizations. No, wait – not all users, only clients connecting with certain IE version, Chrome and Opera will benefit from the new format types. This is where mozjpeg is looking to change the game. Mozjpeg’s goal is to improve the compressibility of JPEGs without sacrificing quality. The beauty of this project is that the file format would remain a JPEG meaning that every browser and user would be able to see the benefits of reduced image sizes. It is interesting to track the progress of this project and see who will win the image wars, in the meantime optimize the images you can for the users that you can.298Views0likes0CommentsF5 Synthesis: Trickle Up Performance
#sdas #webperf New TCP algorithms and protocols in a platform net improved performance for apps The original RFC for TCP (793) was written in September of 1981. Let's pause for a moment and reflect on that date. When TCP was introduced applications on "smart" devices was relegated to science fiction and the use of the Internet by your average consumer was still more than two decades away. Yet TCP remains, like IP, uncontested as "the" transport protocol of ... everything. New application architectures, networks, and devices all introduce challenges with respect to how TCP behaves. Over the years it's become necessary to tweak the protocol with new algorithms designed to address everything from congestion control to window size control to key management for the TCP MD5 signature option to header compression. There are, in fact, over 100 (that's where I stopped counting) TCP-related RFCs on the digital books. The most recent addition is RFC 6897, or Multipath TCP (MPTCP) as it's more commonly referred to. I take any bets against there being more in the future, if I were you. What most of these RFCs related to TCP have in common is they attempt to address some issue that's holding application performance hostage. Congestion, for example, is a huge network issue that can impact TCP performance (and subsequently the performance of applications) in a very negative way. Congestion can cause lost packets and, for TCP at least, that often means retransmission. You see, TCP is very particular about the order in which is receives packets and, is also designed to be reliable. That means if a packet is lost, you're going to hear about it (or more accurately your infrastructure will hear about it and need to resend it). All those retransmitted packets, trying to traverse an already congested network... well, you can imagine it isn't exactly a good thing. So there are a variety of congestion control algorithms designed to better manage TCP in such situations. From TCP Reno to Vegas to Illinois to H-TCP, algorithms are the most common way in which network stacks in general deal with congestion. The important thing to remember about this is that performance trickles up. Improvements in the TCP stack benefit those layers that reside above, like the application. F5 Synthesis: Faster Platforms Means Faster Apps The F5 platforms that comprise the Synthesis High Performance Service Fabric are no different. They implement the vast majority of the congestion control algorithms available to ensure the fastest TCP stack we can offer. In the latest release of our platforms, we also added an F5 created algorithm - TCP Woodside - designed to use both loss and latency-based algorithms to improve, in particular, the performance of applications operating over mobile networks. By implementing at the TCP - the platform - layer, all applications receive the benefit of improved performance but it's particularly noticeable in mobile environments because of the differences inherent in mobile versus fixed networks. Also new in our latest platforms is support for MPTCP, another potential boon for mobile application users. MPTCP is designed to improve performance by enabling the use of multiple TCP subflows over a single TCP connection. Messages can then be dynamically routed across those subflows. For web applications, this can result in significantly faster retrieval of the more than 90 objects that comprise the average web page today. Synthesis 1.5 comprises a variety of new performance and security-related features that improve the platforms that make up its High Performance Services Fabric. What that ultimately means for customers and users alike is faster apps. For more information on Synthesis: F5 Synthesis Site More DevCentral articles on F5 Synthesis420Views0likes1Comment