Advanced iRules: An Abstract View of iRules with the Tcl Bytecode Disassembler
In case you didn't already know, I'm a child of the 70's. As such, my formative years were in the 80's, where the music and movies were synthesized and super cheesy. Short Circuit was one of those cheesy movies, featuring Ally Sheedy, Steve Guttenberg, and Johnny Five, the tank-treaded laser-wielding robot with feelings and self awareness. Oh yeah! The plot...doesn't at all matter, but Johnny's big fear once reaching self actualization was being disassembled. Well, in this article, we won't dissamble Johnny Five, but we will take a look at disassembling some Tcl code and talk about optimizations.
Tcl forms the foundation of several code environments on BIG-IP: iRules, iCall, tmsh, and iApps. The latter environments don't carry the burden of performance that iRules do, so efficiency isn't as big a concern. When we speak at conferences, we often commit some time to cover code optimization techniques due to the impactful nature of applying an iRule to live traffic. This isn't to say that the system isn't highly tuned and optimized already, it's just important not to introduce any more impact than is absolutely necessary to carry out purpose.
In iRules, you can turn timing on to see the impact of an iRule, and in the Tcl shell (tclsh) you can use the time command. These are ultimately the best tools to see what the impact is going to be from a performance perspective. But if you want to see what the Tcl interpreter is actually doing from an instruction standpoint, well, you will need to disassemble the code.
I've looked at bytecode in some of the python scripts I've written, but I wasn't aware of a way to do that in Tcl. I found a thread on stack that indicated it was possible, and after probing a little further was given a solution. This doesn't work in Tcl 8.4, which is what the BIG-IP uses, but it does work on 8.5+ so if you have a linux box with 8.5+ you're good to go. Note that there are variances from version to version that could absolutely change the way the interpreter works, so understand that this is an just an exercise in discovery.
Solution 1
Fire up tclsh and then grab a piece of code. For simplicity, I'll use two forms of a simple math problem. The first is using the expr command to evaluate 3 times 4, and the second is the same math problem, but wraps the evaluation with curly brackets. The command that will show how the interpreter works its magic is tcl::unsupported::disassemble.
##
## unwrapped expression ##
##
% tcl::unsupported::disassemble script { expr 3 * 4 }
ByteCode 0x0x1e1ee20, refCt 1, epoch 16, interp 0x0x1d59670 (epoch 16)
Source " expr 3 * 4 "
Cmds 1, src 12, inst 14, litObjs 4, aux 0, stkDepth 5, code/src 0.00
Commands 1:
1: pc 0-12, src 1-11
Command 1: "expr 3 * 4 "
(0) push1 0 # "3"
(2) push1 1 # " "
(4) push1 2 # "*"
(6) push1 1 # " "
(8) push1 3 # "4"
(10) concat1 5
(12) exprStk
(13) done
##
## wrapped expression ##
##
% tcl::unsupported::disassemble script { expr { 3 * 4 } }
ByteCode 0x0x1de7a40, refCt 1, epoch 16, interp 0x0x1d59670 (epoch 16)
Source " expr { 3 * 4 } "
Cmds 1, src 16, inst 3, litObjs 1, aux 0, stkDepth 1, code/src 0.00
Commands 1:
1: pc 0-1, src 1-15
Command 1: "expr { 3 * 4 } "
(0) push1 0 # "12"
(2) done
Because the first expression is unwrapped, the interpreter has to build the expression and then call the runtime expression engine, resulting in 4 objects and a stack depth of 5. With the wrapped expression, the interpreter found a compile-time constant and used that directly, resulting in 1 object and a stack depth of 1 as well. Much thanks to Donal Fellows on Stack Overflow for the details.
Using the time command in the shell, you can see that wrapping the expression results in a wildly more efficient experience.
% time { expr 3 * 4 } 100000
1.02325 microseconds per iteration
% time { expr {3*4} } 100000
0.07945 microseconds per iteration
Solution 2
I was looking in earnest for some explanatory information for the bytecode fields displayed with tcl::unsupported::disassemble, and came across a couple pages on the Tcl wiki, one building on the other. Combining the pertinent sections of code from each page results in this script you can paste into tclsh:
namespace eval tcl::unsupported {namespace export assemble}
namespace import tcl::unsupported::assemble
rename assemble asm
interp alias {} disasm {} ::tcl::unsupported::disassemble
proc aproc {name argl body args} {
proc $name $argl $body
set res [disasm proc $name]
if {"-x" in $args} {
set res [list proc $name $argl [list asm [dis2asm $res]]]
eval $res
}
return $res
}
proc dis2asm body {
set fstart " push -1; store @p; pop "
set fstep " incrImm @p +1;load @l;load @p
listIndex;store @i;pop
load @l;listLength;lt "
set res ""
set wait ""
set jumptargets {}
set lines [split $body \n]
foreach line $lines { ;#-- pass 1: collect jump targets
if [regexp {\# pc (\d+)} $line -> pc] {lappend jumptargets $pc}
}
set lineno 0
foreach line $lines { ;#-- pass 2: do the rest
incr lineno
set line [string trim $line]
if {$line eq ""} continue
set code ""
if {[regexp {slot (\d+), (.+)} $line -> number descr]} {
set slot($number) $descr
} elseif {[regexp {data=.+loop=%v(\d+)} $line -> ptr]} {
#got ptr, carry on
} elseif {[regexp {it%v(\d+).+\[%v(\d+)\]} $line -> copy number]} {
set loopvar [lindex $slot($number) end]
if {$wait ne ""} {
set map [list @p $ptr @i $loopvar @l $copy]
set code [string map $map $fstart]
append res "\n $code ;# $wait"
set wait ""
}
} elseif {[regexp {^ *\((\d+)\) (.+)} $line -> pc instr]} {
if {$pc in $jumptargets} {append res "\n label L$pc;"}
if {[regexp {(.+)#(.+)} $instr -> instr comment]} {
set arg [list [lindex $comment end]]
if [string match jump* $instr] {set arg L$arg}
} else {set arg ""}
set instr0 [normalize [lindex $instr 0]]
switch -- $instr0 {
concat - invokeStk {set arg [lindex $instr end]}
incrImm {set arg [list $arg [lindex $instr end]]}
}
set code "$instr0 $arg"
switch -- $instr0 {
done {
if {$lineno < [llength $lines]-2} {
set code "jump Done"
} else {set code ""}
}
startCommand {set code ""}
foreach_start {set wait $line; continue}
foreach_step {set code [string map $map $fstep]}
}
append res "\n [format %-24s $code] ;# $line"
}
}
append res "\n label Done;\n"
return $res
}
proc normalize instr {
regsub {\d+$} $instr "" instr ;# strip off trailing length indicator
set instr [string map {
loadScalar load nop "" storeScalar store
incrScalar1Imm incrImm
} $instr]
return $instr
}
Now that the script source is in place, you can test the two expressions we tested in solution 1. The output is very similar, however, there is less diagnostic information to go with the bytecode instructions. Still, the instructions are consistent between the two solutions. The difference here is that after "building" the proc, you can execute it, shown below each aproc expression.
% aproc f x { expr 3 * 4 } -x
proc f x {asm {
push 3 ;# (0) push1 0 # "3"
push { } ;# (2) push1 1 # " "
push * ;# (4) push1 2 # "*"
push { } ;# (6) push1 1 # " "
push 4 ;# (8) push1 3 # "4"
concat 5 ;# (10) concat1 5
exprStk ;# (12) exprStk
;# (13) done
label Done;
}}
% f x
12
% aproc f x { expr { 3 * 4 } } -x
proc f x {asm {
push 12 ;# (0) push1 0 # "12"
;# (2) done
label Done;
}}
% f x
12
Deeper Down the Rabbit Hole
Will the internet explode if I switch metaphors from bad 80's movie to literary classic? I guess we'll find out. Simple comparisons are interesting, but now that we're peeling back the layers, let's look at something a little more complicated like a for loop and a list append.
% tcl::unsupported::disassemble script { for { $x } { $x < 50 } { incr x } { lappend mylist $x } }
ByteCode 0x0x2479d30, refCt 1, epoch 16, interp 0x0x23ef670 (epoch 16)
Source " for { $x } { $x < 50 } { incr x } { lappend mylist $x "
Cmds 4, src 57, inst 43, litObjs 5, aux 0, stkDepth 3, code/src 0.00
Exception ranges 2, depth 1:
0: level 0, loop, pc 8-16, continue 18, break 40
1: level 0, loop, pc 18-30, continue -1, break 40
Commands 4:
1: pc 0-41, src 1-56 2: pc 0-4, src 7-9
3: pc 8-16, src 37-54 4: pc 18-30, src 26-32
Command 1: "for { $x } { $x < 50 } { incr x } { lappend mylist $x }"
Command 2: "$x "
(0) push1 0 # "x"
(2) loadStk
(3) invokeStk1 1
(5) pop
(6) jump1 +26 # pc 32
Command 3: "lappend mylist $x "
(8) push1 1 # "lappend"
(10) push1 2 # "mylist"
(12) push1 0 # "x"
(14) loadStk
(15) invokeStk1 3
(17) pop
Command 4: "incr x "
(18) startCommand +13 1 # next cmd at pc 31
(27) push1 0 # "x"
(29) incrStkImm +1
(31) pop
(32) push1 0 # "x"
(34) loadStk
(35) push1 3 # "50"
(37) lt
(38) jumpTrue1 -30 # pc 8
(40) push1 4 # ""
(42) done
You'll notice that there are four commands in this code. The for loop, the x variable evaluation, the lappend operations, and the loop control with the incr command. There are a lot more instructions in this code, with jump pointers from the x interpretation to the incr statement, the less than comparison, then a jump to the list append.
Wrapping Up
I went through an exercise years ago to see how far I could minimize the Solaris kernel before it stopped working. I personally got down into the twenties before the system was unusable, but I think the record was somewhere south of 15. So...what's the point? Minimal for minimal's sake is not the point. Meet the functional objectives, that is job one. But then start tuning. Less is more. Less objects. Less stack depth. Less instantiation. Reviewing bytecode is good for that, and is possible with the native Tcl code. However, it is still important to test the code performance, as relying on bytecode objects and stack depth alone is not a good idea. For example, if we look at the bytecode differences with matching an IP address, there is no discernable difference from Tcl's perspective between the two regexp versions, and very little difference between the two regexp versions and the scan example.
% dis script { regexp {([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})} 192.168.101.20 _ a b c d }
ByteCode 0x0x24cfd30, refCt 1, epoch 15, interp 0x0x2446670 (epoch 15)
Source " regexp {([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-"
Cmds 1, src 90, inst 19, litObjs 8, aux 0, stkDepth 8, code/src 0.00
Commands 1:
1: pc 0-17, src 1-89
Command 1: "regexp {([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9"
(0) push1 0 # "regexp"
(2) push1 1 # "([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})"
(4) push1 2 # "192.168.101.20"
(6) push1 3 # "_"
(8) push1 4 # "a"
(10) push1 5 # "b"
(12) push1 6 # "c"
(14) push1 7 # "d"
(16) invokeStk1 8
(18) done
% dis script { regexp {^(\d+)\.(\d+)\.(\d+)\.(\d+)$} 192.168.101.20 _ a b c d }
ByteCode 0x0x24d1730, refCt 1, epoch 15, interp 0x0x2446670 (epoch 15)
Source " regexp {^(\d+)\.(\d+)\.(\d+)\.(\d+)$} 192.168.101.20 _"
Cmds 1, src 64, inst 19, litObjs 8, aux 0, stkDepth 8, code/src 0.00
Commands 1:
1: pc 0-17, src 1-63
Command 1: "regexp {^(\d+)\.(\d+)\.(\d+)\.(\d+)$} 192.168.101.20 _ "
(0) push1 0 # "regexp"
(2) push1 1 # "^(\d+)\.(\d+)\.(\d+)\.(\d+)$"
(4) push1 2 # "192.168.101.20"
(6) push1 3 # "_"
(8) push1 4 # "a"
(10) push1 5 # "b"
(12) push1 6 # "c"
(14) push1 7 # "d"
(16) invokeStk1 8
(18) done
% dis script { scan 192.168.101.20 %d.%d.%d.%d a b c d }
ByteCode 0x0x24d1930, refCt 1, epoch 15, interp 0x0x2446670 (epoch 15)
Source " scan 192.168.101.20 %d.%d.%d.%d a b c d "
Cmds 1, src 41, inst 17, litObjs 7, aux 0, stkDepth 7, code/src 0.00
Commands 1:
1: pc 0-15, src 1-40
Command 1: "scan 192.168.101.20 %d.%d.%d.%d a b c d "
(0) push1 0 # "scan"
(2) push1 1 # "192.168.101.20"
(4) push1 2 # "%d.%d.%d.%d"
(6) push1 3 # "a"
(8) push1 4 # "b"
(10) push1 5 # "c"
(12) push1 6 # "d"
(14) invokeStk1 7
(16) done
However, if you look at the time results from these examples, they are very different.
% time { regexp {([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})} 192.168.101.20 matched a b c d } 100000
11.29749 microseconds per iteration
% time { regexp {^(\d+)\.(\d+)\.(\d+)\.(\d+)$} 192.168.101.20 _ a b c d } 100000
7.78696 microseconds per iteration
% time { scan 192.168.101.20 %d.%d.%d.%d a b c d } 100000
1.03708 microseconds per iteration
Why is that? Well, bytecode is a good indicator, but it doesn't address the inherent speed of the commands being invoked. Regex is a very slow operation comparatively. And within the regex engine, the second example is a simpler regex to evaluate, so it's faster (though less accurate, so make sure you are actually passing an IP address.) Then of course, scan shows off its optimized self in grand fashion.
Was this a useful exercise in understanding Tcl under the hood? Drop some feedback in the comments if you'd like more tech tips like this that aren't directly covering a product feature or solution, but reveal some utility that assists in learning how things tick.
- David_Holmes_9Historic F5 AccountThis is the most hardcore thing I've ever seen on DevCentral. I'm still a little lost on parts of it. Can you explain why in 'Solution 2' it performs an eval $res and then a return $res? What is the point of the eval at that stage if it isn't assigned to another variable?
- JRahmAdminthe eval is specific for the -x argument to the aproc call, and i believe it is doing a list expansion (not my code)
- Kevin_Davies_40NacreousI should get you and JD in a room to see what you come up with. The results would be enlightening.