Tech Tip submitted by Dave Porter, F5 Data Solutions
The effectiveness of CIFS virtualization in the ARX is impaired by slow response from external systems such as file servers and domain controllers, resulting in poor response to client requests, and so-called WORKSLOW messages in syslog. Version 6.01.000 of the ARX adds new aids to diagnosing the underlying cause.
The problem we’re solving here is one of diagnosis: ARX users complain that ‘the ARX is slow’, and then a knowledgeable technician must try and determine the reason. Over time, we have learned that a good first step is to look for WORKSLOW messages in the system log. But then what? In the past, that has required dumping internal data structures, checking various counters, getting network traces, and so on.
In V6.01.000, we’ve attempted to provide a single command that can serve as the first point of diagnosis. Information that was previously hidden in private data structures is now exposed to view.
The new command is ‘show statistics cifs work-queues’, and some example output is shown below. The output is a little on the technical side; there's not really any way to simplify it, since slowdowns will naturally be related to internal structure. In this document, the output is broken into sections for explanation.
The command used for this example was
show statistics cifs work-queues volume-group 2
The overall output structure is: firstly, information about how the internal queues are doing, followed by more detailed info that might help you determine where any slowdowns are occurring.
Terminology note: ‘dNAS’ is the name of the control plane volume manager component; ‘vcifs’ is the name of the CIFS subcomponent of dNAS. Both of these names are used in syslog as well as in the present document. The volume manager is involved in more-or-less all CIFS traffic except data reads and writes.
Warning: this example is a composite, synthesized from bits and pieces. The numbers might not be consistent.
First, the work queues (or ‘work lists’; the terms are interchangeable). Vcifs has some number of these, used for different purposes. The formal name and approximate use are given. Stats indicate the number of items currently queued, and the high-water mark on the number of items queued. The number of items completed is given—whether the CIFS operation ‘succeeded’ or ‘failed’ is immaterial here; a failed access is not a problem in our software, e.g. user might have typed the wrong filename. ‘Discarded’ happens in times of stress; discarded items never even got started.
CIFS work queue pool statistics ------------------------------- Work list "main": used for most client requests 0 items active or waiting; highest was 4 at 06/01/2011 15:33:49 -0400 8 items completed, 0 discarded Average 13 uSec queued, 1277 uSec processing Time waiting in queue: Secs: < 0.001 < 0.01 < 0.1 < 1 < 10 < 100 Longer Count: 33 2 0 1 0 9 2 Time processing on thread: Secs: < 0.001 < 0.01 < 0.1 < 1 < 10 < 100 Longer Count: 34 1 0 1 0 9 2 Work list "auth": used for client authentication 0 items active or waiting; highest was 1 at 06/01/2011 15:19:33 -0400 12 items completed, 0 discarded Average 16 uSec queued, 1332 uSec processing Time waiting in queue: Secs: < 0.001 < 0.01 < 0.1 < 1 < 10 < 100 Longer Count: 33 11 2 0 0 0 0 Time processing on thread: Secs: < 0.001 < 0.01 < 0.1 < 1 < 10 < 100 Longer Count: 34 10 2 0 0 0 0 Work list "setup": authentication support for data plane 0 items active or waiting; highest was 1 at 06/01/2011 15:18:12 -0400 2 items completed, 0 discarded Average 26 uSec queued, 866 uSec processing Time waiting in queue: Secs: < 0.001 < 0.01 < 0.1 < 1 < 10 < 100 Longer Count: 8 0 0 0 0 0 0 Time processing on thread: Secs: < 0.001 < 0.01 < 0.1 < 1 < 10 < 100 Longer Count: 8 0 0 0 0 0 0 Work list "special": internal notifications, including client disconnect 0 items active or waiting; highest was 1 at 06/01/2011 15:33:50 -0400 4 items completed, 0 discarded Average 27 uSec queued, 151 uSec processing Time waiting in queue: Secs: < 0.001 < 0.01 < 0.1 < 1 < 10 < 100 Longer Count: 14 0 0 0 0 1 0 Time processing on thread: Secs: < 0.001 < 0.01 < 0.1 < 1 < 10 < 100 Longer Count: 14 0 0 0 0 1 0 Work list "no-op": processes keepalives from data plane 0 items active or waiting; highest was 1 at 06/01/2011 15:33:51 -0400 57 items completed, 0 discarded Average 22 uSec queued, 44 uSec processing Time waiting in queue: Secs: < 0.001 < 0.01 < 0.1 < 1 < 10 < 100 Longer Count: 997 10 0 0 0 0 0 Time processing on thread: Secs: < 0.001 < 0.01 < 0.1 < 1 < 10 < 100 Longer Count: 1007 0 0 0 0 0 0
The average times are calculated over ‘completed’ items. Queue time is the time between being placed on the queue and being picked up by a worker thread; this will be non-zero even if worker threads were immediately available. Processing time is the time for which a worker thread was dedicated to this item of work. The vcifs implementation means that a work item ties up a work thread from the time it starts processing until the time it is completed.
In this example, there were a handful of client requests that could be problematic: 9 requests took 10 to 100 seconds, and 2 requests took more than 100 seconds.
There are histograms showing the distribution of the queue times and processing times for the completed work items; the definitions of queue time and processing time are as above. The histogram scale is logarithmic, ranging from 1 msec to 100 secs.
At the end of all the work-list info, there is a table of status messages.
Work queue status messages: Timestamp (UTC) WkList Event Age(sec) Operation 2011-04-08 13:43:20 main DONE 224 CLIENT_REQ TRANSACTION2/FIND_FIRST2 [in filer wait, 192.168.97.45, 223 sec] 2011-04-08 14:26:26 main DONE 288 CLIENT_REQ TRANSACTION2/FIND_FIRST2 [in filer wait, 192.168.97.45, 225 sec] 2011-04-08 14:42:07 main SLOW 40 REQ_FILE_OPEN NT_CREATE_ANDX [in dnas wait, 40 sec] 2011-04-08 14:59:32 main SLOW 40 TREE_CONNECT [in dnas wait, 39 sec]
There is a status entry corresponding to the WORKSLOW and WORKSTUCK log messages, with approximately the same information. The ‘event’ is SLOW, STUCK, or DONE; this last one is the case when a previously SLOW or STUCK message is finally completed.
Thus, a given work item may first show up as SLOW after ~30 secs on a work thread. It will be updated to STUCK at about the 5 minute mark. It may eventually be further updated to DONE, if we're lucky.
The timestamp shows the real time at which the entry was last updated (i.e., first became SLOW, or transitioned to STUCK or DONE) and the ‘age’ was the number of seconds it had been occupying a thread at the time of update.
The ‘operation’ contains an internal message type indicator on the first line, and where relevant, a CIFS message type on a second line.
For currently active (SLOW/STUCK, not DONE) operations, the current run state and time in that state are shown. The state is the same state that has always been part of the log messages; the time in state is a recent addition to log messages and this display: it removes ambiguity about whether we've really been stuck in one place for most of the operation. In the example, yes, we have been stuck in dNAS all that time. Filer-wait state now shows the IP address of the filer in question.
For DONE events, the displayed run state and time in that state are the worst (longest) experienced by the item in its lifetime. Note that it's the single worst: if a given work item accessed filers twice, taking 35 seconds for one access and 40 seconds for the other, then the DONE even will show a filer-wait time of 40 secs, not 75 secs.
The last 20 such events are retained; the table may be cleared on dNAS restart.
Processing Time Per Op Type
This gives you a more detailed breakdown of what's taking so long, as measured at the 'front end' of vcifs.
Average processing time per operation type ------------------------------------------ CIFS SMB Count Avg time (uSec) Transaction2/FindFirst2 3 170709437 NTCreateAndX 1 32007162 TreeConnectAndX 16 6783 SessionSetupAndX 30 1644 Negotiate 16 674 TreeDisconnect 1 46 Transaction2/QueryFSInfo 2 29 All SMBs 69 7888467
After having determined that some operations are taking a long time, by looking at the work-queues info described above, this helps you figure out whether particular SMB types are taking longer than others. Here, directory enumeration (FindFirst2) is troublesome, as is file lookup and creation (NTCreateAndx).
The processing time is the time for which a given SMB was tying up a single vcifs thread; for each SMB type, the count of those SMBs and the average processing time are both shown. The output is ordered by average processing time (longest first) and entries with zero counts are not displayed.
File Server Round Trip Time Per Op Type
For operations that appear to be taking an unusually long time, this helps you determine whether the problem lies with the file server.
Average file server round-trip time per operation type ------------------------------------------------------ CIFS SMB Count RTT (uSec) Close 1 1479 Transaction2/FindFirst2 4 1445 SessionSetupAndX 30 1356 Negotiate 14 860 NTCreateAndX 1 705 TreeConnectAndX 16 622 TreeDisconnect 4 352 Transaction2/QueryFSInfo 650 297 All SMBs 720 368 No response timeouts No network errors
The average round-trip time is given, per CIFS operation type, for all shares in all volumes serviced by this dNAS instance. The time is measured from when vcifs transmits the request to the time at which it has received the entire response.
Cases of network error or response timeout do not contribute to the average round-trip time, but are instead counted separately.
If there appear to be cases of slow file server response, the particular server or share can be isolated using the ‘show statistics filer’ command. The '’show statistics cifs work-queues’ command only presents the overall averages.
Time Spent In Authentication
The last major section of the output shows statistics concerning authentication, particularly the time taken for authentication requests outside the ARX.
Time spent in authentication ---------------------------- Kerberos service ticket requests: Total calls: 2 Elapsed time in call: Secs: < 0.001 < 0.01 < 0.1 < 1 < 10 < 100 Longer Count: 0 1 1 0 0 0 0 Average time: 0.016 secs DC replacement requested: 0 Netlogon requests: Total calls: 0 Elapsed time in call: Secs: < 0.001 < 0.01 < 0.1 < 1 < 10 < 100 Longer Count: 0 0 0 0 0 0 0 Average time: none ARX Secure Agent requests: Total calls: 0 Elapsed time in call: Secs: < 0.001 < 0.01 < 0.1 < 1 < 10 < 100 Longer Count: 0 0 0 0 0 0 0 Average time: none
We retain separate statistics for getting Kerberos service tickets, for authenticating clients with Netlogon, and for communication with the ARX Secure Agent for client and file server logons. The results are presented as a histogram. In the above example, there is nothing that is worrying.
In addition, there is a table showing details of the 20 longest round-trip times.
Table of longest recorded round-trip times: Timestamp (UTC) Type Time (uSec) Principal Name 2011-04-08 13:07:38 KRB 30596 Administrator@MOUTH.ORG 2011-04-08 13:07:38 KRB 1455 Administrator@MOUTH.ORG
The ‘type’ is one of KRB, ASA, or NL (the last-named being for NetLogon). The principal name is the name of the client for whom authentication is being done.
This new command should help you with diagnosis of CIFS response-time problems, by providing the information you need to get started on figuring out what is wrong. You may still need to use other commands to further isolate the issue (for example, to determine which particular filer is slow), but we hope it presents the top-level data in a useful form.
Dave Porter is an architect with F5 Data Solutions; he’s been working on ARX file server virtualization since it was invented, and is the principal author of the ARX control-plane CIFS code.