Forum Discussion

Vratix_97086's avatar
Vratix_97086
Icon for Nimbostratus rankNimbostratus
Aug 20, 2009

Discovery/Monitoring Issue

We are seeing a discovery error in the F5 eventlog and Trace.log. The error from the eventlog is shown below. We see the F5 device inthe SCOM console, but it is showing as not monitored and has been for a few hours. In the diagram view, we can seeing components being monitored though. the only indication that there is an issue is the below error. Does this have to do with port configurations? I ask becuase the error refers to IQuery. Also, we can successfully telnet to the IQuery (port 4353) and Icontrol (Port 443) on the device. Any assistace would be greatly appreciated.

 

 

Vratix

 

 

 

Event Type:Error

 

Event Source:F5 Events

 

Event Category:None

 

Event ID:301

 

Date:8/20/2009

 

Time:10:51:36 AM

 

User:N/A

 

Computer:

 

Description:

 

Failed to discover device at address:

 

Network-related failure has occurred: [Category]SecureSocketLayer:[Type]ConnectFailure;LastError=SSL IO Error 419418928: SYSCALL:

 

 

 

F5Networks.Protocols.iQuery.iQueryException: [Category]SecureSocketLayer:[Type]ConnectFailure;LastError=SSL IO Error 419418928: SYSCALL:

 

at F5Networks.Protocols.iQuery.iQuerySocketBase.Connect()

 

at F5Networks.ManagementPack.Services.DeviceMonitor._CompleteDiscovery(DeviceDiscoveredEventArgs successContext)

 

at F5Networks.ManagementPack.Services.DeviceMonitor._DeviceDiscovered(Object sender, DeviceDiscoveredEventArgs successContext)

 

 

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

 

10 Replies

  • **Update**

     

     

    Literally 1 min after I posted this the device went healthy. I should clarify that only one monitor went healthy and the other 100 still have no state.

     

     

    Also I still see the below error with a current timestamp in the Trace.log

     

     

     

    Global Error: 0 : [08/20/2009 11:56:43]Failed to discover device at address: Network-related failure has occurred: [Category]SecureSocketLayer:[Type]Connect
  • Stephen_Fisher_'s avatar
    Stephen_Fisher_
    Historic F5 Account
    It looks like networking issues are causing the connection to either drop, or periodically drop (the SSL Connect error).

     

    If the device has been discovered, and the connection is dropped, we will attempt to reconnect every few seconds or minutes, until reconnect is successful.

     

     

    The Management Pack uses persistent socket connections between the Management Server and the F5 devices. You may try using telnet and keeping the session open for an extended period of time (minutes or hours) to verify there aren't intermittent networking issues.

     

     

    Thanks,

     

    Stephen

     

    F5 Management Pack
  • Thank you Stephen. We realized that a new MP just came out and decided to ugrade to the lastest and greatest before spending any more time troubleshooting this. BTW the telnet tests were disconnecting witing 5-10 min.

     

     

    So i uninstalled the MP off my secondary MS and the RMS then reinstalled the new version. We got the same error on both after installation. See below

     

     

    ----------------------

     

    FATALERROR: The DataSource and ConditionDetection modules required for the F5 Monitoring Service to run could not be loaded by Operations Manager monitoring host. Check the Operations Manager Event Log for errors and manually start the F5 Monitoring Service once any issues have been resolved.

     

    -----------------------

     

    The Service on the RMS was disabled (I'm guessing this is by design) and the service on the 2nd MS was set to Auto but it wasn't started. I manually started it.

     

     

    Now the Trace.log has the following error:

     

     

    ----------------------

     

    The PerformanceDataSourceConnector connection to Operations Manager Health Service host 2nd Mgmt Server could not be established: Failed to connect to an IPC Port: The system cannot find the file specified.

     

    ----------------------

     

     

    I've reinstalled this MP 2 times now and tried rebooting/repairs etc and I don't know why we are getting this error. The insallation account has SCOM,Server, SQL admin rights and the Service account has the same except only DBO rights to the F5 db. Please tell me if these rights are not sufficient.

     

     

     

    Thank you and Please help!!
  • **Update**

     

     

    The status of the our MP is as follows. We have somehow resolved the installation issues for 715 (Not sure how) and now we are again seeing the same discovery issues we saw in version 579. This is the current status:

     

    - Both F5 devices are showing in the console as Healthy

     

    - In Health Explorer only 1 monitor is showing a healthy state, the "Control Device Connection" monitor. All other monitors don't have a state and don't seem to be working.

     

    - We aren't getting any alerts as I don't think monitoring is fully working.

     

    - We can't seem to see any performance data/counters being collected when we launch the performance view in the console.

     

    -We are consistenly seeing event 301 on the Mgmt server monitoring these devices. See below for the full error. The error states the it failed to discover the two devices. There is obviously something still wrong. One thing we noticed is that when you telnet to the devices on ports 443 and 4353 the connection is eventually closed after 5-10 min. We don't know if this is significant or not and the F5 admins don't know why this is happening.

     

     

    Additional questions are:

     

    1. Is what we are seeing in Health Explorer by design and we just have to start enabling things, or are we in fact having issues?

     

    2. How are you supposed to view the Performance counters collected by the MP? Is it intended that you right click the device in the state view and open a performance view?

     

    3. There is no Alerts view with the MP, is it intended that the Alert be viewed from the state or diagram views?

     

     

    -----------------------------------------

     

    Event Type:Error

     

    Event Source:F5 Events

     

    Event Category:None

     

    Event ID:301

     

    Date:8/25/2009

     

    Time:1:15:25 PM

     

    User:N/A

     

    Computer:

     

    Description:

     

    Failed to discover device at address:

     

    Network-related failure has occurred: [Category]SecureSocketLayer:[Type]ConnectFailure;LastError=SSL IO Error 465227312: SYSCALL:

     

     

     

    F5Networks.Protocols.iQuery.iQueryException: [Category]SecureSocketLayer:[Type]ConnectFailure;LastError=SSL IO Error 465227312: SYSCALL:

     

    at F5Networks.Protocols.iQuery.iQuerySocketBase.Connect()

     

    at F5Networks.ManagementPack.Services.DeviceMonitor._CompleteDiscovery(DeviceDiscoveredEventArgs successContext)

     

    at F5Networks.ManagementPack.Services.DeviceMonitor._DeviceDiscovered(Object sender, DeviceDiscoveredEventArgs successContext)

     

     

    For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

     

    -------------------------------------------
  • Julian_Balog_34's avatar
    Julian_Balog_34
    Historic F5 Account
    To answer your questions first:

     

     

    - In the Health Explorer of the F5 device you should see the monitors that are enabled by default (i.e. by design), and some of them should actually signal a valid state (as you pointed out with the "Control Device Connection" monitor). You should also see at least another state (green) for the "Global Memory Used" performance counter monitor, which is controlled by the related threshold rule (which is again, enabled by default in the F5 Management Pack). For all the other performance counter monitors, you would override / enable the related Collection / Threshold Rules, to actually 'arm' the related monitor (as you correctly assumed in your post).

     

     

    - Not seeing any alerts and performance data coming in, would probably be a fall-out from you network / connectivity issue with the F5 device, which we will troubleshoot together.

     

     

    - Viewing the performance counters collected by the F5 Management Pack would be simply by invoking either the Open > Performance View in the context menu related to an F5 Device / Object (in the diagram view / hierarchy), or by selecting the "Performance View" in the Actions pane (on the right hand side of the SCOM Operations Console). But you have to make sure the related performance counters are enabled / collected, by setting the appropriate collection rules.

     

     

    - The F5 Management Pack doesn't feature a dedicated Alert View. Invoking the Alert View for the F5 Management Pack should be a matter of preference, being opened either from the diagram or state view. There's no recommended or intended way of using it, from our end.

     

     

    * * *

     

     

    For troubleshooting the network connectivity issues with the F5 Device, we will need a detailed trace log, which you will have to enable for the F5 Management Pack. See the "Verbose Logging Support" section in the Troubleshooting page (Click here). With the verbose logging enabled, attempt to discover a new device and wait for a drop in the device connection (as you mentioned, in about 5..10 minutes), than zip the trace.log file located at %Program Files%/F5 Networks/Management Pack/log and post it here, or you can email it directly to ManagementPack(at)f5.com. We can also provide FTP access for uploading it. Once we get the trace, we'll take a look and will get back to you.

     

     

    Thank you,

     

    Julian
  • thank you Julian. How do I get the log to you. I have case open with Alistair Helfer. Should I just respond to that case email with the log file attached?
  • Julian_Balog_34's avatar
    Julian_Balog_34
    Historic F5 Account
    You can email the log directly to us: ManagementPack(at)f5.com. The case will end up anyway in our hands.

     

     

    Thanks!

     

    Julian
  • Julian,

     

     

    I sent it in thank you. Also my case is C566087 in case you need it.
  • Julian_Balog_34's avatar
    Julian_Balog_34
    Historic F5 Account
    Thanks Marc.

     

     

    Here are my findings, after thoroughly analyzing the trace logs you've sent us:

     

     

    - the intermittent connection errors are happening while the F5 Monitoring Service attempts to initiate an SSL connection with the F5 device, during the periodical heartbeat / update / refresh cycle.

     

    - the actual SSL error (SSL_ERROR_SYSCALL / OpenSSL) reported by the F5 Monitoring Service (as a client) points to a 'premature' ending of the connection on the server side (big3d / F5 device).

     

    - the F5 Monitoring service attempts to reconnect repeatedly until the connection to the F5 device will be available again.

     

     

    This being said, I believe the problem is somewhere between the box running the F5 Monitoring service and the F5 device and it may involve:

     

     

    - network / routers,

     

    - proxies (if any, between the monitoring box and the F5 device),

     

    - big3d daemon crashing (on the F5 device)

     

     

    My feeling is that the problem is related to the network, or anything on it between the monitoring box and the F5 device. Here's what I would suggest:

     

     

    - run a traceroute from the monitoring box to the F5 device (and you can send it to us: ManagementPack(at)f5.com) and we can analyze it together;

     

    - are there any proxies (HTTP, etc., including other Big-IP device(s)) between the monitoring box and the F5 device?

     

    - try to ping / telnet the F5 device from a different box / network and see if the intermittent connection failures still happen;

     

     

    Ultimately, just to be sure all is well on the F5 device / server side, we could also analyze the big3d logs (found in /var/log and /var/core on the BigIP). Just check if the /var/core has any core files (listing would suffice) during the time when the connection drops, and if there's any we'll take it from here.

     

     

    Let us know, and we'll investigate further.

     

     

    Thanks,

     

    Julian
  • Julian_Balog_34's avatar
    Julian_Balog_34
    Historic F5 Account
    Thanks Marc.

     

     

    This all looks reasonable, and it’s good to know that the ping to the F5 device would never be disrupted. I wouldn’t worry about the telnet sessions being closed on those ports after a while. That would be expected behavior, if there’s no activity going on.

     

     

    From the traceroute it looks like the only hardware between the monitoring box and the F5 device is the default gateway. Is this an F5 device (Load Balancer) by any chance? What kind of equipment is it? At this point I would run a network packet analyzer (TCPdump / Wireshark) to capture the packets back and forth between the monitoring server and the F5 device, for the SSL communication. It’s possible that the disruption that we’re experiencing is a connection reset (possibly showing up as a RST flag at the TCP packet level) issued either from the gateway or the F5 device itself. It would be interesting to find out.

     

     

    Here’s what I would suggest:

     

     

    - Find out which network interface is used by the F5 device to communicate with our monitoring server (run the ifconfig command on the F5 device);

     

    - Based on the interface, issue a tcpdump (on the F5 device), to monitor the SSL communication with the remote host (monitoring server), with a syntax similar to: tcpdump –i external host x.x.x.x and port 443 > tcpdump.log

     

    - Monitor the traffic for a while, between the monitoring server and the F5 device, to capture the network disruption;

     

    - Pack and zip the tcpdum.log file and send it to us (or you can also analyze it using Wireshark for example) and look for anything that may point to the RST flag and who’s initiating it (and why).

     

     

    In the meantime I’ll take a look also at the Qkview dump you sent me to see if there might by anything wrong with the F5 device, that I can tell (before probably submitting it to a BigIP expert).

     

     

    Thanks.

     

    Julian