| Steve,
So I understand .. the rule runs for days (well, in your example 255
false evaluations) .. then appears to get stuck. Neither the counters
nor the 'time of last evaluation' change.
(Q) Does this happen on any other entity other than SNMP ?
If the Rule's background thread dies for some reason, the Rule Status and
Counter information will just stop changing. Or, if Alarms is waiting
on SNMP (in your example) to return with some data .. the counters and
such will appear stuck.
(Q) Do you have a lot of SNMP rules running simultaineously ?
I believe if the SNMP AM runs out of sockets, it keeps trying to get one
every second till one becomes available. I don't know what would happen
if one never became available.
/keith
|
| More info and answers to question(.1):
IP transport is not UCX but TVG Multinet V3.1
(A) Only polling alarms set up are SNMP alarms, all others are event
driven.
(A) There are currently 30 SNMP polling alarms enabled, 2 of which have
stalled out.
Where do we go from here.
Steve
|
| I wish I knew alot more about TVGs MULTINET. I am begining to think
this may be at the core of this problem. I have asked the customer to
see if Multinet has a command similar to 'UCX SHOW COMMUNICATION',
which lists the configured number of sockets, as well as the current
and peak number used.
Any Multinet users out there have any ideas?
Steve
|
| I went on site to review problem with the customer and found the following two
events.
Problem Summary Description:
Several of the customers SNMP polling alarms are hanging. There is
no indication that there is any problem until the time and date of the
last evaluation is compared to the current time and date. When this
comparison is done it is seen that there can be a large difference
between the current time and the time of last evaluation, many times
the polling period.
Hardware configuration: Vaxstation 3100, Model 76
Software configuration: VMS 5.5, NODNS, NORDB,
TVG Multinet V3.1, for TCPIP Access module, and
DECMCC-BMS V1.2
1) Polling of ALARMS stalling.
o The alarm's status is ENABLED but the time of last poll was three days
ago.
o I verified system parameters as stated by MCC and TGV, all were
configured correctly.
o Next we rebooted the system to start from a known state.
o Entered MCC/interface=windows and enabled the alarms.
o Waited until all alarms were polled at least one time, alarms are set
for 10 minutes and one hour.
o Removed the system from the network and connected a LOOPBACK connector
to the MCC system.
o Alarms started to trigger.
o After 45 minutes the system was connected back to the network and waited
30 minutes to review polling status.
o Checking the status of when the alarms were last polled the DECNET
alarms were all working. Most of the SNMP alarms last polled time was
1:15 to 30 minutes ago, NOT working -- ENABLED but last time of update
were not correct.
o Any SNMP alarms the were not polled while the system was disconnected
from the network was still working (an hour timed alarm for example).
o Looking at the network with a sniffer were was no packeted on the wire
for the alarm events that didn't have the correct polling time. Any
of the SNMP alarms with the correct polling time was seen on the wire.
o Looking a the mail massage triggered from one of the alarms that was
triggered while the system was off the network was the following
exception: Internet Communication device error %SYSTEM-W-CANCEL,
operation. This alarm was one that was NOW stuck. There was no software
error log for MCC for alarm logging or Multinet reported no errors but
other SNMP alarms were still working.
o We disabled and enabled the alarms that were not being updated and they
started to work.
2) Error in command file to process event alarms.
During the processing of an event alarm a data file:
SYS$SCRATCH:MCC_ALARMS_DATA_xxxxxxx.DAR
is created. The second record in this file has a record size of 1039
bytes, this record size causes an error:
%DCL-W-BUFOVF, command buffer overflow - shorten expression or
command line
The error occurs after the second read of the data file:
$ READ_LOOP:
$ !
$ read/end_of_file=endit data_file line
$ string - f$element(0, " ",line)
%DCL-W-BUFOVF, command buffer overflow - shorten expression or command line
.....
....
...
..
.
The name of the command file is MCC_ALARMS_MAIL_EXCEPTION.COM. The
text of the data record that is causing the error is,
"MANAGED_OBJECT: SNMP ISGRSE2 Interface 7"
with 1001 spaces that follow.
I don't know if the two problems are related, but this alarms did get stuck.
Not all of the alarms that got stuck had this problem. The alarm definition
statement looked normal.
Regards,
Bill Hittenmiller IOOSRV::HITTENMILLER
|