[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference azur::mcc

Title:DECmcc user notes file. Does not replace IPMT.
Notice:Use IPMT for problems. Newsletter location in note 6187
Moderator:TAEC::BEROUD
Created:Mon Aug 21 1989
Last Modified:Wed Jun 04 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:6497
Total number of notes:27359

807.0. "notify & events" by JETSAM::WOODCOCK () Mon Mar 18 1991 13:30

Hi there,

I am having a problem of reliability when I start combining alarms,
events, and notifications. A number of problems and anomalies are occuring
which I can't put my finger on. The process players are EVL, MCC_DNA4_EVL,
two alarm procedures in batch, and map notification. There are 50-60 alarms
combined between the two jobs and the expressions are of the following nature:

expression=(occurs(node4 bbpk01 cir syn-* adj node * adjacency down))

	or

expression=(occurs(node4 bbpk01 cir ethernet adj node * adjacency down))

Polling alarms are also run in the evening (about 70) for circuit substates.

-1-

Often in the morning notifications (or event alarms) no longer work (although 
it survived this entire weekend). All processes are running but the following 
is found in MCC_DNA4_EVL.LOG. Because all the players are running I don't 
know there is a problem until I *test* MCC. This isn't something I'd expect 
to have to do periodically to ensure the tool's operation.

Ready to read the next event message...
Ready to read the next event message...
Ready to read the next event message...
Failed to receive an event from EVL, status = 8420

-2-

When the alarm batch jobs are stopped the MCC_DNA4_EVL process has died
several times (but not every time). Why? Didn't happen to catch this
log. If I reproduce it again I'll post it.

-3-

When the alarm batch jobs are started and the notification (map) is enabled 
at the same time I get unpredictable results. Errors range from nothing
noticable to "invalid lock id". The map may or may not handle incoming
alarms and the map window almost always crashes when exiting. I would have
thought these processes would be independent, no?? QARed.  

-4-

The EVL and MCC_DNA4_EVL processes die when confronted with streaming events.
This happened (and continues to happen) when a circuit bounces continuously
(2 or more events per second) or if mulitple nodes are sinking the same network
event simutaneously (ie. Area Reachability event). I tried reproducing this
situation manually (via .com) with about 600 events within 5 minutes (async 
events from a single node) and it was handled fine. I get errors within the
EVL.log file but I can't determine if this is actually pointing me in the
right direction. Below is a copy of the final few lines for the present log
and everything is still running ok! Any ideas out there??

$ RUN SYS$SYSTEM:EVL
%EVL-E-OPENMON, error creating logical link to monitor process NOCMAN::"TASK=MCC
_DNA4_EVL"
-SYSTEM-F-NOSUCHOBJ, network object is unknown at remote node
%EVL-E-WRITEMON, error writing event record to monitor process MCC_DNA4_EVL
-SYSTEM-F-FILNOTACC, file not accessed on channel
 
regards,
brad...

ps. Jim/Daryl - The polling alarms have been 100% reliable since code changes!
T.RTitleUserPersonal
Name
DateLines
807.1mcc_dna4_evl droppedJETSAM::WOODCOCKMon Mar 18 1991 20:4127
The MCC_DNA4_EVL process died this afternoon with the following log:

Ready to read the next event message...
Ready to read the next event message...
Failed to receive an event from EVL, status = 8420
%SYSTEM-F-LINKABORT, network partner aborted logical link
  DECMCC       job terminated at 18-MAR-1991 15:07:24.98



The following is a portion of the EVL.log. I believe that it broke down at the
(%EVL-F-NETASN, unable to assign a channel to NET) line. EVL restarted
but of course MCC_DNA4_EVL didn't because it had to be done manually. Could
EVL need tuning?

$ RUN SYS$SYSTEM:EVL
%EVL-E-OPENMON, error creating logical link to monitor process NOCMAN::"TASK=MCC
_DNA4_EVL"
-SYSTEM-F-NOSUCHOBJ, network object is unknown at remote node
%EVL-E-WRITEMON, error writing event record to monitor process MCC_DNA4_EVL
-SYSTEM-F-FILNOTACC, file not accessed on channel
%EVL-F-NETASN, unable to assign a channel to NET
-SYSTEM-F-PATHLOST, path to network partner node lost
$ PURGE/KEEP=3 EVL.LOG
$ LOGOUT/BRIEF
  DECNET       job terminated at 18-MAR-1991 15:07:22.94

807.2We'll take a look....TOOK::CAREYWed Mar 20 1991 19:3712
    
    Brad,
    
    Gee, thanks.  We love the problems you bring us.  :-)
    
    I have *no idea* what could be going on.
    
    I'll get some data on the evl.log information and see if we can come up
    with a scenario for your breakdown.
    
    -Jim
    
807.3some cluesJETSAM::WOODCOCKThu Mar 21 1991 17:5844
It seems that two of these problems may be related. I reproduced problem
-2- today (alarms started in batch, notfication enabled simutaneously) and
was again given "invalid lock id". When exiting the map the DECterm window
vanished. I restarted the map (all jobs already running), but notifies
failed to work. I checked MCC_DNA4_EVL.LOG and found;

Ready to read the next event message...
Ready to read the next event message...
Ready to read the next event message...
A fatal error occurred when sending event = 418 to MCC event manager!
The EVL sink is terminated!
OPS_DNA4_STOP_SINK_MONITOR Failed at step 5, status = 52877226
STOP_SINK_MONITOR is terminated, thread id = 65539, status=52854793

I then stopped the batch que and that's when MCC_DNA4_EVL dies (prob -3-). 
I don't seem to be able to recreate this problem -3- unless problem -2- 
has been encountered. A recheck of MCC_DNA4_EVL.LOG now shows a new error 
line:

Ready to read the next event message...
Ready to read the next event message...
Ready to read the next event message...
A fatal error occurred when sending event = 418 to MCC event manager!
The EVL sink is terminated!
OPS_DNA4_STOP_SINK_MONITOR Failed at step 5, status = 52877226
STOP_SINK_MONITOR is terminated, thread id = 65539, status=52854793
%LIB-F-SECINTFAI, secondary interlock failure in queue
  SYSTEM       job terminated at 21-MAR-1991 14:22:55.04


Problem -2- appears to be a real nasty one where if it occurs the processes
continue to appear to run. But in reality the user must stop all alarm jobs,
stop MCC_DNA4_EVL process (maybe even EVL), then restart MCC_DNA4_EVL, then
restart the alarm jobs, then bring the map back up with notifications (with
WAIT statements between everything so nothing has a chance to bump into each
other!!).

So Jim/Jean, being the nice guy that I am, I've just reduced the number of 
problems down to a mere 3. It just keeps getting easier every day :-).


brad...


807.4This COULD be an Event Manager problemTOOK::GUERTINI do this for a living -- reallyMon Mar 25 1991 17:2237
    Brad,
    
    There were two potential problems with the Event Manager that we had
    for V1.1.  The first was a small window where we try to acquire the
    same lock at two different points within the same process.
    This window was so small that we only saw it on a MIPS machine while
    porting to Ultrix.  The affect of this problem is that you could see
    a "lock conversion" error.  This problem was fixed for next release by
    combining locks and moving around the acquire/release statements.
    
    The second problem you just discovered.  In order to save on VMS
    resources, we used RTL interlock calls instead of Locks for enqueuing
    and dequeuing entries in the Event Pool.  Apparently, in a multiple-CPU
    environment (assuming that is what you have), the "Secondary Interlock
    Failure" is easier to get than we expected (we have a high retry
    count).  This problem was also fixed for the next release, since in
    order to have portable code, we had to use locks more efficiently in
    order to remove the need of having calls to the RTL interlock routines.
    
    I guess what I am saying is this:  If it is the Event Manager (and it
    may not be), then most, if not all, of your problems should go away on
    the next VMS release (whenever that is).  The workaround would be to
    spread out the load on the Event Manager over a longer period, in order
    to reduce the strain on system resources (locks and CPUs).  If this
    workaround is not acceptable, we have to work thru management to get
    you a special MCC kernel (not an easy thing to do). But even that is
    not a guarantee that your problems will go away, it is just one
    possible cause of your problems.
    
    One way of checking if the Event Manager is detecting the problem and
    bubbling it up to the Event Sink, is to define the MCC_EVENT_LOG to 1
    ($ define MCC_EVENT_LOG 1) in the same process, and see if you get any
    Internal error messages about Lock conversions or Interlock failures.
    If no Internal error messages get displayed, then the Event Manager is
    (probably) innocent.
    
    -Matt.
807.5it can waitJETSAM::WOODCOCKWed Apr 24 1991 21:4419
I got side tracked for a while but I just got a chance to re-review
this note. If you folks think the LOCK problem will be resolved in
the next release I can wait. The work around isn't pretty but it
seems to work most times. I'll give you a ring next release if it's
still there. BTW, in poking thru VPA reports there were a couple
mentions of lock problems so this seems to confirm your thoughts.

As far as EVL goes I'll probably try to spend some time getting a
better understanding of it and try to make it more robust thru tuning.
Although from all I've heard it's the nature of this beast to drop
out and come back up. MCC should put some effort into ensuring that
MCC_DNA4_EVL is automated to come back up with it. Whole operational
businesses may depend on these two working in harmony and their
interaction is **very** important to many net mngrs even if they don't
know it today. It's the future they will move to.

best regards,
brad...    

807.6attempt at stabilityJETSAM::WOODCOCKThu Jul 18 1991 17:0121
Hello,

As promised/threatened I would take a look at trying to make MCC and EVL
more stable partners. I know you folks are tight on time and this problem
is *critical* for ease of operations. My first approach was with EVL
which I got nowhere. It seems the EVL experts are far and few between or
very shy. Therefore I looked toward an MCC solution. I have edited
MCC_COMMON:MCC_DNA4_EVL.COM to be a bonafide hack. Basically when EVL
drops out (happens often when hit with streams of events) MCC_DNA4_EVL
drops with a fatal link abort message. I simply capture the status code,
test to see if it is a link abort, then loop back up and restart. Seems
to work ok but is not time tested as yet. I would like some feedback to
ensure this doesn't negatively impact anything. Also I wanted to see what
would happen if EVL had not yet returned. MCC_DNA4_EVL seemed to wait for
EVL after it was restarted then made a link. This worked up to 5 minutes of
EVL down time so far. I'll post the ala hack as the next reply. But of course
I do not intend to support this very delicate code, and NO rights are 
reserved :-).

cheers,
brad...
807.7MCC_COMMON:MCC_DNA4_EVL.COMJETSAM::WOODCOCKThu Jul 18 1991 17:1118
$! This procedure replaces the original MCC_COMMON:MCC_DNA4_EVL.COM and
$! is intended to allow this process to restart when the EVL process fails
$! and causes a LINKABORT error which would normally EXIT this procedure.
$!
$!
$ set verify
$ start:
$ on warning then goto status
$ manage/enter/presen=mcc_dna4_evl
$ status:
$ show symbol $status
$ wait 00:00:20
$!
$! Check to see if error was caused by LINKABORT and restart if true
$!
$ if $status .eqs. "%X000020E4" then goto start
$!
$ exit