[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference azur::mcc

Title:DECmcc user notes file. Does not replace IPMT.
Notice:Use IPMT for problems. Newsletter location in note 6187
Moderator:TAEC::BEROUD
Created:Mon Aug 21 1989
Last Modified:Wed Jun 04 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:6497
Total number of notes:27359

1149.0. "EVENTS problem" by SNOC02::MISNETWORK (Take a byte) Mon Jun 17 1991 03:46

    Haven't been able to find a similar problem so here goes. 
    
    I setup all the appropriate event logging and MCC_DNA4_EVL task object
    so that my DECnet alarms would work. They have worked fine but all of a
    sudden things seem to have died. Lots of things look strange, and I am
    now at a lost stage.
    
    I tried the old reboot trick, but I had no success with a GETEVENT
    command. Below is a section of my MCC_DNA4_EVL log -
    
    $ manage/enter/presen=mcc_dna4_evl
    Network object MCC_DNA4_EVL is declared, Status = 52854793
    Waiting for the event message from EVL....
    
    but nothing happens, I see the events reaching my system with
    REPL/ENA=NET. I disabled/enabled my local sink monitor twice before it
    started to work again. This is causing some pain as I try to keep a log
    of all DECnet outages with the following command file, but it is not
    reliable because something gets locked up, and events get
    lost/misplaced/unrecorded -
    
    show time
    $today=f$cvtime("today","absolute","date")
    $hour_till_midnight=23-f$cvtime("''f$time()'","absolute","hour")
    $minutes_till_midnight=59-f$cvtime("''f$time()'","absolute","minute")
    $duration="''hour_till_midnight'"+":"+"''minutes_till_midnight'"+":00.00"
    $todays_event_file="disk$userdisk:[tassone.mcc]''today'.events"
    $todays_event_com="disk$userdisk:[tassone.mcc]''today'.com"
    $!
    $open/write command_file 'todays_event_com'
    $write command_file "$mana/enter"
    $write command_file "getevent node4 * circ * Any Events, -"
    $write command_file "for dur ''duration', to file=''todays_event_file'"
    $write command_file "exit"
    $close command_file
    $!
    $show time
    $@'todays_event_com'
    $show time
    $!
    $submit/after=tomorrow/keep/noprint/queue=mcc$batch/-
    log=disk$userdisk:[tassone]decnet_events.log -
    disk$userdisk:[tassone.mcc]decnet_events.com
    $delete 'todays_event_com';*
    $mail/sub="DECnet events for ''today'" 'todays_event_file'
    snoc01::misnetwork
    $purg/keep=3 disk$userdisk:[tassone]decnet_events.log 
    
    Previous MCC_DNA4_EVL logs showed the following illness -
    
    Waiting for the event message from EVL.....
    The connection with EVL is established.
    ** Unable to connect to NMCC  **
    Ready to read the next event message...
    Failed to send event = 409 to MCC event manager, INSEVTPOOLMEM
    Ready to read the next event message...
    Failed to send event = 407 to MCC event manager, INSEVTPOOLMEM
    Ready to read the next event message...
    Failed to send event = 410 to MCC event manager, INSEVTPOOLMEM
    Ready to read the next event message...
    Failed to send event = 410 to MCC event manager, INSEVTPOOLMEM
    Ready to read the next event message...
    Failed to send event = 407 to MCC event manager, INSEVTPOOLMEM
    Ready to read the next event message...
    Failed to receive an event from EVL, status = 8420
    %SYSTEM-F-LINKABORT, network partner aborted logical link
      TASSONE      job terminated at 17-JUN-1991 13:55:53.01
    
    Help !
    Louis
    
T.RTitleUserPersonal
Name
DateLines
1149.1More info - more confusionSNOC02::MISNETWORKTake a byteTue Jun 18 1991 02:0186
More info. 

I know that last night my alarms worked when an event happened on one of my 
circuits, but again, today it is very much broken. 

The MCC_DNA4_EVL log showed the following -

$ set proc/priv=(all,nobypass)
$ manage/enter/presen=mcc_dna4_evl
Network object MCC_DNA4_EVL is declared, Status = 52854793
Waiting for the event message from EVL.....
The connection with EVL is established.
** Unable to connect to NMCC  **
Ready to read the next event message...
Ready to read the next event message...
Ready to read the next event message...
.
.
.
Ready to read the next event message...
Ready to read the next event message...
Failed to receive an event from EVL, status = 8420
%SYSTEM-F-LINKABORT, network partner aborted logical link
  TASSONE      job terminated at 18-JUN-1991 11:46:34.63
 
I tried the DISABLE/ENABLE trick with the local sink monitor without any 
success, again the log as follows -

$ manage/enter/presen=mcc_dna4_evl
Network object MCC_DNA4_EVL is declared, Status = 52854793
Waiting for the event message from EVL.....

I tried the DISABLE/ENABLE trick  a second time with the following results -

MCC> disab node4 sprnet local sink monitor

Node4 59.1 Local Sink Monitor
AT 18-JUN-1991 11:51:31

Disable completed successfully.
MCC> enabl node4 sprnet local sink monitor

Node4 59.1 Local Sink Monitor
AT 18-JUN-1991 11:51:34

Internal error in DECnet Phase IV AM.
                              VMS Error = %SYSTEM-F-DUPLNAM, duplicate name
MCC> enabl node4 sprnet local sink monitor

Node4 59.1 Local Sink Monitor
AT 18-JUN-1991 11:59:59

Enable completed successfully.

Tried zeroing my counters with the following results -

MCC> getevent node4 * any event
%%%%%%%%%%%  OPCOM  18-JUN-1991 12:01:01.00  %%%%%%%%%%%
Message from user DECNET on SPRNET
DECnet event 0.9, counters zeroed
From node 59.1 (SPRNET), 18-JUN-1991 12:01:00.02
Node 59.1 (SPRNET)

%%%%%%%%%%%  OPCOM  18-JUN-1991 12:01:01.79  %%%%%%%%%%%
Message from user AUDIT$SERVER on SPRNET
Security alarm (SECURITY) and security audit (SECURITY) on SPRNET, system id: 65
534
Auditable event:        Network login failure
Event time:             18-JUN-1991 12:01:01.77
PID:                    00000164
Username:               ILLEGAL
Remote nodename:        SPRNET          Remote node id:         60417
Remote username:        TASSONE
Status:                 %LOGIN-F-NOSUCHUSER, no such user
    
    NCP showed following -
    
    MCC_DNA4_EVL   0   00000163
    TASK           0                             ILLEGAL
    
                                                         
HELP!!! What is happening here. My once beloved uncomplaining fully operational 
DECmcc is sick !

Cheers,
Louis
1149.2TOOK::JEAN_LEETue Jun 18 1991 16:55110
    
    Hi Louis,

	Thanks for entering these reports.  Let me answer them sequentially.

1.  

>    $ manage/enter/presen=mcc_dna4_evl
>    Network object MCC_DNA4_EVL is declared, Status = 52854793
>    Waiting for the event message from EVL....
    
>    but nothing happens, I see the events reaching my system with
>    REPL/ENA=NET. I disabled/enabled my local sink monitor twice before it
>    started to work again. This is causing some pain as I try to keep a log
>    of all DECnet outages with the following command file, but it is not
>    reliable .....

	We have also experienced this.  By toggling the state of the sink
	usually clears the problem.  We will investigate further whether this 
	is a expected behaviour of EVL or not.

2.  

>    Waiting for the event message from EVL.....
>    The connection with EVL is established.
>    ** Unable to connect to NMCC  **
>    Ready to read the next event message...
>    Failed to send event = 409 to MCC event manager, INSEVTPOOLMEM
>    Ready to read the next event message...
>    Failed to send event = 407 to MCC event manager, INSEVTPOOLMEM
>    Ready to read the next event message...
>    Failed to send event = 410 to MCC event manager, INSEVTPOOLMEM
>    Ready to read the next event message...
>    Failed to send event = 410 to MCC event manager, INSEVTPOOLMEM
>    Ready to read the next event message...
>    Failed to send event = 407 to MCC event manager, INSEVTPOOLMEM
>    Ready to read the next event message...
>    Failed to receive an event from EVL, status = 8420
>    %SYSTEM-F-LINKABORT, network partner aborted logical link

	This means that MCC event manager is running out of its virtual memory.
	This problem needs further investigation.  I will report the findings
	in a future note.

3.

================================================================================
Note 1149.1                      EVENTS problem                           1 of 1
SNOC02::MISNETWORK "Take a byte"                     86 lines  17-JUN-1991 23:01
                        -< More info - more confusion >-
--------------------------------------------------------------------------------
> Ready to read the next event message...
> Ready to read the next event message...
> Failed to receive an event from EVL, status = 8420
> %SYSTEM-F-LINKABORT, network partner aborted logical link

	When the logical link between EVL and the event sink is broken, it 
	can be caused by many reasons, node reachability change, circuit state 
	change, line problem...etc, just like any connectivity between two 
	nodes.  When this happens, I would check the system EVL.LOG right away,
	(not the mcc_dna4_evl.log) to find out the cause.  Depending
	on the cause, restarting the sink or EVL immediately may not always be 
	the right answer.  MCC does not control the connectivity between
	EVL and MCC sink, except using ENABLE or DISABLE to start or abort
	the sink process.  If the latter is the case, the log will tell you so.
 
4. 

>  I tried the DISABLE/ENABLE trick with the local sink monitor without any 
>  success, again the log as follows -

>  $ manage/enter/presen=mcc_dna4_evl
>  Network object MCC_DNA4_EVL is declared, Status = 52854793
>  Waiting for the event message from EVL.....

> I tried the DISABLE/ENABLE trick  a second time with the following results -

MCC> enable node4 sprnet local sink monitor

Node4 59.1 Local Sink Monitor
AT 18-JUN-1991 11:51:34

> Internal error in DECnet Phase IV AM.
>                              VMS Error = %SYSTEM-F-DUPLNAM, duplicate name

	This means the sink monitor process is not completely gone yet.  
	Sometimes it takes a while for VMS to kill a process. 
	I would make sure the process mcc_dna4_evl is actually gone before I
	enable it.

5.

>  Tried zeroing my counters with the following results -
>  MCC> getevent node4 * any event

>  %%%%%%%%%%%  OPCOM  18-JUN-1991 12:01:01.00  %%%%%%%%%%%
>  Message from user DECNET on SPRNET
>  DECnet event 0.9, counters zeroed
>  From node 59.1 (SPRNET), 18-JUN-1991 12:01:00.02
>  Node 59.1 (SPRNET)

	In the above OPCOM message, this event occurred on sprnet 
	and is from node sprnet.  In MCC's model, this event is considered an
	event of node4 sprnet remote node sprnet.  

	Thus, you need to use this command to get the event:

		MCC> getevent node4 sprnet remote node sprnet any event
	

1149.3Thanks for the infoSNOC02::MISNETWORKTake a byteTue Jun 18 1991 22:5422
    Thanks for the thorough reply. Good to see there are answeres to some
    of my problems, if not total solutions. 
    
    I checked my EVL.LOG and only found 2, one was fine but the latest
    version showed the following -
    
    $ RUN SYS$SYSTEM:EVL
    %EVL-E-OPENMON, error creating logical link to monitor process
    SPRNET::"TASK=mcc
    _dna4_evl"
    -SYSTEM-F-INVLOGIN, login information invalid at remote node
    %EVL-E-WRITEMON, error writing event record to monitor process
    mcc_dna4_evl
    -SYSTEM-F-FILNOTACC, file not accessed on channel
    
    Must have been when I was turning the lights on and off. The log times
    donot correspond to the MCC_DNA4_EVL log, so I will have to remember
    next time to check the EVL.LOG when I get the network abort message.
    
    Looking forward to your findings,
    Cheers,
    Louis
1149.4Need more info for INSEVTPOOLMEMTOOK::T_HUPPERThe rest, as they say, is history.Tue Jun 25 1991 16:0633
    The inquiry into the INSEVTPOOLMEM error needs further input from you. 
    Are you receiving MCC_S_EVENT_LOST in your com file log when the sink
    is reporting INSEVTPOOLMEM?  This should be the case.  If not, then
    something is either not being reported, or the event pool is so full
    that lost events cannot be delivered.
    
    A good way to create a big problem in the current type of event pool is
    to "stop" (exit handlers don't run) a DECmcc process that is receiving
    events while other DECmcc processes are still running.  The event pool
    will still contain the abandonned mcc_event_get request structures. 
    These abandonned requests will still receive all matching events, but
    will not read the event out of the event pool and free its memory.  If
    this is the case, the only way to free the memory is to exit from all
    DECmcc processes on the system and restart them.  There must be a point
    in time when there are NO DECmcc processes running.  Then the next
    DECmcc process to perform an event operation will cause the event pool
    to be recreated in its empty state.  Are you stopping any DECmcc
    processes on the system (any users) while leaving others running?
    
    I would assume that a reboot of the system would also clean out the
    event pool nicely.  How long after the reboot did the sink report that
    the pool had INSEVTPOOLMEM?  How many events are correctly received
    before lost events or no events are received?  I would assume that
    events are correctly received for a while, then lost events are
    received, then no events are received.  This would be the case if the
    events were simply arriving too fast to be processed by the DECmcc
    system.  The event pool happens to be the most limited queue in the the
    events subsystem, so that is where the problem is reported.  What is
    the arrival rate of events in the event sink that are to be processed
    by DECmcc?  Also, what type of machine are you using, so we can get a
    estimate of reasonable event throughput?
    
    Ted Hupper
1149.5INSEVTPOOLMEM error goneSNOC01::MISNETWORKThey call me LATMon Jul 01 1991 02:008
    The INSEVTPOOLMEM error seems to have gone away, so I will not pursue
    it at this stage. Things have been working pretty well, but I haven't
    had a chance to check all the logs, so I will start doing that again.
    
    Thanks for the advice,
    
    Cheers,
    Louis
1149.6still a probJETSAM::WOODCOCKMon Jul 01 1991 12:2962
If it's ok I'd like to pick up following thru on this problem. I see this
INSETPOOLMEM almost daily with MCC_DNA4_EVL going south after a dozen or
two. This is hampering my confidence in using EVENTS.

>    The inquiry into the INSEVTPOOLMEM error needs further input from you. 
>    Are you receiving MCC_S_EVENT_LOST in your com file log when the sink
>    is reporting INSEVTPOOLMEM?  This should be the case.  If not, then
>    something is either not being reported, or the event pool is so full
>    that lost events cannot be delivered.
 
I'm sure I'm not starting the process exactly like base note but I don't
believe I've ever seen a MCC_S_EVENT_LOST error.
   
>    A good way to create a big problem in the current type of event pool is
>    to "stop" (exit handlers don't run) a DECmcc process that is receiving
>    events while other DECmcc processes are still running.  The event pool
>    will still contain the abandonned mcc_event_get request structures. 
>    These abandonned requests will still receive all matching events, but
>    will not read the event out of the event pool and free its memory.  If
>    this is the case, the only way to free the memory is to exit from all
>    DECmcc processes on the system and restart them.  There must be a point
>    in time when there are NO DECmcc processes running.  Then the next
>    DECmcc process to perform an event operation will cause the event pool
>    to be recreated in its empty state.  Are you stopping any DECmcc
>    processes on the system (any users) while leaving others running?
 
Usually the only reason we stop processes is because they don't work. Once
the INSETPOOLMEM kills MCC_DNA4_EVL we of course have to restart it. This
is typically in the morning when we check for proper processes. As far as
other MCC processes running there probably is. It is unrealistic to stop ALL
MCC processes when we restart MCC_DNA4_EVL and the associated alarms. There
will ALWAYS be other alarm processes, recording, and exporting to take place.
We can't be restarting all MCC processes in the future when this occurs.
   
>    I would assume that a reboot of the system would also clean out the
>    event pool nicely.  

It seems to, yes.

>    How long after the reboot did the sink report that the pool had 
>    INSEVTPOOLMEM?  How many events are correctly received before lost 
>    events or no events are received?  I would assume that events are 
>    correctly received for a while, then lost events are received, then no 
>    events are received.  This would be the case if the events were simply 
>    arriving too fast to be processed by the DECmcc system.  The event pool 
>    happens to be the most limited queue in the the
>    events subsystem, so that is where the problem is reported.  What is
>    the arrival rate of events in the event sink that are to be processed
>    by DECmcc?  Also, what type of machine are you using, so we can get a
>    estimate of reasonable event throughput?
 
I'm not sure how long after a reboot. Restarting MCC_DNA4_EVL seems to work
for several hours though. I haven't seen anything in the present logs which
indicate lost events. Events coming in can be anything from 1 an hour to 
5-10 per second. It depends on what is happening on the net. I'm now running
on an 8810 w/384M (it feels good to breathe again, the 3520 now only does the
display work). To say the least I should have enough fire power, and I'm gonna
let all sorts of MCC stuff **RIP** and bring us to the levels we should have
been at months ago.
   
best regards,
brad...
1149.7Please be careful how you kill background MCC processesTOOK::GUERTINI do this for a living -- reallyMon Jul 01 1991 14:5240
    If you are seeing INSEVTPOOLMEM when you look at the DNA4 EVL log file,
    then I can understand it.  It should (I assume) have some text around
    it, like "The DNA4 Event Monitor just got a INSEVTPOOLMEM from the MCC
    Event Manager!".  On the other hand, if you are seeing this signalled
    as a VMS message, then something doesn't make sense.  That CVR should
    always be trapped by the caller of the mcc_event_put() MCC kernel
    routine.
    
    In order to clean up a request of an event, the Requestor of an MCC
    event must cancel the request.  However, the Requestor cannot always
    cancel, for example, if the user hits Control-Y, the Requestor may not
    get control.  We therefore have an Exit Handler in  the Event Manager
    to capture any remaining outstanding requests.  On image exit, the
    Event Manager cleans up whatever the Event Requestors could not.  But
    if someone does a $ STOP on an MCC process which is requesting events,
    even the Exit Handlers do not get called.  There is little we can do at
    this point (being a user-mode event system).  The Event Sinks generally
    only PUT events, so stopping them (with a $ STOP) rarely (if ever)
    would cause outstanding Requests to be left in the Event Pool. 
    
    Ideally, Event Sinks should be stopped but issuing some sort of
    MCC> DISABLE <whatever> SINK command, which will cause a clean rundown
    of the Event Sink.  Check the Documentation for the exact command
    syntax for the Sink you want to stop.
    
    There are some MCC processes which run in the background (no user
    interface), but also do GETEVENTs.  These need to be aborted WITHOUT
    stopping them (e.g., DO NOT use the DCL $ STOP command).  An example
    might be MCC Alarms running in batch.  If you DO abort a background MCC
    Alarms process, would almost always cause garbage (mostly invalid
    request information) to be left in the Event Pool.  The Putters (e.g.,
    DNA4 Event Sinks) would see these as valid reqests for events, and post
    events to the Event Manager.   After awhile, the Events will flood the
    Event Pool, and you have to take fairly drastic measures (killing all
    processes using MCC) to clean things up.
    
    Do you have to kill background MCC Alarms processes?  If so, how do 
    you kill them?
    
    -Matt.
1149.8more info/questionsJETSAM::WOODCOCKMon Jul 01 1991 17:0754
>    If you are seeing INSEVTPOOLMEM when you look at the DNA4 EVL log file,
>    then I can understand it.  It should (I assume) have some text around
>    it, like "The DNA4 Event Monitor just got a INSEVTPOOLMEM from the MCC
>    Event Manager!".  On the other hand, if you are seeing this signalled
>    as a VMS message, then something doesn't make sense.  That CVR should
>    always be trapped by the caller of the mcc_event_put() MCC kernel
>    routine.
 
The INSEVTPOOLMEM error is indeed seen in the MCC_DNA4_EVL.LOG.
   
>    In order to clean up a request of an event, the Requestor of an MCC
>    event must cancel the request.  However, the Requestor cannot always
>    cancel, for example, if the user hits Control-Y, the Requestor may not
>    get control.  We therefore have an Exit Handler in  the Event Manager
>    to capture any remaining outstanding requests.  On image exit, the
>    Event Manager cleans up whatever the Event Requestors could not.  But
>    if someone does a $ STOP on an MCC process which is requesting events,
>    even the Exit Handlers do not get called.  There is little we can do at
>    this point (being a user-mode event system).  The Event Sinks generally
>    only PUT events, so stopping them (with a $ STOP) rarely (if ever)
>    would cause outstanding Requests to be left in the Event Pool. 
    
>    Ideally, Event Sinks should be stopped but issuing some sort of
>    MCC> DISABLE <whatever> SINK command, which will cause a clean rundown
>    of the Event Sink.  Check the Documentation for the exact command
>    syntax for the Sink you want to stop.
 
Actually, MCC_STARTUP_DNA4_EVL I think does this as a first step. In any
event, the errors and subsequent failure of MCC_DNA4_EVL doesn't come when
someone has STOPped a process. It is usually in the middle of the night
sometime. Could a STOP process cause problems later?
   
>    There are some MCC processes which run in the background (no user
>    interface), but also do GETEVENTs.  These need to be aborted WITHOUT
>    stopping them (e.g., DO NOT use the DCL $ STOP command).  An example
>    might be MCC Alarms running in batch.  If you DO abort a background MCC
>    Alarms process, would almost always cause garbage (mostly invalid
>    request information) to be left in the Event Pool.  The Putters (e.g.,
>    DNA4 Event Sinks) would see these as valid reqests for events, and post
>    events to the Event Manager.   After awhile, the Events will flood the
>    Event Pool, and you have to take fairly drastic measures (killing all
>    processes using MCC) to clean things up.
    
>    Do you have to kill background MCC Alarms processes?  If so, how do 
>    you kill them?
    
The only time we STOP MCC ALARMS processes is when they don't work. Sorry, I'm
a bit puzzled, if we are running ALARMS in batch what other options other than
STOP do we have to initiate a restart of the alarms? Or should the order of
things go, DISABLE SINK, STOP alarms process, ENABLE SINK, START alarms process?

thanks,
brad...
1149.9There are no easy answers for this problemTOOK::GUERTINI do this for a living -- reallyMon Jul 01 1991 19:0740
> Actually, MCC_STARTUP_DNA4_EVL I think does this as a first step. In any
> event, the errors and subsequent failure of MCC_DNA4_EVL doesn't come when
> someone has STOPped a process. It is usually in the middle of the night
> sometime. Could a STOP process cause problems later?
   
    Yes.  Once you STOP a process which is doing GETEVENTs, you have
    initiated a stale request, which could eventually clog up the Event
    Pool.  It may minutes, hours, or days, depending how often the events
    (which never get picked up) come into the Event Pool.
    
> The only time we STOP MCC ALARMS processes is when they don't work. Sorry, I'm
> a bit puzzled, if we are running ALARMS in batch what other options other than
> STOP do we have to initiate a restart of the alarms? Or should the order of
> things go, DISABLE SINK, STOP alarms process, ENABLE SINK, START alarms
> process?

    I'm sorrier than you are!  There is no elegant solution to this
    problem. The fact of the matter is that in the release notes, we state
    (for users of the MCC Kernel routines) that the MCC processes should
    not be STOPped.  End users are now realizing that it is useful to have
    Alarms running in batch, but don't know of a clean way to stop the
    batch process.  Hence, shooting it in the head seems to do the trick. 
    There are two possibilities for this awkward situation.  I recommend
    running Alarms from a window (you can iconize it).  If you want to kill
    Alarms, then just Control-Y out.  Everything should cleanup correctly. 
    The other possibility to do a "Forced Exit" of the Alarms process. 
    This is more difficult, because there is no way at DCL level to do
    this, you need to write your own program (I have one that I can post as
    a reply if you want it).  Also, since it causes the process to
    essentially call the Exit routine in the middle of execution, you may
    cause the process to go into resource waits (for example, if the
    process was in a Disable Control-Y window of execution, and you Forced
    an Exit).

    If Alarms is not working, then we need to figure out why BEFORE killing
    the Alarms process.  If we find the originator of the problems, you
    should never need to stop the Alarms process.  I think by solving one
    of your problems, you are creating bigger problems.
    
    -Matt.
1149.10If not Batch, then what?NSSG::R_SPENCENets don't fail me now...Tue Jul 02 1991 13:0411
    DECmcc engineering reccomends running alarms in batch.
    
    No one is going to run production alarms in a window. Can't reboot the
    workstation... can't even log out to let someone else use it...
    
    Sounds like the re-engineering of alarms to a detached process
    controlled from DECmcc is a priority.
    
    What do we tell customers?
    
    s/rob
1149.11managable batch alarms soon??JETSAM::WOODCOCKTue Jul 02 1991 14:2916
I have to agree. Alarms from a window is not viable. For the reasons Rob
mentioned and also alarms run 24 hours a day. Leaving sys logged in all day/
night I'm uncomfortable with, especially considering I've set host to the
main system and this link potentially could drop occasionally creating the same
problem we're trying to avoid. Managable alarms within batch has been LONG
stated as an area needed for change. Are there any updates as to when this
may change? As far as killing processes I'll try to walk more lightly but
what can I say. Stopping all MCC processes or rebooting a multi-application
clustered 8810 aren't pretty options. Also I'm not convinced this is the
root to all the evil, but only an irritant worsening the situation. FYI, this 
problem with the pool is probably more widespread among EVL users than known
because others have indicated they seen it also. Considering how many are 
actually using EVL for monitoring it may be a high percentage seeing the error.

cheers,
brad...
1149.12We said THAT!?!?!TOOK::GUERTINI do this for a living -- reallyTue Jul 02 1991 14:4724
    Rob,
    
    As a member of DECmcc engineering, I'm amazed and disappointed that
    this fell through the cracks.  There is no patch that I can think of.
    
    I talked to Anil Navkal (Alarms PL) just yesterday, and thought he
    told me that they did NOT explicitly state that the user should run
    Alarms in batch.
    
    The problem is that Alarms does not have ANY detached process support.
    If it did, then we would not be in this predicament.  (This is not
    a complaint about the Alarms-FM.  The MCC-Kernel needs to provide
    generic detached process management routines.)  Other MMs have
    implemented their own private detached process support.
    
    The fact of the matter remains that you cannot kill the Alarms process
    by doing a DCL STOP on the process while Alarms is requesting Events. 
    I don't know what a DELETE/ENTRY does to a process, if it is the same
    as a STOP, then you MUST NOT do that either.
    
    Is it possible to have a command procedure disable all the Alarm Event
    rules running in batch?
    
    -Matt.
1149.13No can do ...TOOK::ORENSTEINTue Jul 02 1991 16:4213
    I too have been thinking about this problem, and I agree that
    ALARMS will be better off when it is detached.
    
    Matt, unfortunately rules are enabled within a process.  So
    a user on DCL can not see that rules are being run in batch.
    And that user on DCL can not disable the rules that are
    running in batch.
    
    Infact, ALARMS is designed so that once the rule is enabled,
    another process could delete the rule from the MIR, and it
    would keep running in the first process as if nothing happened.
    
    aud...
1149.14using DELETE not STOPJETSAM::WOODCOCKTue Jul 02 1991 17:337
Hi Matt,

For clarity, I always DELETE/ENTRY to stop the process. I never use STOP
PROCESS/ID=... I too, don't know if there is a difference. But I always
use DELETE because it's usually easier to type :-).

brad...
1149.15Try this instead...TOOK::GUERTINI do this for a living -- reallyTue Jul 02 1991 18:10102
    The following is a VAX C program which will attempt to send a Force
    Exit to another process.   You need privileges to send a Force Exit
    to a process that you do not own.
    
    If you need to abort an MCC process and cannot do it interactively,
    then please try using "FORCEX" before attempting to use the STOP or
    DELETE/ENTRY commands.  (At least until we find a better solution.)
    
    -Matt.
    
    This program is not supported by NME, MCC, or DEC in general.   No
    one is liable or responsible for this program in any way, shape or
    form.  Use at your own risk.  Etc,etc. <insert usual caveats here>
    --------------------------CUT HERE---------------------------------
/* FORCEX.C -- Force Another Process to Exit
               (by calling the $FORCEX system routine).

   $ CC FORCEX.C
   $ LINK FORCEX.OBJ, SYS$INPUT:/OPT   ! Type in image lib interactively.
   SYS$SHARE:VAXCRTL.EXE/SHARE
   ^Z                                  ! Control-Z out of input mode.
   $ COPY FORCEX.EXE                   ! Copy it to where you want it.

   from a privileged account,
   define it as a Foreign command:
   $ FORCEX:==$SYS$DISK:[]FORCEX.EXE   ! Use actual disk location.
   $ FORCEX <pid1> [<pid2> ... <pidn>] ! Use PID or Process name (quoted).

*/
#include <descrip.h>
#include <ssdef.h>

int remove_quotes( p_string ) /* Remove double quotes */
char *p_string;
{
  int i;

  for (i=0;*(p_string+i) != '\0';i++)
    *(p_string+i) = *(p_string+i+1);
      
  if ((i > 1) && (*(p_string+i-2) == '"'))
    *(p_string+i-2) = '\0';

  return (strlen( p_string ));
}

main( argc, argv )
int argc;
char *argv[];
{
  int exit_code = SS$_FORCEDEXIT;
  int use_pid;
  int sstat;
  int pid;
  char *procnam_str;
  struct dsc$descriptor procnam_dsc = {0, DSC$K_DTYPE_T, DSC$K_CLASS_S, 0};
  int arg_count = 0;
  int quotes = 0;/* boolean flag 1 = no quotes, 0 = quotes specified */
  int msg_len;
  char msg_txt[256];
  struct dsc$descriptor msg_dsc = {256, DSC$K_DTYPE_T, DSC$K_CLASS_S, msg_txt};

  procnam_str = malloc( 256 );
  do
  {
    arg_count++;
    if (argc < 2)
    {
      printf("Enter a PID in hex (or a Process Name) : ");
      scanf("%s",procnam_str );
      argc = 1;
    }
    else
      procnam_str = argv[arg_count];

    procnam_dsc.dsc$w_length = strlen( procnam_str );
    procnam_dsc.dsc$a_pointer = procnam_str;

    /* Quoted strings are always treated as Names */
    quotes = (*procnam_str == '"');
    if (!quotes && (ots$cvt_tz_l(&procnam_dsc, &pid, 4, 0) == SS$_NORMAL))
      sstat = sys$forcex( &pid, 0, &exit_code );
    else
    {
      if ((quotes) && (procnam_dsc.dsc$w_length > 1))
        procnam_dsc.dsc$w_length = remove_quotes( procnam_str );
      sstat = sys$forcex( 0, &procnam_dsc, &exit_code );
    }

    if (sstat == SS$_NORMAL)
      printf("\nForced Exit successfully requested for %s\n", procnam_str );
    else
    {
      printf("\nForced Exit request failed for %s\n", procnam_str);
      sys$getmsg( sstat, &msg_len, &msg_dsc, 1, 0 );
      msg_txt[msg_len] ='\0';
      printf("Reason: %s\n",msg_txt);
    }

  } while (arg_count < argc-1);

}
1149.16SET MODE=HACKWAKEME::ANILWed Jul 03 1991 12:2137
   Thanks Matt. Will every one out there give a good round of applause 
   to Matt for writing the real code! :-)

   While you guys are busy compiling Matt's program you may want to try 
   the following to get you out of the "how-to-stop-MCC-that-is_running-
   in-the-background".

   The command procedure has all the comments. My first thought was to 
   make it a lot more fancy and be driven by some rule firing that will 
   stop the batch job. But for now I prefer it to be very simple. A little
   effort on users part will solve the problem. In V1.2 we may try to
   be a little more user friendly :-), no promises though!!
   



$ manage/enter
! Enable mcc 0 Alarms rule foo_1, in domain blaha
!       :                                         
!	Enable all your rules here                                                
!       :                                         
! Enable mcc 0 Alarms rule foo_n, in domain blaha
!
! The following command will wait for what ever delta-time you specify
! If you want to stop the Background process check the PID of the 
! spawned process. The name of the process is <username>_1
! The PID of this process is generally 1 more than the batch job's PID
! , say its x. Now to stop the background MCC, do your favorite stop/id
! for the PID x. The spawned process will be killed. The parent process 	 
! will now resume next mcc command which just happens to be a graceful
! exit. You may want to do SHOW MCC 0 Alarms RULE * all att before 
! the exit command.
!
  spawn wait 22:00:00           
  exit
	 

1149.17works goodJETSAM::WOODCOCKWed Jul 03 1991 18:4111
Hi Matt,

Thanks for the program. I've got it compiled and tested. It seems to do
the trick and hopefully it helps and/or resolves this problem.

best regards,
brad...

PS.   Anil, nice creative hack as Option B :-)	 


1149.18For a future version...MARVIN::COBBGraham R. Cobb (Wide Area Comms.), REO2-G/H9, 830-3917Fri Jul 05 1991 11:3126
Processes will always get stopped for many reasons.  You shouldn't ever rely
on user-mode exit handlers or ^Y interception to clean up a shared resource.
There are two fairly obvious fixes I can think of for a future version:

1) Use  a  kernel  mode  exit  handler.   Of  course  this  requires writing
privileged, inner mode code and using things like protected sharable images.

2) Take  stock  and  tidy  up  frequently.  For example every time a process
connects  to  the  global  section  have it look around and tidy up the mess
caused  by  a  process going away unexpectedly.  Or do it from a timer.  The
main  problem  here is working out who is still attached.  Fortunately there
is an easy solution to that using locks.  

You can  get as complex as you like using locks but a simple solution should
work: every process that uses the global section writes its PID somewhere in
the section where everyone else can find it.  It also takes out an exclusive
lock  called  MCC$<pid>.  If another process needs to know whether the first
process  is  still  around  (and,  more  importantly, still using the global
section!)  it  tries  to acquire lock MCC$<pid>.  If it succeeds the process
has stopped using the section and its mess should be tidied away.

Either of  those  solutions  could work.  Or, of course, something much more
specific  to the alarms module.  Whichever way it is done I think this needs
to be a high priority to fix for V1.2.

Graham
1149.19The future is ... "Portability"!TOOK::GUERTINI do this for a living -- reallyMon Jul 08 1991 11:5055
    RE:.18
    
    Graham,
       Yes, the solutions you suggest are doable.  The problems are:
    
    1) Using a kernel mode exit handler.  This is analogous to cracking
       open a peanut with a thermonuclear device.  Yes, it will work,
       yes, it is overkill, yes, there are simpler (and more portable)
       solutions which stay in user-mode.
    
    2) Various garbage collection schemes.  Counting on things such as
       PIDs to identify a process will work until the same PID gets
       re-used.  If you look at the N-process to N-process communication
       behavior of MCC events (for example Sinks are generally very long
       running process which mainly do Puts, while forground MCC tend not
       to run very long, and do Gets), then you will notice that it may
       be several hours or days between when the process goes aways and
       another process needs to check its PID.  I do not believe there
       is a guarantee in the VMS architecture that PIDs will be not
       be reused, or at which intervals they could be re-used.  If you
       know of any statements (such as, "PIDs are always unique and never
       reused between reboots"), then please let me know.  Also, remember
       that were are not just talking about processes, we are also talking
       about threads.  For example, if a thread issues a Get, and then
       is destroyed, or hangs, the event request remains in the event pool.
    
    Instead, there appears to be a handful of creative, yet simple
    solutions, which provide the same end result.  Some examples:
    1) Implement a "sweeper".  Sweepers are threads which run in any
       process which calls the MCC Event Manager.  They are started
       up on Event Manager initialization, and periodically scan the
       Event Pool for garbage.  Unfortunately, this is an "active"
       as opposed to a "passive" solution, and required the system
       to do more work base upon the load.
    2) When the Putter puts an Event, and notices that the Getter
       hasn't picked up events in a timely fashion, he issues a
       "challenge".  If the Getter accepts the challenge, then
       the Request is validated.
    3) Each Getter has a quota of the number of events it can
       have queued up in the event pool.  If the quota is reached,
       the events are "lost", after a period of no Getter activity,
       the request itself becomes invalid.
    
    We have several others, including various combinations of the above
    schemes.
    
    I appreciate your interest, and your taking the time to propose
    plausable solutions.  However, the real issue is not the lack of
    solutions, but the lack of time and people resources to implement them.
    The solution we have finally come up with requires a minimum of both,
    but it still must be worked into the schedule and traded against other
    tasks (which means some other piece of functionality or some other bug
    fix will NOT get into the product in the next release).  For V1.1, we
    reluctantly settled for exit handler cleanup -- although that solution
    isn't very portable either :-).
1149.20Help is coming - "Real Soon"TOOK::T_HUPPERThe rest, as they say, is history.Mon Jul 08 1991 14:4613
    RE:.18, .19
    
    Just so everybody can feel better about "the Event Manager that can't
    clean up after itself", we have time allocated for the V1.2 release to
    implement some/all of the functionality that Matt outlined in .19.  The
    internal Event Pool cleanup mechanism has always been an integral part
    of the Event Manager, but until now, there has been NO time to
    implement it.  The tradeoffs we've had to make in many areas of DECmcc
    in order to get ANY product out the door have been severe.  We are
    allocating more time now to filling in some of the areas previously
    traded off.
    
    Ted
1149.21MARVIN::COBBGraham R. Cobb (Wide Area Comms.), REO2-G/H9, 830-3917Mon Jul 08 1991 14:4911
You are right that there are many possible solutions (by the way, the "lock"
approach  can be made immune to re-using the same PID but it rapidly becomes
complex).   Personally  I  would  probably  use the kernel mode exit handler
approach,  but  then  I  have been writing VMS inner mode code for almost 10
years!

I take  your  point  that  any  solution  will cost some other feature but I
wanted to add my voice to the outcry that a user mode exit handler is not an
adequate solution for V1.2.

Graham
1149.22INSEVTPOOLMEM is backJETSAM::WOODCOCKThu Jul 18 1991 14:5918
I have come back to the original problem, INSEVTPOOLMEM. I have once again
received this error today. I have been extremely careful to use 'FORCEX'
but the error has reappeared. Usually I can simply restart MCC_DNA4_EVL
and all works well for awhile but not today. Restarting it brought back
the same error within minutes. Should I reset all MCC processes when I
receive this error always, please say no that is a painful workaround.

As a side note I have been working on MCC and EVL being more robust. As
a consequence I forced EVL to go away many times yesterday which produced
a fatal link abort error in MCC_DNA4_EVL. Could this have been the prelude
to this error coming on again? It shouldn't be because EVL goes away on its
own often and can't be avoided thru normal operations. Would it help to
restart only MCC_DNA4_EVL each work day? BTW, I think I have a hack to
keep MCC_DNA4_EVL running even when EVL drops out. I'll be looking for
opinions on it but I'll post it in the appropriate note.

regards,
brad...
1149.23Not processing sinked eventsAUNTB::BRILEYAre you a rock or leaf in the windWed Jul 24 1991 12:427
    Did anyone ever find out the problem causing the initial problem that
    Louis reported.  That is the MCC_DNA4_EVL not receiving/processing
    sinked event.
    
    Thanks,
    
    Rob
1149.24Event Mgr cleanup for killed processes?TAEC::MCDONALDMon Feb 17 1992 08:1621
    re .20
    >Just so everybody can feel better about "the Event Manager that can't
    >clean up after itself", we have time allocated for the V1.2 release
    >to implement some/all of the functionality that Matt outlined in .19.
    
    I am using mcc  Component Version = T1.2.4 on Ultrix. 
    Has the functionality discussed in notes .19 & .20 been implemented
    in the newer Event Manager?
    I have a background process which does an mcc_event_get for infinity.
    If this process gets killed (kill on Ultrix), then other
    processes doing mcc_event_puts still receive a status of
    Normal (as if another process has received the event, when
    in fact there are no other processes waiting for the event).
    
    If the background process does an mcc_event_get cancel before
    exiting then this does not happen.
    
    Is there a way to correct this (the mcc_event_put receives 
    MCC_S_NOEVENTREQ when the process is no longer there) ?
    
    thanks, Carol
1149.25Use mcc_kill rather than killTOOK::MINTZErik Mintz, DECmcc Development, dtn 226-5033Mon Feb 17 1992 11:406
This does appear to be a problem (and I have seen the relevant QAR).
However, we DO NOT recommend killing DECmcc processes on ULTRIX
using "kill".  That is why we provide mcc_kill to terminate them.

-- Erik

1149.26what's the difference?TAEC::MCDONALDMon Feb 17 1992 13:593
    what does mcc_kill do differently from "kill"?
    
    Anyway a process might exit for other reasons before doing a cancel.
1149.27mcc_kill allows a clean shut downTOOK::MINTZErik Mintz, DECmcc Development, dtn 226-5033Mon Feb 17 1992 14:167
>    what does mcc_kill do differently from "kill"?

It sends an MCC event that allows a process to shut itself down.

There are known clean-up problems when a process is abruptly
terminated.

1149.28Event manager cleanup has been implemented in V1.2TOOK::T_HUPPERThe rest, as they say, is history.Tue Feb 18 1992 14:0861
    RE .24:
    
    New functionality for V1.2:
    
    The event manager DOES cleanup when processes die.  It does NOT do so
    immediately.  The purpose is to avoid filling up the event memory pool
    with events for GETs of processes that have been killed/stopped.  The
    purpose is not to ensure to the PUT that a GET actually processed the
    event.  That is impossible for the (low-level) event manager to do.  It
    has no control over what happens to an event after it leaves the event
    manager.  

    The cleanup that is done when a process doing mcc_event_get calls dies
    is based on a timer and the queue of the mcc_event_get filling up.  The
    algorithm is as follows:

    If the event queue (settable with the MCC_EVENT_EDQ_SIZE_LIMIT
    environmemt variable, default is 200) for the GET fills up, after a
    timeout (settable with the environment variable
    MCC_EVENT_EDQ_TIME_LIMIT, default is 60 seconds) AND another event is
    PUT to this queue, the entire contents of the queue is converted to
    lost events.  If another event is PUT to this queue after another
    timeout (settable with the environment variable
    MCC_EVENT_LOST_TIME_LIMIT, default is 600 seconds) expires, the GET
    structures are removed from the event manager.  No further PUTs will
    see this deleted GET (they will now receive MCC_S_NOEVENTREQ).

    If the event pool has filled to a threshold level (not settable), it is
    not necessary to have any PUTs enqueued for the dead GET to have the
    above sequence take place.  All GETs in the event pool are checked
    against the timeouts.  Any GETs past the timeouts are deleted along
    with their posted events.

    The purpose of the above sweeping operation is to prevent the event
    manager pool from being put out of commission by dead GETs.  Note that
    because of the timeouts and/or requirement to reach a threshold of
    fullness, we cannot give instantaneous accuracy on whether or not the
    event actually went to a GET process.

    After a process with outstanding GETs dies, and before the GET
    structures are removed from the event pool, PUTs that match those GETs
    will return MCC_S_NORMAL.  After the cleanup, they will receive
    MCC_S_NOEVENTREQ.  The difference in these CVRs is whether or not the
    event was queued to a GET, not whether the event was acted upon by a
    real process.
    
    If you need to know whether an event was acted upon, then you need a
    transaction processing model.  As the event manager is only providing a
    one-way distribution of data, a single event posting cannot provide
    this capability.  An end-to-end receipt is required.  A return event
    could provide that receipt, but the model is becoming complicated.
    
    If knowing as quickly as possible whether a GET process has died
    (perhaps so that an automatic restart of the GET process can be done
    (but why did it die?)) is really important, we would have to test the
    existence of the GET process for each matching GET for each PUT of an
    event.  Given that the event manager cannot guarantee action on an
    event and needs to have high performance, we did not implement this
    test.
    
       Ted