[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference azur::mcc

Title:DECmcc user notes file. Does not replace IPMT.
Notice:Use IPMT for problems. Newsletter location in note 6187
Moderator:TAEC::BEROUD
Created:Mon Aug 21 1989
Last Modified:Wed Jun 04 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:6497
Total number of notes:27359

84.0. "Is this a Bug or is it me ?" by PILOU::BONGARTZ (Huckleberry Finn, I presume ?) Tue Mar 27 1990 05:56

T.RTitleUserPersonal
Name
DateLines
84.1INFO - are you running EFT kit?GOSTE::CALLANDERTue Mar 27 1990 18:4910
    
    Hi,
    
    You have hit upon some of the problems that we are currently working
    on. I would be interested in knowing if you are running the EFT
    kit. Especially the component version numbers of the DECnet NODE4
    Access Module, the TRM Presentation Module, and the base system.
    
    Thanks for the additional information.
    
84.2All T1.0.0 ...PILOU::BONGARTZHuckleberry Finn, I presume ?Wed Mar 28 1990 09:557
>    kit. Especially the component version numbers of the DECnet NODE4
>    Access Module, the TRM Presentation Module, and the base system.

	All three Component Versions are T1.0.0 ...

	( my workaround now is to exit and re-run mcc if it takes
	more than 45 seconds for a poll... )
84.3if you find another goods ones...GOSTE::CALLANDERWed Mar 28 1990 20:2712
    
    Thanks for the additional input. We will see what can be done. If
    you hit any other commands that go up at such a nice rate it would
    be useful if you posted them here. Since different commands go through
    different paths in the system, sometimes something that looks like
    a small leak on one command, turns out to be something major given
    another command.
    
    jill
    
    
    
84.4got one! (or two?)PILOU::BONGARTZHuckleberry Finn, I presume ?Fri Mar 30 1990 09:1428
>                     -< if you find another goods ones... >-

	Got another one...

	in  my  original  polling  loop, I also checked the counters on
	the local node (GABIN). Each poll created a SERVER_xxxx process,
	which  apparently  terminated  after ca 5 minutes... but as the
	commands were given in less time than that,the system filled up
	with these processes...  and  ended up doing nothing but paging
	and swapping.

	Another thing, though it might not be due to me, MCC or whatever
	else - "just a coincidence ?" :

	I started my poll server in the afternoon before leaving work,
	and  left it running over night, polling all the routers here
	in  Valbonne. During the night, the whole network went down -
	systems crashed, etc.  The  last output from my server was at
	03:13,  and  about  that time the problems occured. Wether my
	code  crashed  because of the problem, or the problem occured
	because  of the polls, is not clear to me - but *if* it's due
	to  MCC  or my server (no privs!), we'd better make sure this
	doesn't happen on a customer network.. I'll let the thing run
	tonight and let you know if the net goes down the drain again.


		Regards,
				Marc.
84.5Thanks for the additional informationPETE::BURGESSFri Mar 30 1990 13:5536
    You have presented several problems to us which have been
    assigned to different engineers for resolution.
    
    1) The reserved operand fault which occurs when MCC is executed
       as a sub-process assigning sys$input/output to mail-boxes.
    
       This seems like a contained problem- I will try to reproduce
       your experiment here and diagnosis the problem:  Would you
       send me the exact commands which you used to create the mcc
       sub-process and the commands used for communicating with
       the sub-process?
    
       (enet:  Pete::Burgess)
    
    
    2) Virtual memory expansion.  This is probably due to "vm leaks".
       We have instrumented test versions of MCC with diagnostic
       tools for recording vm deallocation problems, and have been
       testing this problem since December, and have fixed many problems.
       Our focus has probably been the on the normal successful operations,
       and the most common error paths.  My hypothesis is that
       MCC is taking some error paths without properly terminating
       its requested operations.   We will be trying to reproduce 
       this problem with our instrumented version of MCC.
    
     3) The performance problems:  The DECnet phase 4 project leader
        will be contacting you to obtain more diagnostic information.
        
        My first concerns relate to the large number of nml servers which
        are being created on your routing servers
    
    
    \Pete Burgess
    
    
	       
84.6Reduce NETSERVER$TIMEOUT to dump processesTOOK::CAREYFri Mar 30 1990 15:2526
    
    
    The only way we can see MCC "bringing down the network" is by applying
    huge loads on all of the routing nodes in the network.  If we put
    enough pressure on them in terms of excessive NETSERVERs, it is
    conceivable that they will be unable to perform normal network 
    communications.  As soon as that happens, the routing traffic increases
    dramatically because the routers are trying to understand the topology.
    
    If you've got an appreciable number of routers, the network degrades
    rapidly.
    
    So, the first thing to do is get rid of the excessive NETSERVERS.
    We don't know why you spawn a new server with each connection.  But 
    until we do, you can at least cut down on the number of server
    processes that are out there by setting the NETSERVER process timeout
    lower.  Do this by setting the system logical NETSERVER$TIMEOUT to just
    a few seconds instead of the default of around five minutes.  You'll
    still suffer the process creation overhead, but at least you won't get
    the swapping and paging that you're seeing.
    
    Hope this helps, and I'll give you more on this server problem as soon
    as I can find out more.
    
    -Jim Carey
    
84.7We Can't Reproduce Multiple Server ProblemsTOOK::CAREYMon Apr 02 1990 16:0361
    
    Marc,
    
    I had a chance to do some experimenting on our network here, and
    was unable to reproduce a situation where multiple servers were 
    spawned and weren't expected.  Any details that you could give me
    about the exact nature of your requests could help, although I can't
    imagine what might be different about them.
    
    I created and checked out the following cases:
    
    - Connecting to a remote node with Proxy Access defined.
    
    	This worked fine.  Subsequent requests connected to the spawned
    	server.
    
    - Connecting to a remote node using explicit access (BY USER = "...")
    
    	This also worked fine.  I did these close together, so the Proxy
    	Server was still out there, and a new server was created for the
    	explicit access case.  This is normal because VMS has to consider
    	them to be different processes with different rights.  As expected,
    	subsequent requests connected to the same server just spawned.
    
    - Connecting to a remote node using Default Access (no proxy, no
      explicit accounting information)
    
    	This worked as expected too.  After forming this connection, I had
    	three servers running: one for the Proxy access, one for the
    	Explicit Access, and one for the Default Access.  Subsequent
    	requests didn't spawn any new servers.
    
    In fact, once I had the three servers running, I attempted to confuse
    the system by using Proxy, Explicit, and Default Access in different
    combinations.  No problems were encountered, and no additional
    processes were spawned (by the way, connecting to an existing server
    cuts down the response to a circuit counters request from an estimated
    fifteen seconds, to two or three seconds maximum).
    
    We also tried to reproduce the problem on a boundary condition.  You 
    mentioned that your servers were set up to last about five minutes and
    that you were requesting counters about every five minutes.  We
    wondered if the server process could somehow get locked up if a request
    came in just as it was being stopped.
    
    Several attempts to cause this to happen were unsuccessful.  Since you
    appear to reproduce this problem at will, we don't expect that the 
    problem lies on that boundary.
    
    We still suspect that there is something funny about the NETSERVER 
    processes that you are creating and will continue to pursue that angle.
    I hope that isolating and changing the appropriate network, system, or
    account parameters will clean up these servers and get your connections
    behaving more closely to what we expect.
    
    -Jim Carey
    
    
    
    
    
84.8Defective Bridge responsible for Network problemsTOOK::CAREYTue Apr 03 1990 14:2011
    
    Just a little added detail:
    
    While MCC was under suspicion of "bringing down the network" it appears
    that a defective bridge was the real culprit this time.
    
    We are still investigating the problems described in this note, but
    there is no grounds to fear that DECmcc will topple your network.
    
    -Jim Carey
    
84.9Any progress on increasing response time problem?DSTEG1::MCCANNWed May 09 1990 13:406
Has the problem of the ever-increasing response times mentioned in .0
been solved, or its cause identified?  If so, will it be fixed in EFT
update?

Jack
84.10leaks being pluggedGOSTE::CALLANDERWed May 09 1990 20:1024
    
    There were two things at work in the problems reported. The defective
    bridge was the cause of the crash and most of the "slow down" that
    was experienced. 
    
    The other problem was due to some memory leaks (causing fragmentation
    of memory when run for extended periods of time), and the dictionary
    lookup overhead.
    
    For EFT update we have made quite a few advances in our memory
    management by implementing a local cache for the allocation and
    deallocation of temporary memory; a better caching alogrithym for
    the dictionary look ups was implmented in the EFT release, and fine
    tuned for EFT update; quite a few leaks were plugged; and some of
    the slower code paths have been reviewed and condensed to provide
    a faster end user response time.
    
    So far people with early, integration, releases of the base system
    changes have been very pleased with the enhancements. I hope you
    are too. But we are not stopping there, work on performance and
    memory management are continuing.
    
    jill