[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference help::decnet-osi_for_vms

Title:DECnet/OSI for OpenVMS
Moderator:TUXEDO::FONSECA
Created:Fri Feb 22 1991
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:3990
Total number of notes:19027

2620.0. "%LIB-F-INVARG, invalid argument on SET HOST" by BWTEAL::W_MCGAW () Fri Jun 02 1995 13:57

T.RTitleUserPersonal
Name
DateLines
2620.1EEMELI::MOSEROrienteers do it in the bush...Mon Jun 05 1995 06:2611
2620.2LIB error right after set hsot command.BWTEAL::W_MCGAWMon Jun 05 1995 16:339
2620.3EEMELI::MOSEROrienteers do it in the bush...Tue Jun 06 1995 05:4833
2620.4test.mar (quick and dirty example prog)EEMELI::MOSEROrienteers do it in the bush...Tue Jun 06 1995 05:4969
2620.5I had this pb...MOSCOW::JOUVINMichel Jouvin - Digital MoscowTue Jun 06 1995 10:379
2620.6Got the file for testing.BWTEAL::W_MCGAWTue Jun 06 1995 14:2111
2620.7TFOSS1::HEISERmaranatha!Sat Sep 21 1996 18:525
2620.8Transport connections at max ?COMICS::WEIRJohn Weir, UK Country SupportMon Sep 23 1996 08:4220
2620.9More than one possible cause...CSC32::D_WILDERThere's coffee in that nebula!Mon Sep 23 1996 18:00105
2620.10what's typical?TFOS02::HEISERMaranatha!Tue May 06 1997 19:335
    What's a typical value for VMS 6.2, DECnet/OSI 6.3 ECO6 on a 6540 with
    256Mb RAM?  I've set a node like this to 75000 and it still ran out.
    
    thanks,
    Mike
2620.11CANTH::WATTUMScott Wattum - FTAM/VT/OSAK EngineeringTue May 06 1997 19:405
    You should first take a look at note 3762 and verify that you aren't
    having a that problem, which has a different fix.
    
    --Scott
    
2620.12TFOS01::HEISERMaranatha!Fri May 09 1997 20:424
    No that's not it.  These are SEPS97 machines which all have the correct
    values for CTLPAGES and CTLIMGLIM
    
    Mike
2620.13Register all your nodes ?COMICS::WEIRJohn Weir, UK Country SupportTue May 13 1997 07:4535
Mike,

>    What's a typical value for VMS 6.2, DECnet/OSI 6.3 ECO6 on a 6540 with
>    256Mb RAM?  I've set a node like this to 75000 and it still ran out.

Please do not assume that all problems which produce the INVARG error are the
same problem... Generally, the problem occurs when NET$ACP runs out of
VA (virtual address space), but there might be other reasons... There are
a number of reasons (and/or bugs) which may result in NET$ACP running out
of VA. Over time, these bugs are fixed, and it is known that there are good
fixes for some of them. For example, the CTLPAGES problem is fixed for
VAX VMS V6.1 and V6.2 by VAXSYS08_062.

Typical (ie non-buggy) usage of pagefile quota should be under 10k. If the node
is used for very large numbers of connections, where the connections are made
in bursts, or where the node is a very busy DNS Server then values over 20k
are possible, but if you get values over 25k then look for bugs. In other
words, I believe that your system is suffering from one of the bugs...

There is a problem in V6.3 ECO-6 which shows up particularly often if you
do not register all of your nodes in the namespace. The frequency of onset of
this problem is drastically reduced if you ensure:

	a) that all nodes in your network are registered correctly in the
	   naming service

	b) that you increase the "Sess Control Naming Cache Timeout" to
	   something which greatly exceeds the anticipated fix time for the
	   bug ;-)

Regards,

	John

2620.14net$acp exhausted pgflquota (6.3 eco 6)PRSSOS::MAGENCWed May 21 1997 13:5931
                                              
    
    			Hello !
    
    John , in your previous reply , you say :
    
    <<There is a problem in V6.3 ECO-6 which shows up particularly often 
      if you do not register all of your nodes in the namespace. >>
    
    Could you please provide more info about this problem ? 
    (IPMT case etc ?) 
    
    Here in Easynet France, this problem has been experimented twice in
    2 weeks (Cluster OpenVMS VAX 6.1, DNVOSI 6.3 eco 6 , directory
    services DECDNS,local) .
    Having all the nodes registered in the namespace (DEC:) is nearly
    impossible .
    We checked that it's not a "CTLPAGES" problem.
    When this problem occurred for the second time (20 may 97), we
    changed "session control Naming Cache Timeout" to 1000 days,
    then rebooted. It's a "production" cluster called EVTISA .
    
    This problem occurred once on EVTV10 and once on EVTIS6 , with
    a pgflquota value = 75000 for net$acp !
    Under "normal" circumstances , the pgflquota used is between 10000
    and 15000 ; NSP and OSI TRANSPORT both have maximum connections = 500
    
    	What else could be done ?
    	Thanks in advance, and best regards , Michele .


2620.15could it be max transport connections??CSC32::J_RYERMCI Mission Critical Support TeamWed May 21 1997 15:3711
    I just escalated a case for MCI (sorry, don't have a cfs number yet,
    as CHAMP/CSC seems to be slow passing things to IPMT) on a similar 
    problem on a system running OSI V6.3 ECO-6.  In their case, we think 
    the memory leak was triggered by bumping up against OSI Transport 
    Maximum Transport Connections (due to a bug in application code written
    by the user).  See note 2990.1 in this conference; John Weir
    escalated the problem as IPMT case CFS.27302, but it's not evident
    that a fix had been issued as of eco 6.
    
    Jane Ryer
    MCI Mission Critical Support Team
2620.16TFOS02::HEISERMaranatha!Wed May 21 1997 17:526
    The node that forced me to bring this issue up just exhausted a 100K
    pgflquota in 2 weeks.  It only took a few days to do 75K.  The strange
    thing is that the other node in the same production cluster is fine
    with 75K (has been for the 2 months since the upgrade to ECO6).
    
    Mike
2620.17TFOS02::HEISERMaranatha!Wed May 21 1997 17:5812
|Typical (ie non-buggy) usage of pagefile quota should be under 10k. If the node
|is used for very large numbers of connections, where the connections are made
|in bursts, or where the node is a very busy DNS Server then values over 20k
|are possible, but if you get values over 25k then look for bugs. In other
|words, I believe that your system is suffering from one of the bugs...
    
    John, I find this interesting.  Did you know that on CCS production
    clusters that 50K is a "standard" value?  These are usually heavily
    loaded clusters (i.e., several hundred users).
    
    later,
    Mike
2620.18TFOS02::HEISERMaranatha!Wed May 21 1997 18:1311
    I just adjusted max cache timeout to 1000.  I got this vague error when
    trying to adjust max connections.  This is with 100k pgflquota exhausted.
    
    $ mcr ncl set osi transport maximum transport connect 250
    
    Node 0 OSI Transport
    at 1997-05-21-11:13:20.466-07:00I1.576
    
    command failed due to:
     process failure
    
2620.19V6.3 ECO-6 CDI bug with "lost" lookupsCOMICS::WEIRJohn Weir, UK Country SupportThu May 22 1997 09:51132
Hi,

There are well known and long standing problems if you reach the "Maximum
Transport Connections" limits for either NSP or OSI Transport. These are NOT
the issues that I refered to earlier.

Briefly, the "Maximum Transport Connections" problem is well known, and
you avoid it by either a) fixing your application so it does not beat on the
limit, or b) increase the limit. Before you increase the limit, you have to
increase "Maximum Remote NSAPs" to be at least one greater than the intended
new value for "Maximum Transport Connections". This problem is annoying, but
appears unlikely to be fixed. It's rather like beating your head on a brick
wall -- if it hurts, then don't do it!

The problem that I referred to exists in V6.3 ECO-6, and presumably in V7.1.
You will not see it in any later versions, because Engineering will fix it
before the next ECO and/or version (sic ;-)). You are unlikely to see it in
earlier versions or ECOs. I believe (although I am not sure whether Engineering
agree) that the underlying bug may have existed in DECnet/OSI since V6.0 SSB,
but that it has not shown up until implementation of the ECO-6 version of
the dynamic CDI cache. There was an earlier "dynamic CDI cache" kit, which
some people installed as an optional addition to their systems. I believe
that this earlier kit did not include the "CDI meltdown" fix, which was
bundled into the V6.3 ECO-6 CDI and which exposed the earlier bugs ...

Do you follow me so far ?

Just to summarise the terminology:

Dynamic CDI cache: The original CDI cache design was a fixed size file.
Unfortunately, the original size was too small for busy systems, so it was
increased. As every member of a cluster has the same sized cache file, this
meant that several hundred thousand blocks of system disk could be consumed
in a large cluster, even though most systems were satellites and only required
small cache files. The solution was dynamic CDI cache, which dynamically
increased the cache based on demand. This was implemented as an "early release"
kit, and in V6.3 ECO-6.

"CDI meltdown": A phrase coined by Bob Watson -- but he coins so many that he
can probably no longer remember ;-) A feature of the original CDI design
was that if several lookups for the same nodename (or backtranslation) occur
at about the same time, and if the name/backtranslation is not in the CDI
cache, then CDI will do several DNS lookups in parallel instead of optimising
and doing just one DNS lookup to satisfy all requests. The enhancement 
(included in the V6.3 ECO-6 CDI) was to detect this condition. If several
CDI lookups are done for the same name/backtranslation which is not in the
cache, then the first lookup triggers a real DNS lookup, while the others
are queued up to await completion of the first lookup. You can see this
on a CDI trace under V6.3 ECO-6, where you will see the first lookup
recorded as "parent" and queued lookups recorded as "child". BTW: Just
for completeness of the description, this change is a nice optimisation in
most cases, but it actually solved a very serious problem on DNS Servers.
Specifically, all nodes from time to time lose their own CDI cache entry.
(The default is 30 days, or hardcoded at 7 days on reboot...) Whenever a
DNS Server loses its own CDI cache entry, there is a severe risk that
disaster will strike! When the DNS Server loses its CDI entry then CDI will
use the DNS Clerk to do a lookup on its name--- This involves a DECnet link
back to itself (ie Clerk and Server are on the same node) and the incoming
connect must be backtranslated requiring a lookup on its name requiring
another logical link from Clerk to Server requiring another backtranslation
lookup of its name and so on in a loop until something runs out of resources
and fails. Maybe the DNS Server runs out of memory or some other resource.
Maybe NSP or OSI Transport runs out of "Maximum Transport Connections". Maybe
you like that last one in particular ?? It links together this problem with
the otherwise totally unrelated "Maximum Transport Connections" problem that
I dismissed at the start of this reply.

CDI "sticky" bit: Given the severity of problems which might occur when a
node loses its own CDI cache entry (particularly DNS Server nodes) Engineering
have enhanced the CDI design, yet again, so that the CDI cache entry for
a node's own name and that of its Cluster Alias are not timed out and
therefore are not periodicly removed from the CDI cache. This enhancement
has been implemented susequent to V6.3 ECO-6 and will appear in V6.3 ECO-7.

CDI 7-day hardcoded timeout: CDI up to and including V6.3 ECO-6 has a hardcoded
timeout of 7 days (which can only be over-ridden by the logical name
CDI_CACHE_TTL). When the SEARCHPATH .NCL is executed, this 7-day timeout is
over-ridden by the value specified in the .NCL. But, during boot there is
a 20 second period between the startup of NET$ACP and the execution of
the .NCL when the timeout is not easily controllable by the System Manager.
Subsequent to V6.3 ECO-6 Engineering have "fixed" this so that the timeout
is "infinite" during this timing window, and is then controlled by the .NCL.
Also, if you use the .NCL to set the timeout to 0, it sets it to "infinite".
(Previously, setting the timeout to 0 would set it to 7 days, again!!)

That's the preamble completed -- who's still with me?

The long-standing bug, which has only shown up with V6.3 ECO-6, is that from
time to time backtranslation operations may get "lost" in CDI. (At least,
the current theory is that the bug is in CDI, although it might be elsewhere.)
I have only seen these problems for incoming NSP connections. The problems
may well occur for incoming OSI Transport connections, but I have just not
seen them. Also, I thought I heard that a variation of the problem may occur
for outgoing connections, although I have no idea what the symptoms might be.

For an incoming NSP connection, if the backtranslation is not in the CDI cache
then CDI has to do a DNS lookup. Sometimes, the CDI/DNS lookup just gets
"lost", and in this case the incoming NSP connection just "hangs". At this
point in time there is no timer on the incoming NSP port, so the port just
remains on the system "for-ever" and consumes one of the "NSP Maximum
Transport Connections". (Of course, if the CDI/DNS lookup completes
successfully then everything is OK. Also, if the CDI/DNS lookup fails, then
the failure status is used when continuing to process the incoming connection,
and the incoming connection appears to come from node 12345:: instead of
DEC:.XYZ.FRED:: -- ie you get a backtranslation failure, but a successful
connection.) "Losing" a CDI/DNS lookup is a rare event -- on a very busy system
it might occur once a week, and at that rate (prior to V6.3 ECO-6) it would
take 4 years without reboot to consume all of your 200 (default) "NSP Maximum
Transport Connections".

The problem is that V6.3 ECO-6 includes the "CDI meltdown" fix. (Remember,
with this fix, CDI lookups for the same name are queued until the first
lookup completes ?) The problem with this fix is that if the CDI/DNS lookup
at the head of the queue (ie the "parent") gets "lost" then it does not
complete, and none of the queued "child" lookups will complete. Furthermore,
all subsequent lookups of the same name/backtranslation will find that
there is a lookup in progress (ie the outstanding "parent") and they will
also be queued with no chance of ever completing. Thus, every incoming
connection from that name/backtranslation will be queued in the same way,
and will consume NSP ports until you run out. Each outstanding connection
on the queue consumes a significant amount of NET$ACP VA, so you will
both run out of transport connections and will run out of NET$ACP VA and
it is just a race to see which happens first. The only solution is reboot.

The problem of "lost" CDI/DNS connections is expected to be fixed in V6.3
ECO-7 and in V7.1 ECO-1.

Regards,

	John

2620.20THANKSPRSSOS::MAGENCFri May 23 1997 14:3212
    
    
    			Waoh !!!!
    
    	What a WONDERFUL answer !
    
    	Thanks a lot for such details : John, you're a REAL GURU !
        Your explanations are very clear and usefull .
        That's great !
    
    	Best regards , Michele . 
    	
2620.21TFOS02::HEISERMaranatha!Fri May 23 1997 16:081
    John, do you have an estimated date yet for ECO7?
2620.22soft restart?PHXSS1::HEISERMaranatha!Tue May 27 1997 20:587
    Is there any way to shutdown the network and recreate NET$ACP without
    rebooting?  NET$SHUTDOWN doesn't recreate the process.  This is starting 
    to impact business production clusters (especially since we are 
    approaching fiscal year end).
    
    thanks,
    Mike
2620.23Wait days, or else use IPMTCOMICS::WEIRJohn Weir, UK Country SupportWed May 28 1997 07:3920
	No, I do not know of any way to stop and restart NET$ACP.

	I suspect that even if you did something devious to get rid of
	NET$ACP you would not be able to restart it, as there is almost
	certainly some initialisation of the NET$ACP/NET$DRIVER interface
	which would not survive any such tampering ;-)

	Engineering have produced a fix -- at this stage it survives
	lab tests (and previously I could reproduce the problem in under
	30 seconds) -- although none of my Customers have installed it yet.

	So, it looks as though the fix will be on general distribution within
	days, but you know the rules -- if you have a Business critial issue,
	you use the IPMT system, not notesfiles.

	Regards,

		John

2620.24PHXSS1::HEISERMaranatha!Wed May 28 1997 15:161
    Well, I've downgraded to ECO5 in the meantime.
2620.25CSC patch kitsPHXSS1::HEISERMaranatha!Fri May 30 1997 21:515
    Have patch kits VAXSHAD09_061 and VAXSYS08_062 been proven to fix the
    pool expansion problem?
    
    thanks,
    Mike
2620.26ECO kits fix problems they were intended to fixCOMICS::WEIRJohn Weir, UK Country SupportMon Jun 02 1997 12:3024
Mike,

>    Have patch kits VAXSHAD09_061 and VAXSYS08_062 been proven to fix the
>    pool expansion problem?
    
These kits have been proven to fix the problems that they fix -- period.

VAXSYS08_062 fixes a leak of process alloc region, which only shows up if
you set CTLPAGES higher than the SYSGEN default. If CTLPAGES is 128 or less,
then there is no way that you could suffer the problem. Thus, if CTLPAGES
is 128 or less, and you have a problem, VAXSYS08_062 will not fix that problem.

VAXSHAD09_061 fixes whatever NPAGEDYN leaks it is documented to fix... I
can't remember.

The DECnet/OSI V6.3 ECO-6 CDI problems are not resolved by either of these, but
will be resolved by ECO-7. Engineering have proved that they have a good fix.
I can confirm the fix is good.

Regards,

	John