[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference help::decnet-osi_for_vms

Title:	DECnet/OSI for OpenVMS

Moderator:	TUXEDO::FONSECA

Created:	Fri Feb 22 1991
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	3990
Total number of notes:	19027

2620.0. "%LIB-F-INVARG, invalid argument on SET HOST" by BWTEAL::W_MCGAW () Fri Jun 02 1995 13:57

T.R	Title	User	Personal Name	Date	Lines
2620.1		EEMELI::MOSER	Orienteers do it in the bush...	`Mon Jun 05 1995 06:26`	11
2620.2	LIB error right after set hsot command.	BWTEAL::W_MCGAW		`Mon Jun 05 1995 16:33`	9
2620.3		EEMELI::MOSER	Orienteers do it in the bush...	`Tue Jun 06 1995 05:48`	33
2620.4	test.mar (quick and dirty example prog)	EEMELI::MOSER	Orienteers do it in the bush...	`Tue Jun 06 1995 05:49`	69
2620.5	I had this pb...	MOSCOW::JOUVIN	Michel Jouvin - Digital Moscow	`Tue Jun 06 1995 10:37`	9
2620.6	Got the file for testing.	BWTEAL::W_MCGAW		`Tue Jun 06 1995 14:21`	11
2620.7		TFOSS1::HEISER	maranatha!	`Sat Sep 21 1996 18:52`	5
2620.8	Transport connections at max ?	COMICS::WEIR	John Weir, UK Country Support	`Mon Sep 23 1996 08:42`	20
2620.9	More than one possible cause...	CSC32::D_WILDER	There's coffee in that nebula!	`Mon Sep 23 1996 18:00`	105
2620.10	what's typical?	TFOS02::HEISER	Maranatha!	`Tue May 06 1997 19:33`	5
	What's a typical value for VMS 6.2, DECnet/OSI 6.3 ECO6 on a 6540 with 256Mb RAM? I've set a node like this to 75000 and it still ran out. thanks, Mike
2620.11		CANTH::WATTUM	Scott Wattum - FTAM/VT/OSAK Engineering	`Tue May 06 1997 19:40`	5
	You should first take a look at note 3762 and verify that you aren't having a that problem, which has a different fix. --Scott
2620.12		TFOS01::HEISER	Maranatha!	`Fri May 09 1997 20:42`	4
	No that's not it. These are SEPS97 machines which all have the correct values for CTLPAGES and CTLIMGLIM Mike
2620.13	Register all your nodes ?	COMICS::WEIR	John Weir, UK Country Support	`Tue May 13 1997 07:45`	35
	Mike, > What's a typical value for VMS 6.2, DECnet/OSI 6.3 ECO6 on a 6540 with > 256Mb RAM? I've set a node like this to 75000 and it still ran out. Please do not assume that all problems which produce the INVARG error are the same problem... Generally, the problem occurs when NET$ACP runs out of VA (virtual address space), but there might be other reasons... There are a number of reasons (and/or bugs) which may result in NET$ACP running out of VA. Over time, these bugs are fixed, and it is known that there are good fixes for some of them. For example, the CTLPAGES problem is fixed for VAX VMS V6.1 and V6.2 by VAXSYS08_062. Typical (ie non-buggy) usage of pagefile quota should be under 10k. If the node is used for very large numbers of connections, where the connections are made in bursts, or where the node is a very busy DNS Server then values over 20k are possible, but if you get values over 25k then look for bugs. In other words, I believe that your system is suffering from one of the bugs... There is a problem in V6.3 ECO-6 which shows up particularly often if you do not register all of your nodes in the namespace. The frequency of onset of this problem is drastically reduced if you ensure: a) that all nodes in your network are registered correctly in the naming service b) that you increase the "Sess Control Naming Cache Timeout" to something which greatly exceeds the anticipated fix time for the bug ;-) Regards, John
2620.14	net$acp exhausted pgflquota (6.3 eco 6)	PRSSOS::MAGENC		`Wed May 21 1997 13:59`	31
	Hello ! John , in your previous reply , you say : <<There is a problem in V6.3 ECO-6 which shows up particularly often if you do not register all of your nodes in the namespace. >> Could you please provide more info about this problem ? (IPMT case etc ?) Here in Easynet France, this problem has been experimented twice in 2 weeks (Cluster OpenVMS VAX 6.1, DNVOSI 6.3 eco 6 , directory services DECDNS,local) . Having all the nodes registered in the namespace (DEC:) is nearly impossible . We checked that it's not a "CTLPAGES" problem. When this problem occurred for the second time (20 may 97), we changed "session control Naming Cache Timeout" to 1000 days, then rebooted. It's a "production" cluster called EVTISA . This problem occurred once on EVTV10 and once on EVTIS6 , with a pgflquota value = 75000 for net$acp ! Under "normal" circumstances , the pgflquota used is between 10000 and 15000 ; NSP and OSI TRANSPORT both have maximum connections = 500 What else could be done ? Thanks in advance, and best regards , Michele .
2620.15	could it be max transport connections??	CSC32::J_RYER	MCI Mission Critical Support Team	`Wed May 21 1997 15:37`	11
	I just escalated a case for MCI (sorry, don't have a cfs number yet, as CHAMP/CSC seems to be slow passing things to IPMT) on a similar problem on a system running OSI V6.3 ECO-6. In their case, we think the memory leak was triggered by bumping up against OSI Transport Maximum Transport Connections (due to a bug in application code written by the user). See note 2990.1 in this conference; John Weir escalated the problem as IPMT case CFS.27302, but it's not evident that a fix had been issued as of eco 6. Jane Ryer MCI Mission Critical Support Team
2620.16		TFOS02::HEISER	Maranatha!	`Wed May 21 1997 17:52`	6
	The node that forced me to bring this issue up just exhausted a 100K pgflquota in 2 weeks. It only took a few days to do 75K. The strange thing is that the other node in the same production cluster is fine with 75K (has been for the 2 months since the upgrade to ECO6). Mike
2620.17		TFOS02::HEISER	Maranatha!	`Wed May 21 1997 17:58`	12
	\|Typical (ie non-buggy) usage of pagefile quota should be under 10k. If the node \|is used for very large numbers of connections, where the connections are made \|in bursts, or where the node is a very busy DNS Server then values over 20k \|are possible, but if you get values over 25k then look for bugs. In other \|words, I believe that your system is suffering from one of the bugs... John, I find this interesting. Did you know that on CCS production clusters that 50K is a "standard" value? These are usually heavily loaded clusters (i.e., several hundred users). later, Mike
2620.18		TFOS02::HEISER	Maranatha!	`Wed May 21 1997 18:13`	11
	I just adjusted max cache timeout to 1000. I got this vague error when trying to adjust max connections. This is with 100k pgflquota exhausted. $ mcr ncl set osi transport maximum transport connect 250 Node 0 OSI Transport at 1997-05-21-11:13:20.466-07:00I1.576 command failed due to: process failure
2620.19	V6.3 ECO-6 CDI bug with "lost" lookups	COMICS::WEIR	John Weir, UK Country Support	`Thu May 22 1997 09:51`	132
	Hi, There are well known and long standing problems if you reach the "Maximum Transport Connections" limits for either NSP or OSI Transport. These are NOT the issues that I refered to earlier. Briefly, the "Maximum Transport Connections" problem is well known, and you avoid it by either a) fixing your application so it does not beat on the limit, or b) increase the limit. Before you increase the limit, you have to increase "Maximum Remote NSAPs" to be at least one greater than the intended new value for "Maximum Transport Connections". This problem is annoying, but appears unlikely to be fixed. It's rather like beating your head on a brick wall -- if it hurts, then don't do it! The problem that I referred to exists in V6.3 ECO-6, and presumably in V7.1. You will not see it in any later versions, because Engineering will fix it before the next ECO and/or version (sic ;-)). You are unlikely to see it in earlier versions or ECOs. I believe (although I am not sure whether Engineering agree) that the underlying bug may have existed in DECnet/OSI since V6.0 SSB, but that it has not shown up until implementation of the ECO-6 version of the dynamic CDI cache. There was an earlier "dynamic CDI cache" kit, which some people installed as an optional addition to their systems. I believe that this earlier kit did not include the "CDI meltdown" fix, which was bundled into the V6.3 ECO-6 CDI and which exposed the earlier bugs ... Do you follow me so far ? Just to summarise the terminology: Dynamic CDI cache: The original CDI cache design was a fixed size file. Unfortunately, the original size was too small for busy systems, so it was increased. As every member of a cluster has the same sized cache file, this meant that several hundred thousand blocks of system disk could be consumed in a large cluster, even though most systems were satellites and only required small cache files. The solution was dynamic CDI cache, which dynamically increased the cache based on demand. This was implemented as an "early release" kit, and in V6.3 ECO-6. "CDI meltdown": A phrase coined by Bob Watson -- but he coins so many that he can probably no longer remember ;-) A feature of the original CDI design was that if several lookups for the same nodename (or backtranslation) occur at about the same time, and if the name/backtranslation is not in the CDI cache, then CDI will do several DNS lookups in parallel instead of optimising and doing just one DNS lookup to satisfy all requests. The enhancement (included in the V6.3 ECO-6 CDI) was to detect this condition. If several CDI lookups are done for the same name/backtranslation which is not in the cache, then the first lookup triggers a real DNS lookup, while the others are queued up to await completion of the first lookup. You can see this on a CDI trace under V6.3 ECO-6, where you will see the first lookup recorded as "parent" and queued lookups recorded as "child". BTW: Just for completeness of the description, this change is a nice optimisation in most cases, but it actually solved a very serious problem on DNS Servers. Specifically, all nodes from time to time lose their own CDI cache entry. (The default is 30 days, or hardcoded at 7 days on reboot...) Whenever a DNS Server loses its own CDI cache entry, there is a severe risk that disaster will strike! When the DNS Server loses its CDI entry then CDI will use the DNS Clerk to do a lookup on its name--- This involves a DECnet link back to itself (ie Clerk and Server are on the same node) and the incoming connect must be backtranslated requiring a lookup on its name requiring another logical link from Clerk to Server requiring another backtranslation lookup of its name and so on in a loop until something runs out of resources and fails. Maybe the DNS Server runs out of memory or some other resource. Maybe NSP or OSI Transport runs out of "Maximum Transport Connections". Maybe you like that last one in particular ?? It links together this problem with the otherwise totally unrelated "Maximum Transport Connections" problem that I dismissed at the start of this reply. CDI "sticky" bit: Given the severity of problems which might occur when a node loses its own CDI cache entry (particularly DNS Server nodes) Engineering have enhanced the CDI design, yet again, so that the CDI cache entry for a node's own name and that of its Cluster Alias are not timed out and therefore are not periodicly removed from the CDI cache. This enhancement has been implemented susequent to V6.3 ECO-6 and will appear in V6.3 ECO-7. CDI 7-day hardcoded timeout: CDI up to and including V6.3 ECO-6 has a hardcoded timeout of 7 days (which can only be over-ridden by the logical name CDI_CACHE_TTL). When the SEARCHPATH .NCL is executed, this 7-day timeout is over-ridden by the value specified in the .NCL. But, during boot there is a 20 second period between the startup of NET$ACP and the execution of the .NCL when the timeout is not easily controllable by the System Manager. Subsequent to V6.3 ECO-6 Engineering have "fixed" this so that the timeout is "infinite" during this timing window, and is then controlled by the .NCL. Also, if you use the .NCL to set the timeout to 0, it sets it to "infinite". (Previously, setting the timeout to 0 would set it to 7 days, again!!) That's the preamble completed -- who's still with me? The long-standing bug, which has only shown up with V6.3 ECO-6, is that from time to time backtranslation operations may get "lost" in CDI. (At least, the current theory is that the bug is in CDI, although it might be elsewhere.) I have only seen these problems for incoming NSP connections. The problems may well occur for incoming OSI Transport connections, but I have just not seen them. Also, I thought I heard that a variation of the problem may occur for outgoing connections, although I have no idea what the symptoms might be. For an incoming NSP connection, if the backtranslation is not in the CDI cache then CDI has to do a DNS lookup. Sometimes, the CDI/DNS lookup just gets "lost", and in this case the incoming NSP connection just "hangs". At this point in time there is no timer on the incoming NSP port, so the port just remains on the system "for-ever" and consumes one of the "NSP Maximum Transport Connections". (Of course, if the CDI/DNS lookup completes successfully then everything is OK. Also, if the CDI/DNS lookup fails, then the failure status is used when continuing to process the incoming connection, and the incoming connection appears to come from node 12345:: instead of DEC:.XYZ.FRED:: -- ie you get a backtranslation failure, but a successful connection.) "Losing" a CDI/DNS lookup is a rare event -- on a very busy system it might occur once a week, and at that rate (prior to V6.3 ECO-6) it would take 4 years without reboot to consume all of your 200 (default) "NSP Maximum Transport Connections". The problem is that V6.3 ECO-6 includes the "CDI meltdown" fix. (Remember, with this fix, CDI lookups for the same name are queued until the first lookup completes ?) The problem with this fix is that if the CDI/DNS lookup at the head of the queue (ie the "parent") gets "lost" then it does not complete, and none of the queued "child" lookups will complete. Furthermore, all subsequent lookups of the same name/backtranslation will find that there is a lookup in progress (ie the outstanding "parent") and they will also be queued with no chance of ever completing. Thus, every incoming connection from that name/backtranslation will be queued in the same way, and will consume NSP ports until you run out. Each outstanding connection on the queue consumes a significant amount of NET$ACP VA, so you will both run out of transport connections and will run out of NET$ACP VA and it is just a race to see which happens first. The only solution is reboot. The problem of "lost" CDI/DNS connections is expected to be fixed in V6.3 ECO-7 and in V7.1 ECO-1. Regards, John
2620.20	THANKS	PRSSOS::MAGENC		`Fri May 23 1997 14:32`	12
	Waoh !!!! What a WONDERFUL answer ! Thanks a lot for such details : John, you're a REAL GURU ! Your explanations are very clear and usefull . That's great ! Best regards , Michele .
2620.21		TFOS02::HEISER	Maranatha!	`Fri May 23 1997 16:08`	1
	John, do you have an estimated date yet for ECO7?
2620.22	soft restart?	PHXSS1::HEISER	Maranatha!	`Tue May 27 1997 20:58`	7
	Is there any way to shutdown the network and recreate NET$ACP without rebooting? NET$SHUTDOWN doesn't recreate the process. This is starting to impact business production clusters (especially since we are approaching fiscal year end). thanks, Mike
2620.23	Wait days, or else use IPMT	COMICS::WEIR	John Weir, UK Country Support	`Wed May 28 1997 07:39`	20
	No, I do not know of any way to stop and restart NET$ACP. I suspect that even if you did something devious to get rid of NET$ACP you would not be able to restart it, as there is almost certainly some initialisation of the NET$ACP/NET$DRIVER interface which would not survive any such tampering ;-) Engineering have produced a fix -- at this stage it survives lab tests (and previously I could reproduce the problem in under 30 seconds) -- although none of my Customers have installed it yet. So, it looks as though the fix will be on general distribution within days, but you know the rules -- if you have a Business critial issue, you use the IPMT system, not notesfiles. Regards, John
2620.24		PHXSS1::HEISER	Maranatha!	`Wed May 28 1997 15:16`	1
	Well, I've downgraded to ECO5 in the meantime.
2620.25	CSC patch kits	PHXSS1::HEISER	Maranatha!	`Fri May 30 1997 21:51`	5
	Have patch kits VAXSHAD09_061 and VAXSYS08_062 been proven to fix the pool expansion problem? thanks, Mike
2620.26	ECO kits fix problems they were intended to fix	COMICS::WEIR	John Weir, UK Country Support	`Mon Jun 02 1997 12:30`	24
	Mike, > Have patch kits VAXSHAD09_061 and VAXSYS08_062 been proven to fix the > pool expansion problem? These kits have been proven to fix the problems that they fix -- period. VAXSYS08_062 fixes a leak of process alloc region, which only shows up if you set CTLPAGES higher than the SYSGEN default. If CTLPAGES is 128 or less, then there is no way that you could suffer the problem. Thus, if CTLPAGES is 128 or less, and you have a problem, VAXSYS08_062 will not fix that problem. VAXSHAD09_061 fixes whatever NPAGEDYN leaks it is documented to fix... I can't remember. The DECnet/OSI V6.3 ECO-6 CDI problems are not resolved by either of these, but will be resolved by ECO-7. Engineering have proved that they have a good fix. I can confirm the fix is good. Regards, John