[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference spezko::cluster

Title:	+ OpenVMS Clusters - The best clusters in the world! +
Notice:	This conference is COMPANY CONFIDENTIAL. See #1.3
Moderator:	PROXY::MOORE

Created:	Fri Aug 26 1988
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	5320
Total number of notes:	23384

5314.0. "Once again on "Lost connection to node..."" by MLNCSC::CAREMISE (and then they were ...four !) Tue May 20 1997 15:21

Hi everybody .

I want to arise again the problem of Lost connection, even if already
discussed in various entries in this conference, because although having
read all of them and implemented a lot of tuning/setups, I'm still having
the problem.

Briefly, what is happening in my site :

We have two clusters that were running on the same Ethernet backbone, and 
everything worked fine, until we moved some satellites of both clusters
in another building, that is 100 mt. far away from the computer room, 
where boot nodes still resides.

The network connection :
All systems are connected to a DecRepeater 900TP into DecHubs 900MS linked
thru an FDDI ring, via two DecSwitch 900EF (one for each Hub)

The main cluster is composed by 3 Vax 6000-510 (boot nodes) and 8 
satellites (mainly Mvax3100-76); every satellite has its own root on a
shadowed system disk on CI , and its own page/swapfiles on a local disk.

The other cluster is pure NI, with a Vax 4200 as boot node and 12 
satellites ( 5 are V4VLC's and 7 are Mvax3100-40 and 76)

Both clusters are running VAX/VMS 5.5-2.

All satellites of both cluster, moved to other site are showing the problem
that is happening almost every 3-4 minutes, and is between them and another
node on the previous backbone: 
(No Vaxcluster transition are evidenced)

EXAMPLE:

%%%%%%%%%%%  OPCOM  20-MAY-1997 15:31:20.76  %%%%%%%%%%%    (from node 
FAMV37 at 20-MAY-1997 15:31:52.62)
15:31:52.59 Node FAMV37 (csid 00010042) lost connection to node FAMV09

%%%%%%%%%%%  OPCOM  20-MAY-1997 15:31:20.79  %%%%%%%%%%%    (from node 
FAMV37 at 20-MAY-1997 15:31:56.13)
15:31:56.09 Node FAMV37 (csid 00010042) re-established connection to node 
FAMV09


This causes : 1. cluster slowdown due to random shadow system disk rebuild
	      2. OPERATOR.LOG abnormal growth
	      3. ERRORLOG.SYS abnormal growth
	      4. Potential cluster hang due to system disk filling

ACTIONS DONE :
	1. Tuned all system's pool and cluster parameters
	2. Applied patch VAXLAVC04_U2055	
	3. Rised RECNXINTERVAL to 180  (nonsense in my opinion)
	4. Tried to connect some satellites directly on a PEswitch
	   on a DEChub-One MX directly on FDDI (nothing changed)
	5. Rised PIPELINE quota on all Decnet executors to 14400
	   (were 10000 before)
 		
EVIDENCE : All stations and boot nodes logs hundreds of errors on PEA0
	   with the same undeciphrable (for me) description :


 ******************************* ENTRY     336. ***************************
 ERROR SEQUENCE 6256.                            LOGGED ON:    SID 0B000006
 DATE/TIME 20-MAY-1997 10:54:55.82                        SYS_TYPE 02400101
 SYSTEM UPTIME: 21 DAYS 02:22:50
 SCS NODE: FAMV09                                            VAX/VMS V5.5-2

 ERL$LOGMESSAGE  KA64A  CPU FW REV# 6.  CONSOLE FW REV# 4.0
                 XMI NODE # 1.

 NI-SCS SUB-SYSTEM, _FAMV09$PEA0:

       PORT HAS CLOSED VIRTUAL CIRCUIT

       LOCAL STATION ADDRESS, FFFFFFFFFF00(X)
       LOCAL SYSTEM ID, 000000002014(X)

       REMOTE STATION ADDRESS, 0000000000D5(X)
       REMOTE SYSTEM ID, 000000002157(X)

       UCB$B_ERTCNT          32
                                       50. RETRIES REMAINING
       UCB$B_ERTMAX          32
                                      50. RETRIES ALLOWABLE
       UCB$W_ERRCNT        00E0
                                       224. ERRORS THIS UNIT
       PPD$B_PORT            00
                                       REMOTE NODE # 0.
       PPD$B_STATUS          00
       PPD$B_OPC             00
                                       UNKNOWN OPCODE
       PPD$B_FLAGS           00


FINAL CONSIDERATIONS and QUESTIONS :
We strongly suspect something related to network introduced delay.
(The same satellite that fails, moved back to computer room works fine)

Are there any SCS/SYSGEN parameter that can be impacted where LAVC must
pass thru more than two switches/bridges ?

Has anyone any idea on what is the UNKNOWN OPCODE reason for closing
the Virtual circuit ?

Are there any NCP executor/circuit/line  parameters impacted ?

    Thanks in advance to anyone who will help and apologize for the 
    long entry ....
    
    Sergio.

T.R	Title	User	Personal Name	Date	Lines
5314.1		UTRTSC::utoras-198-48-113.uto.dec.com::JurVanDerBurg	Change mode to Panic!	`Tue May 20 1997 15:42`	9
	First of all, DECNET pipeline quota has nothing to do with cluster communications. And raising it may make things worse if there's also heavy decnet traffic. I would suggest to investigate the network load over the DEChub. It's purely a network connection problem. Jur.
5314.2	Suggestions...	XDELTA::HOFFMAN	Steve, OpenVMS Engineering	`Tue May 20 1997 17:12`	19
	I'd simplify the various LAN segments involved, and I'd start looking for cabling faults. You'll need to check the latency of those LAN widgets, as well. Do you have access to a LAN monitor? You'll also want to reAUTOGEN all nodes with FEEDBACK, and reboot. Are there patterns to the messages? (eg: is node FAMV37 regularly involved?) If so, concentrate on the patterns. What does DECamds have to say about the configuration? Seriously consider an upgrade from V5.5-2, as V7.1 is current. (And we have rewritten shadowing, mount, and a number of other areas...) (NCP and PIPELINE settings are entirely unrelated to the VMScluster communications -- DECnet is involved only during the satellite node download operation, and is not involved thereafter.)
5314.3	More info ...	MLNCSC::CAREMISE	and then they were ...four !	`Wed May 21 1997 08:01`	38
	Thank you guys for your feedbacks ! Just a couple of things to clarify the situation : Could you be more specific on how to check latency . NETWK LOAD : We have put a sniffer on the Computer room's backbone and the load on Ethernet is under 25% , and the error rate (CRC, SHORT,RUNT...ecc) is very low. What is DECamds : is it an 'available on the net' tool ? Upgrade to Vms 7.1 is now impossible : customer's application are related to Telephonic devices that are strictly linked to VMS 5.5-2 and we don't see way out on this matter, until the 3rd party appli- cation will be re-written for an higher VMS version..... If you suspect shadow or mount or other areas involved, pls. provide patches info. PATTERNS : the only thing I noted is that 2 station has an error rate on PEA0 double than on other remaining station : I would check if it is possible to remove it from cluster. Anyway I'll suggest to system manager to re-run Autogen on all nodes with feedback enabled, even if I'm a bit skeptic on this. I tend to beleive more in a something involved with network too. At this purpose I'll try to get more info on LAVC$FAILURE_ANALYSIS What do you think about it ? Don't you have any advice on errorlog entry meaning ? Thanks again and Ciao ! Sergio (MCS Milano, Italia)
5314.4	More Info	XDELTA::HOFFMAN	Steve, OpenVMS Engineering	`Wed May 21 1997 13:48`	60
	: Could you be more specific on how to check latency . Confirm the path. Confirm the number of devices. Confirm that the devices are suited for this application. More than a few customers will tell you that their network configuration is `X', and when you actually look, you find `Y'. Also confirm that the configuration of the network is valid -- more than a few sites have seen a misused "T" connector or an unterminated LAN segment, as specific examples. Also see what DTS/DTR show for throughput on the link -- these are DECnet tools, but these can load up a network nicely. I'd expect that specific round-trip measurements would require a LAN monitor. : NETWK LOAD : We have put a sniffer on the Computer room's backbone : and the load on Ethernet is under 25% , and the error rate (CRC, : SHORT,RUNT...ecc) is very low. What counters are increasing in DECnet, if any? (Zero the counters, run the DTS/DTR tools to generate a load, and see what happens to the DECnet line and circuit counters.) : What is DECamds : is it an 'available on the net' tool ? It's part of OpenVMS. A very valuable part for managing a network of nodes, or a VMScluster, too. : Upgrade to Vms 7.1 is now impossible : customer's application are : related to Telephonic devices that are strictly linked to VMS 5.5-2 : and we don't see way out on this matter, until the 3rd party appli- : cation will be re-written for an higher VMS version..... I will assume the customer has a "prior version support" contract. : If you suspect shadow or mount or other areas involved, pls. provide : patches info. Check http://www.service.digital.com -- I'd look, but the link is down right now. : PATTERNS : the only thing I noted is that 2 station has an error rate : on PEA0 double than on other remaining station : I would check : if it is possible to remove it from cluster. Make sure there are not overlapping cluster groups or a bad cluster password involved here -- and what are the other errors that are in the error log? (The entry listed in .0 is rather nondescript...) : Anyway I'll suggest to system manager to re-run Autogen on all nodes : with feedback enabled, even if I'm a bit skeptic on this. I tend to : beleive more in a something involved with network too. : At this purpose I'll try to get more info on LAVC$FAILURE_ANALYSIS : What do you think about it ? When one node has out-of-whack SYSGEN parameters, the whole VMScluster can encounter pronlems when that node gets "backed up". (This is why I asked you to check for any common patterns on the error messages.)
5314.5	TIMVCFAIL is a step forward	MLNCSC::CAREMISE	and then they were ...four !	`Mon May 26 1997 15:24`	18
	The problem on the cluster with vax 6000's and satellites has been solved, downsizing the TIMVCFAIL parameter from 1600 to 800. Unfortunately, the same had no effect on the other cluster (pure NI) We still have a lot of system buffer unavailable on decnet lines, and also some receive errors (frame too long) we tryied to enlarge line's receive buffers (from 10 to 20) with no appreciable results except for 3 stations, that now runs OK. Can be a problem of NPAGEDYN oversized ? ( we noted on many station a value over 8 million, with an effective usage around 1 million.) Other parameters of the pool looke OK. Any comments ?
5314.6		UTRTSC::jgoras-197-2-3.jgo.dec.com::JurVanDerBurg	Change mode to Panic!	`Tue May 27 1997 05:08`	17
	Lowering TIMVCFAIL is not a solution but a workaround for your network problems. A lot of system buffer unavailable means that the network is so busy that the system has a hard time keeping up, and drops packets. > Can be a problem of NPAGEDYN oversized ? ( we noted on many station > a value over 8 million, with an effective usage around 1 million.) That means that there's been a peak usage of NPP which may be attributed to network broadcast storms. I would seriously take a good look at the network load and check if you can do something about that, like adding bridges etc. Or check for other bad things. A network trace can do wonders. Jur.
5314.7	LAVc Troubleshooting, Key patches??	STAR::BOAEN	LANclusters/VMScluster Tech. Office	`Tue May 27 1997 19:01`	28
	VERIFY HAVE PEDRIVER PATCHES: Before doing anything else, make certain that the nodes have the following TIMA kit: PEDRIVER V5.5-2 VOID/TIMA kit: VAXLAVC03_U2055; or CSC patch kit CSCPAT_1081 It's been out for several years, but if you don't have it in, it's the 1st thing to do. We significantly improved PEdriver's ability to deal with network congestion & delay variations somewhare around V6.0. This kit back-port these changes to V5.5-2. READ THE MANUAL: The "Troubleshooting the NISCA Protocol" appendix to the V6.1 & higher versions of the VMScluster Systems manual shows how to use SDA to get & interpret counters & delay information from PEdriver's port, VC, & Channel data structures. This should help identify why PEdriver is closing VCs. I suspect that a channel is getting listen timeouts because packets are being lost due to network congestion or (less likely) faulty network HW. 00x = UNRECOGNIZED OPCODE The errorlog analyzer doesn't understand that some errorlog entries don't have a message buffer attached. It always assumes that the message buffer fields are there. In this case there isn't any message & these fields are all 0s. The OPcode value of 00x is undefined for PEdriver. So this part of the errorlog report is missleading... 'Gards, Verell
5314.8	update	MLNCSC::CAREMISE	and then they were ...four !	`Wed May 28 1997 08:50`	28
	Probably I wasn't clear in my .0 It cannot be a problem of loosed connection 'cause the cabling is TP, that means direct connection to the Repeater. No thin wire involved. All port are switched (that means 'bridge' in my opinion...) and every Repeater is switched again on the backbone through a 3COM Port Switch. The boot node is directly connected to the other DecSwitch (that IS a Bridge) so everything is ALREADY Bridged. Traffic percentage on the two networks are below 25% , so there's not ethernet congestion. VAXLAVC patch is already been installed ( read my .0 ) The only thing I can agree with, is the chance to have a lot of broadcast storms, and now I will investigate on this. It seems that this storms comes from SUN Stations. Can someone tell me something on how to work on storms, expecially on non DEC systems ? And how storms can be the cause of this disconnections ? Thanks again. Sergio.
5314.9		UTRTSC::jgoras-197-2-3.jgo.dec.com::JurVanDerBurg	Change mode to Panic!	`Wed May 28 1997 10:29`	14
	>Can someone tell me something on how to work on storms, expecially >on non DEC systems ? Start measuring with a sniffer, and if non-dec systems are causing storms contact the system managers for these systems and let them find out what's wrong. >And how storms can be the cause of this disconnections ? Heavy broadcast storms can cause severe packet loss, and if that happens frequently enough the scs will timeout. Jur.
5314.10	Look for Listen Timeouts	STAR::BOAEN	LANclusters/VMScluster Tech. Office	`Thu May 29 1997 15:16`	37
	To determine if connections are being lost due to NISCA multicast packets being lost use SDA to examine the PEdriver channels between the two nodes - do the following on each node from a priviledged account: $ ANALYZE/SYSTEM $ SHOW PORT $ SHOW PORT/CH/VC=VC_nodename This will get you the PEdriver internal counters. Look at the channel errors section of each channel to see if listen timeouts are occurring: SDA> VMScluster data structures -------------------------- -- Active Channel (CH:812F00C0) for Virtual Circuit (VC:8126ABC0) ZAPNOT -- State: 0004 open Status: 0B path,open,rmt_hwa_valid BUS: 8123D100 (FXA) Lcl Device: FX_DEMFA Lcl LAN Address: 08-00-2B-3B-15-85 Rmt Name: FXA Rmt Device: FX_DEMFA Rmt LAN Address: 08-00-2B-29-E1- Rmt Seq #: 0001 Open:21-MAY-1997 07:33:44.70 Closed:21-MAY-1997 07:31:05.77 ------- Transmit ------ ------- Receive ------- ----- Channel Selection ---- Lcl CH Seq # 0008 Msg Rcv 3161273 Average Xmt Time 00314521 Msg Xmt 19 Mcast Msgs 3161263 Remote Buffer Size 4382 Ctrl Msgs 14 Mcast Bytes 309803774 Max Buffer Size 4382 Ctrl Bytes 1372 Ctrl Msgs 10 Best Channel 8 Bytes Xmt 1822 Ctrl Bytes 980 Preferred Channel 5 Rmt Ring Size 31 Bytes Rcv 309804754 Retransmit Penalty 2 --------------- Channel Errors --------------- Xmt Error Penalty 0 Handshake TMO 0 Short CC Msgs 0 ------- Channel Timer ------ Listen TMO 7 Incompat Chan 0 Timer Entry Flink 81204D40 Bad Authorize 0 No MSCP Srvr 0 Blink 8124F540 Bad ECO 0 Disk Not Srvd 0 Last Ring Index 10 Bad Multicast 0 Old TR Msgs 0 Protocol 1.4.0 Topology Change 0 Supported Services 00000000