[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference spezko::cluster

Title:+ OpenVMS Clusters - The best clusters in the world! +
Notice:This conference is COMPANY CONFIDENTIAL. See #1.3
Moderator:PROXY::MOORE
Created:Fri Aug 26 1988
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:5320
Total number of notes:23384

5314.0. "Once again on "Lost connection to node..."" by MLNCSC::CAREMISE (and then they were ...four !) Tue May 20 1997 15:21

Hi everybody .

I want to arise again the problem of Lost connection, even if already
discussed in various entries in this conference, because although having
read all of them and implemented a lot of tuning/setups, I'm still having
the problem.

Briefly, what is happening in my site :

We have two clusters that were running on the same Ethernet backbone, and 
everything worked fine, until we moved some satellites of both clusters
in another building, that is 100 mt. far away from the computer room, 
where boot nodes still resides.

The network connection :
All systems are connected to a DecRepeater 900TP into DecHubs 900MS linked
thru an FDDI ring, via two DecSwitch 900EF (one for each Hub)

The main cluster is composed by 3 Vax 6000-510 (boot nodes) and 8 
satellites (mainly Mvax3100-76); every satellite has its own root on a
shadowed system disk on CI , and its own page/swapfiles on a local disk.

The other cluster is pure NI, with a Vax 4200 as boot node and 12 
satellites ( 5 are V4VLC's and 7 are Mvax3100-40 and 76)

Both clusters are running VAX/VMS 5.5-2.

All satellites of both cluster, moved to other site are showing the problem
that is happening almost every 3-4 minutes, and is between them and another
node on the previous backbone: 
(No Vaxcluster transition are evidenced)

EXAMPLE:

%%%%%%%%%%%  OPCOM  20-MAY-1997 15:31:20.76  %%%%%%%%%%%    (from node 
FAMV37 at 20-MAY-1997 15:31:52.62)
15:31:52.59 Node FAMV37 (csid 00010042) lost connection to node FAMV09

%%%%%%%%%%%  OPCOM  20-MAY-1997 15:31:20.79  %%%%%%%%%%%    (from node 
FAMV37 at 20-MAY-1997 15:31:56.13)
15:31:56.09 Node FAMV37 (csid 00010042) re-established connection to node 
FAMV09


This causes : 1. cluster slowdown due to random shadow system disk rebuild
	      2. OPERATOR.LOG abnormal growth
	      3. ERRORLOG.SYS abnormal growth
	      4. Potential cluster hang due to system disk filling

ACTIONS DONE :
	1. Tuned all system's pool and cluster parameters
	2. Applied patch VAXLAVC04_U2055	
	3. Rised RECNXINTERVAL to 180  (nonsense in my opinion)
	4. Tried to connect some satellites directly on a PEswitch
	   on a DEChub-One MX directly on FDDI (nothing changed)
	5. Rised PIPELINE quota on all Decnet executors to 14400
	   (were 10000 before)
 		
EVIDENCE : All stations and boot nodes logs hundreds of errors on PEA0
	   with the same undeciphrable (for me) description :


 ******************************* ENTRY     336. ***************************
 ERROR SEQUENCE 6256.                            LOGGED ON:    SID 0B000006
 DATE/TIME 20-MAY-1997 10:54:55.82                        SYS_TYPE 02400101
 SYSTEM UPTIME: 21 DAYS 02:22:50
 SCS NODE: FAMV09                                            VAX/VMS V5.5-2

 ERL$LOGMESSAGE  KA64A  CPU FW REV# 6.  CONSOLE FW REV# 4.0
                 XMI NODE # 1.

 NI-SCS SUB-SYSTEM, _FAMV09$PEA0:

       PORT HAS CLOSED VIRTUAL CIRCUIT

       LOCAL STATION ADDRESS, FFFFFFFFFF00(X)
       LOCAL SYSTEM ID, 000000002014(X)

       REMOTE STATION ADDRESS, 0000000000D5(X)
       REMOTE SYSTEM ID, 000000002157(X)

       UCB$B_ERTCNT          32
                                       50. RETRIES REMAINING
       UCB$B_ERTMAX          32
                                      50. RETRIES ALLOWABLE
       UCB$W_ERRCNT        00E0
                                       224. ERRORS THIS UNIT
       PPD$B_PORT            00
                                       REMOTE NODE # 0.
       PPD$B_STATUS          00
       PPD$B_OPC             00
                                       UNKNOWN OPCODE
       PPD$B_FLAGS           00


FINAL CONSIDERATIONS and QUESTIONS :
We strongly suspect something related to network introduced delay.
(The same satellite that fails, moved back to computer room works fine)

Are there any SCS/SYSGEN parameter that can be impacted where LAVC must
pass thru more than two switches/bridges ?

Has anyone any idea on what is the UNKNOWN OPCODE reason for closing
the Virtual circuit ?

Are there any NCP executor/circuit/line  parameters impacted ?

    Thanks in advance to anyone who will help and apologize for the 
    long entry ....
    
    Sergio.
    
T.RTitleUserPersonal
Name
DateLines
5314.1UTRTSC::utoras-198-48-113.uto.dec.com::JurVanDerBurgChange mode to Panic!Tue May 20 1997 15:429
First of all, DECNET pipeline quota has nothing to do with cluster 
communications. And raising it may make things worse if there's also
heavy decnet traffic.

I would suggest to investigate the network load over the DEChub.
It's purely a network connection problem.

Jur.

5314.2Suggestions...XDELTA::HOFFMANSteve, OpenVMS EngineeringTue May 20 1997 17:1219
   I'd simplify the various LAN segments involved, and I'd start looking
   for cabling faults.  You'll need to check the latency of those LAN
   widgets, as well.  Do you have access to a LAN monitor?

   You'll also want to reAUTOGEN all nodes with FEEDBACK, and reboot.

   Are there patterns to the messages?  (eg: is node FAMV37 regularly
   involved?)  If so, concentrate on the patterns.

   What does DECamds have to say about the configuration?

   Seriously consider an upgrade from V5.5-2, as V7.1 is current.  (And
   we have rewritten shadowing, mount, and a number of other areas...)

   (NCP and PIPELINE settings are entirely unrelated to the VMScluster
   communications -- DECnet is involved only during the satellite node
   download operation, and is not involved thereafter.)

5314.3More info ...MLNCSC::CAREMISEand then they were ...four !Wed May 21 1997 08:0138
    
    
    Thank you guys for your feedbacks !
    
    Just a couple of things to clarify the situation :
    
    Could you be more specific on how to check latency . 
    
    NETWK LOAD : We have put a sniffer on the Computer room's backbone
    and the load on Ethernet is under 25% , and the error rate (CRC,
    SHORT,RUNT...ecc) is very low.
    
    What is DECamds : is it an 'available on the net' tool ?
    
    Upgrade to Vms 7.1 is now impossible : customer's application are
    related to Telephonic devices that are strictly linked to VMS 5.5-2
    and we don't see way out on this matter, until the 3rd party appli-
    cation will be re-written for an higher VMS version.....
    
    If you suspect shadow or mount or other areas involved, pls. provide
    patches info.
    
    PATTERNS : the only thing I noted is that 2 station has an error rate
            on PEA0 double than on other remaining station : I would check
            if it is possible to remove it from cluster.
    
    Anyway I'll suggest to system manager to re-run Autogen on all nodes
    with feedback enabled, even if I'm a bit skeptic on this. I tend to
    beleive more in a something involved with network too.
    At this purpose I'll try to get more info on LAVC$FAILURE_ANALYSIS
    What do you think about it ?
    
    Don't you have any advice on errorlog entry meaning ?
    
    Thanks again and Ciao !
    
    Sergio (MCS Milano, Italia)
    
5314.4More InfoXDELTA::HOFFMANSteve, OpenVMS EngineeringWed May 21 1997 13:4860
:    Could you be more specific on how to check latency . 

   Confirm the path.  Confirm the number of devices.  Confirm that
   the devices are suited for this application.  More than a few
   customers will tell you that their network configuration is `X',
   and when you actually look, you find `Y'.  Also confirm that the
   configuration of the network is valid -- more than a few sites
   have seen a misused "T" connector or an unterminated LAN segment,
   as specific examples.

   Also see what DTS/DTR show for throughput on the link -- these
   are DECnet tools, but these can load up a network nicely.

   I'd expect that specific round-trip measurements would require
   a LAN monitor.
    
:    NETWK LOAD : We have put a sniffer on the Computer room's backbone
:    and the load on Ethernet is under 25% , and the error rate (CRC,
:    SHORT,RUNT...ecc) is very low.

   What counters are increasing in DECnet, if any?  (Zero the counters,
   run the DTS/DTR tools to generate a load, and see what happens to the
   DECnet line and circuit counters.)
    
:    What is DECamds : is it an 'available on the net' tool ?

   It's part of OpenVMS.  A very valuable part for managing a network
   of nodes, or a VMScluster, too.

:    Upgrade to Vms 7.1 is now impossible : customer's application are
:    related to Telephonic devices that are strictly linked to VMS 5.5-2
:    and we don't see way out on this matter, until the 3rd party appli-
:    cation will be re-written for an higher VMS version.....

   I will assume the customer has a "prior version support" contract.
    
:    If you suspect shadow or mount or other areas involved, pls. provide
:    patches info.

   Check http://www.service.digital.com -- I'd look, but the link is
   down right now.
    
:    PATTERNS : the only thing I noted is that 2 station has an error rate
:            on PEA0 double than on other remaining station : I would check
:            if it is possible to remove it from cluster.

   Make sure there are not overlapping cluster groups or a bad cluster
   password involved here -- and what are the other errors that are in
   the error log?  (The entry listed in .0 is rather nondescript...)

:    Anyway I'll suggest to system manager to re-run Autogen on all nodes
:    with feedback enabled, even if I'm a bit skeptic on this. I tend to
:    beleive more in a something involved with network too.
:    At this purpose I'll try to get more info on LAVC$FAILURE_ANALYSIS
:    What do you think about it ?

   When one node has out-of-whack SYSGEN parameters, the whole VMScluster
   can encounter pronlems when that node gets "backed up".  (This is why
   I asked you to check for any common patterns on the error messages.)
    
5314.5TIMVCFAIL is a step forwardMLNCSC::CAREMISEand then they were ...four !Mon May 26 1997 15:2418
    The problem on the cluster with vax 6000's and satellites has been
    solved, downsizing the TIMVCFAIL parameter from 1600 to 800.
    Unfortunately, the same had no effect on the other cluster (pure NI)
    
    
    We still have a lot of system buffer unavailable on decnet lines,
    and also some receive errors (frame too long) we tryied to enlarge
    line's receive buffers (from 10 to 20) with no appreciable results
    except for 3 stations, that now runs OK.
    
    Can be a problem of NPAGEDYN oversized ?  ( we noted on many station
    a value over 8 million, with an effective usage around 1 million.)
    
    Other parameters of the pool looke OK.
    
    Any comments ?
    
    
5314.6UTRTSC::jgoras-197-2-3.jgo.dec.com::JurVanDerBurgChange mode to Panic!Tue May 27 1997 05:0817
Lowering TIMVCFAIL is not a solution but a workaround for your network problems.

A lot of system buffer unavailable means that the network is so busy that the
system has a hard time keeping up, and drops packets.

>    Can be a problem of NPAGEDYN oversized ?  ( we noted on many station
>    a value over 8 million, with an effective usage around 1 million.)

That means that there's been a peak usage of NPP which may be attributed
to network broadcast storms.

I would seriously take a good look at the network load and check if you
can do something about that, like adding bridges etc. Or check for other
bad things. A network trace can do wonders.

Jur.

5314.7LAVc Troubleshooting, Key patches??STAR::BOAENLANclusters/VMScluster Tech. OfficeTue May 27 1997 19:0128
VERIFY HAVE PEDRIVER PATCHES:
Before doing anything else, make certain that the nodes have the following TIMA kit:
	PEDRIVER V5.5-2 VOID/TIMA kit: VAXLAVC03_U2055; 
	or CSC patch kit CSCPAT_1081

It's been out for several years, but if you don't have it in, it's the 1st thing to do.
We significantly improved PEdriver's ability to deal with network congestion & delay 
variations somewhare around V6.0. This kit back-port these changes to V5.5-2.

READ THE MANUAL:
	The "Troubleshooting the NISCA Protocol" appendix to the 
V6.1 & higher versions of the VMScluster Systems manual shows how 
to use SDA to get & interpret counters & delay information from 
PEdriver's port, VC, & Channel data structures. This should help
identify why PEdriver is closing VCs. I suspect that a channel is
getting listen timeouts because packets are being lost due to network
congestion or (less likely) faulty network HW.

00x = UNRECOGNIZED OPCODE
The errorlog analyzer doesn't understand that some errorlog entries don't
have a message buffer attached. It always assumes that the message buffer fields
are there. In this case there isn't any message & these fields are all 0s. 
The OPcode value of  00x is undefined for PEdriver. So this part of the
errorlog report is missleading...

'Gards, Verell


5314.8updateMLNCSC::CAREMISEand then they were ...four !Wed May 28 1997 08:5028
Probably I wasn't clear in my .0

It cannot be a problem of loosed connection 'cause the cabling is TP,
that means direct connection to the Repeater. No thin wire involved.

All port are switched (that means 'bridge' in my opinion...) and every
Repeater is switched again on the backbone through a 3COM Port Switch.

The boot node is directly connected to the other DecSwitch (that IS a
Bridge) so everything is ALREADY Bridged.

Traffic percentage on the two networks are below 25% , so there's not
ethernet congestion.

VAXLAVC patch is already been installed ( read my .0 )

The only thing I can agree with, is the chance to have a lot of 
broadcast storms, and now I will investigate on this.
It seems that this storms comes from SUN Stations.

Can someone tell me something on how to work on storms, expecially
on non DEC systems ?

And how storms can be the cause of this disconnections ?

Thanks again. Sergio.

    
5314.9UTRTSC::jgoras-197-2-3.jgo.dec.com::JurVanDerBurgChange mode to Panic!Wed May 28 1997 10:2914
>Can someone tell me something on how to work on storms, expecially
>on non DEC systems ?

Start measuring with a sniffer, and if non-dec systems are causing
storms contact the system managers for these systems and let them
find out what's wrong.

>And how storms can be the cause of this disconnections ?

Heavy broadcast storms can cause severe packet loss, and if that happens
frequently enough the scs will timeout.

Jur.

5314.10Look for Listen TimeoutsSTAR::BOAENLANclusters/VMScluster Tech. OfficeThu May 29 1997 15:1637
To determine if connections are being lost due to NISCA multicast packets
being lost use SDA to examine the PEdriver channels between the
two nodes - do the following on each node from a priviledged account:

$ ANALYZE/SYSTEM
$ SHOW PORT
$ SHOW PORT/CH/VC=VC_nodename

This will get you the PEdriver internal counters.
Look at the channel errors section of   each channel to see if
listen timeouts are occurring:
SDA>

    VMScluster data structures
    --------------------------
 -- Active Channel (CH:812F00C0) for Virtual Circuit (VC:8126ABC0) ZAPNOT --
State: 0004 open                Status: 0B path,open,rmt_hwa_valid
BUS: 8123D100  (FXA)  Lcl Device: FX_DEMFA  Lcl LAN Address: 08-00-2B-3B-15-85
Rmt Name: FXA         Rmt Device: FX_DEMFA  Rmt LAN Address: 08-00-2B-29-E1-


Rmt Seq #: 0001   Open:21-MAY-1997 07:33:44.70  Closed:21-MAY-1997 07:31:05.77
------- Transmit ------  ------- Receive -------  ----- Channel Selection ----
Lcl CH Seq #       0008  Msg Rcv         3161273  Average Xmt Time    00314521
Msg Xmt              19    Mcast Msgs    3161263  Remote Buffer Size      4382
  Ctrl Msgs          14    Mcast Bytes 309803774  Max Buffer Size         4382
  Ctrl Bytes       1372    Ctrl Msgs          10  Best Channel               8
Bytes Xmt          1822    Ctrl Bytes        980  Preferred Channel          5
Rmt Ring Size        31  Bytes Rcv     309804754  Retransmit Penalty         2
---------------  Channel Errors  ---------------  Xmt Error Penalty          0
Handshake TMO         0  Short CC Msgs         0  ------- Channel Timer ------
Listen TMO            7  Incompat Chan         0  Timer Entry Flink   81204D40
Bad Authorize         0  No MSCP Srvr          0              Blink   8124F540
Bad ECO               0  Disk Not Srvd         0  Last Ring Index           10
Bad Multicast         0  Old TR Msgs           0  Protocol               1.4.0
Topology Change       0                           Supported Services  00000000