[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference netcad::hub_mgnt

Title:	DEChub/HUBwatch/PROBEwatch CONFERENCE
Notice:	Firmware -2, Doc -3, Power -4, HW kits -5, firm load -6&7
Moderator:	NETCAD::COLELLADT

Created:	Wed Nov 13 1991
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	4455
Total number of notes:	16761

2268.0. "DB900EF performance measure and FDDI wrap time?" by TROFS::WEBSTER (NIS, London, Canada) Wed May 10 1995 17:59

My customer has large LAVC setup with all disks in the cluster being MSCP
served. As their LAVC grew, the network became slower. They were recently
upgraded to an FDDI ring, with all nodes remaining on enet segments off 
DECswitch 900EF ports. They have about 40-60 LAVC nodes which range from
MicroVAX IIs to AlphaServer 2100s. (They will be migrating to FDDI attached
2100s and VAXs...but that is another issue with the DEFPA card...)

Last week, they moved 2 2100s, a VAX 3100 mod 90 and 2 MVIIs onto one segment.
The 3100 is a Pathworks server for about 30 PCs. One 2100 started logging
PEAO virtual circuit closures (up and down). This in turn caused the network 
to slow down drastically as the cluster reconfigured. They moved the 2100 to 
another 900 port but the problem remained relatively the same. The 2100 was 
then moved to another floor on another 900EF port. This seemed to resolve the 
problem.

Is there any method to find out how many packets are going thru the 900EF?
I have a sniffer, but can only measure one port at a time. I'd like to find 
out if they are pushing the bridge to the 66000 pps limit. Checking with 
HubWatch shows the deferred packet count, on various ports, is increasing 
rapidly. The nodes have the same problem in their "line" counters. Lot of
single and multiple collisions occur too.

The current config is


	+-----------------------------------------------+
	|						|
     DB900EF					   +-DB900EF	DH900MX
	|					   |		backplane
     DB900EF					   +-DC900MX	FDDI
	|					   |		config
     DB900EF					   +-DC900MX
	|						|
	+-----------------------------------------------+

FDDI nodes will be connected to the DC900MXs when the FDDI cards are installed.

Customer also has a question.

1. If the FDDI ring was to break (DB900EF A/B or DC900MX A/B port failure of 
   fiber failure), will the cluster nodes exhibit any problems? What is FDDI
   wrap time? They are concerned with cluster state transistions, as it
   really impacts their ability to work. (Digital has suggested they try to
   de-cluster and move to more powerful file and compute servers... the
   customer builds software based process simluators and trainers... and could
   rely on Xterminals and compute/file servers. Code is stored all over the
   cluster and is compiled and linked on various nodes also.)

T.R	Title	User	Personal Name	Date	Lines
2268.1	The cluster from hell. 8^)	CGOS01::DMARLOWE	Wow! Reality, what a concept!	`Wed May 10 1995 18:31`	37
	>> They moved the 2100 to >> another 900 port but the problem remained relatively the same. The 2100 was >> then moved to another floor on another 900EF port. This seemed to resolve the >> problem. How many nodes are on the port that the 2100 is now on? How many on the ports before? >> Is there any method to find out how many packets are going thru the 900EF? Double click on a 900EF. The first view includes the 7 ports with packet in/out counts for the ports. >> Checking with >> HubWatch shows the deferred packet count, on various ports, is increasing >> rapidly. The nodes have the same problem in their "line" counters. Lot of >> single and multiple collisions occur too. Deferred means traffic was already on the wire so the packet was held back. Collisions are another matter, especially multiple collisions. Sounds like you have too much traffic. Also are there any nodes that may not be fully complying with the 9.6uS IPG time? >> If the FDDI ring was to break (DB900EF A/B or DC900MX A/B port failure of >> fiber failure), will the cluster nodes exhibit any problems? What is >> FDDI wrap time? Wrap time is quite small. Can take as low as 10mS to beacon and wrap (heal) but maybe someone might be able to provide a closer answer. This is mega times faster than a STP reconfig however. The only thing that will happen is than any packets on the ring at the time of the wrap will be lost. Packets will have to be retransmitted based on timers in the upper protocol stack. dave
2268.2		NPSS::WADE	Network Systems Support	`Wed May 10 1995 19:30`	12
	Are you seeing any resets on the 900EF? Any other bridges on the net? Any STP topology changes on the E-LAN? I assume you have RECNXINTERVAL set to >20 (default) seconds on all cluster nodes? Bill
2268.3		NETCAD::ANIL		`Thu May 11 1995 00:55`	10
	The sum of out packets on all ports is the total number of packets forwarded -- this includes spanning tree hellos and SNMP management packets, but these will be a negligible percentage of actual traffic. Deferred frames, single, and multiple collisions are normal in an active network. The thing to look at is excessive collisions for indication of high levels of congestion on the Ethernet. Anil
2268.4	Update on Cluster From Hell :-(	TROFS::WEBSTER	NIS, London, Canada	`Thu May 11 1995 19:28`	38
	I put a sniffer on their net today. Multicast/Broadcast storms seems to be common on all segments that have more than 1 LAVC node. Greater than 3 LAVC nodes causes LAN overload conditions. One segment I measured had some sustained averages of >50% utilization, lasting >10 seconds. There are no resets on the bridges (up time exceeds 63 days all bridges... about the time of the last power failure we had in the building (Digital is actually in the same building as the customer....our downsizing has given them more floor space...)). There are only the 4 DB900EFs on the net and 1 Cisco 2500 router. Many IP nodes, but not a lot of traffic from them. PC count is now over 50 and most are running Workgroups and there is 1 Netware server, 1 remote mail server, 1 internal mail server and 1 FAXserver (all PC based). REXCNXINTERVAL is the default 20. They have not adjusted this, so I assume it is the same on all nodes. I have some concerns about the performance of the DETTR, which is an Allied Telisis 10bT to 10b2 repeater. The customer's wiring is all coax. We replaced their DECrepeater 90Cs with the DB900EFs, so DETTRs and DECXMs were supplied to connect the coax. The sniffer found CRC errors and runt packets on several segments, even ones that were not excessivly busy. On one segment, we started shutting down the nodes one by one to isolate, but the errors persisted. Segments measured, that came off the DECXMs, did not have these errors. >Any STP topology changes on the E-LAN? What do you mean by this Bill? MCS did get the DEFPA working today on the 2100 server running OpenVMS 6.1, so the LAVC traffic is now on the FDDI for this one node. Other 2100 will be upgrade next week (MCS has to make sure all the console code and hardware is proper revs) and 4 other nodes will be added shortly after (mix of vax 3100's and alpha 3000's). That should help reduce enet traffic. -Larry
2268.5		NETCAD::ANIL		`Thu May 11 1995 23:47`	11
	If a large percentage of the traffic is broadcast or multicast, you can turn on "rate limiting" for the specific addresses in the switches to stop them from propagating; in any case the reason still needs to be figured out. Note that the Sniffer tends to report "storms" falsely if its trigger for such detection is set low. If a large percentage is error traffic, that would indicate a physical (MAU/repeater) problem. I would also check to make sure that no configuration rules, such as repeaters in series, etc, are being violated. I've seen these causing the kind of network slowdown you're describing. Anil
2268.6		NPSS::WADE	Network Systems Support	`Fri May 12 1995 13:06`	8
	Off the track for this conference but, RECNXINTERVAL = 20 seconds is the default for a CI cluster. Increasing this to 60-90 (on all nodes) should stop the pedriver errors while you fix the problem on the E-LAN. And it is advised to leave it at 60-90 for a cluster that includes NI nodes. Bill
2268.7	Cluster transitions when DC900MX removed.	TROFS::WEBSTER	NIS, London, Canada	`Mon Aug 21 1995 20:53`	23
	Back to one of the original question regarding FDDI wrap time and cluster state transistions. Last week we had the privledge of removing a concentrator 900 from ring to change a PMD. This unit had no SAS's connected, as the nodes were just being installed (old problem of DEFTA-UA's not supported by VMS < version 6.2 so UTP PMD were changed to MMF PMDs). The DC900MX was in the middle of the ring created in the backplane. (see diagram in note .0) Using HUBwatch, we pulled the B port off the channel. As soon as we did this, the cluster went into state transition. Any node that was talking to 2 FDDI nodes on the other DC900MX, in the backplane, were affected. Will changing the RECNXinterval timer value up to the 60-90 value, as mentioned in -.1, resolve this problem? The customer was not to happy about this and I would like to go back and say "We told you to increase your counters and you didn't, so that's why there were transitions!". -Larry
2268.8		NETCAD::DOODY	Michael Doody	`Tue Aug 22 1995 13:29`	15
	I think you are seeing a problem, not with FDDI wrap time but rather with the bridge ports going into pre-forwarding state. When you removed the concentrator from the ring, the adjacent modules' ports are then connected to each other to heal the ring. When this disconnect/reconnect happens, one or more of the bridge FDDI ports go into pre-forwarding state as they learn the new topology (Anil could give a better answer). During the 30-second pre-forwarding, no packets are forwarded between FDDI/ethernet. So clearly a RECNXinterval = 20 will expire before the bridge ports come back. -Mike