[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference netcad::hub_mgnt

Title:DEChub/HUBwatch/PROBEwatch CONFERENCE
Notice:Firmware -2, Doc -3, Power -4, HW kits -5, firm load -6&7
Moderator:NETCAD::COLELLADT
Created:Wed Nov 13 1991
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:4455
Total number of notes:16761

2268.0. "DB900EF performance measure and FDDI wrap time?" by TROFS::WEBSTER (NIS, London, Canada) Wed May 10 1995 17:59

My customer has large LAVC setup with all disks in the cluster being MSCP
served. As their LAVC grew, the network became slower. They were recently
upgraded to an FDDI ring, with all nodes remaining on enet segments off 
DECswitch 900EF ports. They have about 40-60 LAVC nodes which range from
MicroVAX IIs to AlphaServer 2100s. (They will be migrating to FDDI attached
2100s and VAXs...but that is another issue with the DEFPA card...)

Last week, they moved 2 2100s, a VAX 3100 mod 90 and 2 MVIIs onto one segment.
The 3100 is a Pathworks server for about 30 PCs. One 2100 started logging
PEAO virtual circuit closures (up and down). This in turn caused the network 
to slow down drastically as the cluster reconfigured. They moved the 2100 to 
another 900 port but the problem remained relatively the same. The 2100 was 
then moved to another floor on another 900EF port. This seemed to resolve the 
problem.

Is there any method to find out how many packets are going thru the 900EF?
I have a sniffer, but can only measure one port at a time. I'd like to find 
out if they are pushing the bridge to the 66000 pps limit. Checking with 
HubWatch shows the deferred packet count, on various ports, is increasing 
rapidly. The nodes have the same problem in their "line" counters. Lot of
single and multiple collisions occur too.

The current config is


	+-----------------------------------------------+
	|						|
     DB900EF					   +-DB900EF	DH900MX
	|					   |		backplane
     DB900EF					   +-DC900MX	FDDI
	|					   |		config
     DB900EF					   +-DC900MX
	|						|
	+-----------------------------------------------+

FDDI nodes will be connected to the DC900MXs when the FDDI cards are installed.

Customer also has a question.

1. If the FDDI ring was to break (DB900EF A/B or DC900MX A/B port failure of 
   fiber failure), will the cluster nodes exhibit any problems? What is FDDI
   wrap time? They are concerned with cluster state transistions, as it
   really impacts their ability to work. (Digital has suggested they try to
   de-cluster and move to more powerful file and compute servers... the
   customer builds software based process simluators and trainers... and could
   rely on Xterminals and compute/file servers. Code is stored all over the
   cluster and is compiled and linked on various nodes also.)
T.RTitleUserPersonal
Name
DateLines
2268.1The cluster from hell. 8^)CGOS01::DMARLOWEWow! Reality, what a concept!Wed May 10 1995 18:3137
>> They moved the 2100 to 
>> another 900 port but the problem remained relatively the same. The 2100 was 
>> then moved to another floor on another 900EF port. This seemed to resolve the 
>> problem.

    How many nodes are on the port that the 2100 is now on?
    How many on the ports before?
    
>> Is there any method to find out how many packets are going thru the 900EF?
    
    Double click on a 900EF.  The first view includes the 7 ports with
    packet in/out counts for the ports.  
    
>> Checking with 
>> HubWatch shows the deferred packet count, on various ports, is increasing 
>> rapidly. The nodes have the same problem in their "line" counters. Lot of
>> single and multiple collisions occur too.

    Deferred means traffic was already on the wire so the packet was held
    back.  Collisions are another matter, especially multiple collisions. 
    Sounds like you have too much traffic.  Also are there any nodes that
    may not be fully complying with the 9.6uS IPG time?
    

>> If the FDDI ring was to break (DB900EF A/B or DC900MX A/B port failure of 
>> fiber failure), will the cluster nodes exhibit any problems? What is 
>> FDDI wrap time?
    
    Wrap time is quite small.  Can take as low as 10mS to beacon and wrap
    (heal) but maybe someone might be able to provide a closer answer.  This 
    is mega times faster than a STP reconfig however. The only thing that will 
    happen is than any packets on the ring at the time of the wrap will be 
    lost.  Packets will have to be retransmitted based on timers in the 
    upper protocol stack.
    
    dave
2268.2NPSS::WADENetwork Systems SupportWed May 10 1995 19:3012
    
    Are you seeing any resets on the 900EF?
    
    Any other bridges on the net?
    
    Any STP topology changes on the E-LAN?
    
    I assume you have RECNXINTERVAL set to >20 (default) seconds on all
    cluster nodes?
    
    Bill
    
2268.3NETCAD::ANILThu May 11 1995 00:5510
    The sum of out packets on all ports is the total number of packets
    forwarded -- this includes spanning tree hellos and SNMP management
    packets, but these will be a negligible percentage of actual
    traffic.
    
    Deferred frames, single, and multiple collisions are normal in
    an active network.  The thing to look at is excessive collisions
    for indication of high levels of congestion on the Ethernet.
    
    Anil
2268.4Update on Cluster From Hell :-(TROFS::WEBSTERNIS, London, CanadaThu May 11 1995 19:2838
I put a sniffer on their net today. Multicast/Broadcast storms seems to be
common on all segments that have more than 1 LAVC node. Greater than 3 LAVC
nodes causes LAN overload conditions. One segment I measured had some
sustained averages of >50% utilization, lasting >10 seconds.

There are no resets on the bridges (up time exceeds 63 days all bridges...
about the time of the last power failure we had in the building (Digital is
actually in the same building as the customer....our downsizing has given them
more floor space...)).

There are only the 4 DB900EFs on the net and 1 Cisco 2500 router. Many IP
nodes, but not a lot of traffic from them. PC count is now over 50 and most
are running Workgroups and there is 1 Netware server, 1 remote mail server,
1 internal mail server and 1 FAXserver (all PC based).

REXCNXINTERVAL is the default 20. They have not adjusted this, so I assume it
is the same on all nodes.

I have some concerns about the performance of the DETTR, which is an Allied
Telisis 10bT to 10b2 repeater. The customer's wiring is all coax. We replaced
their DECrepeater 90Cs with the DB900EFs, so DETTRs and DECXMs were supplied
to connect the coax. The sniffer found CRC errors and runt packets on several
segments, even ones that were not excessivly busy. On one segment, we started
shutting down the nodes one by one to isolate, but the errors persisted.
Segments measured, that came off the DECXMs, did not have these errors. 

>Any STP topology changes on the E-LAN?

	What do you mean by this Bill?

MCS did get the DEFPA working today on the 2100 server running OpenVMS 6.1,
so the LAVC traffic is now on the FDDI for this one node. Other 2100 will
be upgrade next week (MCS has to make sure all the console code and hardware
is proper revs) and 4 other nodes will be added shortly after (mix of vax 
3100's and alpha 3000's). That should help reduce enet traffic.

-Larry

2268.5NETCAD::ANILThu May 11 1995 23:4711
    If a large percentage of the traffic is broadcast or multicast, you can
    turn on "rate limiting" for the specific addresses in the switches to
    stop them from propagating; in any case the reason still needs to be
    figured out.  Note that the Sniffer tends to report "storms" falsely
    if its trigger for such detection is set low.  If a large percentage
    is error traffic, that would indicate a physical (MAU/repeater) problem.
    I would also check to make sure that no configuration rules, such
    as repeaters in series, etc, are being violated.  I've seen these
    causing the kind of network slowdown you're describing.
    
    Anil
2268.6NPSS::WADENetwork Systems SupportFri May 12 1995 13:068
    Off the track for this conference but, RECNXINTERVAL = 20 seconds is
    the default for a CI cluster.  Increasing this to 60-90 (on all nodes) 
    should stop the pedriver errors while you fix the problem on the E-LAN.  
    And it is advised to leave it at 60-90 for a cluster that includes NI
    nodes.
    
    Bill
     
2268.7Cluster transitions when DC900MX removed.TROFS::WEBSTERNIS, London, CanadaMon Aug 21 1995 20:5323
	Back to one of the original question regarding FDDI wrap time and
	cluster state transistions.

	Last week we had the privledge of removing a concentrator 900 from 
	ring to change a PMD. This unit had no SAS's connected, as the nodes
	were just being installed (old problem of DEFTA-UA's not supported
	by VMS < version 6.2 so UTP PMD were changed to MMF PMDs).

	The DC900MX was in the middle of the ring created in the backplane.
	(see diagram in note .0)
	Using HUBwatch, we pulled the B port off the channel. As soon as we
	did this, the cluster went into state transition. Any node that was
	talking to 2 FDDI nodes on the other DC900MX, in the backplane, were
	affected.

	Will changing the RECNXinterval timer value up to the 60-90 value, as 
	mentioned in -.1, resolve this problem?

	The customer was not to happy about this and I would like to go back
	and say "We told you to increase your counters and you didn't, so
	that's why there were transitions!".

	-Larry
2268.8NETCAD::DOODYMichael DoodyTue Aug 22 1995 13:2915
    I think you are seeing a problem, not with FDDI wrap time but rather
    with the bridge ports going into pre-forwarding state. 
    
    When you removed the concentrator from the ring, the adjacent modules'
    ports are then connected to each other to heal the ring. When this
    disconnect/reconnect happens, one or more of the bridge FDDI ports go
    into pre-forwarding state as they learn the new topology (Anil could
    give a better answer). During the 30-second pre-forwarding, no packets
    are forwarded between FDDI/ethernet. 
    
    So clearly a RECNXinterval = 20 will expire before the bridge ports
    come back.
    
    -Mike