[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference netcad::hub_mgnt

Title:DEChub/HUBwatch/PROBEwatch CONFERENCE
Notice:Firmware -2, Doc -3, Power -4, HW kits -5, firm load -6&7
Moderator:NETCAD::COLELLADT
Created:Wed Nov 13 1991
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:4455
Total number of notes:16761

1114.0. "900TM "module not responding"" by UTRTSC::DENIJS () Tue Jun 14 1994 13:40

Hello,

We have a strange problem with a dechub 900 with 6 repeaters (900 TM) in it.
Since the installation every few minutes one of the repeaters is reset.
First we see a message on the terminal connected to the hub:

"	status: module not responding	"

Then the repeater is reset. This is seen on all repeaters.
During troubleshooting we found that this happens 
less frequent if the network is quiet and more frequent if there is more load.
We hooked up an lan analyzer and found nothing to cause this behaviour except
we could see all repeaters sending out a broadcast every minute. Since i assumed
these were bootp requests for fetching the ip address, we gave all repeaters an
ip address by hand and indeed it stopped the bootp requests, but as a side
effect, we also did not see the resets anymore. As soon as we got rid of the ip
addresses in the repeaters the resets were there again.

Questions:

1) Has anybody seen this behaviour before ?

2) are there any problems known in this area ?

3) Is there any way we can stop the repeaters from sending bootp requests 
   ( other then giving the repeaters an ip address ) ?

Best regards,
Peter de Nijs.
T.RTitleUserPersonal
Name
DateLines
1114.1Need more infoLEVERS::PAGLIARORich Pagliaro, Hub Products GroupTue Jun 14 1994 15:388
    Before I can answer this question I need some more information. Can you
    tell me what version of firmware the DECrepeater 900TMs are running?
    It would also help a great deal if you dumped the error log (via the
    setup menu) of the repeaters which reset and sent me the results.
    
    Thanks,
    
    Rich
1114.2Here is some infoUTRTSC::DENIJSWed Jun 15 1994 07:4120
    I have the versions for you so here goes:
    
    Hub manager:	HW rev. F
    			ROM V1.1.6
    			Firmware V2.2.1
    
    Repeaters 900 TM	slot	HW    	RO	SW
    			1	V2	V1	V1.0G
    			2	V2	V1	V1.0G
    			3	V1	V1	V1.0G
    			4	V2	V1	V1.0G
    			5	V2	V1	V1.0G
    			6	V1	V1	V1.0F	
    
    
    I have no errorlog info but will try to get it ASAP.
    
    Thanks for your help,
    
    Peter.
1114.3errorlog infoUTRTSC::DENIJSThu Jun 16 1994 13:3556
Below is the errorlog from the repeater, as requested. You see the entries
of one repeater but all repeaters have the same entries, only the entry
numbers differ.

I have one more question, since the repeater are snmp managable, they must
be able to look at mac addresses and take frames from the ethernet. will they
only look for their own addresses ? in other words, how will the repeaters
react on broadcasts?


==============================================================================

                           Enter selection :  9

DECrepeater 900TM
=============================================================================

                                  ERROR LOG



        Entry        = 45
        Time Stamp   = 0 0
        Reset Count  = 0
        Fatal error: Line 611, File enet.c


Dump another entry y/[n]? y

        Entry        = 44
        Time Stamp   = 0 0
        Reset Count  = 0
        Fatal error: Line 611, File enet.c


Dump another entry y/[n]? y

        Entry        = 43
        Time Stamp   = 0 0
        Reset Count  = 0
        Fatal error: Line 611, File enet.c


Dump another entry y/[n]? y

        Entry        = 42
        Time Stamp   = 0 0
        Reset Count  = 0
        Fatal error: Line 611, File enet.c


Dump another entry y/[n]? y

Regards, 

Peter.
1114.4LEVERS::PAGLIARORich Pagliaro, Hub Products GroupMon Jun 20 1994 18:0827
    Peter,
    
    Thanks for the extra info. The behaviour you are experiencing appears
    to be somewhat similar to something we've seen here in our lab. In our 
    testing environment the behavior occurs very infrequently and we have
    had difficulty determining the root cause of the problem.
    
    You mentioned that your problem occures less frequently when there is
    less traffic. Do you know approximately what your traffic level was
    when you started to experience the problem? Do you know what percentage
    of that traffic was multicast/broadcast?
    
    To answer some of your other questions:
    
    There is no way to stop the repeater from transmitting bootp requests
    other than to assign the repeater an IP address.
    
    The repeaters "look for" frames with their own unicast destination
    addresses as well as frames with multicast and broadcast destination
    addresses. That is, a repeater will receive frames with
    multicast/broadcast destination addresses and process them. The actual
    processing, of course, depends upon the received message.
    
    
    Regards,
    
    Rich
1114.5Some more info....UTRTSC::DENIJSTue Jun 21 1994 11:1879
Rich,

Some more background info:
**************************

DECHUB 900 with 6 decrepeater 900 TM

  Slot:	 1	 2	 3	 4	 5	 6
	+-+	+-+	+-+	+-+	+-+	+-+
	| |  	| |	| |	| |	| |	| |
    -----*-------*-------*-------*-------*----------- Thin bus
	| |	| |	| |	| |	| |	| |
	| |	| |	| |	| |	| | 	| |
	| |	| |	| |	| |	| |	| |
	+-+	+-+	+-+	+-+	+-+	+-+
                 |
		 |      Repeater 1-5 are connected to the thinwire bus,
  +-------+      |      and not to any flex bus.
  |DECNIS |      |      They have about 70 PC's connected (PCSA) to 3
  |       |      |      servers (4000), 7 VXT's.
  |TCP/IP +------+      
  | LAT   |		Repeater 6 is not connected to any internal bus,
  |       |             it just sits in the backplane for power and is 
  +---+---+             part of a different network.
      |
      / To outside world ( 128Kb ) 

If we set up the terminal, connected to the hub, for displaying events we 
see intermittent messages like " status: module not responding " every minute
or so on repeaters 1 to 5, 6 seems to be ok. This message is often followed by
a reset of the repeater (probably done by the hub manager because the module
is not responding). When we accedentely did an init of the hub, causing
repeater 6 to be also connected to the thin bus we also saw the messages and
resets for slot 6.

During troubleshooting we found two ways to get rid of the problems:

	1) Disconnect the DECnis from the hub or disconnect the link to
	   the outside world on the DECnis.

	2) Give the repeaters an ip address by hand.

1)Since we were wondering why the problems stopped when we disconnected the
  link on the DECnis to the outside world we have monitored the packets going
  to and coming from the MAC addresse of the repeaters. The only thing we have
  seen here is broadcasts from all repeaters (bootp asking for ip address?).
  The repeaters stopped sending those broadasts when an ip address was supplied
  by hand.Nothing was sent back in response to these broadcasts.
  The only thing we have not monitored was if there were broadcasts/multicasts
  in response to the bootp requests, hence my question about how the repeaters
  would react to broad/multicasts.
  They only other thing i can think of why disconnecting the DECnis could stop
  the problems is just taking away part of the ethernet load.
  The total ethernet load during the problems was about 20-25%, we have not
  measured the broadcasts/multicasts.

2)As a workaround we have now given the repeaters an ip address by hand.  
  


  At the moment i see two possible scenarios:

	a) The bootp requests from the repeaters initiate some sort of
	   broadcast storm, keeping the repeaters to busy to respond to
	   the hub manager.

        b) The repeaters send out a bootp request, listen to the response,
           and in combination with ethernet load >25% is to busy to respond
           to the hub manager.

	Any other ideas ?

I think it would be possible for us to go back to the old situation and do 
some measurements, however, this will take place during the evenings and i am
not sure we will see the problems then. But we can try.

Best regards,

Peter.
1114.6NACAD2::SLAWRENCETue Jun 21 1994 15:179
    The reset is not initiated by the hub in response to the 'module not
    responding'; what's happening is that the module has either hung or
    crashed and the hub notices it before the self test is started on the
    module.  The reset is the cause and the 'not responding' is an effect,
    not the other way around.
    
    You can prevent your slot 6 repeater from being added to the backplane
    on a reset by creating an Ethernet backplane net (IMB) and connecting
    it to that by itself.
1114.7ideas for measurement with analyzer ???UTRTSC::DENIJSMon Jun 27 1994 09:547
    ok, so the repeaters hang or crash. We are planning to go back onsite
    to do some measurements with the lan analyzer. First thing we will look
    for is broadcast storms.
    
    Any other ideas what we can look for ?
    
    Peter.
1114.8We have our best people working on it...LEVERS::PAGLIARORich Pagliaro, Hub Products GroupMon Jun 27 1994 15:5117
    Peter,
    
    We are still investigating this problem here on our end. We had another
    failure occur last week and it correlated to what you have observed
    regarding IP addresses.  That is, repeaters assigned IP addresses did
    not crash while repeaters without IP addresses did crash. Because of
    this we suspect something is broken in the repeater's bootP protocol
    processing.
    
    The people actually doing the investigation here told me you might want
    to look for bogus bootP responses or bogus ICMP messages.
    
    I'll post results here when we find something definitive.
    
    Regards,
    
    Rich 
1114.9Thanks for the updateUTRTSC::DENIJSTue Jun 28 1994 12:0011
    Hi Rich,
    
    Thanks for the update. As of next week i will be on vacation but one of
    our other engineers ( Ted Paehlig ) will setup a session on site to do
    some maesurements. If he finds something interresting he will post it
    here.
    
    Best regards,
    
    Peter.
    
1114.10Update...Problem solved!NACAD::PAGLIARORich Pagliaro, Hub Products GroupThu Jul 07 1994 16:5626
    Peter,
    
    Good News! The bug causing the problem you witnessed has been found and
    fixed. Apparently there is a bug in the repeater's UDP layer
    processing. The UDP layer is designed to accept bootp responses when
    the repeater has not been assigned an IP address. The problem occurs
    when the repeater receives other types of IP frames when the repeater
    is not assigned an IP address.  The UDP layer will not process the
    frame but it will also not relinquish the frame's buffer. Hence a
    memory leak exists. Eventually the repeater runs out of buffers and
    crashes.

    We experienced this problem here in our lab when some station on the
    network decided to send out SNMP requests to the broadcast address.

    For what its worth, this UDP code is shared by all of the repeaters as
    well as the DECconcentrator 900 and DECbridge 900. The repeaters will
    crash once they run out of buffers. I personally do not know how this
    memory leak effects the behavior of the concentrator and bridge.

    The fix to this bug will be available in the soon to be released
    "60-day upgrade".
                                
    Regards,
    
    Rich
1114.11Great News!IJSAPL::PAEHLIGTed Paehlig _ AmsterdamFri Jul 08 1994 10:1510
    We won't have to bother our customer with additional measurement visits
    but tell him the good news instead !

    Please keep us posted on the availability of the fix(es).

    Thanks for your pronto attention on this matter.

    Peter (proxy) & Ted
    
1114.12effect on concentrator & bridgeLEVERS::SLAWRENCEMon Jul 11 1994 14:5013
    
    This problem causes the DECconcentrator 900MX to 'go silent'; it
    doesn't crash, but it stops doing any management communications
    (including the FDDI management frames used to create a ring map).  It
    does continue to pass FDDI frames normally, but won't respond to
    anything for its own MAC address or as an IP server.
    
    It too will have the fix in the 60 day upgrade.
    
    I don't believe that this problem affects the DECbridge 900MX because
    of the way the low-level bridging code passes frames up to the
    management stack, but in any event it too will have the fix.
    
1114.13When will the 60 day upgrade availibleZUR01::SCHNEIDERRThu Jul 21 1994 14:404
When will the 60 day upgrade availible???


Roland
1114.14Early AugustNACAD2::PAGLIARORich Pagliaro, Hub Products GroupThu Jul 21 1994 14:443
    I believe some time during the first week of August.
    
    -Rich
1114.15Customer kits available laterNAC::FORRESTMon Jul 25 1994 18:207
	To clarify, Rich is talking about online availability. It will be 
	available to customers with HUBwatch V3.1 when V3.1 ships, hopefully 
	by the end of September.

	You can have problems by upgrading only one module, and 
	not the MAM, or HUBwatch.
1114.16Can I upgradeZUR01::SCHNEIDERRTue Aug 02 1994 09:088
We have different problems out in the field and we are waiting for this upgrade.

If the upgrade is availible (this week, isnb't it???), can we upgrade the MAM and 
the modules and use HUBwatch V3.0? Or do we realy have to wait until HUBwatch 
V3.1???


Roland
1114.17V3.0 is ok in the short runNACAD2::HAROKOPUSWed Aug 03 1994 16:2210
    Although, officially you need HUBwatch V3.1  with MAM V3.1, I don't
    anticipate any problems using V3.0 until V3.1 ships.
    
    However,  there are some new modules shipping soon that are not
    supported by V3.0 and V3.1 has many bug fixes so you will want
    to upgrade to V3.1 as soon as it is available.
    
    Regards,
    
    Bob