[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference smurf::ase

Title:ase
Moderator:SMURF::GROSSO
Created:Thu Jul 29 1993
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2114
Total number of notes:7347

2022.0. "kzpsa and mc errors at console & boot time" by SANITY::PCUMMINGS (The perfect democracy) Thu Apr 24 1997 23:41

    We're trying to bring up a 2 node 4100 TruCluster with local system &
    page disk and storageworks w/4 HSZs and several rz29s.  Redundant
    Memory Channel HUBS. DU is V3.2g
    which we're tied to because of 3rd party app software.
    
    mc_diag runs ok
    mc_cable runs ok - maps things but is not interruptable via Cntrl C.
    Have to press reset to clear.
    
    show device and test commands at the console level produce errors 
    associated with kzpsa2 and kzpsa3 which are connected to HSZ50 disks
    in SW800 cab.  This is the case from both system consoles.
    
    Errors seen from "show device"
    
    kzpsa2 slot 4 bus 0 .....
    	bad SCSI status (28) received from PKD0.n.n.n.
    	"	"			"	"
    	"	"	"		"	"
    	4 entries I think
    kzpsa slot n bus n .....
    	bad SCSI status (28) received from PKD0.n.n.n.
        "       "                       "       "
        "       "       "               "       "
        4 entries I think
    
    On top of this one of the 4100's appears to boot up - we can login.
    things look basically okay.  The other 4100 though dies early in the
    boot process.  This seems to happen using /vmunix or /genvmunix, single
    user mode, etc.
    
    Messages on console;
    
    Created FRU table configuration binary log packet
    lvm0 completed
    lvm1 completed
    lvm2 completed
    Starting secondary CPU
    load of /etc/init failed, errno 2
    load program?
    
    	**** what is it asking for?
    
    Sometimes it's gets past this point and appears to have memory channel 
    problems.....
    
    init prog: LOST CONNECTION WITH HUB (primary adapter)
    rmerror_int: error_count = 2 unit = 0 Err_reg Node = 0 
    panic cpu0 rmerror_int: fatal error and no alternate mc to failover
    
    	**** strange considering it see both MC units with mc_diag/mc_cable
    
    Ideas anyone?
    
    thanx!
    /Paul
    
    
T.RTitleUserPersonal
Name
DateLines
2022.1more infoBACHUS::DEVOSManu Devos NSIS Brussels 856-7539Fri Apr 25 1997 09:0410
    Hi Paul,
    
    Please, give us more information on your hardware configuration.
    
    Did you change the SCSI ID of the controllers ? Is the local disk
    really local or did you try to place it on the shared bus ????
    
    A drawing of the config is welcome ...
    
    Manu.
2022.2check basic-dma-window-sizeFOUNDR::STRICKEREnterprise Systems Engineering, Salem NHFri Apr 25 1997 10:5620
Paul,
    
>    show device and test commands at the console level produce errors 
>    associated with kzpsa2 and kzpsa3 which are connected to HSZ50 disks
>    in SW800 cab.  This is the case from both system consoles.

   Did you check the HSZ50s status? Are the batteries in a GOOD state,
meaning are they fully charged?
    
>    load of /etc/init failed, errno 2
>    load program?
>   	**** what is it asking for?

   Did anyone change the sysconfigtab file? This looks like the same problem
that I was seeing on 4100s with >1GB memory when I tried to bootup UNIX
with basic-dma-window-size=512. The workaround was to change this value to
either 0 (zero) or 3072 (number of MB of memory).
    
-Jerry
    
2022.3Also slot restrictionsNNTPD::"cherkus@buff.zk3.dec.com"Dave CherkusFri Apr 25 1997 17:019
Also search this notesfile using the keyword '4100' for a discussion
on the proper slots to use for memory channel in 4100 for DU 3.2x.

One piece of advice is: don't expect your system to perform reliably
if the console is complaining about your storage devices.  This
should not be ignored.  

Dave
[Posted by WWW Notes gateway]
2022.4Could be this...NNTPD::"mcdonald@decatl.alf.dec.com"John McDonaldMon Apr 28 1997 19:1586
Paul,

This may or may not be your problem, but it's worth looking into:

Author                    : MARCI R POTTER
 User type                 : DBA 
 Location                  : USTIMA
 Vaxmail address           : CSC32::POTTER       
 
 Copyright (c) Digital Equipment Corporation 1997. All rights reserved.
 
 +---------------------------+TM
 |   |   |   |   |   |   |   |
 | d | i | g | i | t | a | l |      TIME   DEPENDENT   BLITZ
 |   |   |   |   |   |   |   |      
 +---------------------------+
 
 
    
       BLITZ TITLE: Alpha Server 4100 - SCSI Bus and Related Errors
  
                                                 DATE: April 24, 1997
       AUTHOR:Ted Gent                          TD #: 2274
       DTN:223-6530         
       ENET:POBOXA::GENT                        CROSS REFERENCE #'s:
       DEPARTMENT:SBU Engineering               (PRISM/TIME/CLD#'s) 
                                                 
          
                                                         
       INTENDED AUDIENCE: U.S/EUROPE/GIA                PRIORITY LEVEL: 2
                          
                                                  (1=TIME CRITICAL,
                                                   2=NON-TIME CRITICAL)
       =====================================================================
 
 Subject : Alpha Server 4100 - SCSI Bus and Related Errors
           AlphaServer 4100 - Errors on SCSI Busses
 
 1. Problem:
 
 Test coverage  on  the  B3040  'Horse'  module  has  been  found marginal in
 identifying  problems  in  the 'Scatter-Gather' mapping.  The result is that
 some modules have been shipped which may have latent defects.
 
 2. Susceptibility:
 
 Scatter-Gather mode  use  varys   depending on the adapter type.  The NCR810
 based  controllers  -  KZPAA and the On-Board SCSI controller for the CD-ROM
 always  use  Scatter-Gather  mode.   Unix uses Scatter-Gather mode for other
 SCSI controllers if the memory configuration exceeds 1 GByte.
 
 3. Symptoms:
 
 Symptoms of the problem will vary, some of the problems seen include:
 
 a) Excessive SCSI errors on NCR810 based devices
 b) SCSI (CAM) errors reported during the system boot
 c) System crashes (panics) with Bus Faults
 d) Performance degradation accessing disk subsystems or CDROM
 e) Occasional Power Up Self Test failures indicating an IOD failure
 f) Failure to initialise graphics correctly
 g) Moving all of the PCI adapters to one (of the) PCI bus makes the system
    function properly.
 
 4. Fault Finding:
 
 If you  are involved in trouble shooting a system where you are experiencing
 any  of  the symptoms listed above, the B3040 may be the problem module. 
 
 5. Prevention:
 
 A screen  is  being  implemented  in  Stage  1  and Stage 2 Manufacturing to
 remove defective B3040 modules.  The screen went into effect in April.  This
 means that any systems currently in the field may have this problem
 
 6. Severity:
 
 Screening indicates  that  this problem may be seen on as high as 10% of the
 population of B3040's
 
 tg 4/22/97
 
 John McDonald
Atlanta CSC

[Posted by WWW Notes gateway]
2022.5fixed- Thanx Jerry!SANITY::PCUMMINGSThe perfect democracyTue Apr 29 1997 13:1016
    the no-boot problem was fixed by booting the DU o.s CD in sys mgmt mode
    bringing us to the Unix Shell prompt, mounting the SWXCR HW raid root
    disk as /mnt and using 'ed' to edit the /etc/sysconfigtab file. the
    problem was that reinstalling the TCR*100 subsets modifies the
    sysconfigtab file with a bogus basic-dma-window-size=512.  Once you
    reboot with this value, you're hozed.  Using 'ed' we reset the value to
    3072 the init and Memory Channel Hub errors went away and system booted
    fine.
    
    The CAM & HSZ5 errors during reboot remained until we powered down the
    SW800 disk cab (which has HSZs too).  Strange thing is the other
    cluster member, didn't get these CAM & HSZ errors.
    
    thanx
    /paul