[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference smurf::ase

Title:	ase

Moderator:	SMURF::GROSSO

Created:	Thu Jul 29 1993
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2114
Total number of notes:	7347

2022.0. "kzpsa and mc errors at console & boot time" by SANITY::PCUMMINGS (The perfect democracy) Thu Apr 24 1997 23:41

    We're trying to bring up a 2 node 4100 TruCluster with local system &
    page disk and storageworks w/4 HSZs and several rz29s.  Redundant
    Memory Channel HUBS. DU is V3.2g
    which we're tied to because of 3rd party app software.
    
    mc_diag runs ok
    mc_cable runs ok - maps things but is not interruptable via Cntrl C.
    Have to press reset to clear.
    
    show device and test commands at the console level produce errors 
    associated with kzpsa2 and kzpsa3 which are connected to HSZ50 disks
    in SW800 cab.  This is the case from both system consoles.
    
    Errors seen from "show device"
    
    kzpsa2 slot 4 bus 0 .....
    	bad SCSI status (28) received from PKD0.n.n.n.
    	"	"			"	"
    	"	"	"		"	"
    	4 entries I think
    kzpsa slot n bus n .....
    	bad SCSI status (28) received from PKD0.n.n.n.
        "       "                       "       "
        "       "       "               "       "
        4 entries I think
    
    On top of this one of the 4100's appears to boot up - we can login.
    things look basically okay.  The other 4100 though dies early in the
    boot process.  This seems to happen using /vmunix or /genvmunix, single
    user mode, etc.
    
    Messages on console;
    
    Created FRU table configuration binary log packet
    lvm0 completed
    lvm1 completed
    lvm2 completed
    Starting secondary CPU
    load of /etc/init failed, errno 2
    load program?
    
    	**** what is it asking for?
    
    Sometimes it's gets past this point and appears to have memory channel 
    problems.....
    
    init prog: LOST CONNECTION WITH HUB (primary adapter)
    rmerror_int: error_count = 2 unit = 0 Err_reg Node = 0 
    panic cpu0 rmerror_int: fatal error and no alternate mc to failover
    
    	**** strange considering it see both MC units with mc_diag/mc_cable
    
    Ideas anyone?
    
    thanx!
    /Paul

T.R	Title	User	Personal Name	Date	Lines
2022.1	more info	BACHUS::DEVOS	Manu Devos NSIS Brussels 856-7539	`Fri Apr 25 1997 09:04`	10
	Hi Paul, Please, give us more information on your hardware configuration. Did you change the SCSI ID of the controllers ? Is the local disk really local or did you try to place it on the shared bus ???? A drawing of the config is welcome ... Manu.
2022.2	check basic-dma-window-size	FOUNDR::STRICKER	Enterprise Systems Engineering, Salem NH	`Fri Apr 25 1997 10:56`	20
	Paul, > show device and test commands at the console level produce errors > associated with kzpsa2 and kzpsa3 which are connected to HSZ50 disks > in SW800 cab. This is the case from both system consoles. Did you check the HSZ50s status? Are the batteries in a GOOD state, meaning are they fully charged? > load of /etc/init failed, errno 2 > load program? > **** what is it asking for? Did anyone change the sysconfigtab file? This looks like the same problem that I was seeing on 4100s with >1GB memory when I tried to bootup UNIX with basic-dma-window-size=512. The workaround was to change this value to either 0 (zero) or 3072 (number of MB of memory). -Jerry
2022.3	Also slot restrictions	NNTPD::"cherkus@buff.zk3.dec.com"	Dave Cherkus	`Fri Apr 25 1997 17:01`	9
	Also search this notesfile using the keyword '4100' for a discussion on the proper slots to use for memory channel in 4100 for DU 3.2x. One piece of advice is: don't expect your system to perform reliably if the console is complaining about your storage devices. This should not be ignored. Dave [Posted by WWW Notes gateway]
2022.4	Could be this...	NNTPD::"mcdonald@decatl.alf.dec.com"	John McDonald	`Mon Apr 28 1997 19:15`	86
	Paul, This may or may not be your problem, but it's worth looking into: Author : MARCI R POTTER User type : DBA Location : USTIMA Vaxmail address : CSC32::POTTER Copyright (c) Digital Equipment Corporation 1997. All rights reserved. +---------------------------+TM \| \| \| \| \| \| \| \| \| d \| i \| g \| i \| t \| a \| l \| TIME DEPENDENT BLITZ \| \| \| \| \| \| \| \| +---------------------------+ BLITZ TITLE: Alpha Server 4100 - SCSI Bus and Related Errors DATE: April 24, 1997 AUTHOR:Ted Gent TD #: 2274 DTN:223-6530 ENET:POBOXA::GENT CROSS REFERENCE #'s: DEPARTMENT:SBU Engineering (PRISM/TIME/CLD#'s) INTENDED AUDIENCE: U.S/EUROPE/GIA PRIORITY LEVEL: 2 (1=TIME CRITICAL, 2=NON-TIME CRITICAL) ===================================================================== Subject : Alpha Server 4100 - SCSI Bus and Related Errors AlphaServer 4100 - Errors on SCSI Busses 1. Problem: Test coverage on the B3040 'Horse' module has been found marginal in identifying problems in the 'Scatter-Gather' mapping. The result is that some modules have been shipped which may have latent defects. 2. Susceptibility: Scatter-Gather mode use varys depending on the adapter type. The NCR810 based controllers - KZPAA and the On-Board SCSI controller for the CD-ROM always use Scatter-Gather mode. Unix uses Scatter-Gather mode for other SCSI controllers if the memory configuration exceeds 1 GByte. 3. Symptoms: Symptoms of the problem will vary, some of the problems seen include: a) Excessive SCSI errors on NCR810 based devices b) SCSI (CAM) errors reported during the system boot c) System crashes (panics) with Bus Faults d) Performance degradation accessing disk subsystems or CDROM e) Occasional Power Up Self Test failures indicating an IOD failure f) Failure to initialise graphics correctly g) Moving all of the PCI adapters to one (of the) PCI bus makes the system function properly. 4. Fault Finding: If you are involved in trouble shooting a system where you are experiencing any of the symptoms listed above, the B3040 may be the problem module. 5. Prevention: A screen is being implemented in Stage 1 and Stage 2 Manufacturing to remove defective B3040 modules. The screen went into effect in April. This means that any systems currently in the field may have this problem 6. Severity: Screening indicates that this problem may be seen on as high as 10% of the population of B3040's tg 4/22/97 John McDonald Atlanta CSC [Posted by WWW Notes gateway]
2022.5	fixed- Thanx Jerry!	SANITY::PCUMMINGS	The perfect democracy	`Tue Apr 29 1997 13:10`	16
	the no-boot problem was fixed by booting the DU o.s CD in sys mgmt mode bringing us to the Unix Shell prompt, mounting the SWXCR HW raid root disk as /mnt and using 'ed' to edit the /etc/sysconfigtab file. the problem was that reinstalling the TCR*100 subsets modifies the sysconfigtab file with a bogus basic-dma-window-size=512. Once you reboot with this value, you're hozed. Using 'ed' we reset the value to 3072 the init and Memory Channel Hub errors went away and system booted fine. The CAM & HSZ5 errors during reboot remained until we powered down the SW800 disk cab (which has HSZs too). Strange thing is the other cluster member, didn't get these CAM & HSZ errors. thanx /paul