[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference mvblab::sable

Title:SABLE SYSTEM PUBLIC DISCUSSION
Moderator:COSMIC::PETERSON
Created:Mon Jan 11 1993
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2614
Total number of notes:10244

2600.0. "URGENT - Unexplained crashes on AlphaServer 2100 5/250 running OpenVMS V6.2-1H3" by ROBSON::WARNE () Mon May 19 1997 10:30

Customer has two AlphaServer 2100 5/250s running OpenVMS V6.2-1H3 in a cluster, connected to six SW300 cabinets
with HSZ40s, via KZPSA controllers (three in each Alpha). Each system also has a KZESC RAID controller.

A couple of weeks ago (when the system load was increased due to more users being brought online) they had a
problem when one of the Alphas crashed for no apparent reason. Nothing is written to the SYSDUMP file (though
DUMPFILE and DUMPBUG are set set to one), no errors are logged, and all that's written to the console is the
following ...

 HALTED CPU 0
 KERNAL STACK NOT VALID HALT
 PC = FFFFFFFF80029050


This happened again at the weekend, on the other cluster node!  Unfortnately, they've got AUTOACTION set to HALT,
so we haven't got a post crash dump either. I know this all but rules out a dfinitive answer to the problem, but
I'd be grateful if anyone could give me a pointer as to what the problem MIGHT be.

CPU and config details are as follows:


$ SHOW CPU/FUL

TSLV13, a AlphaServer 2100 5/250
Multiprocessing is DISABLED. Uniprocessing synchronization image loaded.
Minimum multiprocessing revision levels: CPU = 1

System Page Size = 8192
System Revision Code =
System Serial Number = ay52507111
Default CPU Capabilities:
        QUORUM RUN
Default Process Capabilities:
        QUORUM RUN

PRIMARY CPU = 00

CPU 00 is in RUN state
Current Process: _RTA2:          PID = 206009E7
Serial Number:
Revision:
VAX floating point operations supported.
IEEE floating point operations and data types supported.
Processor is Primary Eligible.
PALCODE: Revision Code = 1.18
         PALcode Compatibility = 1
         Maximum Shared Processors = 4
         Memory Space:  Physical address = 00000000 00000000
                        Length = 0
         Scratch Space: Physical address = 00000000 00000000
                        Length = 0
Capabilities of this CPU:
        PRIMARY QUORUM RUN
Processes which can only execute on this CPU:
        *** None ***



SDA> clue config

System Configuration:
---------------------
System Information:
System Type    AlphaServer 2100 5/250                 Primary CPU ID 00
Cycle Time     4.0 nsec (250 MHz)                     Pagesize       8192 Byte

Memory Configuration:
Cluster    PFN Start    PFN Count         Range (MByte)       Usage
 #03             0          256         0.0 MB -    2.0 MB    Console
 #04           256       130815         2.0 MB - 1023.9 MB    System
 #05        131071            1      1023.9 MB - 1024.0 MB    Console

Per-CPU Slot Processor Information:
CPU ID         00                        CPU State    rc,pa,pp,cv,pv,pmv,pl
CPU Type       EV5  Pass 4 (21164)       Halt PC      00000000 20000000
PAL Code       1.18-1                    Halt PS      00000000 00001F00
CPU Revision   ....                      Halt Code    00000000 00000000
Serial Number  ..........                             Bootstrap or Powerfail
Console Vers   V4.5-55



Adapter Configuration:
----------------------
TR  Adapter Name (Address)  Hose  Bus          Node  Device Name       HW-Id/SW
--  ----------------------  ----  -----------  ----  ----------------  --------
 1  KA0905      (80D84080)     0  CBUS
                                                  0  KA0902_CPU        00000017
                                                  4  KA0902_MEM        00000018
                                                  5  KA0902_MEM        00000018
                                                  8  KA0902_IIO        00000019
 2  PCI         (80D84480)     0  PCI
                                         EWA:     0  TULIP             00021011
                                         PKA:     1  NCR53C810         00011000
                                                  2  MERCURY           04828086
                                         PKB:     6  KZPSA             00081011
                                         PKC:     7  KZPSA             00081011
                                         PKD:     8  KZPSA             00081011
 3  EISA        (80D84B40)     1  EISA
                                                  0                    012AA310
                                         GQA:     2  CPQ3011           1130110E
                                         FRA:     4  DEFEA_2           0230A310
                                         DRA:     7  MLX0075           75009835

Adapter Configuration:
----------------------
TR  Adapter Name (Address)  Hose  Bus          Node  Device Name       HW-Id/SW
--  ----------------------  ----  -----------  ----  ----------------  --------
 4  XBUS        (80D85040)     0  XBUS
                                                  0  EISA_SYSTEM_BOAR  00000016
                                         DVA:     1  AHA1742A_FLOPPY   504F4C46
                                         LRA:     2  VTI82C106_PP      00000015
                                         TTA:     3  NS16450           00016450




This system serves the customer's sites throughiut the UK, so it's critical the problem is sorted asap.


many thanks,

Chris Warne
T.RTitleUserPersonal
Name
DateLines
2600.1Need crash dumpfile for more infoSTAR::jacobi.zko.dec.com::jacobiPaul A. Jacobi - OpenVMS Systems GroupMon May 19 1997 17:2213
I don't see any obvious problems.

Be sure the console environment varible AUTO_ACTION is set to 
RESTART, so a crash dump will be generated the next time the problem 
occurs.

>>>set auto_action restart
>>>init


						-Paul

2600.2Crash dumpfileROBSON::WARNETue May 20 1997 07:429
I was afraid you'd say that! As I say, AUTO_ACTION is currently set to HALT, and they don't want to shut down
their systems unless they absolutely have to. So it means it will have to crash twice more before I get any
answers - not good.

So, nobody's come across a similar problem, or has any idea the sort of area the problem might be stemming from. I
need something to take into a customer meeting, and "wait 'til it crashes, then set a console variable, and I
might be able to tell you something when it crashes next time ... " isn't really what I had in mind!

Chris