[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference ssdevo::hsj40_product

Title:HSJ30/40 Product Conference
Moderator:SSDEVO::EDMONDS
Created:Tue Jul 13 1993
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1264
Total number of notes:4958

1206.0. "1 Shelf died 5 Raidsets lived." by SUBSYS::DUVAL () Mon Feb 10 1997 12:25

    Hi.
    Over this weekend we lost an entire shelf. I had 5 Raidsets set up
    downward using 1 drive per shelf per Raidset. So loosing this shelf
    affected all 5 Raidsets. I am happy and pleased to say that the
    HSJ software worked great in that 2 Raidsets picked up my 2 spares
    and reconstructed themselves and the other 3 Raidsets reduced and
    went on functioning with so far no apparent loss data. I do have a
    few questions however:
    
    1. What happened, what caused this shelf to bail out? The fix this
       morning via CSC was to "del and add" that entire shelf, the 3
       reduced raidsets then reconstructed. No one can tell me the cause
       though. Is there a way to track this down?
    
    2. From the VMS level, all 5 raidsets went into mount-verify and
       were inaccessable. I thought while all that great stuff was going
       on at the HSJ level that it would not hurt VMS. The entire cluster
       was reloaded to solve the mount verify issue.
    
    I did find the following entry in my VMS error log. Do I need an
    HSJ or Battery replaced?
    
    thanks,
    roger d.
    
 ******************************* ENTRY    4799. *******************************
 ERROR SEQUENCE 13931.                           LOGGED ON:        SID 17000202
 DATE/TIME  8-FEB-1997 14:37:16.59                            SYS_TYPE 01430001
 SYSTEM UPTIME: 19 DAYS 13:51:04
 SCS NODE: LEDS                                                VAX/VMS V6.1

 TIME STAMP KA7AA-AA  CPU FW REV# 2.  CONSOLE FW REV# 4.3

 V A X / V M S        SYSTEM ERROR REPORT         COMPILED 10-FEB-1997 10:10:13
                                                                      PAGE   1.

 ******************************* ENTRY    4800. *******************************
 ERROR SEQUENCE 13932.                           LOGGED ON:        SID 17000202
 DATE/TIME  8-FEB-1997 14:38:50.91                            SYS_TYPE 01430001
 SYSTEM UPTIME: 19 DAYS 13:52:38
 SCS NODE: LEDS                                                VAX/VMS V6.1

 ERL$LOGMSCP KA7AA-AA  CPU FW REV# 2.  CONSOLE FW REV# 4.3

       MESSAGE TYPE        000B
                                       DATAGRAM FOR NON-EXISTING "UCB"
       CLASS DRIVER    4B534944
                                       /DISK/
       CDDB$Q_CNTRLID  54111653
                       01280009
                                       UNIQUE IDENTIFIER, 000954111653(X)
                                       MASS STORAGE CONTROLLER
                                       HSJ40
       CDDB$B_SYSTEMID 10021520
                           4200
       MSLG$L_CMD_REF  00000000
       MSLG$W_SEQ_NUM      00C7
                                       SEQUENCE #199.
       MSLG$B_FORMAT         01
                                       HOST/CNTRLR MEMORY ACCESS LOG
       MSLG$B_FLAGS          00
                                       UNRECOVERABLE ERROR
       MSLG$W_EVENT        012A
                                       CONTROLLER ERROR
                                       CNTRLR MEMORY ERROR
       MSLG$Q_CNT_ID   54111653
                       01280009
                                       UNIQUE IDENTIFIER, 000954111653(X)
                                       MASS STORAGE CONTROLLER
                                       HSJ40
       MSLG$B_CNT_SVR        27
                                       CONTROLLER SOFTWARE VERSION #39.
       MSLG$B_CNT_HVR        4D
                                       CONTROLLER HARDWARE REVISION #77.
       MSLG$L_BUS_ADDR 00000000
                                       MAPPING REGISTER #0 SELECTED

 FIB DEPENDENT DATA

 BACKUP BATTERY PACKET

       INSTANCE        02062301
                                       COMPONENT ID = VALUE ADDED SERVICE

                                       EVENT NUMBER = 06(X)
                                       Processor interrupt by CACHEA1 DRAB. -
                                       CACHEA1 backup battery failure

                                       REPAIR ACTION = 23(X)

                                       NR THRESHOLD = 01(X)
                                       NR CLASSIFICATION = IMMEDIATE

 V A X / V M S        SYSTEM ERROR REPORT         COMPILED 10-FEB-1997 10:10:13
                                                                      PAGE   2.


       TEMPL                 12
       TDISIZE               00
       EVENT TIME      01F90BB0
                       00000000
                                       9194. HRS, 4. MINS, 32. SECS
ANAL/ERR/SINCE=8-FEB-1997 00:00:00.00/BEFORE=9-FEB-1997 00:00:00.00/OUT=X.LIS
    
T.RTitleUserPersonal
Name
DateLines
1206.1roger d., retry the errorlog with decevent..SUBSYS::VIDIOT::PATENAUDEAsk your boss for ARRAY's...Mon Feb 10 1997 13:526
$diag/since=08-feb-1997

Anal/err is giving you bogus info "controller memory error"

roger p.
1206.2DIAG?SUBSYS::DUVALMon Feb 10 1997 14:5910
    
    $diag/since=08-feb-1997 ?
    
    $DIAG not a DCL command?
    
    Is decevent something layered on VMS?
    
    thanks,
    roger d.
    
1206.3Yup could be batterySSDEVO::RMCLEANMon Feb 10 1997 17:234
The instance code is cache battery bad.  It might be that in the middle of 
all of this you lost the cache battery's.  Did you have a failover occur also?
The failover would have caused inaccessability.  A bad battery would have
caused that too.
1206.4SUBSYS::VIDIOT::PATENAUDEAsk your boss for ARRAY's...Mon Feb 10 1997 18:347
Roger,

Yes it's layered but not on LEDS (VAX), I tried it on MSGAXP, but being a lowley
user have no privs to the errorlog ;^) but the command did take.

Roger.
1206.52.7 Patch problem now...SUBSYS::DUVALTue Feb 11 1997 13:1422
    Yes. One controller's battery goes to FAILED every few days. I do a
    restart and it comes back to "GOOD". FS pointed me to a patch for
    2.7 that may help this situation. I pulled the patch this morning,
    applied it to all 4 controllers and now i'm in REAL trouble. After
    the patch I did a restart and ever since then I have not been able
    to get them to come to life again. I tried reloading the card and
    it appears to try, it sequences, then the green light flashes off
    and on just as the other does, but then it stays out and never comes
    back, nothing on console?
    
    I tried a known good card and same result. Where did i go wrong? No
    errors or problems applying the patches. I pulled the patch from the
    CSC32:: location. I've applied patches in the past to 2.5 without a
    problem?
    
    All my disks are now attached to my 2 controllers that I dare not
    restart until I can get this resolved.
    
    Realy need help now,
    thanks,
    roger d.
    
1206.6Try this...SSDEVO::RMCLEANTue Feb 11 1997 14:34124
  I haven't seen this but...  Take the patch out of the controllers that are
currently running.  Then at least you won't have a problem if they crash.
Next I would strongly suggest you get good batteries.  The patch only helps
when you have certain types of batteries that are good but exhibit interesting
properties at certain temperatures.  I would log an IPMT case and see if you
can get other insight there....

To ensure you have the right patch I will include a copy below:


Patch Title:    Fix for Battery Test failures 
HSOF Version:   V27J
Patch Number:   1
Date:           24-SEP-1996
Engineer:       

I.  Patch Description:

The current Periodic Cache Battery Test algorithm does not provide
sufficient test coverage for the controller to properly detect good/bad
Cache Batteries. This patch improves the test coverage provided by the
Periodic Cache Battery Test.

II.  Patch in text form:

Extract the following text to PATCH_V27J-1.txt and send to customers entering
the patch by hand.

----- Begin text -----

Title:	Fix for Battery Test failures 

Version:	V27J   
Length: 	128 
Patch Type:	0 
Patch Number:	1 

 Count:  	27 
 Address:	20108310 
 Value[  0]:	90383000 
 Value[  1]:	20171CC8 
 Value[  2]:	58E8198B 
 Value[  3]:	90A9E008 
 Value[  4]:	8C803000 
 Value[  5]:	0027BC86 
 Value[  6]:	09EF7EAC 
 Value[  7]:	58A01989 
 Value[  8]:	92A1E008 
 Value[  9]:	8C803000 
 Value[ 10]:	0013DE43 
 Value[ 11]:	09EF7E98 
 Value[ 12]:	90A1E008 
 Value[ 13]:	305D202C 
 Value[ 14]:	598E5E03 
 Value[ 15]:	5947DE10 
 Value[ 16]:	5936DE10 
 Value[ 17]:	65320A80 
 Value[ 18]:	8C803000 
 Value[ 19]:	20142AE0 
 Value[ 20]:	0901ED50 
 Value[ 21]:	90A1E008 
 Value[ 22]:	8C91D000 
 Value[ 23]:	375D2010 
 Value[ 24]:	8C91D000 
 Value[ 25]:	58840090 
 Value[ 26]:	305D6038 

 Count:  	0 

Verification:	30C80AA1 
----- End text -----

III.  Patch Installation Script for use with HSDSA-SCRIPT.EXE:

Extract the following script to PATCH_V27J-1.script and execute it using the
HSDSA-SCRIPT program.

----- Begin Script -----
!
!Fix for Battery Test failures
! 
run clcp
2
1
y
V27J   
128 
0 
1 
27 
20108310 
90383000 
20171CC8 
58E8198B 
90A9E008 
8C803000 
0027BC86 
09EF7EAC 
58A01989 
92A1E008 
8C803000 
0013DE43 
09EF7E98 
90A1E008 
305D202C 
598E5E03 
5947DE10 
5936DE10 
65320A80 
8C803000 
20142AE0 
0901ED50 
90A1E008 
8C91D000 
375D2010 
8C91D000 
58840090 
305D6038 
0 
30C80AA1 
3
0
----- End Script -----

1206.7Backed out the v27j-1 patchSUBSYS::DUVALWed Feb 12 1997 17:2313
    Well I did remove the patch from my one surviving HSJ before It
    rebooted. It did turn out to be the patch that caused it, I had
    to boot from a v25j card (which ignored the 27j-1 code patch) and
    then go in and delete the v27j-1 patch. I did that to 3 controllers
    and rebooted them with the origional v27j cards. Then I had to do
    a "clear lost_data" for all my units. CSC helped me with all that.
    
    The patch and log included in .6 is exactly what I had done, each
    log was clean with no indication of a problem?
    
    thanks,
    roger d.
    
1206.8Status /GIDDAY::HOBBSAndy Hobbs. Sydney CSC. -730 5964Tue Feb 18 1997 00:4810
    
     Made any progress in understanding this situation, Roger ?
    
    I've not seen it myself and I've pushed this patch onto a few
    controllers (Australia and UK) without any hitches before, during
    or after.
    
    How did you perform the post-patch reboots ?
    
    A/.
1206.9bad v27j.img file for patching anyway...SUBSYS::DUVALFri Mar 07 1997 11:5114
    Hi,
    Yes. This one has been understood. I had an unofficial v27j-0 release
    to start with. I received 4 Hsj controllers with no cards last year,
    and never could convince them to send me any, so I had to beg. We had
    a blaster here in SHR, so all I needed was the v27j.img file. The one
    I got was a bit off for applying a patch (although the patch utility
    did not flag a problem). The patch was corrupting the OS and so a
    "restart" would just hang.
    
    I re-blasted all my cards and the patch works fine now...
    
    thanks,
    roger d.