[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference ssdevo::hsz40_product

Title:HSZ40 Product Conference
Moderator:SSDEVO::EDMONDS
Created:Mon Apr 11 1994
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:902
Total number of notes:3319

819.0. "HSZ40 error -> ADVFS panic!" by ATZIS1::PUTZENLECHNE (wherever is fun, there's always ALPHA) Thu Mar 20 1997 13:20

Hi!

Can somebody have a look at this :

I experienced a system-down because of a single error on a MIRRORED
disk on the HSZ40 because of an follow on advfs domain panic.

The customer asked me why a disk error is causing a problem if he uses
mirrorsets. I don't know the answer.

crossposted in HSZ40 notes-file and ADVFS_SUPPORT notes-file

Please look at the following description:

I have the following configuration:

An Alphaserver 8200 with a HSZ40 (dual redundant) connected to scsi5.
On the HSZ40 there is a unit D100 configured which is a stripeset.
This stripeset (STRIPE3) consists of 6 mirrorsets (MIRR31 - MIRR36).
Mirrorset MIRR32 consists of DISK210 and DISK110.

The unit D100 is the UNIX-device /dev/rza41c which is used by the
ADVFS domain named ora_dat1.

ON 13th March the unit D100 logged the following error which
can be decoded as a command timeout to DISK110:


******************************** ENTRY    4 ********************************


Logging OS                        2. Digital UNIX
System Architecture               2. Alpha
Event sequence number            10.
Timestamp of occurrence              13-MAR-1997 05:32:38
Host name                            sapfddi4

System type register      x0000000C  AlphaServer 8x00
Number of CPUs (mpnum)    x00000002
CPU logging event (mperr) x0000000D

Event validity                    1. O/S claims event is valid
Event severity                    5. Low Priority
Entry type                      199. CAM SCSI Event Type


------- Unit Info -------
Bus Number                        5.
Unit Number                   x0148  Target =   1.  <--- this is rza41c
                                     LUN =   0.          UNIT D100 on
------- CAM Data -------				 the HSZ40
Class                           x00  Disk
Subsystem                       x00  Disk
Number of Packets                10.

------ Packet Type ------       258. Module Name String

Routine Name                         cdisk_check_sense

------ Packet Type ------       256. Generic String

                                     Event - Unit Attention

------ Packet Type ------       262. Info Error String

Error Type                           Information Message Detected (recovered)

------ Packet Type ------       257. Device Name String

Device Name                          DEC     HSZ4

------ Packet Type ------       256. Generic String

                                     Active CCB at time of error

------ Packet Type ------       256. Generic String

                                     CCB request completed with an error

------ Packet Type ------         1. SCSIh I/O Request CCB(CCB_SCSIIO)
Packet Revision                  37.

CCB Address               xFFFFFC005D4B7B28
CCB Lengt                    x00C0
XPT Function Code               x01  Execute requested SCSI I/O
Cam Status                      x84  CCB Request Completed WITH Error
                                     Autosense Data Valid for Target
Path ID                           5.
Target ID                         1.
Target LUN                        0.
Cam Flags                 x00000482  SIM Queue Actions are Enabled
                                     Data Direction (10: DATA OUT)
                                     Disable the SIM Queue Frozen State
*pdrv_ptr                 xFFFFFC005D4B7828
*next_ccb                 x0000000000000000
*req_map                  xFFFFFC007B13F400
void (*cam_cbfcnp)()      xFFFFFC00004A5460
*data_ptr                 xFFFFFFFFC6428000
Data Transfer Length          16384.
*sense_ptr                xFFFFFC005D4B7850
Auotsense Byte Length           160.
CDB Length                       10.
Scatter/Gather Entry Cnt          0.
SCSI Status                     x02  Check Condition
Autosense Residue Length        x00
Transfer Residue Length   x00004000
(CDB) Command & Data Buf

          15--<-12  11--<-08  07--<-04  03--<-00   :Byte Order
 0000:              00000000  0000C037  B301002A   *    *...7... ...*

Timeout Value             x0000003C
*msg_ptr                  x0000000000000000
Message Length                    0.
Vendor Unique Flags           x4000
Tag Queue Actions               x20  Tag for Simple Queue

------ Packet Type ------       256. Generic String

                                     Error, exception, or abnormal condition

------ Packet Type ------       256. Generic String

                                     UNIT ATTENTION - Medium changed or target
                                     reset

------ Packet Type ------       768. SCSI Sense Data
Packet Revision                   0.

------- HSZ Data -------
Instance Code             x031A4002  Command timeout.

                                     Component ID =   Device Services.
                                     Event Number =   x0000001A
                                     Repair Action =   x00000040
                                     NR Threshold =   x00000002
Template Type                   x51  Disk Transfer Error.
Template Flags                  x00  HCE =   0, Event did not occur during Host
                                             Command Execution.
Ctrl Serial #                              ZG60606525
Ctrl Software Revision               V30Z
RAIDSET State                   x00  NORMAL. All members present and
                                     reconstructed, IF LUN is configured as a
                                     RAIDSET.

Error Count                       1.
Retry Count                       0.
Most Recent ASC                 xB0
Most Recent ASCQ                x00
Next Most Recent ASC            x00
Next Most Recent ASCQ           x00
Device Locator              x000101  Port    =   1.
                                     Target  =   1.
                                     LUN     =   0.    <--- DISK110
Drive Software Revision              0007
Drive Product Name                   RZ29B    (C) DEC
Device Type                     x00  Direct Access Device.
Sense Data Qualifier            x00  Buf Mode =   0, The target shall not
                                                  report GOOD Status on write
                                                  commands until the data
                                                  blocks are actually written
                                                  on the medium.
                                     UWEUO =   0, not defined.
                                     MSBD =   0, not defined.
                                     FBW =   0, not defined.
                                     IDSD =   0, Valid Device Sense Data
                                              fields.
                                     DSSD =   0, Device Sense Data fields
                                              supplied by the controller.
-- Standard Sense Data --

Error Code                      x70  Current Error
Segment #                       x00
Information Byte 3              x00
            Byte 2              x00
            Byte 1              x00
            Byte 0              x00
Sense Key                       x06  Unit Attention
Additional Sense Length         x98
CMD Specific Info Byte 3        x00
                  Byte 2        x00
                  Byte 1        x00
                  Byte 0        x00

ASC & ASCQ                    xB000  ASC  =   x00B0
                                     ASCQ =   x0000
                                     Command timeout.

FRU Code                        x00
Sense Key Specific Byte 0       x00  Sense Key Data NOT Valid
                   Byte 1       x00
                   Byte 2       x00

-- Device Sense Data --

Error Code                      x00  Error Code not decoded
Segment #                       x00
Information Byte 3              x00
            Byte 2              x00
            Byte 1              x00
            Byte 0              x00
Sense Key                       x04  Hardware Error
Additional Sense Length         x00
CMD Specific Info Byte 3        x00
                  Byte 2        x00
                  Byte 1        x00
                  Byte 0        x00

ASC & ASCQ                    xB000  ASC  =   x00B0
                                     ASCQ =   x0000
                                     Command timeout.

FRU Code                        x00
Sense Key Specific Byte 0       x00  Sense Key Data NOT Valid
                   Byte 1       x00
                   Byte 2       x00


******************************** ENTRY    5 ********************************


Logging OS                        2. Digital UNIX
System Architecture               2. Alpha
Event sequence number            11.
Timestamp of occurrence              13-MAR-1997 05:32:40
Host name                            sapfddi4

System type register      x0000000C  AlphaServer 8x00
Number of CPUs (mpnum)    x00000002
CPU logging event (mperr) x0000000D

Event validity                    1. O/S claims event is valid
Event severity                    3. High Priority
Entry type                      199. CAM SCSI Event Type


------- Unit Info -------
Bus Number                        5.
Unit Number                   x0148  Target =   1.
                                     LUN =   0.
------- CAM Data -------
Class                           x00  Disk
Subsystem                       x00  Disk
Number of Packets                 4.

------ Packet Type ------       258. Module Name String

Routine Name                         cdisk_reset_rec_err

------ Packet Type ------       256. Generic String

                                     Recovery failed

------ Packet Type ------       260. Hardware Error String

Error Type                           Hard Error Detected

------ Packet Type ------       257. Device Name String

Device Name                          DEC     HSZ4



At the same time the domain "ora_dat1" paniced, and oracle stopped.
These are the entries from /var/adm/messages:

Mar 13 05:32:40 sapfddi4 vmunix: advfs I/O error: setId 0x3171fd89.000554e0.ffff
fffe.0000  tag 0xfffffff7.0000u  page 474
Mar 13 05:32:40 sapfddi4 vmunix:        vd 1  blk 28522432  blkCnt 32
Mar 13 05:32:40 sapfddi4 vmunix:        write error = 5
Mar 13 05:32:40 sapfddi4 vmunix:
Mar 13 05:32:40 sapfddi4 vmunix: bs_osf_complete: metadata write failed
Mar 13 05:32:40 sapfddi4 vmunix: AdvFS Domain Panic; Domain ora_dat1 Id 0x3171fd
89.000554e0



The DISK110 was not failed after this. I wonder why such a "soft" error
is causing such heavy failure. The system is built with redundant conrollers
and mirrored disks to prevent system down situations in case of hardware
errors of disks or controllerboards, but in this case this did not work.

Can anybody help me to explain what really happened.

thanks for every input

Helmut

T.RTitleUserPersonal
Name
DateLines
819.1not enough informationSSDEVO::RMCLEANThu Mar 20 1997 19:033
What version of HSOF software are you running & what patch level.  The
error logs don't tell us this nor do they tell us what configuration
you have.
819.2Configuration HSZ40ATZIS2::PUTZENLECHNEwherever is fun, there's always ALPHATue Apr 01 1997 06:33119
Hi!

I'm sorry for the delay, I had to go out of the office last week.

The HSZ40 is connected to an Alphaserver 8200 via a KZPSA in a
DWLPA. UNIX Version was at V3.2d-1 and is now upgraded to 3.2G.

here i print out the relevant part hsz40 config:

HSZ03> sho this full

Controller:
        HSZ40 ZG60606525 Firmware V30Z-2, Hardware  B03
        Configured for dual-redundancy with ZG60506190
            In dual-redundant configuration
        SCSI address 6
        Time: 20-MAR-1997 17:07:08
Host port:
        SCSI target(s) (1, 2, 3, 4), Preferred target(s) (1, 3)
        TRANSFER_RATE_REQUESTED = 10MHZ
Cache:
        32 megabyte write cache, version 2
        Cache is GOOD
        Battery is GOOD
        Unflushed data in cache
        CACHE_FLUSH_TIMER = DEFAULT (10 seconds)
        CACHE_POLICY = B
        Host Functionality Mode = A
Licensing information:
        RAID (RAID Option) is ENABLED, license key is VALID
        WBCA (Writeback Cache Option) is ENABLED, license key is VALID
        MIRR (Disk Mirroring Option) is ENABLED, license key is VALID
Extended information:
        Terminal speed 9600 baud, eight bit, no parity, 1 stop bit
        Operation control: 00000004  Security state code: 76193
        Configuration backup enabled on 16 devices


HSZ03> sho unit

    LUN                                      Uses
--------------------------------------------------------------

  D100                                       STRIPE3
  D101                                       MIRR11
  D200                                       STRIPE2
  D300                                       STRIPE5
  D400                                       STRIPE4

The effected UNIT was D100:

HSZ03> sho d100

    LUN                                      Uses
--------------------------------------------------------------

  D100                                       STRIPE3
        Switches:
          RUN                    NOWRITE_PROTECT        READ_CACHE
          WRITEBACK_CACHE
          MAXIMUM_CACHED_TRANSFER_SIZE = 1024
        State:
          ONLINE to this controller
          Not reserved
          PREFERRED_PATH = THIS_CONTROLLER
        Size: 50265168 blocks

HSZ03> sho stripe3

Name          Storageset                     Uses             Used by
------------------------------------------------------------------------------

STRIPE3       stripeset                      MIRR31           D100
                                             MIRR32
                                             MIRR33
                                             MIRR34
                                             MIRR35
                                             MIRR36
        Switches:
          CHUNKSIZE = 256 blocks
        State:
          NORMAL
          MIRR31    (member  0) is NORMAL
          MIRR32    (member  1) is NORMAL
          MIRR33    (member  2) is NORMAL
          MIRR34    (member  3) is NORMAL
          MIRR35    (member  4) is NORMAL
          MIRR36    (member  5) is NORMAL
        Size: 50265168 blocks
HSZ03> sho mirr32

Name          Storageset                     Uses             Used by
------------------------------------------------------------------------------

MIRR32        mirrorset                      DISK110          STRIPE3
                                             DISK210
        Switches:
          NOPOLICY (for replacement)
          COPY (priority) = NORMAL
          READ_SOURCE = LEAST_BUSY
          MEMBERSHIP = 2, 2 members present
        State:
          NORMAL
          DISK210   (member  0) is NORMAL
          DISK110   (member  1) is NORMAL  <--- disk with error
        Size: 8377528 blocks
HSZ03> sho disk110

Name          Type                      Port Targ  Lun        Used by
------------------------------------------------------------------------------

DISK110       disk                         1    1    0        MIRR32
          DEC      RZ29B    (C) DEC 0016
        Switches:
          NOTRANSPORTABLE
          TRANSFER_RATE_REQUESTED = 10MHZ (synchronous 10 MHZ negotiated)
        Size: 8377528 blocks
        Configuration being backed up on this container

819.3You need -3 patchSSDEVO::RMCLEANTue Apr 01 1997 14:4725
>>        HSZ40 ZG60606525 Firmware V30Z-2, Hardware  B03


You should be running V30Z-3  It corrects some problems in this area.


I.  Patch Description:

  This mirrorset repair/fast buffer problem may be encountered with HSOF
  V3.0Z, V5.0Z and V5.0J.  Mirroring (with or without striping) must be
  in use on the controller.  Data transfers greater than the value
  specified in the controller parameter MAXIMUM_CACHED_TRANSFER_SIZE
  must be taking place.  The default parameter value is 32 blocks
  (16KB).  An unrecoverable error from a device must initiate a Mirror
  repair.


  When the above conditions take place, the controller improperly
  de-allocates buffers, contaminating the Fast Buffer pool and the Cache
  Buffer pool.  Subsequently, when a mix of transfers greater than the
  MAXIMUM_CACHED_TRANSFER_SIZE (using Fast buffers) and less than the
  MAXIMUM_CACHED_TRANSFER_SIZE (using Cache Buffers) occurs, the
  double-allocated buffers will be used and a data integrity problem is
  stimulated.
                         
819.4OK - but....?ATZIS2::PUTZENLECHNEwherever is fun, there's always ALPHAWed Apr 02 1997 06:1914
    Thanks!
    
    I did not really understand the Patch Description, but i will
    install the patch and hope this helps.
    
    What i do not understand is if there are two different problems
    fixed with this patch?
    1.) transfer size > MAXIMUM_CACHED_TRANSFER_SIZE
    2.) unrecoverable error from device must initiate mirror repair
    
    Are these things independent from each other or is there a
    relationship between 1.) and 2.)?
    
    Helmut
819.5KERNEL::LOANEComfortably numb!!Wed Apr 02 1997 11:263
    What  it  really  says is that you are susceptible to the problem IF 
    ALL the points in the reply are valid  i.e.  You  have  Mirror  sets 
    .AND. you have errors .AND. .......etc
819.6Yes, it's not sureATZIS2::PUTZENLECHNEwherever is fun, there's always ALPHAThu Apr 03 1997 11:116
    My words - I fear it is not the same, because i think we had already
    implemeted the "early fix" (setting MAXIMUM_CACHED_TRANSFER_SIZE to
    1024, and changing HSZ40-entriy in the cam_data.c) as the error 
    occurred.
    
    Helmut