[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference ssdevo::hsz40_product

Title:HSZ40 Product Conference
Moderator:SSDEVO::EDMONDS
Created:Mon Apr 11 1994
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:902
Total number of notes:3319

901.0. "RZ29B - fault latched?" by TROFS::S_OSSOWSKI (STEPHEN OSSOWSKI @TRO) Wed Jun 04 1997 15:04

    Is it possible(or allowed) that a disk fails(rz29b u0007) orange (fault)
    light on and it appears to continue functionning without being moved
    automatically to the failedset(hsz40 v3.1)? It was a mirror member within a
    stripeset (0+1). This caused confusion for the customer and a service
    engineer who went in for the repair. The host system (unix 3.2g) did
    log an event (UNIT ATTENTION - Medium changed or _target reset). The
    customer noticed an application would not run anymore(I don't have
    more detail) and led them to go and open the SW cab, discovering the
    fault light.
    
    Why would fault light remain lit and drive appear active(green flash)?
    Was it perhaps normalizing?. The mirroset status info was not recorded hence
    why I'm asking what's possible. Is this a power management issue related
    to f/w 0007 on rz29b's? Incidentally V3.1 of HSOF had just been
    installed 2 weeks prior, whether that means anything or not.
    
    Quick reply appreciated.
    
    Steve Ossowski 
T.RTitleUserPersonal
Name
DateLines
901.1Locate command????SSDEVO::RMCLEANWed Jun 04 1997 15:511
Did you do a locate command with the CLI?
901.2did not need locate.TROFS::S_OSSOWSKISTEPHEN OSSOWSKI @TROWed Jun 04 1997 15:555
    The fault light was on . Did not need to do a locate. The problem is
    that the disk was not moved to the failedset and was being accessed
    with the yellow led on.
    
    
901.3locate clarificationTROFS::S_OSSOWSKISTEPHEN OSSOWSKI @TROWed Jun 04 1997 16:125
    Ok. I know I can do a LOCATE CANCEL to turn off the led. In this
    situation it was clear from UERF that there was a loggable event and
    something was happening to the mirrored pair.
    
    
901.4SSDEVO::T_GONZALESWed Jun 04 1997 21:065
    was the orange led blinking or lit continuously?  What type of shelf
    were you using ba350 or ba356 with i/o module?  If the light were on
    solid,then the fault led could have become falsely latched.
    
    
901.5ba350TROFS::S_OSSOWSKISTEPHEN OSSOWSKI @TROThu Jun 05 1997 11:1515
    It was in a BA350 within a SW500. (ie no personality module)
    
    How would you define a blinking fault led? Since I was not on site
    myself I am relying on second hand info about led status. I am told ON 
    solid. My concern is again: what type of fault condition (beside locate
    CLI) would NOT put the disk in the failedset. You say "falsely latched"
    , of what consequence is this for the mirrorset. I beleive the host
    should not be impacted at all (besides logging an event). 
    
    This is important because the first sign of a problem from the host's
    perspective was that one Unix Application would not run(don't have details
    yet) hence the discovery of one faulted drive in a two drive mirror.
    The system had to be rebooted to allow app execution (cache flush - unix or
    hsz???). Then the disk was swapped by MCS.
    
901.6SSDEVO::T_GONZALESThu Jun 05 1997 13:295
    A blinking fault light, is either set by the locate command or is
    set because the hsz has put the device into the failed set.  If the
    fault light is on steady and not blinking, the device may
    not have ever become ready to the hsz. Did this condition occur after
    a power up or a reboot of the hsz?
901.7customer confidenceTROFS::S_OSSOWSKISTEPHEN OSSOWSKI @TROThu Jun 05 1997 14:0714
    The fault was discovered by the customer because they were having some
    sort of application failure and decided to go look at the arrays.
    
    The exact timing of this fault is in question. I believe that if it
    occured any time near(no more than the few prior sec'/min's) it just
    might be related, if not then it is a prior failure with nothing to do
    with the app. That reasoning is based on the fact that it was
    NOT in the failedset and no one looked at the mirror status info.
    
    The customer at the momemt beleives that this 0+1 setup did not provide
    them with DATA RELIABILITY. I am trying calm the storm.
    
    Does this make sense?