[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference smurf::ase

Title:ase
Moderator:SMURF::GROSSO
Created:Thu Jul 29 1993
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2114
Total number of notes:7347

1906.0. "failover question" by KYOSS1::GREEN () Fri Feb 28 1997 13:12

    I have a question regarding failover.
    	3.2g with patches
    	ase 1.3 with patches
    	2 8400s
    
    	We are doing some testing for customer and here is scenario.
    	dual-redundant HSZ40s (v3.0-2).
    	HSZ A mirrors HSZ b.
    
    	We pull power on one HSZ redundant pair. We expected LSM to fail
    out the mirrors, instead we appear to get in a condition
    where ASE fails overs to other machine, everything comes
    back fine.
    
    	We think what might have confused things, was that customer
    plugged the HSZ back in too soon.
T.RTitleUserPersonal
Name
DateLines
1906.1CSC32::KIRKFri Feb 28 1997 16:1723
    Dick,
    
    I have also seen the same thing happen wit v3.2g/ase1.3 with patches
    and 2-8200s.
    
    
    sysA                  sysb
    kspsa0------hsz40-----kzpsa0
    
    kzpsa1------hsz40-----kzpsa1
    
    The disk are lsm mirrored across the SCSI's..
    
    We have seen this once with pulling the power on the hsz40 shelve. We
    see this everytime when the y-cable is pulled off the kzpsa to try and
    simulate a scsi bus failure.
    
    What we see in the daemon.log file is lsm_lv_action times out and then
    ase tries to shutdown and relocate the service. During the shutdown 
    lsm_dg_action times out and the system is rebooted which fails the
    service over to the other system.
    
    
1906.2.0 and .1 are different, need daemon.log for .0NETRIX::"myrdal@zk3.dec.com"Greory P. MyrdalMon Mar 03 1997 14:0231
It appears to me that the problems in notes .0 and .1 are
different.

.1  The scsi cable was pulled of a system in which I/O was
   going to a drive off of this bus.  Since the cable was
   pulled the scsi is now in an unterminated state.  We are
   at the mercy of the layers below us (ie. cam, device drivers,
   hardware, etc).  The scsi engineers tell me that you get
   unexpected results when you are dealing with an unterminated
   bus.  In this case we timed out when we were trying to 
   determine if we should relocate the system by asking if
   all disks within this service were mirrored.  This requires
   access to the drives on the scsi bus which hung up. ASE did 
   the correct thing by eventually relocating the service (via 
   a force method) to the other system to keep it running.

.0 This is a better test of hardware failure as it does not
   unterminate the scsi bus.  ASE should not have failed this
   service over in a correctly configured environment.  If
   you include (or email me) the daemon.log during the time
   in which you turn off the power from the hsz40 I might be
   able to give you an idea what is going or if we have a 
   problem.  Please make sure informational logging is turned
   on first.

   What happens is the customer does not power back on the 
   hsz40 for a long time?

-- Greg 

[Posted by WWW Notes gateway]
1906.3USCTR1::ASCHERDave AscherMon Mar 03 1997 18:1744
re:     <<< Note 1906.2 by NETRIX::"myrdal@zk3.dec.com" "Greory P. Myrdal" >>>
              -< .0 and .1 are different, need daemon.log for .0 >-

It appears to me that the problems in notes .0 and .1 are
different.

    Yes, they are... 
    
.1  The scsi cable was pulled of a system in which I/O was
   going to a drive off of this bus.  Since the cable was
   pulled the scsi is now in an unterminated state.  We are
   at the mercy of the layers below us (ie. cam, device drivers,
   hardware, etc).  The scsi engineers tell me that you get
   unexpected results when you are dealing with an unterminated
   bus.  In this case we timed out when we were trying to 
   determine if we should relocate the system by asking if
   all disks within this service were mirrored.  This requires
   access to the drives on the scsi bus which hung up. ASE did 
   the correct thing by eventually relocating the service (via 
   a force method) to the other system to keep it running.

    I agree that you are 'at the mercy of the layers below', and the
    scenarios are also different due to the fact that in .0 there is
    still a path between the two systems over this scsi while in .1
    there is not. 
    
    However, the problem is that your current logic is not robust
    enough to deal with this situation. Without ASE, a system can keep
    working fine with the cable pulled out of a KZPSA. With ASE, the
    same should be true. If it is a matter of longer timeouts required
    on the lsm_lv_action script or timeouts on the vold show diskgroup
    and voldisk list commands within that script, then that's what
    needs to be done. ASE should not be forcing a failover when one is
    not necessary... 
    
    Assuming that this once worked, perhaps changes in the behavior of
    the HSZ or LSM have not been responded to by ASE yet? 
    
    btw we also tried this test with a terminator stuck onto the
    kzpsa. That made no difference.
    
    An IPMT is on the way.
    
    dave
1906.4clarification of .0KYOSS1::GREENMon Mar 03 1997 18:377
    	The problem reported in .0 was pulling power on HSZ box.
    	We did this twice. The first time we left HSZ down and NO
    FAILOVER.
    	During the second test (different pair of HSZs, same firmware),
    the power was re-applied to the HSZs (possibly prematurely) and
    the service failed over.
    			dick
1906.5Timeouts do not help .... pulling scsi cables were NEVER a supported failure caseNETRIX::&quot;myrdal@zk3.dec.com&quot;Gregory P. MyrdalMon Mar 03 1997 19:5659
Note .3 reads:

    I agree that you are 'at the mercy of the layers below', and the
    scenarios are also different due to the fact that in .0 there is
    still a path between the two systems over this scsi while in .1
    there is not. 
    
    However, the problem is that your current logic is not robust
    enough to deal with this situation. Without ASE, a system can keep
    working fine with the cable pulled out of a KZPSA. With ASE, the
    same should be true. If it is a matter of longer timeouts required

If you do not agree with the fact that ASE decides to reboot the system
that is fine.  The reason does not always lie in the hands of ASE.  For
example, a common case for this is the umount command failing in this
situation.  If we had a forced umount we could have kept the system
available and relocated the service.  ASE engineering has worked for about
2 years to get a forced umount to avoid things like this. 

We are continuing to work harder with the base such that we can act
correctly when we get a failure.  This process is always slower than
any of us like.

    same should be true. If it is a matter of longer timeouts required
    on the lsm_lv_action script or timeouts on the vold show diskgroup
    and voldisk list commands within that script, then that's what
    needs to be done. ASE should not be forcing a failover when one is
    not necessary... 

Ah, no this is not a matter of timeouts.  I already tried that.  It might
actually work, however, not in all cases.  I am not a scsi engineer, so
when I asked them about this they could not tell me exactly how long the
timeout should be.  In undeterministic.
    
    Assuming that this once worked, perhaps changes in the behavior of
    the HSZ or LSM have not been responded to by ASE yet? 

It is not clear to me what once worked.  If this is a regression of 
behavior (of which we support) then please enter a QAR.  This will be
fixed in the next release.

If something like this worked in the past its not because of our changes.
It would have been because of the base.  Once the QAR is entered we can
determine what should be done with it (ie. which group owns it).

    btw we also tried this test with a terminator stuck onto the
    kzpsa. That made no difference.

I heard about this.  Someone would have to explain this to a scsi/cam
engineer and I am sure they could tell you what is going on at that
layer.  Of course, putting a terminator back into the kzpsa is not a
real life example of a hardware failure. 

    An IPMT is on the way.

Thank you.

-- Greg
[Posted by WWW Notes gateway]
1906.6USCTR1::ASCHERDave AscherMon Mar 03 1997 21:0128
Of course, putting a terminator back into the kzpsa is not a
real life example of a hardware failure. 

    I don't want to waste a lot of time trying to verify that ASE
    is able to help systems survive problems that it cannot actually
    help with... or worse, finding that it makes systems less
    available then they would be without ASE.
    
    How can I find out what you guys consider 'legitimate' failure
    conditions so we can use those  as a base for our testing in
    the field? If pulling the scsi cable out of a KZPSA is not
    a good simulation of a kzpsa failure (or of a cable failure)
    then what is? what do you use for testing?
    
    Clearly there are all kinds of conditions that can arise that can
    make it impossible for the system to survive and for whcih
    rebooting is the only possible alternative for attempting to get
    the application available on another node. I image there are
    failure modes in a kzpsa that would play havoc with scsi - and
    others that would play havoc with PCI. This particular scenario
    doesn't seem all that complex or obscure - in fact it was the very
    first failure that I observed on a real site over 2 years ago when
    one of our 'suits' tripped over a bundle of cables and they got
    pulled out of their scsi interface cards. Fortunately, the
    connectors were not secured and fortunately there was LSM
    mirroring. Also fortunately, I guess, there was no ASE. 

    d
1906.7Try our QA groupNETRIX::&quot;myrdal@zk3.dec.com&quot;Grgeory P. MyrdalTue Mar 04 1997 15:2912
To get information about what our test group does please contact
someone in that group.  If someone from that group reads this notes
file, maybe you can post a pointer to tests.  Note: they may actually 
pull cables for some test cases, however, keep in mind when they do
this they are looking for specific results (which may cause the
system to reboot).

Cheers,

-- Greg
[Posted by WWW Notes gateway]
1906.8Same kind of test ... same problemsNNTPD::&quot;LopezJO@mail.dec.com&quot;Jose Ignacio LopezTue May 06 1997 09:0312
Hello,

Using the same configuration and same tests we've got the same results. 
Customer need to test a single SCSI failure in a redundant scenario
(2 SCSIs, mirrored with LSM) and pulling only one SCSI, ASE shouldn't
fail over the service to the other machine.

Is there any way to avoid the timeout in the lsm_lv_action script ?
Thanks
Jose Ignacio

[Posted by WWW Notes gateway]
1906.9use a DWZZASMURF::MYRDALThu May 08 1997 13:195
    Put a DWZZA on the scsi bus and turn it off.  This will cause a path
    failure.
    
    -- Greg