[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference smurf::ase

Title:	ase

Moderator:	SMURF::GROSSO

Created:	Thu Jul 29 1993
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2114
Total number of notes:	7347

1906.0. "failover question" by KYOSS1::GREEN () Fri Feb 28 1997 13:12

    I have a question regarding failover.
    	3.2g with patches
    	ase 1.3 with patches
    	2 8400s
    
    	We are doing some testing for customer and here is scenario.
    	dual-redundant HSZ40s (v3.0-2).
    	HSZ A mirrors HSZ b.
    
    	We pull power on one HSZ redundant pair. We expected LSM to fail
    out the mirrors, instead we appear to get in a condition
    where ASE fails overs to other machine, everything comes
    back fine.
    
    	We think what might have confused things, was that customer
    plugged the HSZ back in too soon.

T.R	Title	User	Personal Name	Date	Lines
1906.1		CSC32::KIRK		`Fri Feb 28 1997 16:17`	23
	Dick, I have also seen the same thing happen wit v3.2g/ase1.3 with patches and 2-8200s. sysA sysb kspsa0------hsz40-----kzpsa0 kzpsa1------hsz40-----kzpsa1 The disk are lsm mirrored across the SCSI's.. We have seen this once with pulling the power on the hsz40 shelve. We see this everytime when the y-cable is pulled off the kzpsa to try and simulate a scsi bus failure. What we see in the daemon.log file is lsm_lv_action times out and then ase tries to shutdown and relocate the service. During the shutdown lsm_dg_action times out and the system is rebooted which fails the service over to the other system.
1906.2	.0 and .1 are different, need daemon.log for .0	NETRIX::"myrdal@zk3.dec.com"	Greory P. Myrdal	`Mon Mar 03 1997 14:02`	31
	It appears to me that the problems in notes .0 and .1 are different. .1 The scsi cable was pulled of a system in which I/O was going to a drive off of this bus. Since the cable was pulled the scsi is now in an unterminated state. We are at the mercy of the layers below us (ie. cam, device drivers, hardware, etc). The scsi engineers tell me that you get unexpected results when you are dealing with an unterminated bus. In this case we timed out when we were trying to determine if we should relocate the system by asking if all disks within this service were mirrored. This requires access to the drives on the scsi bus which hung up. ASE did the correct thing by eventually relocating the service (via a force method) to the other system to keep it running. .0 This is a better test of hardware failure as it does not unterminate the scsi bus. ASE should not have failed this service over in a correctly configured environment. If you include (or email me) the daemon.log during the time in which you turn off the power from the hsz40 I might be able to give you an idea what is going or if we have a problem. Please make sure informational logging is turned on first. What happens is the customer does not power back on the hsz40 for a long time? -- Greg [Posted by WWW Notes gateway]
1906.3		USCTR1::ASCHER	Dave Ascher	`Mon Mar 03 1997 18:17`	44
	re: <<< Note 1906.2 by NETRIX::"myrdal@zk3.dec.com" "Greory P. Myrdal" >>> -< .0 and .1 are different, need daemon.log for .0 >- It appears to me that the problems in notes .0 and .1 are different. Yes, they are... .1 The scsi cable was pulled of a system in which I/O was going to a drive off of this bus. Since the cable was pulled the scsi is now in an unterminated state. We are at the mercy of the layers below us (ie. cam, device drivers, hardware, etc). The scsi engineers tell me that you get unexpected results when you are dealing with an unterminated bus. In this case we timed out when we were trying to determine if we should relocate the system by asking if all disks within this service were mirrored. This requires access to the drives on the scsi bus which hung up. ASE did the correct thing by eventually relocating the service (via a force method) to the other system to keep it running. I agree that you are 'at the mercy of the layers below', and the scenarios are also different due to the fact that in .0 there is still a path between the two systems over this scsi while in .1 there is not. However, the problem is that your current logic is not robust enough to deal with this situation. Without ASE, a system can keep working fine with the cable pulled out of a KZPSA. With ASE, the same should be true. If it is a matter of longer timeouts required on the lsm_lv_action script or timeouts on the vold show diskgroup and voldisk list commands within that script, then that's what needs to be done. ASE should not be forcing a failover when one is not necessary... Assuming that this once worked, perhaps changes in the behavior of the HSZ or LSM have not been responded to by ASE yet? btw we also tried this test with a terminator stuck onto the kzpsa. That made no difference. An IPMT is on the way. dave
1906.4	clarification of .0	KYOSS1::GREEN		`Mon Mar 03 1997 18:37`	7
	The problem reported in .0 was pulling power on HSZ box. We did this twice. The first time we left HSZ down and NO FAILOVER. During the second test (different pair of HSZs, same firmware), the power was re-applied to the HSZs (possibly prematurely) and the service failed over. dick
1906.5	Timeouts do not help .... pulling scsi cables were NEVER a supported failure case	NETRIX::"myrdal@zk3.dec.com"	Gregory P. Myrdal	`Mon Mar 03 1997 19:56`	59
	Note .3 reads: I agree that you are 'at the mercy of the layers below', and the scenarios are also different due to the fact that in .0 there is still a path between the two systems over this scsi while in .1 there is not. However, the problem is that your current logic is not robust enough to deal with this situation. Without ASE, a system can keep working fine with the cable pulled out of a KZPSA. With ASE, the same should be true. If it is a matter of longer timeouts required If you do not agree with the fact that ASE decides to reboot the system that is fine. The reason does not always lie in the hands of ASE. For example, a common case for this is the umount command failing in this situation. If we had a forced umount we could have kept the system available and relocated the service. ASE engineering has worked for about 2 years to get a forced umount to avoid things like this. We are continuing to work harder with the base such that we can act correctly when we get a failure. This process is always slower than any of us like. same should be true. If it is a matter of longer timeouts required on the lsm_lv_action script or timeouts on the vold show diskgroup and voldisk list commands within that script, then that's what needs to be done. ASE should not be forcing a failover when one is not necessary... Ah, no this is not a matter of timeouts. I already tried that. It might actually work, however, not in all cases. I am not a scsi engineer, so when I asked them about this they could not tell me exactly how long the timeout should be. In undeterministic. Assuming that this once worked, perhaps changes in the behavior of the HSZ or LSM have not been responded to by ASE yet? It is not clear to me what once worked. If this is a regression of behavior (of which we support) then please enter a QAR. This will be fixed in the next release. If something like this worked in the past its not because of our changes. It would have been because of the base. Once the QAR is entered we can determine what should be done with it (ie. which group owns it). btw we also tried this test with a terminator stuck onto the kzpsa. That made no difference. I heard about this. Someone would have to explain this to a scsi/cam engineer and I am sure they could tell you what is going on at that layer. Of course, putting a terminator back into the kzpsa is not a real life example of a hardware failure. An IPMT is on the way. Thank you. -- Greg [Posted by WWW Notes gateway]
1906.6		USCTR1::ASCHER	Dave Ascher	`Mon Mar 03 1997 21:01`	28
	Of course, putting a terminator back into the kzpsa is not a real life example of a hardware failure. I don't want to waste a lot of time trying to verify that ASE is able to help systems survive problems that it cannot actually help with... or worse, finding that it makes systems less available then they would be without ASE. How can I find out what you guys consider 'legitimate' failure conditions so we can use those as a base for our testing in the field? If pulling the scsi cable out of a KZPSA is not a good simulation of a kzpsa failure (or of a cable failure) then what is? what do you use for testing? Clearly there are all kinds of conditions that can arise that can make it impossible for the system to survive and for whcih rebooting is the only possible alternative for attempting to get the application available on another node. I image there are failure modes in a kzpsa that would play havoc with scsi - and others that would play havoc with PCI. This particular scenario doesn't seem all that complex or obscure - in fact it was the very first failure that I observed on a real site over 2 years ago when one of our 'suits' tripped over a bundle of cables and they got pulled out of their scsi interface cards. Fortunately, the connectors were not secured and fortunately there was LSM mirroring. Also fortunately, I guess, there was no ASE. d
1906.7	Try our QA group	NETRIX::"myrdal@zk3.dec.com"	Grgeory P. Myrdal	`Tue Mar 04 1997 15:29`	12
	To get information about what our test group does please contact someone in that group. If someone from that group reads this notes file, maybe you can post a pointer to tests. Note: they may actually pull cables for some test cases, however, keep in mind when they do this they are looking for specific results (which may cause the system to reboot). Cheers, -- Greg [Posted by WWW Notes gateway]
1906.8	Same kind of test ... same problems	NNTPD::"LopezJO@mail.dec.com"	Jose Ignacio Lopez	`Tue May 06 1997 09:03`	12
	Hello, Using the same configuration and same tests we've got the same results. Customer need to test a single SCSI failure in a redundant scenario (2 SCSIs, mirrored with LSM) and pulling only one SCSI, ASE shouldn't fail over the service to the other machine. Is there any way to avoid the timeout in the lsm_lv_action script ? Thanks Jose Ignacio [Posted by WWW Notes gateway]
1906.9	use a DWZZA	SMURF::MYRDAL		`Thu May 08 1997 13:19`	5
	Put a DWZZA on the scsi bus and turn it off. This will cause a path failure. -- Greg