[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference smurf::ase

Title:ase
Moderator:SMURF::GROSSO
Created:Thu Jul 29 1993
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2114
Total number of notes:7347

1982.0. "ASE not detecting simulated failure?" by NETRIX::"mcdonald@decatl.alf.dec.com" (John McDonald) Thu Apr 03 1997 20:27

I have customer that's trying to demo an ase config to their customer. It's
2 8400's running ase v1.4 with 6 shared SCSI busses on KZPSA's. They've
setup a Sybase service that accesses all 6 of the shared busses, which have
HSZ's on them. No LSM mirroring.

They are trying to demo ase's failover capability, and one of the tests
they're doing is to disconnect the SCSI cable from one of the KZPSA's
to simulate a KZPSA failure. However, when they pull the cable, nothing
happens. No errors show up in the daemon.log file and the service
doesn't failover. They mentioned that the service may not actually be
doing anything (there aren't any clients yet), so I suggested that they
pull the cable and enter a 'disklabel -r' command for one device on the
shared bus they disconnected to simulate a disk access. Still nothing
happens.

They waited several minutes after pulling the cable and still nothing
showed up. What could cause ase NOT to see a failure?

John McDonald
Atlanta CSC

[Posted by WWW Notes gateway]
T.RTitleUserPersonal
Name
DateLines
1982.1Humph ..... worked on my systemsNETRIX::"myrdal@zk3.dec.com"Gregory P. MyrdalThu Apr 03 1997 20:4536
John,

Not sure what to say.  I agree what you did should have worked.  Note,
however, the aseagent registers itself to get I/O errors.  Thus, if
nothing is going on the service will not be moved.  But the disklabel
read from the physical disk so I gave it a try on my system (running
post V1.4) and it worked ok for me.

Did they do a disklabel command to a disk within the HSZ that ASE is
not aware of?  Ie. if you gave drive rz17 to ASE, do a disklabel on
it.  The agent will be registered for I/O errors to this drive.

You could also just create a filesystem and make a change to a file
on it.

Following is the output of my test from daemon.log after doing a
disklabel -r rz17.

-- Greg

Apr  3 16:34:07 greg2 ASE: fgreg1 Agent ***ALERT: device access failure on
/dev/rz17a from fgreg1
Apr  3 16:34:10 greg2 ASE: fgreg1 Agent Error: can't unreserve device
Apr  3 16:34:13 greg2 ASE: fgreg1 Agent Warning: AM can't ping /dev/rz17a
Apr  3 16:34:13 greg2 ASE: fgreg1 Agent Warning: can't reach device
'/dev/rz17a'
Apr  3 16:34:13 greg2 ASE: fgreg1 Agent Info: exec'ing with pipe:
/var/ase/sbin/ase_run_sh 15583  
Apr  3 16:34:13 greg2 ASE: fgreg1 Agent ***ALERT: possible device failure:
/dev/rz17a
Apr  3 16:34:13 greg2 ASE: fgreg1 Agent Error: can't unreserve device
/dev/rz17a
Apr  3 16:34:13 greg2 ASE: fgreg1 Agent Notice: can't unreserve disk's
devices, stopping it anyway

[Posted by WWW Notes gateway]
1982.2I'll give it a shot...NETRIX::"mcdonald@decatl.alf.dec.com"John McDonaldThu Apr 03 1997 22:5517
Greg,

thanx for the reply. I'm not able to get direct access to the system, so
I have to rely on what I'm told. I'll double check tomorrow that
the device they did the disklabel on really was part of a service.

BTW - I want to double check something. I'm under the impression that
as long as ase can ping other members over at least 1 SCSI bus, it won't
generate an alert, even if the other 5 break. That's the behavior I've
seen in the past and that's what they saw here.

Once again, Thanx.

John McDonald
Atlanta CSC

[Posted by WWW Notes gateway]
1982.3rz40 was part of a serviceNETRIX::"garman@mail.dec.com"Clair GarmanThu Apr 03 1997 23:0516
I am the customer (DEC employee at AOL) for which John posted the note.

Sybase is using raw disks.  rz40b, rz40c, rz40d are raw partitions
being used by one disk service.  We altered the default partitions.

The service is running on dec02.  A disklabel to rz40 works fine.
We run a script that performs a constant disklabel command to rz40
and disconnect the KZPSA cable to that bus.  The disklabel command
stalls - no output.  The daemon.log and DECevent show no notice of
the disconnection.

I aborted the disklabel command and tried a dd command from rz40.
It stalled as well.

Clair Garman
[Posted by WWW Notes gateway]
1982.4Problem solved.NETRIX::"mcdonald@decatl.alf.dec.com"John McDonaldFri Apr 04 1997 15:5713
Problem solved. It turns out that they weren't waiting long enough for
ase to detect the failure - It took almost 2 minutes for the error
to show up. Since the system is going to be demo'd to a customer,
I suggested that they consider modifying the timeout values using
/etc/hsm.conf, with the usual warning about possible false alerts
showing up.

Thanx for the replies.

John McDonald
Atlanta CSC

[Posted by WWW Notes gateway]
1982.5XIRTLU::schottEric R. Schott USG Product ManagementFri Apr 04 1997 16:238
Hi

 The timeout problem may be in the CAM driver, not in ASE.  You may
find changing /etc/hsm.conf won't fix this.  You may need to qar/IPMT
this...

I would not close it quite yet...

1982.6dust.zk3.dec.com::MarshallRob Marshall USEGFri Apr 04 1997 17:1811
Hi,

Eric is right, the timeouts are in the CAM layer, and there is
nothing in hsm.conf that you can change that will help.  Plus,
there are changes being made (not sure, but they *may* be in 
PTmin - 4.0c) that will fail a device that is not answering 
much more quickly (somewhere around 15 seconds).  But, don't
quote me on the version for this change.

Rob

1982.7ConfusionNETRIX::"mcdonald@decatl.alf.dec.com"John McDonaldFri Apr 04 1997 20:5512
Eric & Rob,

I'm confused. Are you saying that the changes in /etc/hsm.conf will have
no effect at all, or that they won't have any significant effect in this
case? The reason I'm confused is that I've used hsm.conf before, and it
can make a difference. Also, according to the source, HSM replaces it's
internal values with those specified by hsm.conf.

John McDonald
Atlanta CSC

[Posted by WWW Notes gateway]
1982.8things are improving?namix.fno.dec.com::jptFIS and ChipsMon Apr 07 1997 08:2012
	As previous replys state, the problem may not be the ASE timeout
	itself, but underlying layer of SCSI CAM driver, which seem not to
	notice the error soon enough. And before CAM sees the problem, ASE
	can't do absolutely anything to solve it!!!

	I'm glad to hear that someone has put some effort on this, as this
	similar problem was reported first time almost two years ago, and
	again one year later with both LSM and ASE. This will solve some
	issues we've been fighting against in couple of customer cases.

		-jari
1982.9SMURF::KNIGHTFred KnightTue Apr 08 1997 17:3932
The exact failure code followed in the CAM driver is
very dependent on exactly what the failure is.  Removing
a device for example may be similar to disconnecting a
cable, but then again, it may not.  It depends on what
else is going on out on the SCSI bus at the time, it
depends on what adapter is being used, and a number of
other items.

Consider if a device is removed from an idle bus and the
device had NEVER been used.  When you first access it
we will notice it fairly quickly.  Then take a device that
is being used, and you remove the device immediatly after
a command has been sent to the device.  We sent a command,
so we wait for the command to complete.  In some devices
it is legal to take 60 seconds to complete some commands.
So, if after 60 seconds it isn't done, we abort the command
and try again (and we do this several times).  So, you
then end up with a several minute detection time for the
removal of that particular device.

The basic problem is that failure detection is not predictable.

The goal of our future work is to make it more predictable.
It will never be 100%, but it will be more predictable than
it is today.

Why will it never be 100% - consider a device that is broken
in such a way that it accepts commands but NEVER executes them.
I think it unlikely that a device would break in such a way,
but if it does, it will take us a long time to figure it out.

	Fred Knight