[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference smurf::ase

Title:	ase

Moderator:	SMURF::GROSSO

Created:	Thu Jul 29 1993
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2114
Total number of notes:	7347

1994.0. "Oracle and AdvFs failover problem" by HERON::LAFORGUE (It works better when you plug it!) Wed Apr 09 1997 09:37

Oracle failover with ASE and Advfs file system

While demonstrating Oracle failover during a benchmark, we encounter a problem
with Advfs: if the startup script is too long, exeeding the time-out
(by default 60s), ASE tries to stop the service, removing the domain entries
in /etc/fdmns and cannot dismount the used filesystem. This leaves ADvfs
in a strange state that you can recover, but only if you do not make mistake.

The configuration is a TrueCluster Available Server (ASE) of 2 machines:
- a database server (deion) 8400 6CPU, 4GB
- an application server (laerte) 4100 4CPU, 4GB
Digital UNIX V4.0B
TCR -ASE 1.4

failover of the database server
The database service is a disk service on AdvFs file systems with start script
and stop script.
These scripts start and stop the database and modify the /etc/tnsnames.ora file
to redirect the client-server queries to the right machines.

The first time we tried it, the database was setup with a huge check-point
interval. This means that when the database was brutally stopped with the
server (Ctrl/P on the console) it had a long recovery time (about 15mn).
so the start script timed-out while the database was recovering and
ASE decided to stop the service, trying to dismount the Advfs File system
(it failed because their were busy) but it managed to remove the corresponding
entries in /etc/fdmns. So at this point we had a database in recovering phase
over Advfs filessystems that were supposed to not exist but were actually
mounted.
The correct thing to do then would have been to wait the end of the recover,
shutdown the database, dismount the file systems and then start the service
again.
But following Murphy's law I did a second mistake: Thinking that the service
was stopped, I decided to restart the other machine, just to get it ready
for the next tests. That was a very bad idea: This machine,the database server,
was the favorite member of the service.
So when it came up, it tried to start the service, mounting the Advfs file
sytem that were currently writed by the recovering database:
the result was an Advfs panic on the rebooted machine when it tried to mount
disks used by the other and a corrupted table on the database the was
recovering.
trying to fix the problem I shutdown the database on the backup machine,
dismount the Advfs file system and then I could not mount them anywhere.
I had to create manually the entries in /etc/fdmns, run verify on each
Advfs filesystem nad then dismount them, correct the start script of the
service by just starting thye database in background, to never hit the time-out
and then it was OK.

Conclusion:
You need to be a very experimented user of AdvFs to recover this kind of
problem. The highest risk is the panic of the system manager who takes bad
decision and corrupt the file system or just the consistency of the database.

Some suggestions:
- do not restart the crashed machine before being sure that the service is OK
on the failover machine. If you restart it too early you enter unexpected
situation with high risk for the consistency of your data.
- check that the file system are correctly mounted or dismounted.
- For Advfs, you must be able to recreate manually the entries in /etc/fdmns.
- if you have a complex Advfs setup , for example several domain or
multi-volume domains, make a tar of the corresponding directories once their
are active.
- a suggestion for engineering: ASE should not remove the domain entries from
/etc/fdmns if it cannot dismount the filesystems. When it happens,
you have still a live, mounted file system, but you cannot do any maintenance
operation on it like showfdmn, as the domain entry no longer exists in
/etc/fdmns.
- in startup script, if you cannot guarantee that the startup oeration will
finish within the timeout, start it in background.
This was an easy workaround for a demo, but it is not satisfactory:
what should we do if the startup failed after a long time? For ASE the service
is started as the startup script succesfully finished, but for the users it
is not.

So here we have a design problem: what should we do with startup operation
that take a unpredictable time to complete?

Here is my startup script:
# *****************************************************************
# * *
# * Copyright (c) Digital Equipment Corporation, 1991, 1996 *
# * *
# * All Rights Reserved. Unpublished rights reserved under *
# * the copyright laws of the United States. *
# * *
# * The software contained on this media is proprietary to *
# * and embodies the confidential technology of Digital *
# * Equipment Corporation. Possession, use, duplication or *
# * dissemination of the software and media is authorized only *
# * pursuant to a valid written license from Digital Equipment *
# * Corporation. *
# * *
# * RESTRICTED RIGHTS LEGEND Use, duplication, or disclosure *
# * by the U.S. Government is subject to restrictions as set *
# * forth in Subparagraph (c)(1)(ii) of DFARS 252.227-7013, *
# * or in FAR 52.227-19, as applicable. *
# * *
# *****************************************************************
# @(#)$RCSfile: startAction.sh,v $ $Revision: 1.2.2.2 $ (DEC) $Date: 1995/01/27 22:53:32 $

#
# A skeleton example of a start action script.
# original path: /var/opt/TCR140/ase/lib/startAction

PATH=/sbin:/usr/sbin:/usr/bin
export PATH
ASETMPDIR=/var/ase/tmp

if [ $# -gt 0 ]; then
svcName=$1 # Service name to start
else
svcName=

# in the case where the command must be run by a particular user:
# for example the command startup is in the oracle path
su - oracle -c /usr/users/oracle/bin/startup > /usr/users/oracle/startup0.log\
2>&1 &
# if the command is run as the user root, give the full pathname:
# /usr/myase/beispiel.start
#
# Any non zero exit will be considered a failure.
#
exit 0

T.R	Title	User	Personal Name	Date	Lines
1994.1		KITCHE::schott	Eric R. Schott USG Product Management	`Thu Apr 10 1997 17:11`	3
	The other option is to have the start script complete more quickly by backgrounding the application start if possible.
1994.2	also stop script must be changed	TRN02::ALMONDO	Quid ut UNIX ?	`Fri Apr 11 1997 20:30`	7
	One more thing.... In the stop script you will have to check if the background portion of start script is still running......