[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference smurf::ase

Title:ase
Moderator:SMURF::GROSSO
Created:Thu Jul 29 1993
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2114
Total number of notes:7347

2075.0. "ftx_bfmeta_rec crash kill's three node cluster" by WASHDC::mccain.cop.dec.com::sarasin () Tue May 20 1997 17:11

Hello All I cross posted this to the ADVfs conference as well.
anyone know of a patch for this? 

I just got off the phone with a customer who's entire ASE had crashed. 
Each system had crashed with this error:
ftx_bfmeta_rec_redo: got bmt page N1 instead of N2

I found a note in comet that related this to a bug in simport. 
As far as I can tell from the note simport is the driver for the KZPSA.  
The note also indicated there was a patch for this. I found a patch for 
all versions except 4.0b. 4.0b is the version that we are using. 
The comet entry lead me to suspect a corrupted domain which I found. 
I deleted and recreated the domain and was able to get the cluster up. 
As best I can tell this is what happened. 

Node 1 corrupted the domain and crashed. The ASE tried to fail the service 
to node two. Node two tried to mount the domain and crashed. 
ASE the switched the service to node three. There you have it three dead 
cluster members. 

I have also noted a large number of CAM errors in the members logs 
that may or may not be related. I am not sure because the HSZ's 
had to be reset after the system crashes for a node to see the 
shared disks. 

Here is the setup. Three 4100's with 3'KZPSA connected to three HSZ50's
one on each KZP. TrueCluster Available Server 1.4. Digital Unix 4.0b. 
As best as I can tell all systems have the current versions of firmware.


T.RTitleUserPersonal
Name
DateLines