[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference decwet::advfs_support

Title:AdvFS Support/Info/Questions Notefile
Notice:note 187 is Freq Asked Questions;note 7 is support policy
Moderator:DECWET::DADDAMIO
Created:Wed Jun 02 1993
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1077
Total number of notes:4417

1061.0. "ftx_bfmeta_rec crash kill's 3 node cluster" by WASHDC::mccain.cop.dec.com::sarasin () Mon May 19 1997 22:50

Hello All, 

I just got off the phone with a customer who's entire ASE had crashed. 
Each system had crashed with this error:
ftx_bfmeta_rec_redo: got bmt page N1 instead of N2

I found a note in comet that related this to a bug in simport. 
As far as I can tell from the note simport is the driver for the KZPSA.  
The note also indicated there was a patch for this. I found a patch for 
all versions except 4.0b. 4.0b is the version that we are using. 
The comet entry lead me to suspect a corrupted domain which I found. 
I deleted and recreated the domain and was able to get the cluster up. 
As best I can tell this is what happened. 

Node 1 corrupted the domain and crashed. The ASE tried to fail the service 
to node two. Node two tried to mount the domain and crashed. 
ASE the switched the service to node three. There you have it three dead 
cluster members. 

I have also noted a large number of CAM errors in the members logs 
that may or may not be related. I am not sure because the HSZ's 
had to be reset after the system crashes for a node to see the 
shared disks. 

Here is the setup. Three 4100's with 3'KZPSA connected to three HSZ50's
one on each KZP. TrueCluster Available Server 1.4. Digital Unix 4.0b. 
As best as I can tell all systems have the current versions of firmware.

T.RTitleUserPersonal
Name
DateLines
1061.1How can we help you?UNIFIX::HARRISJuggling has its ups and downsWed May 21 1997 10:2222
    I'm not sure if there is a question in this post or if you are just
    passing along information on the problems that broken hardware can
    cause?
    
    If you are asking for assistance on analyzing the ftx_bfmeta_rec_redo
    panic, then I would suggest that you submit the crash-data file(s) to
    the CANASTA server to see if there is any existing cases, IPMTs, or
    patches related to your problem (see note 8919 in TURRIS::UNIX_DIGITAL
    notes conference).
    
    Also look at the CAM errors.  If hardware has corrupted that data,
    there is not a lot that software can do.
    
    However, it would be nice if AdvFS didn't panic the entire system if it
    could not mount a file set in a domain (assuming the file set is not
    the root file system).
    
    While there is nothing AdvFS can do if the hardware corrupts the data,
    you may want to open an IPMT case asking that the corrupted domain not
    panic the entire system when it is failed over in an ASE environment. 
    
    					Bob Harris
1061.2I was trying to see if the simport patch was in 4.0bWASHDC::mccain.cop.dec.com::sarasinThu May 22 1997 21:3513
What I was really looking for is to see if the simport patch 
was included in 4.0b. I do not have access to the crash data 
nor could I send it in if I did. The site is a secure facility
and would not allow it. From what I can tell by looking on comet
this has happend on several other nodes running 4.0a and the fix
was this patch. I am working on getting them to give me the crash
data file but so far I have not been successful. I am also sending
them the 4.0b patches as several ADVFS patches are included. I did
not however find any mention of a simport.o patch in the readme which
is what lead me to post. I would also like to hear from any one else
who may have had to deal with this crash as the customer is looking at 
buying 1.5 TB more of disks if I can keep him happy :)

1061.3it's not in 4.0b yetRHETT::MOOREFri May 23 1997 08:5510
    The simport patch has not been officially released in any patch
    kit for 4.0b.  You can pick it up from a number of sites that
    have been mentioned in this NOTES file.  The only one I know
    off the top of my head (since it's my site :) is
    
    	decatl.alf.dec.com:/patches/misc/simport_patches/simport_v40b.tar
    
    Martin Moore
    Digital UNIX Support Group
    Atlanta CSC
1061.4thanks for the pointerWASHDC::mccain.cop.dec.com::sarasinWed May 28 1997 11:396
Thanks for the pointer. I will send it down with the offical patch kit. 
I am also very close to getting them to give me the crash data file.
I will post it to canasta to see if this is a new crash or not.

Thanks again for the pointer, 
Sam