[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference aosg::lsm

Title:LSM
Moderator:SMURF::SHIDERLY
Created:Mon Jan 17 1994
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:803
Total number of notes:2852

764.0. "Failed LSM disk freezes UNIX" by BPSOF::TELEKI (Laszlo Teleki) Tue Mar 11 1997 15:04

One of our customers has two A4100s in ASE configuration. The O/S is Digital
UNIX v3.2G, the application is SAP on Oracle. The database is installed on
mirrored LSM (1.2A) volumes. These volumes are created from RZ29B pairs hanging
on KZPSAs (A10).

The following strange situation occured:

An RZ29B failed in a mirrored volume comprised of two RZ29s. This disk error
resulted in a system hang. ( No process could be started. But the system could
be pinged from the network and probably from the SCSI buses because the other
ASE member saw the services runnig on the frozen machine online.) The only thing
I could do to press the halt button. (Here I should've forced a crash dump but I
forgot to do so.) After a little struggling I could replace the failing disk and
restart both systems.

I've read in the notesfiles that during reads/writes LSM waits for the
underlying SCSI driver to give back status. If no status comes back than LSM
hangs. I think the SCSI driver should've timed out to let LSM push the bad plex
out of the volume and carry on normal operation with a reduced volume. This
whole thing should have been invisible for the users (except a warning message
in the syslog).

The customer says he payed for ASE and LSM to have a highly available computing
environment and avoid the above mentioned situations (Right). He wants us to
give him a position statement about it and guarantee not to reoccure again.

I know it is quite hard to say anything (especially without crash dump) but any
suggestion would be highly appreciated. Thank you in advance.

Regards,

Laszlo
T.RTitleUserPersonal
Name
DateLines
764.1LEXSS1::GINGERRon GingerTue Mar 18 1997 13:1916
    My customer had a similar situtation. One of the Y cables was bad in
    such a way that one system was prevented from reaching a shared bus. It
    would get into a reset/retry mode which would hang the other system. We
    were able to force a crash dump on the system that was hung, and all
    analysis showed it seemed to be fine. We had this as an IPMT, but were
    never able to solve it. When the cable was proven to be bad the case
    was closed.
    
    There are ways for one member of an ASE pair to keep resetting the bus
    such that the other member will not get any work done. It wont crash,
    and it wont log any errors, and if the failing machine ever stops
    resetting the bus it will resume work. It could eaisly just drop that
    one plex form LSM but it never tries. 
    
    I gave up trying to get anyone in engineering interested in solving
    this. 
764.2Re: .1NETRIX::"srn@rio.zk3.dec.com"Tue Mar 18 1997 15:578
> I gave up trying to get anyone in engineering interested in solving
> this.

If you haven't already, I would suggest filing a QAR on gorge.
See http://www-notes.lkg.dec.com/aosg/lsm/165.0 for more details.


[Posted by WWW Notes gateway]