[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference smurf::ase

Title:ase
Moderator:SMURF::GROSSO
Created:Thu Jul 29 1993
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2114
Total number of notes:7347

2099.0. "ASE BEHAVIOUR IN CASE OF SCSI BUS FAILURE" by SOSGPX::FIORINI () Thu May 29 1997 15:10

    Hi all,
     
    I have installed a DECsafe configuration with two ALPSRV 5/400, 
    and the Customer involved is one of the biggest in Italy.
    The operating system is Digital_unix 4.0b and DECsafe 1.4
    The hardware configuration is the following:
    
    
    SYSTEM 1                                       SYSTEM2
    +----------+            +-----+               +----------+
    |A1000  |K |            |BA356|               |A1000  |K |
    |5/400  |Z |            |     |               |5/400  |Z |
    |       |P |            |     |               |       |P |
    |       |S |            |     |               |       |S |
    |       |A |            |     |               |       |A |
    +----------+            +-----+               +----------+
             |                | |                          |
             |                | |                          |
             +----------------+ +--------------------------+
    
    
    In the BA356 there are two disks, (the application disk and its
    mirroring).
    The mirroring is done with LSM.
    
    During the system acceptance test the Customer did some actions to
    see the ASE behaviour in case of system failures.
    
    Assume that SYSTEM 1 is running the service and SYSTEM 2 is in
    stand-by.
    If the SCSI cable is disconnected from the KZPSA of SYSTEM 1 (the SCSI
    is still terminated via the Y cable), the AM notify the HSM that the
    ping over the SCSI bus has timed out.
    THE SERVICE IS NOT RIALLOCATED TO SYSTEM 2, THAT CAN STILL ACCESS TO THE
    SHARED DEVICES, AND THE APPLICATION HANGS.
    If the cable is reconnected (after two minutes) the AM notify the HSM that 
    the ping over SCSI bus is ok, and the application is automatically 
    restarted.
     
    My conclusion is that the ASE does not react correctly in case of
    a SCSI BUS failure.
     
    Any idea if it is possible to change this unacceptable behaviour?
    
    Thanks to everybody who can help me.
    
    Regards 
    
    Moreno Fiorini
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
     
    
    
     
    
    
    
    
    
    
    
T.RTitleUserPersonal
Name
DateLines
2099.1any disk access failures ???BACHUS::DEVOSManu Devos NSIS Brussels 856-7539Thu May 29 1997 22:0816
    
    My first reaction on your note is that you have a single point of
    failure in your drawing, as there is only one SCSI bus..
    
    Now, concerning the described behaviour, you must provide us with more
    information. As you correctly mentioned, the AM detects when the
    "other" member is no longer respondding to SCSI "pings". BUT, this is
    NOT enough to cause a failover. After all, maybe this bus is not used
    by any service running on that host. A failover is only started if an
    access to a shared data is failing, and only if this shared data is not
    accessible from another disk ( I.E. from another plex of the LSM
    volume). So, you should tell us if you receive also a notification that
    ASE is not able to access a specific disk.
    
    Regards, Manu
    
2099.2ASE AND SCSI BUS PARTITIONSOSGPX::FIORINITue Jun 03 1997 10:3538
    
    Hi Manu,
    thanks for the answer to my note.
    
    As you have noticed there is a single point of failure, because each
    system has only one SCSI interface, and the two shared disks (one is
    used as mirroring) are on that interface.
    Now I cannot change the configuration, but I have informed the
    Customer, and I think the configuration will be changed in a short 
    time.
    
    In the actual configuration, only one service (that use both disks) has
    been defined, and the first system gives the service.
    The second system is in "stand by" and will take the service if the
    first system fails.
    If the SCSI cable is disconnected from the system that gives the
    service, that system is not able to access any data on the shared bus.
    AM detects the failure, and notify that there is a SCSI bus partition.
    There is not any entry in the errorlog regarding the access to the
    disks (the loggin severity level has been set to log notice, warning
    and errors).
    The Customer's application is run via the start action script.
    In it there are two lines that point to two other two scripts in order to
    run the database (oracle) and the application.
    Once the application is started it accesses the data on the shared
    disks continuosly.
    I think that ASE does not reallocate the service because it does not
    ping the disks itself, and does not know that the application is unable
    to reach the data.
    Is there any possibility to solve the problem? 
    
    
    Regards,  Moreno
    
    
     
    
    
2099.3How is the service defined?HERON::BLOMBERGTrapped inside the universeTue Jun 03 1997 12:231
    
2099.4.0: It seems that ASE1.4 can't handle SCSI single point failureEPS::NGUYENWithout fools there would be no wisdom.Tue Jun 03 1997 14:1718
Hi there,

I've got the same problem when my application service does not fail over when 
the SCSI bus is disconnected.  My software versions are the same as those in 
.0, and I use HSZ40 instead of the BA box.  I configured the application
favor member "system1,system2" and do NOT fail over to the higher favor 
member automatically.

In order to activate the fail over in the old version of ASE1.3, I used to 
"cd" into the directory on the shared disk and "ls", but it seems does NOT 
work anymore, the application can not to data on the shared disk and just 
hang there without failing over to the other system.

Any recommendations/suggestions are highly appreciated.

Regards,
Gina Nguyen
2099.5COMICS::CORNEJWhat's an Architect?Tue Jun 03 1997 15:2412
    >In order to activate the fail over in the old version of ASE1.3, I used
    >to "cd" into the directory on the shared disk and "ls", but it seems does
    >NOT work anymore, the application can not to data on the shared disk and
    >just hang there without failing over to the other system.
    
    The service will not fail over if you "cd" to the filesystem on the
    shared disk (look in daemon.log - it will show the umount failing
    because the device is still busy).
    
    Jc
    
    
2099.6SCSI full partition changed in 1.4 ?BACHUS::DEVOSManu Devos NSIS Brussels 856-7539Wed Jun 04 1997 20:3420
    Hi Jc,
    
>     The service will not fail over if you "cd" to the filesystem on the
>     shared disk (look in daemon.log - it will show the umount failing
>     because the device is still busy).
  
    
    I think .4 wanted to say that after having disonnected the SCSI bus
    cable, he had to do a "cd - ls" to cause an IO on the disconnected disk
    to cause the failover. This simply confirms my answer saying that a 
    cable disconnection is not sufficient to cause a failover, an IO is
    also needed..
    
    But, the interesting information in .0, .3 and .4 seems that it does
    not work anymore with version 1.4...
    
    Is it any change in version 1.4 concerning the SCSI BUS full partition
    inregards of version 1.3 ?
    
    Manu.
2099.7:-)COMICS::CORNEJWhat's an Architect?Thu Jun 05 1997 11:455
    Ooops!  Sorry about that.  It is what comes from doing what I wrote in
    .4 too often that day myself:-)
    
    Jc
    
2099.8ScSi fail over for less than 1-minute EPS::NGUYENWithout fools there would be no wisdom.Thu Jun 05 1997 13:2923
Hello there,

Thank you very much for discussion on the case.
    
What .6 have written is precisely what I mean.  Since the ASE already 
discovered that there is a failure in the SCSI, it won't hurt it to "cd" 
in the directory (after all, what that be is an empty bucket to mount 
the shared disk). It only doesn't work as .5 suggested in the normal 
condition.

Well, after some more extensively testing, I've found out that the 
system will fail over if any "IO" action is quick enough (I mean within 
less than 2 minutes or so) around the SCSI failure time.  The longer 
the time, the more difficult it is for the system to do anything.
I mean again is that after "1" minute, one should expect that the 
system would hang. However, I believe that if our customers buy our 
product, they would expect that it would work unless it's stating 
otherwise.  What if the SCSI fails at night or sometime when there are 
not many IO trafics?

Any suggestions from the product team?

Gina Nguyen
2099.9KITCHE::schottEric R. Schott USG Product ManagementThu Jun 05 1997 17:159
Hi

  If you have behavior that you think is incorrect (or different between
releases), I suggest you file a QAR or CLD/IPMT.

 The system should not hang...so I think this is a serious problem.  I
think to get the attention this deserves, you should escalate as
required by your customer.