[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference smurf::ase

Title:ase
Moderator:SMURF::GROSSO
Created:Thu Jul 29 1993
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2114
Total number of notes:7347

2068.0. "Member hang during shutdown" by MEOC02::LEE () Mon May 19 1997 07:55

    I am working on a site with 2 AS8400s configured via DECsafe.
    The systems HX11 & HX12 are running DU3.2g and DECsafe V1.3.
    
    We had DECsafe disabled on HX12 because it causes HX11 to panic
    HX12 was brought online.
    
    The DECsafe patches solved the problem with the machine
    panics. But, we encountered another DECsafe problem. With
    DECsafe turned on both HX11 and HX12, we cannot shutdown
    HX12 cleanly via an 'init 0'. It appears to hang and requires
    a physical reset. The local LSM disks on HX12 needs to resync on
    the next reboot.
    
    Any suggestions is appreciated......
    
    The following DECsafe V1.3 patches:-
            ASE130-005
            ASE130-013
            ASE130-013
            ASE130-014
            ASE130-015
    When an 'init 0' is issued on HX12, the machine appears
    to hang. At this point in time the console is locked and
    requires a physical reset to reboot the machine.
    
    After physically resetting HX12, the following
    messages were logged in daemon.log:-
    
       May 17 18:49:52 hx12 DECsafe: local Asemgr Error:
        Can't connect to HSM
       May 17 18:49:52 hx12 DECsafe: local Asemgr notice:
        msgSvcOpenChannel: Agent not in target's port map
       May 17 18:49:52 hx12 DECsafe: deregister_fd: not registered
       May 17 18:49:52 hx12 DECsafe: local Asemgr notice:
        can't connect to local agent, retrying...
    
       May 17 18:49:57 hx12 DECsafe: local Asemgr notice:
        msgSvcOpenChannel: Agent not in target's port map
       May 17 18:49:57 hx12 DECsafe: deregister_fd: not registered
       May 17 18:49:57 hx12 DECsafe: local Asemgr notice:
        can't connect to local agent, retrying...
    
       The last 3 messages were repeated at 5 seconds interval.
    
    
T.RTitleUserPersonal
Name
DateLines
2068.1more infoBACHUS::DEVOSManu Devos NSIS Brussels 856-7539Mon May 19 1997 13:5925
    Hi,
    
    We need more information to help you.
    
    When the two systems are up and running, can you
    
    1) Stop the service ?
    2) Start the service ?
    3) Relocate the service to the other member ?
    
    Is it a message "device busy" in the daemon.log ?
    
    When you try "init 0", do you wait for a long time before doing reset ?
    (ASE should stop every services running on that host and it can take
    time. Also, if something is going wrong, the ASE internal timeout is 
    quite long).
    
    Try to explain us if it has worked before, if something has been
    changed or what you think had triggered the problem.
    
    What is working and what is not working ?
    
    Manu.
    
    
2068.2BACHUS::DEVOSManu Devos NSIS Brussels 856-7539Mon May 19 1997 14:2618
     Hi again,
    
    I just read the note 1975.2 which seems to concern the same customer
    and saw that it is a SAP config. Did you finally resolve the original
    problem ? 
    
    If the start/stop of the service is NOT working when the two members
    are connected, then you can try to disconnected the shared scsi bus 
    cables from the other node and repeat the test. If it works, then you 
    likely have either a mis-configured system (same scsi id for the 
    controllers ?) or a hardware problem on one of the SCSI Buses.
    
    Try to have a progressive approach from a working start point till you
    see the problem appearing. The last step could give you the begin of
    the solution.
    
    Manu.
    
2068.3MEOC02::LEETue May 20 1997 07:0796
    Hi,
    
    We (...the local CSS, TSC, NSIS) have a long and painful saga with
    this particular DECsafe installation. The cluster was put in about
    18 months ago. When it was first installed, DECsafe did work to a
    certain extent. It will failover the disks from HX11 to HX12 but
    there were some bugs in the ASE SAP scripts that prevented the SAP
    R/3 application from starting on HX12. The customer employs about 70
    SAP contractors and did not want to hand the machines over to test
    the scripts then.
    
    Six months down the track, we got a call from their Sys Admin. He tried
    adding some disks to the cluster and it somehow got into a hung state.
    They eventually got HX11 up and running by disabling DECsafe on HX12.
    HX11 will panic whenever HX12 is rebooted with DECsafe turned on.
    
    Since then, DECsafe on HX12 has been turned off. Despite the problems,
    the machines have been upgraded:-
            DU3.2C  ->      DU3.2G
            SAP 3.0C->      SAP 3.0D+
            HX12 which was a AS2100 was replaced with an AS8400.
            (we preserved the LSM/DECsafe environment from the AS2100 by
             moving the system disks across and the placement of the SCSI
             controllers. The new HX12 came up without a problem...with
             DECsafe turned off, that is).
    
    During the Easter Weekend, we turned on DECsafe on HX12 and of course,
    it caused HX11 to panic. We also got ourselves into a knot...as described
    note 1975.
    
    Last weekend, we applied the DECsafe V1.3 patches, deleted, re-added
    HX12. This fixed the problems with the machine panics. However, we tripped
    over again...with another problem.
    
    This is wait happened....
    
       The DECsafe patches solved the problem with the machine
       panics. But, we encountered another DECsafe problem. With
       DECsafe turned on both HX11 and HX12, we cannot shutdown
       HX12 cleanly via an 'init 0'. It appears to hang and requires
       a physical reset. The LSM disks on HX12 needs to resync on
       the next reboot.
    
       Test Results
       When Chris & I finally got the nod, we did the following:-
       1) applied the following DECsafe V1.3 patches:-
            ASE130-005
            ASE130-013
            ASE130-014
            ASE130-015
       2) Rebuild the kernels for HX11 and HX12.
       3) Shutdown HX11 and HX12
       4) Boot HX11, disable the ASE service, sapdb
       5) Boot HX12, turn on and reinitialize DECsafe.
       6) On HX11, delete member HX12 and re-add HX12.
          Everything seems OK at this point. We now have
          DECsafe up and running on both HX11 and HX12.
       7) On HX11, enable the ASE service. This mounted
          the shared File Systems, started ORACLE and the
          SAP R/3 Application.
       8) We decided to reboot HX11 and HX12 to check if
          they will both come up without any problems.
          We issued an 'init 0' on HX12 and waited, waited
          and waited........... (20 minutes)
    
          After physically resetting HX12, the following
          messages were logged in daemon.log:-
    
          May 17 18:49:52 hx12 DECsafe: local Asemgr Error:
           Can't connect to HSM
          May 17 18:49:52 hx12 DECsafe: local Asemgr notice:
           msgSvcOpenChannel: Agent not in target's port map
          May 17 18:49:52 hx12 DECsafe: deregister_fd: not registered
          May 17 18:49:52 hx12 DECsafe: local Asemgr notice:
           can't connect to local agent, retrying...
    
          May 17 18:49:57 hx12 DECsafe: local Asemgr notice:
           msgSvcOpenChannel: Agent not in target's port map
          May 17 18:49:57 hx12 DECsafe: deregister_fd: not registered
          May 17 18:49:57 hx12 DECsafe: local Asemgr notice:
           can't connect to local agent, retrying...
    
          The last 3 messages were repeated at 5 seconds interval.
       9) HX12 will still hang even if shutdown DECsafe first via the commands
          /sbin/init.d/asemember stop
          /sbin/init.d/aseam stop
          before the 'init 0'
       10)With DECsafe turned off on HX12, it will reboot cleanly.
    
    I re-checked the SCSI controllers, cabling and disks this morning
    and I am certain that they are correctly configured. The shared
    disks are on SCSI buses 2,3,4 & 5 on both machines.
    
    Thanks for the replies....and please keep it coming.
    
         
2068.4Some comments and suggestions ...BACHUS::DEVOSManu Devos NSIS Brussels 856-7539Tue May 20 1997 10:58122
<    Hi,
<   
<    We (...the local CSS, TSC, NSIS) have a long and painful saga with
<    this particular DECsafe installation. The cluster was put in about<
<    18 months ago. When it was first installed, DECsafe did work to a
<    certain extent. It will failover the disks from HX11 to HX12 but
<    there were some bugs in the ASE SAP scripts that prevented the SAP
<    R/3 application from starting on HX12. The customer employs about 70
<    SAP contractors and did not want to hand the machines over to test
<    the scripts then.

So, you are still using the original (wrong) scripts ? or did you change them?
   
<    Six months down the track, we got a call from their Sys Admin. He tried
<    adding some disks to the cluster and it somehow got into a hung state.

As the stop/start scripts are needed in each service modification (You know the
famous sequence: stopping-deleting-adding-starting the service), a not working
stop script can lead to an apparent "hung" state (which generally exits after a
very long time [36']).

<    They eventually got HX11 up and running by disabling DECsafe on HX12.
<    HX11 will panic whenever HX12 is rebooted with DECsafe turned on.

If HX12 is causing HX11 to panic when it booted, this is caused either by a
mis-configured SCSI controller-id/termination OR an ASE database OUT of sync
with the running system. They can thus try each to start a director and
consequently to start the service.
   
<    Since then, DECsafe on HX12 has been turned off. Despite the problems,
<    the machines have been upgraded:-
<            DU3.2C  ->      DU3.2G
<            SAP 3.0C->      SAP 3.0D+
<            HX12 which was a AS2100 was replaced with an AS8400.

This is not a criticism (I too know the pressure a customer can place on us!),
but consecutive changes to a NOT-WORKING environment can only lead to virtually
impossible cure of the problems. The step by step (and thus lengthy process) is
the only valid approach to solve complicated problems.

<            (we preserved the LSM/DECsafe environment from the AS2100 by
<             moving the system disks across and the placement of the SCSI
<             controllers. The new HX12 came up without a problem...with
<             DECsafe turned off, that is).

???, very strange to mee!!! It is so simple to delete the Member from the
cluster and then to re-add it. You are then sure that the last (most up to date)
version of the ASE database is used on the new (added) system. But, maybe you
had tried that because ASE had been disabled on that system ???

<    
<    During the Easter Weekend, we turned on DECsafe on HX12 and of course,
<    it caused HX11 to panic. We also got ourselves into a knot...as described
<    note 1975.
<    
<    Last weekend, we applied the DECsafe V1.3 patches, deleted, re-added
<    HX12. This fixed the problems with the machine panics. However, we tripped
<    over again...with another problem.
<    
<    This is wait happened....
<    
<       The DECsafe patches solved the problem with the machine
<       panics. But, we encountered another DECsafe problem. With
<       DECsafe turned on both HX11 and HX12, we cannot shutdown
<       HX12 cleanly via an 'init 0'. It appears to hang and requires
<       a physical reset.
<       The LSM disks on HX12 needs to resync on the next reboot.

The resync is absolutely normal and expected. A LSM mirrored volume which is not
stopped (init 0 should stop it, but you reseted the system before it reach that
point), should always resynchronize its mirrors.

SO, your last problem is 'init 0' not finishing in a reasonable time.
To debug that problem, I suggest you modify the /sbin/rc0 script such that you
can monitor at the console the progression of 'init 0'. Edit the file to place
an 'echo $f' just before the invocation of the stop shell script. See the next
example:

	if [ -d /sbin/rc0.d ]; then
		# KILL procedure	
		for f in /sbin/rc0.d/K*
		do
			if [ -s $f ]; then
				echo "Starting $f ..."     <--------- Here!
				/sbin/sh $f stop
			fi
		done
		...

You can now try a init 0 and see on the console the name of each stop script
just before it is executed. Once you find the blocking script, insert a "set -x"
in that script to be able to monitor its execution at the console. I am sure you
can find the offending command/problem...
			
<  
<       9) HX12 will still hang even if shutdown DECsafe first via the commands
<          /sbin/init.d/asemember stop
<          /sbin/init.d/aseam stop
<          before the 'init 0'

This prove that DECsafe is not a player here !!!

<       10)With DECsafe turned off on HX12, it will reboot cleanly.

What do you mean by "reboot cleanly" ? Do you mean that 'init 0' is working when
you booted the system with ASE=off and NOT working when booted with ASE=on or
that you can boot "cleanly" when ASE=off ?

<    
<    I re-checked the SCSI controllers, cabling and disks this morning
<    and I am certain that they are correctly configured. The shared
<    disks are on SCSI buses 2,3,4 & 5 on both machines.
<    
<    Thanks for the replies....and please keep it coming.
<    
<         

Finally, try to run all your test in a script session, so you can give us all
the evidences.

Manu.

2068.5Let me explain...MEOC02::LEEWed May 21 1997 00:0039
    Hi,
    
    I have another 12 hour window this weekend for further tests. I am
    trying to get as many ideas as I can to help pinpoint or solve the 
    problem.
    
    <<So, you are still using the original (wrong) scripts ? or did you
    <<change them?
    
    Yes, we made some minor changes to rc_service and db_service at this
    point in time. The script manually maintains a list of file system to 
    mount.We intend to make some additional changes latter. 
    
    <<What do you mean by "reboot cleanly" ? Do you mean that 'init 0' is
    <<working when
    <<you booted the system with ASE=off and NOT working when booted with
    <<ASE=on or
    <<that you can boot "cleanly" when ASE=off ?
    
    <<This prove that DECsafe is not a player here !!!
    
    When HX12 is booted with ASE=off, init 0 will halt the system. The chevron
    prompt '>>> ' is displayed at the console. HX12 will boot and shutdown
    without a problem.
    
    If I now set ASE=on, by booting HX12 into single user mode, editing
    rc.config, HX12 will startup fine. No resync of local LSM disk. 
    If an 'init 0' is issued, HX12 will appear hung during the execution of
    the K* scripts. (I will workout which one this weekend). I been quick to
    jump into conclusion that it was DECsafe related as the 'init 0' executes
    the same K* scripts. At every 5 sec internal, the LED on the system disk
    flashes and seems to go on and on....until I hit <CONTROL P> at the console.
    This activity corresponded with the entries in daemon.log.
    
    Rebooting at this point (ASE=on) will bring HX12 up and LSM needs to
    resync the disk, as expected. An 'init 0' now gets HX12 into the same
    hung state.
    
    Thanks...for the patience
2068.6Further clarificationMEOC02::JANKOWSKIWed May 21 1997 11:4955
    Hi,
    
    I am the other guy working on the system with Kay Lee - the author of
    .0
    
    I would like to provide the following clarification.
    
    We only have one service - sapdb.
    This service is normally working on HX11 as this is the preferred
    member. 
    The status of the service is that if cleanly brings up SAP when it
    starts and it cleanly stops SAP when the it stops.
    
    We have *not* got to the stage yet that we actually tested the failover
    to HX12. This is our objective but we need to get to clean state first and
    be very careful - this is a production system with lots of storage.
    
    At the moment we stabilised the system - we have DECsafe running on
    both machines and the services comes up good on HX11 on startup.
    
    To progress further we need to be able to shut down HX12 cleanly.
    If we cannot do it we will have 200Gb of LSM disks resyncing
    and it takes about 8hrs to do.
    
    Note that our window is 12 hours.
    
    Just to summarize:
    
    The current problem is:
    
    HX12 will not complete - init 0 - with the errors as per .0
    with *no* service running on HX12.
    
    However if we disable DECsafe by settting up ASE=off in rc.config
    then after next reboot the machine will shut down cleanly.
    
    Also note that just shutting down the daemons by running asemember stop
    and aseam stop does not remove the problem.
    
    This is strange.
    
    
    Our current plan for our 12hr window is:
    
    0. Activate HX12 (ASE=ON)
    1. delete HX12 from ASE configuration.
    2. Remove ASE susbsets from HX12
    3. reinstall ASE on HX12 and apply patches.
    4. add HX12 to existing ASE configuration
    
    Any comments?
    
    Chris Jankowski
    Melbourne Australia
                       
2068.7BACHUS::DEVOSManu Devos NSIS Brussels 856-7539Wed May 21 1997 21:1329
    Hi again Chris and Klay,
    
    I don't know if this can help you, but I strongly recommend the
    following approach:
    
    As your time window only 12 hours, I suggest that you stop the
    application by calling the DECsafe stop script (outside of ASE, of
    course), and if it is successfull, that you stop the ASE service with
    asemgr. Then you can try "init 0" and check if it works. If it is
    hanging, at least your LSM volumes will not have to be resynchronized.
    
    Then, you can narrow the problem by modifying the /sbin/rc0 script as
    described some replies before.
    
    By the way, I was just thinking if "init 0" on hx12 was done when the
    sapdb service was running on itself or on HX11??? If the service is
    running on HX11, "initing 0" and resetting  HX12 SHOULD NOT CAUSE the
    service to resynchronize its mirrors !!!
    
    Also, I was facing a serious problem sometime ago, to debug an ASE 1.2A
    cluster to which someone applied the ASE 1.3 patches !!! ARe the
    patches you have installed applying to your ASE version?
    
    Did you rebuilt the kernel after the patches ?
    
    Forgive my naive questions, but they are only intended to help you !
    
    Manu.
    
2068.8Further clarification.MEOC02::JANKOWSKIThu May 22 1997 00:2616
    Re. 7
    The ASE is V1.3 and the patches are for V1.3
    Kernel has been rebuilt.
    
    The sapdb service runs on HX11.
    (unsuccessful) shutdown of HX12 causes only the local system disks
    to be resynchronized after a forced halt.
    
    However, our test plan calls for failover of the service to HX12.
    We would prefer to do this when the machine can be shutdown cleanly
    as otherwise we may be left with those disks there and having
    to resynchronize them.
    
    Cheers,
    
    Chris
2068.9broken scriptGIDDAY::SCHWARZMon May 26 1997 04:3520
    
    I went to site on Saturday to help isolate the shutdown problems.
    Thanks to Manu for his suggestions - they were great. It turned out
    that in one of the sap shutdown scripts there was a call to asemgr.
    Unfortunately the sap script was called AFTER asemember stop had been
    run. Thus with the aseagent not running asemgr just sat there and the
    shutdown did not complete. This call to asemgr was not supposed to be
    called during shutdown - only during boot. Modifying the script to
    reflect this allowed the system to shutdown cleanly.
    
    Lesson to learn:
    1) separate you boot and shutdown scripts
    2) check the order in which things happen before assuming your scripts
    work.
    
    
    Kym Schwarz
    Unix Support 
    CSC Sydney
    
2068.10other results of the recent debugging session.MEOC02::JANKOWSKIThu May 29 1997 08:5816
    As per .9 the immediate problem of not being able to shutdown is
    solved. Thanks to Manu for his excellent suggestions in .4.
    
    The fact that the machine would shutdown cleanly with ASE disabled 
    was what put us on a wrong track.
    
    Anyway, we also made excellent progress on debugging and testing
    of the start and stop scripts.
    At the moment we can failover manually from HX11 to HX12 and back
    reliably. We also get correct actions when we boot and shutdown
    machines in all combinations of situations and order.
    
    Regards,
    
    Chris Jankowski
    Melbourne Australia
2068.11are you using the "official" scriptsBACHUS::DEVOSManu Devos NSIS Brussels 856-7539Thu May 29 1997 21:0419
    Hi Chris and the team ...
    
    Firts, I am glad to ear good news. Notes conferences are of great help
    for all of us!
    
    > Anyway, we also made excellent progress on debugging and testing
    > of the start and stop scripts.
    > At the moment we can failover manually from HX11 to HX12 and back
    > reliably. We also get correct actions when we boot and shutdown
    > machines in all combinations of situations and order.
    
    
    Do you know that DIGITAL-SAP team in Waldorf (Germany) produced
    "official" and "supported" start and stop scripts for DECsafe. I think
    the teal leader is Thomas Heinz. I don't have his e-mail address here
    at home, but I am sure that, if you mail to "Marc Dubois @BRO" the
    qquestion, he can answer.
    
    Regards, Manu.