[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference orarep::nomahs::rdb_60

Title:Oracle Rdb - Still a strategic database for DEC on Alpha AXP!
Notice:RDB_60 is archived, please use RDB_70..
Moderator:NOVA::SMITHISON
Created:Fri Mar 18 1994
Last Modified:Fri May 30 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:5118
Total number of notes:28246

5029.0. "Deadlock on page not resolved?" by NLVMS2::VVISSER (Vincent Visser, Oracle Rdb Support, The Netherlands) Fri Feb 14 1997 13:19

    Hello,
    
    A few days ago we had a strange situation on a production database of a
    customer.
    The application was not working any more and nobody received errors.
    With the RMU/SHOW LOCK/MODE=BLOCKING we found that there where
    deadlocks. See output below.
    A few minutes before they detected the problems, one of the two nodes 
    crashed. 
    Why are these deadlocks not solved? I always thought that a deadlock on
    a page was solved by Rdb. 
    The application does check for deadlocks and they didn't get any
    deadlock error at that moment. A stop of the application solved the
    problem. 
    Two weeks ago they had also a hang situation but here was no system
    crash involved. Some processes where reporting DEADLOCK ON FREEZE
    errors. In that case they had to kill a process to get it going again.  
    
    Customer is using Oracle Rdb V6.1-04 on VAX/VMS V6.1.
    Dynamic lock remastering is disabled (PE1 was set to 25 and 50)
    
    Why is Rdb not solving the deadlock in this particular case?
    
    Regards,
    Vincent
    
================================================================================
SHOW LOCKS/MODE=BLOCKING Information
================================================================================

--------------------------------------------------------------------------------
Resource: page 1

          ProcessID Process Name        Lock ID   System ID Requested Granted  
          --------- ---------------     --------- --------- --------- -------
Waiting:  20413CF1  ACMS00BSP016002     25C3002A  00090002  PW        CR
Blocker:  20411EF3  ACMS00BSP018002     26006AB3  00090002  PR        PR
Blocker:  204138F2  ACMS00BSP017002     04B60053  00090002  PR        PR

--------------------------------------------------------------------------------
Resource: freeze

          ProcessID Process Name        Lock ID   System ID Requested Granted  
          --------- ---------------     --------- --------- --------- -------
Waiting:  20411EF3  ACMS00BSP018002     7B0020D6  00090002  CW        NL
Blocker:  20411E31  RDM_RB_1.......     2E00CF86  00090002  PR        PR

--------------------------------------------------------------------------------
Resource: freeze

          ProcessID Process Name        Lock ID   System ID Requested Granted  
          --------- ---------------     --------- --------- --------- -------
Waiting:  204138F2  ACMS00BSP017002     2F004D72  00090002  CW        NL
Blocker:  20411E31  RDM_RB_1.......     2E00CF86  00090002  PR        PR

--------------------------------------------------------------------------------
Resource: page 1166

          ProcessID Process Name        Lock ID   System ID Requested Granted  
          --------- ---------------     --------- --------- --------- -------
Waiting:  20411138  ACMS00BSP013002     3A00A956  00090002  PR        NL
Blocker:  204146F0  ACMS00ASP001000     47006061  00090002  PW        PW

--------------------------------------------------------------------------------
Resource: page 1

          ProcessID Process Name        Lock ID   System ID Requested Granted  
          --------- ---------------     --------- --------- --------- -------
Waiting:  20411E31  RDM_RB_1.......     4600F91C  00090002  PW        NL
Blocker:  20413CF1  ACMS00BSP016002     25C3002A  00090002  PW        CR
Blocker:  20411EF3  ACMS00BSP018002     26006AB3  00090002  PR        PR
Blocker:  204138F2  ACMS00BSP017002     04B60053  00090002  PR        PR

T.RTitleUserPersonal
Name
DateLines
5029.1M5::JHAYTERFri Feb 14 1997 17:419
>    Dynamic lock remastering is disabled (PE1 was set to 25 and 50)

try using 1 or (%xFFFFFFFF) -1
    
>    Why is Rdb not solving the deadlock in this particular case?

Rdb does not detect deadlocks.  The VMS lock manager does and it notifies
Rdb.
5029.2NOVA::R_ANDERSONOracle Corporation (603) 881-1935Sat Feb 15 1997 11:044
    Also, Rdb handles "page" deadlocks internally - they are not normally
    returned to the application.
    
    Rick
5029.3How does it solve the deadlock?NLVMS2::VVISSERVincent Visser, Oracle Rdb Support, The NetherlandsMon Feb 17 1997 07:5317
    
    >Also, Rdb handles "page" deadlocks internally - they are not normally
    >returned to the application.
    >
    >Rick
    
    This is exactly what it should do. The application is not getting
    any deadlock error, but when you look at the RMU/SHOW LOCK/MODE=BLOCKING  
    output there are deadlocks. 
    It looks like that Rdb doesn't correctly handles "page" deadlocks
    internally.
    How does it solve a deadlock with a page lock and a freeze lock
    involved? Who will be chosen as the victim? 
    
    Regards,
    Vincent 
    
5029.4ukvms3.uk.oracle.com::PJACKSONOracle UK Rdb SupportMon Feb 17 1997 08:0815
>    This is exactly what it should do. The application is not getting
>    any deadlock error, but when you look at the RMU/SHOW LOCK/MODE=BLOCKING  
>    output there are deadlocks. 
    
    This shows that VMS has not chosen one of the lock requests to abort.
    When it does the $ENQ returns an error and the request will no longer
    be outstanding.
    
>    How does it solve a deadlock with a page lock and a freeze lock
>    involved? Who will be chosen as the victim? 
         
    VMS does the choosing (based on a value supplied by Rdb). Until VMS
    chooses one Rdb can do nothing.
    
    Peter 
5029.5NOVA::R_ANDERSONOracle Corporation (603) 881-1935Mon Feb 17 1997 09:136
Check your DEADLOCK_WAIT sysgen parameter.  

I like to have it set to "1" or "2" (default the "10" seconds, which is
horrendous for any real-world application).

Rick
5029.6Gotta know your system and the OpenVMS lock manager...BOUVS::OAKEYI'll take Clueless for $500, AlexMon Feb 17 1997 14:3131
~~Note 5029.4              Deadlock on page not resolved?                   4 of 5
~~ukvms3.uk.oracle.com::PJACKSON "Oracle UK Rdb Suppo" 15 lines  17-FEB-1997 05:08
~~
~~    This shows that VMS has not chosen one of the lock requests to abort.
~~    When it does the $ENQ returns an error and the request will no longer
~~    be outstanding.

Not quite true.  When OpenVMS detects a deadlock, it signals the victim but 
does nothing to the pending request.  It is up to the victim to $ENQ to a 
lesser lock mode (or $DEQ) to remove the request from the appropriate 
pending/conversion queue.
    



~~Note 5029.5              Deadlock on page not resolved?                   5 of 5
~~NOVA::R_ANDERSON "Oracle Corporation (603) 881-1935"  6 lines  17-FEB-1997 06:13
~~
~~I like to have it set to "1" or "2" (default the "10" seconds, which is
~~horrendous for any real-world application).

Here is where I might disagree a bit.  DEADLOCK_WAIT is a SYSGEN parameter. 
Tweaking it affects the entire system. Setting it to 1 or 2 will help 
quickly identify true deadlocks.  However, you may be causing the system to 
check an excessive number of potential deadlocks in the deadlock queue that 
aren't really deadlocks, just pending lock requests.  You should evaluate 
your system to make sure that you aren't waiting an excessive amount of 
time to find real deadlocks but also to make sure you're not checking too
quickly and using up system resources checking for potential deadlocks that 
aren't.

5029.7138.3.209.29::PJACKSONOracle UK Rdb SupportMon Feb 17 1997 15:059
>Not quite true.  When OpenVMS detects a deadlock, it signals the victim but 
>does nothing to the pending request.  It is up to the victim to $ENQ to a 
>lesser lock mode (or $DEQ) to remove the request from the appropriate 
>pending/conversion queue.
    
    That's not what my VMS internals manual says. It says the lock request
    fails.
    
    Peter
5029.8I think we said the same thing :)BOUVS::OAKEYI'll take Clueless for $500, AlexMon Feb 17 1997 15:2516
~~      <<< Note 5029.7 by 138.3.209.29::PJACKSON "Oracle UK Rdb Support" >>>

~~    That's not what my VMS internals manual says. It says the lock request
~~    fails.

Which doesn't really disagree with what I said.  When you request a lock 
with WAIT and the request is not immediately granted, you're placed in 
either the waiting or conversion queue (depending on the previous state of 
the lock) *and* the timeout queue.  When you've been in the timeout queue 
deadlock_wait length of time, OpenVMS will check to see if your lock 
request participates in a deadlock.  If so, then one of the deadlock 
participators is signalled as the victim and their lock request returns a 
deadlock error.  That doesn't mean they're removed from the waiting or 
conversion queue, you've got to $ENQ to a more permissive mode for that to 
happen. 
    
5029.9NOVA::GODFRINDOracle Rdb EngineeringMon Feb 17 1997 15:4649
>~~    This shows that VMS has not chosen one of the lock requests to abort.
>~~    When it does the $ENQ returns an error and the request will no longer
>~~    be outstanding.
>
>Not quite true.  When OpenVMS detects a deadlock, it signals the victim but 
>does nothing to the pending request.  It is up to the victim to $ENQ to a 
>lesser lock mode (or $DEQ) to remove the request from the appropriate 
>pending/conversion queue.

Ahem. I beg to disagree (and agree with Peter). The lock request for which the
deadlock error gets reported does get removed from the queue it was waiting in
(and put back in its prior state if necessary).

However, the other locks that the victim process may have (and that are
blocking the other processes, causing the deadlock condition) do NOT get
removed automaticaly. It is up to the applicaiton to do the right thing
(usually rollback the current transaction).

>~~I like to have it set to "1" or "2" (default the "10" seconds, which is
>~~horrendous for any real-world application).
>
>Here is where I might disagree a bit.  DEADLOCK_WAIT is a SYSGEN parameter. 
>Tweaking it affects the entire system. Setting it to 1 or 2 will help 
>quickly identify true deadlocks.  However, you may be causing the system to 
>check an excessive number of potential deadlocks in the deadlock queue that 
>aren't really deadlocks, just pending lock requests.  You should evaluate 
>your system to make sure that you aren't waiting an excessive amount of 
>time to find real deadlocks but also to make sure you're not checking too
>quickly and using up system resources checking for potential deadlocks that 
>aren't.

I beg to agree. Deadlock seraches are pretty costly - not so much that they use
CPU, but that they use kernel mode cpu at evated IPL (IPL8), which may disturb
other system functions.

I tend to think that setting deadlock wait to a low number provides fast
relief, but does not cure the real problem. It acts like a pain killer, but you
still need to see the doctor. A large number of deadlocks (even if they are
handled internally by Rdb) is bad and needs investigating. 

That said, we are straying away from the base prtoblem. From the look of it, 
two ACMS servers were waiting for the freeze lock, held by a recovery process,
which itself was waiting for a page (page #1 in some area), held by those two
processes.

I am not sure what should have happened. The DBR should have a deadlock
priority lowe than the monitor but higher than all user processes, so any
deadlock error should have been reported to the acms servers (probsably a
"deadlock on freeze") error.
5029.10ukvms3.uk.oracle.com::PJACKSONOracle UK Rdb SupportMon Feb 17 1997 15:5231
>~~    That's not what my VMS internals manual says. It says the lock request
>~~    fails.
>
>Which doesn't really disagree with what I said.  
    
    It does as I read it.
    
>When you request a lock
>with WAIT and the request is not immediately granted, you're placed in 
>either the waiting or conversion queue (depending on the previous state of 
>the lock) *and* the timeout queue.  When you've been in the timeout queue 
>deadlock_wait length of time, OpenVMS will check to see if your lock 
>request participates in a deadlock.  If so, then one of the deadlock 
>participators is signalled as the victim and their lock request returns a 
>deadlock error.  That doesn't mean they're removed from the waiting or 
>conversion queue, you've got to $ENQ to a more permissive mode for that to 
>happen. 
    
    If the request is still queued then it has not failed - it may yet
    succeed. 
    
    Two sentences earlier the manual says 'VMS resolves deadlocks by
    choosing a participant in the deadlock cycle and refusing that
    participant's lock request', which also seems incompatible with the
    request remaining queued.
    
    It may be that the manual is wrong. I haven't been able to find
    anything more recent than 1989 - some manuals went missing in the last
    office move :-(
    
    Peter
5029.11ukvms3.uk.oracle.com::PJACKSONOracle UK Rdb SupportMon Feb 17 1997 15:569
>I tend to think that setting deadlock wait to a low number provides fast
>relief, but does not cure the real problem. It acts like a pain killer, but you
>still need to see the doctor. A large number of deadlocks (even if they are
>handled internally by Rdb) is bad and needs investigating. 
    
    I normally consider deadlocks to be a side effect of a locking problem.
    Fix the locking problem and the deadlocks go away by themselves.
    
    Peter
5029.12Small nitHOTRDB::PMEADPaul, pmead@us.oracle.com, 719-577-8032Mon Feb 17 1997 16:234
    I don't want to lead things off on a big tangent, but it is possible
    for a user process doing a rollback to have deadlock priority higher
    than DBR.  This can occur for brief periods on page deadlocks. 
    Rollbacks proceed regardless of whether DBRs are running.
5029.13Back to the real question.NLVMS2::VVISSERVincent Visser, Oracle Rdb Support, The NetherlandsMon Feb 17 1997 17:5512
    Back to the real question.
    Suppose that, because of the deadlock priority, VMS chooses the
    pagelock as the victim and gives a deadlock error back to Rdb. 
    How does it solve this deadlock? When two pagelocks are involved it can
    release all the pagelocks, but can Rdb decide to release the
    freeze lock? This is the only way to get out of this situation when the
    page lock has been chosen as victim.
    Could it be that this is the problem why it didn't get out of the
    situation?
    
    Regards,
    Vincent
5029.14HOTRDB::PMEADPaul, pmead@us.oracle.com, 719-577-8032Mon Feb 17 1997 19:4512
    Any process that gets a deadlock on a page will flush any modified
    buffers and reduce the remaining page locks to the minimum required
    level to indicate that the process is still looking at a page.  It then
    temporarily boosts its deadlock priority to a high enough level that it
    will almost always win in any deadlock conflict (even with a DBR). 
    This activity can iterate forever until all processes involved in the
    deadlock have unmarked all of their buffers and minimized all of their
    page locks.  At some point there should no longer be a conflict.
    
    As far as I know unmarking all buffers is always enough to allow the
    competing process (such as a DBR) to get a copy of the page in question
    and thus resolve the deadlock.  
5029.15ukvms3.uk.oracle.com::PJACKSONOracle UK Rdb SupportTue Feb 18 1997 07:2417
>    Back to the real question. 
>    Suppose that, because of the deadlock priority, VMS chooses the
>    pagelock as the victim and gives a deadlock error back to Rdb. 
>    How does it solve this deadlock? When two pagelocks are involved it can
>    release all the pagelocks, but can Rdb decide to release the
>    freeze lock? This is the only way to get out of this situation when the
>    page lock has been chosen as victim.
>    Could it be that this is the problem why it didn't get out of the
>    situation?
    
    No, because VMS has not given a deadlock back to Rdb. If it had, you
    would not be able to see the deadlock situation using rmu/show locks
    (assuming that Albert and I are correct).
    If what you are suggesting had happened there would be no process waiting
    for the page lock, and that lock request would have been rejected.
    
    Peter
5029.16another 'deadlock'....NLVMS3::ADRIELThu Feb 20 1997 15:0346
	Oracle Rdb V6.1-04 VAX/VMS V6.1

	Hi,

	same customer encountered last night again a hang condition which
	could only be resolved by killing one of the processes.
	An operator is warned when the (7x24)application 'hangs' for more then 
	30 minutes.
	After which he has to 'solve' this problem as quick as possible.

	Below the RMU output just before killing the ACMS process.

	This is the 3 third time in a few weeks such a 'deadlock' condition
	occurs.
	W'll try to collect as much information as possible but that's
	difficult afterwards and with almost no time available to analyze
	on-line.

	Any further ideas, for example is this related to previous events?

	Adri 

    
================================================================================
SHOW LOCKS/MODE=BLOCKING Information
================================================================================

--------------------------------------------------------------------------------
Resource: page 1905

          ProcessID Process Name        Lock ID   System ID Requested Granted  
          --------- ---------------     --------- --------- --------- -------
Waiting:  00207639  ACMS001SP001000     579B0050  00090002  PR        NL
Blocker:  0020824C  BATCH_30.......     3B0007BB  00100001  PW        PW
.
.
.
--------------------------------------------------------------------------------
Resource: nowait signal

          ProcessID Process Name        Lock ID   System ID Requested Granted  
          --------- ---------------     --------- --------- --------- -------
Waiting:  0020824C  BATCH_30.......     0C001666  00090002  CW        PR
Blocker:  00207639  ACMS001SP001000     66003D3A  00100001  PR        PR
...
..
5029.17HOTRDB::PMEADPaul, pmead@us.oracle.com, 719-577-8032Thu Feb 20 1997 16:0610
    That one looks familiar.  A deadlock on the nowait lock.  The nowait
    lock is one of the special "no deadlock search" locks.
    
    I could swear someone reported that in this notesfile a year or so ago.
    If my fuzzy memory serves me right I believe we asked to have the
    problem reported.
    
    Is your customer using fast commit?  Do they use nowait txns?  If so,
    they might want to stop doing one or the other if this problem is
    causing them a lot of grief -- at least until it can be fixed.