[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference orarep::nomahs::rdb_60

Title:	Oracle Rdb - Still a strategic database for DEC on Alpha AXP!
Notice:	RDB_60 is archived, please use RDB_70..
Moderator:	NOVA::SMITHISON

Created:	Fri Mar 18 1994
Last Modified:	Fri May 30 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	5118
Total number of notes:	28246

5029.0. "Deadlock on page not resolved?" by NLVMS2::VVISSER (Vincent Visser, Oracle Rdb Support, The Netherlands) Fri Feb 14 1997 13:19

    Hello,
    
    A few days ago we had a strange situation on a production database of a
    customer.
    The application was not working any more and nobody received errors.
    With the RMU/SHOW LOCK/MODE=BLOCKING we found that there where
    deadlocks. See output below.
    A few minutes before they detected the problems, one of the two nodes 
    crashed. 
    Why are these deadlocks not solved? I always thought that a deadlock on
    a page was solved by Rdb. 
    The application does check for deadlocks and they didn't get any
    deadlock error at that moment. A stop of the application solved the
    problem. 
    Two weeks ago they had also a hang situation but here was no system
    crash involved. Some processes where reporting DEADLOCK ON FREEZE
    errors. In that case they had to kill a process to get it going again.  
    
    Customer is using Oracle Rdb V6.1-04 on VAX/VMS V6.1.
    Dynamic lock remastering is disabled (PE1 was set to 25 and 50)
    
    Why is Rdb not solving the deadlock in this particular case?
    
    Regards,
    Vincent
    
================================================================================
SHOW LOCKS/MODE=BLOCKING Information
================================================================================

--------------------------------------------------------------------------------
Resource: page 1

          ProcessID Process Name        Lock ID   System ID Requested Granted  
          --------- ---------------     --------- --------- --------- -------
Waiting:  20413CF1  ACMS00BSP016002     25C3002A  00090002  PW        CR
Blocker:  20411EF3  ACMS00BSP018002     26006AB3  00090002  PR        PR
Blocker:  204138F2  ACMS00BSP017002     04B60053  00090002  PR        PR

--------------------------------------------------------------------------------
Resource: freeze

          ProcessID Process Name        Lock ID   System ID Requested Granted  
          --------- ---------------     --------- --------- --------- -------
Waiting:  20411EF3  ACMS00BSP018002     7B0020D6  00090002  CW        NL
Blocker:  20411E31  RDM_RB_1.......     2E00CF86  00090002  PR        PR

--------------------------------------------------------------------------------
Resource: freeze

          ProcessID Process Name        Lock ID   System ID Requested Granted  
          --------- ---------------     --------- --------- --------- -------
Waiting:  204138F2  ACMS00BSP017002     2F004D72  00090002  CW        NL
Blocker:  20411E31  RDM_RB_1.......     2E00CF86  00090002  PR        PR

--------------------------------------------------------------------------------
Resource: page 1166

          ProcessID Process Name        Lock ID   System ID Requested Granted  
          --------- ---------------     --------- --------- --------- -------
Waiting:  20411138  ACMS00BSP013002     3A00A956  00090002  PR        NL
Blocker:  204146F0  ACMS00ASP001000     47006061  00090002  PW        PW

--------------------------------------------------------------------------------
Resource: page 1

          ProcessID Process Name        Lock ID   System ID Requested Granted  
          --------- ---------------     --------- --------- --------- -------
Waiting:  20411E31  RDM_RB_1.......     4600F91C  00090002  PW        NL
Blocker:  20413CF1  ACMS00BSP016002     25C3002A  00090002  PW        CR
Blocker:  20411EF3  ACMS00BSP018002     26006AB3  00090002  PR        PR
Blocker:  204138F2  ACMS00BSP017002     04B60053  00090002  PR        PR

T.R	Title	User	Personal Name	Date	Lines
5029.1		M5::JHAYTER		`Fri Feb 14 1997 17:41`	9
	> Dynamic lock remastering is disabled (PE1 was set to 25 and 50) try using 1 or (%xFFFFFFFF) -1 > Why is Rdb not solving the deadlock in this particular case? Rdb does not detect deadlocks. The VMS lock manager does and it notifies Rdb.
5029.2		NOVA::R_ANDERSON	Oracle Corporation (603) 881-1935	`Sat Feb 15 1997 11:04`	4
	Also, Rdb handles "page" deadlocks internally - they are not normally returned to the application. Rick
5029.3	How does it solve the deadlock?	NLVMS2::VVISSER	Vincent Visser, Oracle Rdb Support, The Netherlands	`Mon Feb 17 1997 07:53`	17
	>Also, Rdb handles "page" deadlocks internally - they are not normally >returned to the application. > >Rick This is exactly what it should do. The application is not getting any deadlock error, but when you look at the RMU/SHOW LOCK/MODE=BLOCKING output there are deadlocks. It looks like that Rdb doesn't correctly handles "page" deadlocks internally. How does it solve a deadlock with a page lock and a freeze lock involved? Who will be chosen as the victim? Regards, Vincent
5029.4		ukvms3.uk.oracle.com::PJACKSON	Oracle UK Rdb Support	`Mon Feb 17 1997 08:08`	15
	> This is exactly what it should do. The application is not getting > any deadlock error, but when you look at the RMU/SHOW LOCK/MODE=BLOCKING > output there are deadlocks. This shows that VMS has not chosen one of the lock requests to abort. When it does the $ENQ returns an error and the request will no longer be outstanding. > How does it solve a deadlock with a page lock and a freeze lock > involved? Who will be chosen as the victim? VMS does the choosing (based on a value supplied by Rdb). Until VMS chooses one Rdb can do nothing. Peter
5029.5		NOVA::R_ANDERSON	Oracle Corporation (603) 881-1935	`Mon Feb 17 1997 09:13`	6
	Check your DEADLOCK_WAIT sysgen parameter. I like to have it set to "1" or "2" (default the "10" seconds, which is horrendous for any real-world application). Rick
5029.6	Gotta know your system and the OpenVMS lock manager...	BOUVS::OAKEY	I'll take Clueless for $500, Alex	`Mon Feb 17 1997 14:31`	31
	~~Note 5029.4 Deadlock on page not resolved? 4 of 5 ~~ukvms3.uk.oracle.com::PJACKSON "Oracle UK Rdb Suppo" 15 lines 17-FEB-1997 05:08 ~~ ~~ This shows that VMS has not chosen one of the lock requests to abort. ~~ When it does the $ENQ returns an error and the request will no longer ~~ be outstanding. Not quite true. When OpenVMS detects a deadlock, it signals the victim but does nothing to the pending request. It is up to the victim to $ENQ to a lesser lock mode (or $DEQ) to remove the request from the appropriate pending/conversion queue. ~~Note 5029.5 Deadlock on page not resolved? 5 of 5 ~~NOVA::R_ANDERSON "Oracle Corporation (603) 881-1935" 6 lines 17-FEB-1997 06:13 ~~ ~~I like to have it set to "1" or "2" (default the "10" seconds, which is ~~horrendous for any real-world application). Here is where I might disagree a bit. DEADLOCK_WAIT is a SYSGEN parameter. Tweaking it affects the entire system. Setting it to 1 or 2 will help quickly identify true deadlocks. However, you may be causing the system to check an excessive number of potential deadlocks in the deadlock queue that aren't really deadlocks, just pending lock requests. You should evaluate your system to make sure that you aren't waiting an excessive amount of time to find real deadlocks but also to make sure you're not checking too quickly and using up system resources checking for potential deadlocks that aren't.
5029.7		138.3.209.29::PJACKSON	Oracle UK Rdb Support	`Mon Feb 17 1997 15:05`	9
	>Not quite true. When OpenVMS detects a deadlock, it signals the victim but >does nothing to the pending request. It is up to the victim to $ENQ to a >lesser lock mode (or $DEQ) to remove the request from the appropriate >pending/conversion queue. That's not what my VMS internals manual says. It says the lock request fails. Peter
5029.8	I think we said the same thing :)	BOUVS::OAKEY	I'll take Clueless for $500, Alex	`Mon Feb 17 1997 15:25`	16
	~~ <<< Note 5029.7 by 138.3.209.29::PJACKSON "Oracle UK Rdb Support" >>> ~~ That's not what my VMS internals manual says. It says the lock request ~~ fails. Which doesn't really disagree with what I said. When you request a lock with WAIT and the request is not immediately granted, you're placed in either the waiting or conversion queue (depending on the previous state of the lock) and the timeout queue. When you've been in the timeout queue deadlock_wait length of time, OpenVMS will check to see if your lock request participates in a deadlock. If so, then one of the deadlock participators is signalled as the victim and their lock request returns a deadlock error. That doesn't mean they're removed from the waiting or conversion queue, you've got to $ENQ to a more permissive mode for that to happen.
5029.9		NOVA::GODFRIND	Oracle Rdb Engineering	`Mon Feb 17 1997 15:46`	49
	>~~ This shows that VMS has not chosen one of the lock requests to abort. >~~ When it does the $ENQ returns an error and the request will no longer >~~ be outstanding. > >Not quite true. When OpenVMS detects a deadlock, it signals the victim but >does nothing to the pending request. It is up to the victim to $ENQ to a >lesser lock mode (or $DEQ) to remove the request from the appropriate >pending/conversion queue. Ahem. I beg to disagree (and agree with Peter). The lock request for which the deadlock error gets reported does get removed from the queue it was waiting in (and put back in its prior state if necessary). However, the other locks that the victim process may have (and that are blocking the other processes, causing the deadlock condition) do NOT get removed automaticaly. It is up to the applicaiton to do the right thing (usually rollback the current transaction). >~~I like to have it set to "1" or "2" (default the "10" seconds, which is >~~horrendous for any real-world application). > >Here is where I might disagree a bit. DEADLOCK_WAIT is a SYSGEN parameter. >Tweaking it affects the entire system. Setting it to 1 or 2 will help >quickly identify true deadlocks. However, you may be causing the system to >check an excessive number of potential deadlocks in the deadlock queue that >aren't really deadlocks, just pending lock requests. You should evaluate >your system to make sure that you aren't waiting an excessive amount of >time to find real deadlocks but also to make sure you're not checking too >quickly and using up system resources checking for potential deadlocks that >aren't. I beg to agree. Deadlock seraches are pretty costly - not so much that they use CPU, but that they use kernel mode cpu at evated IPL (IPL8), which may disturb other system functions. I tend to think that setting deadlock wait to a low number provides fast relief, but does not cure the real problem. It acts like a pain killer, but you still need to see the doctor. A large number of deadlocks (even if they are handled internally by Rdb) is bad and needs investigating. That said, we are straying away from the base prtoblem. From the look of it, two ACMS servers were waiting for the freeze lock, held by a recovery process, which itself was waiting for a page (page #1 in some area), held by those two processes. I am not sure what should have happened. The DBR should have a deadlock priority lowe than the monitor but higher than all user processes, so any deadlock error should have been reported to the acms servers (probsably a "deadlock on freeze") error.
5029.10		ukvms3.uk.oracle.com::PJACKSON	Oracle UK Rdb Support	`Mon Feb 17 1997 15:52`	31
	>~~ That's not what my VMS internals manual says. It says the lock request >~~ fails. > >Which doesn't really disagree with what I said. It does as I read it. >When you request a lock >with WAIT and the request is not immediately granted, you're placed in >either the waiting or conversion queue (depending on the previous state of >the lock) and the timeout queue. When you've been in the timeout queue >deadlock_wait length of time, OpenVMS will check to see if your lock >request participates in a deadlock. If so, then one of the deadlock >participators is signalled as the victim and their lock request returns a >deadlock error. That doesn't mean they're removed from the waiting or >conversion queue, you've got to $ENQ to a more permissive mode for that to >happen. If the request is still queued then it has not failed - it may yet succeed. Two sentences earlier the manual says 'VMS resolves deadlocks by choosing a participant in the deadlock cycle and refusing that participant's lock request', which also seems incompatible with the request remaining queued. It may be that the manual is wrong. I haven't been able to find anything more recent than 1989 - some manuals went missing in the last office move :-( Peter
5029.11		ukvms3.uk.oracle.com::PJACKSON	Oracle UK Rdb Support	`Mon Feb 17 1997 15:56`	9
	>I tend to think that setting deadlock wait to a low number provides fast >relief, but does not cure the real problem. It acts like a pain killer, but you >still need to see the doctor. A large number of deadlocks (even if they are >handled internally by Rdb) is bad and needs investigating. I normally consider deadlocks to be a side effect of a locking problem. Fix the locking problem and the deadlocks go away by themselves. Peter
5029.12	Small nit	HOTRDB::PMEAD	Paul, pmead@us.oracle.com, 719-577-8032	`Mon Feb 17 1997 16:23`	4
	I don't want to lead things off on a big tangent, but it is possible for a user process doing a rollback to have deadlock priority higher than DBR. This can occur for brief periods on page deadlocks. Rollbacks proceed regardless of whether DBRs are running.
5029.13	Back to the real question.	NLVMS2::VVISSER	Vincent Visser, Oracle Rdb Support, The Netherlands	`Mon Feb 17 1997 17:55`	12
	Back to the real question. Suppose that, because of the deadlock priority, VMS chooses the pagelock as the victim and gives a deadlock error back to Rdb. How does it solve this deadlock? When two pagelocks are involved it can release all the pagelocks, but can Rdb decide to release the freeze lock? This is the only way to get out of this situation when the page lock has been chosen as victim. Could it be that this is the problem why it didn't get out of the situation? Regards, Vincent
5029.14		HOTRDB::PMEAD	Paul, pmead@us.oracle.com, 719-577-8032	`Mon Feb 17 1997 19:45`	12
	Any process that gets a deadlock on a page will flush any modified buffers and reduce the remaining page locks to the minimum required level to indicate that the process is still looking at a page. It then temporarily boosts its deadlock priority to a high enough level that it will almost always win in any deadlock conflict (even with a DBR). This activity can iterate forever until all processes involved in the deadlock have unmarked all of their buffers and minimized all of their page locks. At some point there should no longer be a conflict. As far as I know unmarking all buffers is always enough to allow the competing process (such as a DBR) to get a copy of the page in question and thus resolve the deadlock.
5029.15		ukvms3.uk.oracle.com::PJACKSON	Oracle UK Rdb Support	`Tue Feb 18 1997 07:24`	17
	> Back to the real question. > Suppose that, because of the deadlock priority, VMS chooses the > pagelock as the victim and gives a deadlock error back to Rdb. > How does it solve this deadlock? When two pagelocks are involved it can > release all the pagelocks, but can Rdb decide to release the > freeze lock? This is the only way to get out of this situation when the > page lock has been chosen as victim. > Could it be that this is the problem why it didn't get out of the > situation? No, because VMS has not given a deadlock back to Rdb. If it had, you would not be able to see the deadlock situation using rmu/show locks (assuming that Albert and I are correct). If what you are suggesting had happened there would be no process waiting for the page lock, and that lock request would have been rejected. Peter
5029.16	another 'deadlock'....	NLVMS3::ADRIEL		`Thu Feb 20 1997 15:03`	46
	Oracle Rdb V6.1-04 VAX/VMS V6.1 Hi, same customer encountered last night again a hang condition which could only be resolved by killing one of the processes. An operator is warned when the (7x24)application 'hangs' for more then 30 minutes. After which he has to 'solve' this problem as quick as possible. Below the RMU output just before killing the ACMS process. This is the 3 third time in a few weeks such a 'deadlock' condition occurs. W'll try to collect as much information as possible but that's difficult afterwards and with almost no time available to analyze on-line. Any further ideas, for example is this related to previous events? Adri ================================================================================ SHOW LOCKS/MODE=BLOCKING Information ================================================================================ -------------------------------------------------------------------------------- Resource: page 1905 ProcessID Process Name Lock ID System ID Requested Granted --------- --------------- --------- --------- --------- ------- Waiting: 00207639 ACMS001SP001000 579B0050 00090002 PR NL Blocker: 0020824C BATCH_30....... 3B0007BB 00100001 PW PW . . . -------------------------------------------------------------------------------- Resource: nowait signal ProcessID Process Name Lock ID System ID Requested Granted --------- --------------- --------- --------- --------- ------- Waiting: 0020824C BATCH_30....... 0C001666 00090002 CW PR Blocker: 00207639 ACMS001SP001000 66003D3A 00100001 PR PR ... ..
5029.17		HOTRDB::PMEAD	Paul, pmead@us.oracle.com, 719-577-8032	`Thu Feb 20 1997 16:06`	10
	That one looks familiar. A deadlock on the nowait lock. The nowait lock is one of the special "no deadlock search" locks. I could swear someone reported that in this notesfile a year or so ago. If my fuzzy memory serves me right I believe we asked to have the problem reported. Is your customer using fast commit? Do they use nowait txns? If so, they might want to stop doing one or the other if this problem is causing them a lot of grief -- at least until it can be fixed.