[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference vaxaxp::vmsnotes

Title:VAX and Alpha VMS
Notice:This is a new VMSnotes, please read note 2.1
Moderator:VAXAXP::BERNARDO
Created:Thu Jan 23 1997
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:703
Total number of notes:3722

342.0. "JOB_CONTROL hung in LEF state" by DECPRG::ZVONAR () Tue Mar 18 1997 07:53

The following problem repeats 2 - 3 times per week on customer system (OpenVMS 
Alpha 6.1, single AS2100, shadowed system disk):

A couple of days after boot batch jobs hangs in starting state, sometime one 
batch job hangs in executing state during exit. Restart of queue manager does 
not solve this problem - every job hangs in starting state. The only way how to 
restart queue manager is reboot. After reboot everything works OK, later the 
problem appears again.

I found the JOB_CONTROL process hangs in LEF state with busy channel open on 
SYS$SYSROOT:[SYSMGR]ACCOUNTNG.DAT. It looks as this situation has impact only 
to batch and print jobs. Interactive logging continue without problems (but 
without log to accountng.dat).

Installed ECOs: ALPSHAD09_061, AXPSCSI01_061, ALPQMAN03_070, AXPDRIV02_061.

It is the same problem as described in Note 626 in VMSNOTES_V12 - there is no 
final solution.

The problem does not depend on job entry number or submit time. Disk free space 
is sufficient, fragmentation acceptable.


Thanks for any info or hints,
Karel

------------------------------------------------------------------------------
The JOB_CONTROL looks:

Process index: 000A   Name: JOB_CONTROL   Extended PID: 0000008A

Process status:        00141003  RES,DELPEN,WAKEPEN,PHDRES,LOGIN
Required capabilities: 0000000C  QUORUM,RUN

PCB address              80A29900    JIB address              80A29B80
PHD address              81972000    Swapfile disk address    00000000
Master internal PID      0001000A    Subprocess count                0
Internal PID             0001000A    Creator internal PID     00000000
Extended PID             0000008A    Creator extended PID     00000000
State                       LEF      Termination mailbox          0000
Previous CPU Id          00000000    Current CPU Id           00000000
Previous ASNSEQ  0000000000013DC4    Previous ASN     000000000000001B
Current priority               13    # of threads     0000000000000000
Initial process priority        8    Delete pending count         0
Base priority                   8    AST's active                 NONE
UIC                [00001,000004]    AST's remaining               295
Mutex count                     0    Buffered I/O count/limit      198/200
Waiting EF cluster              0    Direct I/O count/limit        199/200
Abs time of last event   02035FA1    BUFIO byte count/limit    1637440/1637696
Event flag wait mask     BFFFFFFF    # open files allowed left     197
Swapped copy of LEFC0    00000000    Timer entries allowed left    298
Swapped copy of LEFC1    00000000    Active page table count         0
Global cluster 2 pointer 00000000    Process WS page count         130
Global cluster 3 pointer 00000000    Global WS page count            0


                            Process active channels
                            -----------------------

Channel  Window           Status        Device/file accessed
-------  ------           ------        --------------------
  0010  00000000                        DSA0:
  0020  809F6E40                        DSA0:[VMS$COMMON.SYSEXE]JBC$JOB_CONTROL*
  0030  00000000             Busy       MBA1:
  0050  809EDE00                        DSA0:[VMS$COMMON.SYSEXE]QMAN$MASTER.DAT*
  0060  80B64B00             Busy       DSA0:[SYS0.SYSMGR]ACCOUNTNG.DAT;2


SDA> show call
Call Frame Information
----------------------
        Stack Frame Procedure Descriptor
Flags:  Base Register = FP, No Jacket, Native
        Procedure Entry: FFFFFFFF 8008F8B0              SYS$WAITFR_C
        Return address on stack = FFFFFFFF 801E135C     RMS_NPRO+1535C

Registers saved on stack
------------------------
7FA5F9F0  FFFFFFFF 8086C480  Saved R2     RMS_NPRW+00080
7FA5F9F8  00000000 0005036C  Saved R3
7FA5FA00  FFFFFFFF 836C3150  Saved R13    EXE$PRCDELMSG+00048
7FA5FA08  00000000 7FA5FA10  Saved R29
SDA> show call/next
Call Frame Information
----------------------
        Stack Frame Procedure Descriptor
Flags:  Base Register = FP, No Jacket, Native
        Procedure Entry: FFFFFFFF 801E12D0              RMS_NPRO+152D0
        Return address on stack = FFFFFFFF 801E210C     SYS$FLUSH_C+0009C

Registers saved on stack
------------------------
7FA5FA20  FFFFFFFF 8086C720  Saved R2     SYS$FLUSH
7FA5FA28  00000000 00018001  Saved R3
7FA5FA30  00000000 0005031C  Saved R4
7FA5FA38  00000000 0005036C  Saved R5
7FA5FA40  00000000 7FA5FA50  Saved R29
SDA> show call/next
Call Frame Information
----------------------
        Stack Frame Procedure Descriptor
Flags:  Base Register = FP, No Jacket, Native
        Procedure Entry: FFFFFFFF 801E2070              SYS$FLUSH_C
        Return address on stack = 00000000 00034768

Registers saved on stack
------------------------
7FA5FA68  00000000 00010BB8  Saved R2     INI$LNM_OBJECT_REGISTRATION+00888
7FA5FA70  00000000 00050120  Saved R3
7FA5FA78  00000000 7FA5FA80  Saved R29
SDA> show call/next
Call Frame Information
----------------------
        Stack Frame Procedure Descriptor
Flags:  Base Register = FP, No Jacket, Native
        Procedure Entry: 00000000 00034688
        Return address on stack = 00000000 00033120

Registers saved on stack
------------------------
7FA5FA90  00000000 00010128  Saved R2     SYS$K_VERSION_16+000E8
7FA5FA98  00000000 0005091C  Saved R3
7FA5FAA0  00000000 00000000  Saved R4
7FA5FAA8  00000000 00050000  Saved R5
7FA5FAB0  00000000 7FA5FAC0  Saved R29
SDA> show call/next
Call Frame Information
----------------------
        Stack Frame Procedure Descriptor
Flags:  Base Register = FP, No Jacket, Native
        Procedure Entry: 00000000 000330A8
        Return address on stack = 00000000 00032268

Registers saved on stack
------------------------
7FA5FAD0  00000000 000102E0  Saved R2     SYS$K_VERSION_16+002A0
7FA5FAD8  00000000 00000004  Saved R3
7FA5FAE0  00000000 7FA5FB20  Saved R29
SDA> show call/next
Call Frame Information
----------------------
        Stack Frame Procedure Descriptor
Flags:  Base Register = FP, No Jacket, Native
        Procedure Entry: 00000000 00031C20
        Handler at FFFFFFFF 8081CB60, Data = 00000000 00000018
        Return address on stack = FFFFFFFF 836B3A24     EXE$PROC_IMGACT_C+003A4

Registers saved on stack
------------------------
7FA5FB50  00000000 7FFBF87C  Saved R2     MMG$IMGHDRBUF+0007C
7FA5FB58  00000000 7FFBF960  Saved R3     MMG$IMGHDRBUF+00160
7FA5FB60  FFFFFFFF 80A28B00  Saved R4     PCB
7FA5FB68  00000000 7FF84000  Saved R5
7FA5FB70  FFFFFFFF 8322ADB0  Saved R6
7FA5FB78  00000000 7FA5FBA0  Saved R29


-----------------------------------------------------------------------



T.RTitleUserPersonal
Name
DateLines
342.1MOVIES::WIDDOWSONRodTue Mar 18 1997 08:334
    It might be interesting to see whether the XQP is active.  
    
    SDA> Show  proc/lock		! and
    SDA> CLUE XQP/ACT/FULL 
342.2No active XQP processesDECPRG::ZVONARTue Mar 18 1997 11:0317
Today I can check only forced crash dump file. The JOB_CONTROL is currently 
running, customer rebooted system yesterday.

 SDA> CLUE XQP/ACT/FULL
%CLUE-I-NOACTIVE, there are no active XQP processes

 SDA> Show  proc/lock
	looks the same as on running system.

I have crash dump file and some SDA outputs from running system after problem 
appeared.

Any further tip, please?

Thank in advance,
Karel

342.3Check for "lost" Kmode ASTGIDDAY::GILLINGSa crucible of informative mistakesTue Mar 18 1997 20:1919
  Karel,
    It's possible to "lose" K mode ASTs, often leaving a process in LEF
  state. Typically there's a busy channel to a disk. Format the PCB of
  the process and check the AST queue:

80B99528   PCB$L_ASTQFL_K                  80B99528     PCB+00028
80B9952C   PCB$L_ASTQBL_K                  80B99528     PCB+00028

  Here the QFL and QBL are the same => empty queue. If they're different
  you're probably seeing the problem described. For some reason this
  problem seems to show up more frequently if ALPSHAD09 is installed.

  Solution is to install patch ALPSYS17_061. Indeed, I'd recommend you
  make sure your system has all of the following patches installed:

        ALPLIBR05_070, ALPF11X03_070, ALPSYS08_070, ALPRMS04_061,
        ALPSYS17_061, ALPSMUP01_070, ALPSHAD09_061, ALPSHAD12_061

						John Gillings, Sydney CSC
342.4QFL and QBL are not the sameDECPRG::ZVONARWed Mar 19 1997 09:0115
John,

QFL and QBL are not the same:

> 80A28B28   PCB$L_ASTQFL_K                  809C9E58	
> 80A28B2C   PCB$L_ASTQBL_K                  80996E80	SISR+00A38

ECOs currently installed:
ALPSHAD09_061, AXPSCSI01_061, ALPQMAN03_070, AXPDRIV02_061.

As the next step I will install ECOs from .3

Thanks for your help,
Karel

342.5VIRKE::GULLNASSat Mar 22 1997 14:0114
To be really sure that you have the KAST disabled problem you
should also check the AST{SR/EN} register in SDA>show proc/ind=nn/reg.

This only works on crash dumps. ana/system always show the
KASTs as enabled, even if they are disabled. When using
ana/system the non-empty kernel ast queue for a process
is a good indicator.

The circumstances leading to KAST being disabled involves a kernel
mode AST returning at a elevated IPL. OpenVMS Posix code does
this quite frequently. Perhaps the patched shadowing code does
the same.

	Olof
342.6Problem fixedDECPRG::ZVONARTue Mar 25 1997 13:5910
Hello,

I installed ECOs recommended in .3. The system is now running 6 days without 
problems (before installation of ECOs the problem occured every 2-3 day).

Thanks all for help,
Karel

PS:   AST{SR/EN}    = 0000001F