[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference humane::scheduler

Title:SCHEDULER
Notice:Welcome to the Scheduler Conference on node HUMANEril
Moderator:RUMOR::FALEK
Created:Sat Mar 20 1993
Last Modified:Tue Jun 03 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1240
Total number of notes:5017

1219.0. "SCHEDULER goes into DEP_WAIT state" by REFDV1::DAVIES () Tue Feb 18 1997 10:49

I logged the following call with the hotline, but thought someone in here
    might have some insight.  Ever since we upgraded CASPRO, to SEPS97,
    we are encountering the following scheduler problem on a daily basis.
    
    Anyone have any ideas...
    
    tks,
    Judy
From:	REFDV1::DAVIES       12-FEB-1997 16:46:20.33
To:	NIOPS::FLYNN,PENUTS::EMOTTOLO
CC:	DAVIES
Subj:	DSPS scheduler problem

Bob/Evelyn

I wanted to follow up on the SCHEDULER problem that occured last night with
DSPS jobs going into a DEP_WAIT state.  Bob, in our phone conversation you
mentioned logging a call to colorado to see if they have heard of the 
problem.  I wanted to document exactly what happened so you can tell them
or add it to the current open target call.  I don't have the log #.

Evelyn I am copying you on this mail message to make you aware of the problem
also.  It is critical until these problems are fixed with the DECscheduler,
that you (and everyone you work with), ensure these jobs finish.

The details of the problem...

As Susan mentioned in the attached mail message, after PP61_DSPSPROD 
completed succesfully it went into a DEP_WAIT state.  Because it was in
a DEP_WAIT state it didn't kick off PP70_DSPSPROD.  When John Healey 
and I looked at the scheduler entries, PP61_DSPSPROD didn't have any
local jobs depending on it.  (It should have had PP70_DSPSPROD).
The dependency link appeared to be broken.

Below is the state it was in prior to us "RESYNCHing" the 2 jobs.
        casv05::ref_support> sched sh job/full pp60*=dspsprod

	Job Name             Entry    User_name    State      Next Run Time 
	--------             -----    ---------    -----      -------------
	PP61_DSPSPROD        166      DSPSPROD     Scheduled  12-FEB-1997 19:20
	VMS_Command : @DSPS$COMMAND:PP61_DSPSPROD
	Group : DSPSPROD                           Type : DAILY
	Comment : PRIO 1..REFERENCE SUPPORT..;CALL SUPPORT IF JOB RUNNING @11PM..Table f
	ile
	Last Start Time   : 11-FEB-1997 19:30
	Last Finish Time  : 11-FEB-1997 19:37      Last Exit Status : SUCCESS 
	Schedule Interval : None                   Mode   : Detached
	Mail to           : CASV05::DSPSPROD (Always)    
	Days              : None                                
	Output File       : DSPS$LOGS:PP61_DSPSPROD.LOG
	Cluster_CPU       : Default                Notify user upon completion
	Run Priority      : Default
	Max_Time Warning  : None                   Job Always retained  
	Stall Notify      : None                   No Retry on Error
	Success Count     : 941                    Failure Count : 250
	Owner UIC         : [24,1]                 No Restart on Crash
	Send Opcom Completion Message
	No Pre or Post Function for this job
	No local jobs depend upon this job.
	All dependencies must successfully complete after: 12-FEB-1997 14:25:18.32
	Job Dependencies: (APR_ACMS_STOP)

	casv05::ref_support> sched sh job/full pp70*=dspsprod
	
	Job Name             Entry    User_name    State      Next Run Time 
	--------             -----    ---------    -----      -------------
	PP70_DSPSPROD        174      DSPSPROD     Scheduled  12-FEB-1997 19:30
	VMS_Command : @DSPS$COMMAND:PP70_DSPSPROD.COM
	Group : DSPSPROD                           Type : DAILY
	Comment : PRIO 1..REFERENCE SUPPORT..DSPS Extract/Copy
	Last Start Time   : 12-FEB-1997 12:12
	Last Finish Time  : 12-FEB-1997 12:13      Last Exit Status : SUCCESS 
	Schedule Interval : None                   Mode   : Detached
	Mail to           : CASV05::DSPSPROD (Always)    
	Days              : None                                
	Output File       : DSPS$LOGS:PP70_DSPSPROD.LOG
	Cluster_CPU       : Default                Notify user upon completion
	Run Priority      : Default
	Max_Time Warning  : None                   Job Always retained  
	Stall Notify      : None                   No Retry on Error
	Success Count     : 942                    Failure Count : 0
	Owner UIC         : [24,1]                 No Restart on Crash
	Send Opcom Completion Message
	No Pre or Post Function for this job
	No local jobs depend upon this job.
	All dependencies must successfully complete after: 12-FEB-1997 14:25:20.82
	Job Dependencies: (PP61_DSPSPROD)

We then decided to try to add back the dependency and it didn't work.  So
we took off the dependency and added it back on.  Interesting enough,
we got the below warning:

	casv05::ref_support> sched mod/synch=(pp61_dspsprod=dspsprod) 
			     pp70_dspsprod=dspsprod
	%NSCHED-I-NOMODS, Job PP70_DSPSPROD - No fields modified
	casv05::ref_support> sched mod/nosynch pp70_dspsprod=dspsprod
	%NSCHED-I-RQSTSUCCSS, Job PP70_DSPSPROD - Modified
	%NSCHED-W-NOSCHED, No scheduler available to service request

But it put back on the dependency anyway.  So, there seems to be some
strange things going on with DECSCHEDULER post SEPS97.  Hopefully it
will be o.k. for tonights processing.

This is the way PP61_DSPSPROD now looks...
DSPS_Judy> sched sho job pp61_dspsprod/full

	Job Name             Entry    User_name    State      Next Run Time 
	--------             -----    ---------    -----      -------------
	PP61_DSPSPROD        166      DSPSPROD     Scheduled  12-FEB-1997 19:20
	VMS_Command : @DSPS$COMMAND:PP61_DSPSPROD
	Group : DSPSPROD                           Type : DAILY
	Comment : PRIO 1..REFERENCE SUPPORT..;CALL SUPPORT IF JOB RUNNING @11PM..Table f
	ile
	Last Start Time   : 11-FEB-1997 19:30
	Last Finish Time  : 11-FEB-1997 19:37      Last Exit Status : SUCCESS 
	Schedule Interval : None                   Mode   : Detached
	Mail to           : CASV05::DSPSPROD (Always)    
	Days              : None                                
	Output File       : DSPS$LOGS:PP61_DSPSPROD.LOG
	Cluster_CPU       : Default                Notify user upon completion
	Run Priority      : Default
	Max_Time Warning  : None                   Job Always retained  
	Stall Notify      : None                   No Retry on Error
	Success Count     : 941                    Failure Count : 250
	Owner UIC         : [24,1]                 No Restart on Crash
	Send Opcom Completion Message
	No Pre or Post Function for this job
	This job has 1 local job(s) that depend upon it:
	(PP70_DSPSPROD)
	All dependencies must successfully complete after: 12-FEB-1997 14:25:18.32
	Job Dependencies: (APR_ACMS_STOP)
	DSPS_Judy> 
Tks,
Judy


From:	REFDV1::VINCENT      "mach nicht" 12-FEB-1997 10:28:00.79
To:	USOPS::MPR
CC:	DAVIES,MURPHY,HEALEY,VINCENT
Subj:	Please log *URGENT* call for problem with DECScheduler on CASPRO

Hello,

   Refer to attatched mail message for description of events that occurred
last night with production on cluster CASPRO.  This is apparently not the first
time since the upgrade on Thursday that sched jobs have gone into the black 
hole of DEP_WAIT state.  At a minimum, data center nightly support should
begin looking for / anticipating this problem and should page reference support
asap if this happens.  Someone should also be looking into why the scheduler
is doing this.  Please have assigned person call me as soon as possible.

Susan Vincent, 227-3776

From:	REFDV1::VINCENT      "mach nicht" 12-FEB-1997 08:56:18.29
To:	DASSS1::NORTON
CC:	VINCENT
Subj:	bunch of dsps jobs didn't run last night, they are in dep_wait state

Hi Joy,

   Fyi, half of dsps's nightly production didn't run last night.  All of the
jobs that did run, ran successfully.  I've checked out the dependencies based
on the schedule produced yesterday and everything is right.  pp70_dspsprod
was the next thing that was supposed to run, but it is in dep_wait, even though
the job it was dependent upon (pp61_dspsprod) completed successfully.  I was 
on beeper last night and didn't get paged, but I guess no one would page me 
since no jobs that actually ran failed.  I'm working with pricing ops to do 
what is necessary to reschedule jobs and run jobs that need to run today, but 
I'm not sure what to do about the real problem -- why are these jobs still in 
dep_wait?  Any ideas?

Susan
T.RTitleUserPersonal
Name
DateLines
1219.1corrupt DEPENDENCIES.DAT fileREFDV1::DAVIESWed Feb 19 1997 14:3626
I'll answer my own note,  Colorado found out it was a corrupt file, causing
    our scheduler problem.
    
From:	NIOPS::FLYNN        "Bob Flynn - CCS Platform Management - DTN 264-7632" 18-FEB-1997 14:39:32.24
To:	REFER3::VINCENT,REFER3::MURPHY
CC:	REFER3::DAVIES,REFDV1::HEALEY
Subj:	DECSCHEDULER on CASPRO


	Hi Folks,

	I finally was contacted by the CSC regarding the Scheduler
	problem. The rep believes that the problem is a corrupted
	"dependencies.dat" file. 

	The solution to that is to shut down Scheduler, rename the file
	and restart it on one node. That startup looks for the file and
	if it dosen't find it, it uses the VSS.DAt file to rebuild it.

	I did that at approx. 2:30 this afternoon. Scheduler is back
	up on both nodes and the file is brand new. Let's see what
	tonight brings us.

	Thanks,

	Bob
1219.2make sure network dependencies are ok !RUMOR::FALEKex-TU58 KingFri Feb 21 1997 17:4810
    Be careful - when you delete the dependency.dat file and then have the
    scheduler create a new one, some of the network dependnecy information
    is lost.  If you have network job dependencies, you will have to
    carefully hand-create them again.   However, all the local job
    dependency info is preserved.
    
    By the way, in the $ sched show job/full display, the user interface
    shows the dependencies that are already satisfied in [ ], so if
    there are multiple jobs that a job depends on, you can see which
    specific ones it is still waiting for.