[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference humane::scheduler

Title:SCHEDULER
Notice:Welcome to the Scheduler Conference on node HUMANEril
Moderator:RUMOR::FALEK
Created:Sat Mar 20 1993
Last Modified:Tue Jun 03 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1240
Total number of notes:5017

1234.0. "Default node question" by TAV02::GALIA (Galia Reznik, Israel Software Support) Wed Apr 02 1997 13:12

    Hi,
    
    Our customer has a following configuration:
    Cluster 2 * VAX V6.1 SCHED V2.1A  +  Alpha V6.1 SCHED V2.1B-1.
    The Database for mixed cluster is configured ok.
    
    When RUNing job to an Alpha que, an entry is created, and 
    $ SHOW PROC/CONT/ID=xxx shows that it executes SCHEDULER$DO_COMMAND.EXE
    and the process is in LEF state. It never ends. Scheduler, though, 
    thinks it ended.
    The above situation happens, when tha Alpha in NOT the default node.
    When they set the Alpha to be the default node - the job runs ok.
    But then, the VAX's jobs don't run.
    
    1. Is the above a supported configuration?
    2. If so, what should they do?
    
    Thanks,
    Galia Reznik,
    MCS, Israel.
T.RTitleUserPersonal
Name
DateLines
1234.1Are all batch queues accessible ?RUMOR::FALEKex-TU58 KingWed Apr 02 1997 19:538
    Are you talking here about "batch mode" (ie using batch queues) jobs ?
    Or "detached" mode ?
    
    If batch mode, make sure that the batch queues on both VAX and Alpha
    are properly visible to the scheduler. Are detached mode jobs (no batch
    queues) OK? 
    
    So far as I know, this is supposed to work.
1234.2TAV02::GALIAGalia Reznik, Israel Software SupportMon Apr 07 1997 09:0645
    Hi,
    
    I do talk here about batch queues. They don't have detached mode jobs.
    The queues are visible (in SCHEDULER) from VAX to Alpha and from Alpha
    to VAX. As I mentioned in .0, they CAN send a job to a que, and it
    starts executing, but it never ends. It ends when the Alpha is the
    default node.
    I want to attach a part of the log file in DEBUG mode, which may help
    to trace the problem. This job, 3912, was sent from VAX to Alpha
    queue. Please note, that in each place where should be the Alpha's
    name, they got an empty place. Of course, Alpha's nodename in VMS is
    defined ok - in SCS and in NCP. Where should it be defined in SCHEDULER
    when the node is not the default one?
    And even though the SCHEDULER claims the entry ended, it never ends in
    the queue, it hangs executing an image (pls see .0).
    
    Thanks,
    Galia.
    
    DEBUG log-file:
    ----------------
    
    we woke up!
    got mbx msg 'BS3912    '
    Job start message for job#  3912
    queue_job lock returned  1  lock id=1800728C
    CLUSTER_BROADCAST: node=      msg=B+                <----------
    told        to update count                         <----------
    got term mbx msg  'BE-785'
    job end message for pid FFFFFCEF
    CLUSTER_BROADCAST: node=      msg=B-                <----------
    told        to adjust count                         <----------
    job status of ended batch job is Q
    job end status= 196609
    DEQ_JOB_LOCK returned  1
    job #  3912  finished.... count=  1
     0  remote nodes care about job  3912
    10:18 AM processing record #  3912  status= S  request=
     Now= 2-Apr-1997 10:18:49.88  job_sched_time= 2-Apr-1997 14:18:45.81
    job  3912  is scheduled for the future
    10:18 AM updated   record #  3912  status= S   request=
    Found 0 local jobs depending on :: 3912
    timer flag was clear
    timer not expired. No earlier event to set.
    sleeping
1234.3a problem with logical SYS$NODE ?RUMOR::FALEKex-TU58 KingTue Apr 08 1997 00:5216
    Aha !  The nodename being missing is certainly related to this problem!
    The message telling the "default" scheduler that the job ended is
    getting lost.
    
    When NSCHED.EXE starts, it finds out its nodename by translating
    logical SYS$NODE   If you do a $ sched show status
    on that cluster, do the nodenames appear in the display ? 
    
    Make sure that logical SYS$NODE is properly defined on both machines.
    There is probably a logical name (or else NSCHED wouldn't start) but it
    may have the wrong stuff in it.
    If this is DECnet phase V, make sure the logical has the phase 4 alias 
    (6 characters, maximum).
    
    Then stop and restart the schedulers on both nodes.
    Is the problem still occuring ?
1234.4TAV02::GODOVNIKHaim GodovnikThu Apr 10 1997 12:5218

Hi,

I am stepping in for Galia as She is on vacation.

The scheduler starts after DECNET and the logical name SYS$NODE is defined
correctly on all nodes.

They do not use DECNET Phase V. They have asked restarted the scheduler on all
on all nodes but nothing changed. They also defined the scheduler object in ncp.

What else should we check? Did the scheduler database change between version
2.1A and 2.1B?

Thanks for Your help,

Haim G.
1234.5one additional questionRUMOR::FALEKex-TU58 KingThu Apr 10 1997 20:513
    If you do a $ sched show status
    
    do you see a good display (i.e all the nodenames are shown) ?
1234.6TAV02::GODOVNIKHaim GodovnikSun Apr 13 1997 06:3612

Hi,


On the sched sho stat display He sees both ALPHA and VAX nodes. He added 
another ALPHA to the cluster and between the ALPHA's everything works fine.
The problems are between VAX and ALPHA.

Thanks,

Haim G.
1234.7plan of attackRUMOR::FALEKex-TU58 KingMon Apr 14 1997 22:3345
    Ok, I'm running out of ideas... Lets summarize what we know
    
    The problem occurs when the "default" scheduler is running on an Alpha,
    but the batch execution queue is on a VAX
    
    The job actually runs and completes, but the scheduler system never 
    detects that fact - so it "thinks" it is still running.
    
    The debug info shows that the "batch end" (BE) message is being
    broadcast to a scheduler with a null node name. Actually, on
    reflection, I now think this might actually be normal, since the "default"
    scheduler hears all the messages, and is supposed to react to ones
    where no nodename is specified - no node means the "default". So we may
    have been barking up the wrong tree with the "no node name" thing.
    
    We know a "batch end" message is getting sent when the job completes...
    So the question is, does the default scheduler actually GET this
    message, and if so, what does it do with it ?
    
    To answer this question you could
    
    1. put all scheduler jobs that are likely to run accidently during the
    experiment on hold.  Preferably, wait for all running jobs to finish.
    
    2. Stop all schedulers in the cluster  $ sched stop/all
    
    
    3. On a hardcopy terminal or a screen where you can watch the output,
    on the Alpha system, $ run nsched$:nsched.exe  It will print a lot of
    stuff as it reads thorugh all the jobs and then it will print
    "Sleeping..."
    
    4. Start the scheduler on the VAX
    
    (The Alpha NSCHED will notice it started, you will see some output
    and then it will print "Sleeping..." again)
    
    5. Run a Scheduler batch-mode job on the VAX. You will see some stuff
    print on the Alpha when the job starts. Then the Alpha will print
    "Sleeping...".  When the job completes on the VAX, watch CAREFULLY what
    (if anything) the scheduler on the Alpha prints.
    
    If it doesn't print anything at all, then the BE message isn't being
    processed (valuable information).  If it does print something then that
    will tell excatly what's going on - WHAT DOES IT SAY ?
1234.8TAV02::GODOVNIKHaim GodovnikTue Apr 15 1997 11:2714

Hi,

Thank You for Your help.

I have asked the customer to do the tests You have asked in .7.
After the job completes He gets nothing on the screen after the last 
"Sleeping..." message. Which means that the BE message is not being processed.

He also tried DETTACHED mode jobs and everything worked fine. The problem seems
to be only in BATCH mode.

Haim G. 
1234.9likely a bug that must be escalatedRUMOR::FALEKex-TU58 KingTue Apr 15 1997 18:0714
    Its probably a bug then, and most likely not a known one though I can't
    be sure. Batch jobs in heterogeneous VMSclusters are supposed to work! 
    
    You've already gathered information that shows approximately what step
    in the job processing mechanism is failing. I hoped it would be
    something simple, like a queue file or logical name problem that I could
    suggest a fix for. I suspect this might be a bug that requires a patch.
    Unfortunately, I'm not a member of product engineering. However, the
    information you've already gathered should be very useful to them. 
        
    You need to escalate this through official support channels.
    They need to search through their database to see if this is
    a known problem. They (the support org.) needs to figure out
    exactly why this is broken at your site and supply a fix !
1234.10ZEKE::BURTONJim Burton, DTN 381-6470Tue Apr 15 1997 19:515
If you need to know the official escalation channel in your area, please
contact Curtis Chase @OGO.

Jim
Scheduler Product Manager
1234.11Problem solvedTAV02::GODOVNIKHaim GodovnikThu Apr 17 1997 09:2316

Hi,

Before escalating the problem I have asked the customer to upgrade the VAX
from 2.1A to 2.1B. After the installation everything works fine. I do not 
know if something related to this was fixed in 2.1B or the reinstall simply
solved it.

Thank You very much for all Your help,


Haim Godovnik,


CSC Israel