[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference decwet::winnt-clusters

Title:WinNT-Clusters
Notice:Info directories moved to DECWET::SHARE1$:[NT_CLSTR]
Moderator:DECWET::CAPPELLOF
Created:Thu Oct 19 1995
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:863
Total number of notes:3478

707.0. "Secondary server doesn't bring group online by itself" by CIVPR1::SIMMONS (Mike Simmons (301) 918-5597) Fri Mar 21 1997 12:00

We saw the following problem with Clusters v1.1 on Alpha 1000A.  We think it's
related to the installation of another product, but we need some help tracking
this down.  Here's what we see:

If both servers are shutdown with the drives on the primary server, and only
the secondary is rebooted, the drives don't come online, and  Cluster
Administrator doesn't open if you try.  All cluster services are started.
Eventually, the Cluster Failover Manager service stops.  If you restart it,
everything comes works normally. It seems like a timing problem.

A small log file is produced:

Digital Clusters for Windows NT(TM) 1.1-1222 Release (Build 1222)
Digital Equipment Corporation
Windows NT(TM) is a trademark of Microsoft Corporation.
Cluster Failover Manager Trace File
Opened on cluster ISHCLU1 node ISHDHC2 at 3/20/97 4:29:54 PM
by program C:\Program Files\Digital\Cluster\fmcore.exe

03/20/1997 16:29:55.257 tid=109 trace started
16:29:55.445 tid=109 Cfmd Server is not ready, waiting...
16:30:25.875 tid=109 Cfmd Server is not ready, waiting...
16:30:56.210 tid=109 Cfmd Server is not ready, waiting...
16:31:26.273 tid=109 Cfmd Server is not ready, waiting...
16:31:56.328 tid=109 Cfmd Server is not ready, waiting...
16:32:26.640 tid=109 Cfmd Server is not ready, waiting...
16:32:56.718 tid=109 Error opening object FMType\FMDisk
16:32:56.820 tid=109 Cfmd server not running.
    File: E:\CluBuild\src\fm\fmlib\fmdbobj.c    Line: 1102
16:32:56.984 tid=109 TypeData package failed to initialize
16:32:57.109 tid=95  trace stopped

T.RTitleUserPersonal
Name
DateLines
707.1I am looking in to it..LJSRV1::GOODMANFri Mar 21 1997 13:095
What is the other product installed?????  I'm in today Mike.
I was at a customer demo yesterday.


Donna
707.2OpenMCIVPR1::cop-dhcp-2-98.cop.dec.com::WHITEDASun Mar 23 1997 18:156
The Product is called OpenM by Intersystems. It's basically a database used 
by the VA hospitials for their patient records.


Dale M. White (the other half of Mike S.)

707.3LJSRV1::GOODMANMon Mar 24 1997 12:413
Try stopping the CFMD and Cluster failover manager before you install the application. I have seen instances
where installing or upgrading and application after cluster software is installed causes  some 
of the service .exe files get deleted.
707.4May have to do with too many trust relationshipsCIVPR1::SIMMONSMike Simmons (301) 918-5597Tue Apr 01 1997 16:1921
> Try stopping the CFMD and Cluster failover manager before you install the 
> application. I have seen instances where installing or upgrading and 
> application after cluster software is installed causes  some of the 
> service .exe files get deleted.

I don't think the problem is due a deleted *.EXE file, because (1) I see the
same behavior after re-installing Clusters and (2) everything seems to works
after restarting the "Cluster Failover Manager" (which stops by itself).  
Like I stated initially, it looks like a timing problem.

I don't think it's an interaction with any other software on the system because
we've installed the same configuration at several sites.  At some sites, we see
this problem, at others, we don't.  What is different is the Domain evironment.
I suspect that it has something to do with Domain and/or Server Browsing taking
a long time.  We also have a product called UltraBac that checks for Clients to
backup when the application starts.  Typically, this takes 10-15 seconds.  At
the sites having the problem, this takes ~4 minutes.  Also, I noticed that
User Manager for Domains takes longer than usual.  The sites that see this
behavior have several trusted and trusting domains, across the country, as well
as a rather large distributed WINS database.  Other sistes, not seeing this 
behavior, are using an isolated domain.
707.5More of the Same...CIVPR1::WHITEDAWed Apr 09 1997 19:3631
    Donna (and others),
    
    It seems that this problem is hard to reporduce has well. We have
    installed about 10 sites thus far and 5 of these sites have this
    problem. we try to reporduce the problem in our lab (where we have
    identical equipment) with no success.
    
    Also, we have found that this problem starts has soon has clusters is
    installed. The problem is the same, that if a server comes On-Line
    and it's partner server is powered off then it Goes into the loop Mike
    Mentioned in the first note.
    
    Even if the only thing in the failover group is the disks, the server
    will not pick them up. On a working Cluster, it gives the waiting Message,
    but then continues with the system loadup.
    
    Also, at our Muskgogee, OK site we found logging into NT as soon as
    the prompt appeared, the first server up to pick the disk and services.
    If the systems were booted and just let set for a few minutes, neither
    system would pick up the disks (and services). 
    
    Could this possiblly be due to bad terminators on the Scsi cables ?
    this is about the only thing I can guess, but still we can't produce
    it. Is there a way to keep clusters from searching for the other server
    and start the loadup automatically ?
    
    TIA
    
    Dale M. White
    Systems Integrations VA Team
    Whiteda@mail.dec.com                          
707.6Isolating private network prevents problemCIVPR1::SIMMONSMike Simmons (301) 918-5597Tue May 06 1997 12:0619
We've installed about 30 sites so far, and we've seen this problem at about half
of them so far.  All the sites' installations are the same.  What is different
is the existing domain and network configuration.  On a fairly recent install,
we had seen the problem with the BDC & PDC on the existing network.  We moved 
the BDC to the same hub as the Cluster servers and put in the same subnet and
removed it from the rest of the network.  When we did this, the problem went
away.  We finally were able to reproduce the problem in our lab.  We have the
WINS client on both network cards running TCP/IP on the same subnet, both 
connected to the public network.  We started to see the problem when I noticed
the WINS binding order under Server was opposite as that for Workstation.  (One
was 1,2, the other was 2,1)  When I made the order 1,2 for both, I started
seeing the problem.  After consistently seeing the problem, I could avoid the
problem by preventing the two network cards from seeing each other, in other
words, by isolating the private network.  I could do this by physically 
disconnecting thhe 2nd hub from the public network, or by puuting the 2nd cards
on a separate subnet and not providing a gateway.  We are now in the process of
putting NetBEUI only on the 2nd (private) cards and TCP/IP + LAT on the 1st 
cards.  We need to make this change on the next wave of sites to verify that 
this universally fix this problem.