[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference decwet::winnt-clusters

Title:	WinNT-Clusters
Notice:	Info directories moved to DECWET::SHARE1$:[NT_CLSTR]
Moderator:	DECWET::CAPPELLOF

Created:	Thu Oct 19 1995
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	863
Total number of notes:	3478

707.0. "Secondary server doesn't bring group online by itself" by CIVPR1::SIMMONS (Mike Simmons (301) 918-5597) Fri Mar 21 1997 12:00

We saw the following problem with Clusters v1.1 on Alpha 1000A.  We think it's
related to the installation of another product, but we need some help tracking
this down.  Here's what we see:

If both servers are shutdown with the drives on the primary server, and only
the secondary is rebooted, the drives don't come online, and  Cluster
Administrator doesn't open if you try.  All cluster services are started.
Eventually, the Cluster Failover Manager service stops.  If you restart it,
everything comes works normally. It seems like a timing problem.

A small log file is produced:

Digital Clusters for Windows NT(TM) 1.1-1222 Release (Build 1222)
Digital Equipment Corporation
Windows NT(TM) is a trademark of Microsoft Corporation.
Cluster Failover Manager Trace File
Opened on cluster ISHCLU1 node ISHDHC2 at 3/20/97 4:29:54 PM
by program C:\Program Files\Digital\Cluster\fmcore.exe

03/20/1997 16:29:55.257 tid=109 trace started
16:29:55.445 tid=109 Cfmd Server is not ready, waiting...
16:30:25.875 tid=109 Cfmd Server is not ready, waiting...
16:30:56.210 tid=109 Cfmd Server is not ready, waiting...
16:31:26.273 tid=109 Cfmd Server is not ready, waiting...
16:31:56.328 tid=109 Cfmd Server is not ready, waiting...
16:32:26.640 tid=109 Cfmd Server is not ready, waiting...
16:32:56.718 tid=109 Error opening object FMType\FMDisk
16:32:56.820 tid=109 Cfmd server not running.
    File: E:\CluBuild\src\fm\fmlib\fmdbobj.c    Line: 1102
16:32:56.984 tid=109 TypeData package failed to initialize
16:32:57.109 tid=95  trace stopped

T.R	Title	User	Personal Name	Date	Lines
707.1	I am looking in to it..	LJSRV1::GOODMAN		`Fri Mar 21 1997 13:09`	5
	What is the other product installed????? I'm in today Mike. I was at a customer demo yesterday. Donna
707.2	OpenM	CIVPR1::cop-dhcp-2-98.cop.dec.com::WHITEDA		`Sun Mar 23 1997 18:15`	6
	The Product is called OpenM by Intersystems. It's basically a database used by the VA hospitials for their patient records. Dale M. White (the other half of Mike S.)
707.3		LJSRV1::GOODMAN		`Mon Mar 24 1997 12:41`	3
	Try stopping the CFMD and Cluster failover manager before you install the application. I have seen instances where installing or upgrading and application after cluster software is installed causes some of the service .exe files get deleted.
707.4	May have to do with too many trust relationships	CIVPR1::SIMMONS	Mike Simmons (301) 918-5597	`Tue Apr 01 1997 16:19`	21
	> Try stopping the CFMD and Cluster failover manager before you install the > application. I have seen instances where installing or upgrading and > application after cluster software is installed causes some of the > service .exe files get deleted. I don't think the problem is due a deleted *.EXE file, because (1) I see the same behavior after re-installing Clusters and (2) everything seems to works after restarting the "Cluster Failover Manager" (which stops by itself). Like I stated initially, it looks like a timing problem. I don't think it's an interaction with any other software on the system because we've installed the same configuration at several sites. At some sites, we see this problem, at others, we don't. What is different is the Domain evironment. I suspect that it has something to do with Domain and/or Server Browsing taking a long time. We also have a product called UltraBac that checks for Clients to backup when the application starts. Typically, this takes 10-15 seconds. At the sites having the problem, this takes ~4 minutes. Also, I noticed that User Manager for Domains takes longer than usual. The sites that see this behavior have several trusted and trusting domains, across the country, as well as a rather large distributed WINS database. Other sistes, not seeing this behavior, are using an isolated domain.
707.5	More of the Same...	CIVPR1::WHITEDA		`Wed Apr 09 1997 19:36`	31
	Donna (and others), It seems that this problem is hard to reporduce has well. We have installed about 10 sites thus far and 5 of these sites have this problem. we try to reporduce the problem in our lab (where we have identical equipment) with no success. Also, we have found that this problem starts has soon has clusters is installed. The problem is the same, that if a server comes On-Line and it's partner server is powered off then it Goes into the loop Mike Mentioned in the first note. Even if the only thing in the failover group is the disks, the server will not pick them up. On a working Cluster, it gives the waiting Message, but then continues with the system loadup. Also, at our Muskgogee, OK site we found logging into NT as soon as the prompt appeared, the first server up to pick the disk and services. If the systems were booted and just let set for a few minutes, neither system would pick up the disks (and services). Could this possiblly be due to bad terminators on the Scsi cables ? this is about the only thing I can guess, but still we can't produce it. Is there a way to keep clusters from searching for the other server and start the loadup automatically ? TIA Dale M. White Systems Integrations VA Team Whiteda@mail.dec.com
707.6	Isolating private network prevents problem	CIVPR1::SIMMONS	Mike Simmons (301) 918-5597	`Tue May 06 1997 12:06`	19
	We've installed about 30 sites so far, and we've seen this problem at about half of them so far. All the sites' installations are the same. What is different is the existing domain and network configuration. On a fairly recent install, we had seen the problem with the BDC & PDC on the existing network. We moved the BDC to the same hub as the Cluster servers and put in the same subnet and removed it from the rest of the network. When we did this, the problem went away. We finally were able to reproduce the problem in our lab. We have the WINS client on both network cards running TCP/IP on the same subnet, both connected to the public network. We started to see the problem when I noticed the WINS binding order under Server was opposite as that for Workstation. (One was 1,2, the other was 2,1) When I made the order 1,2 for both, I started seeing the problem. After consistently seeing the problem, I could avoid the problem by preventing the two network cards from seeing each other, in other words, by isolating the private network. I could do this by physically disconnecting thhe 2nd hub from the public network, or by puuting the 2nd cards on a separate subnet and not providing a gateway. We are now in the process of putting NetBEUI only on the 2nd (private) cards and TCP/IP + LAT on the 1st cards. We need to make this change on the next wave of sites to verify that this universally fix this problem.