[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference smurf::ase

Title:ase
Moderator:SMURF::GROSSO
Created:Thu Jul 29 1993
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2114
Total number of notes:7347

1977.0. "Memory Channel Problems?" by ZPOVC::JUSTIN () Wed Apr 02 1997 05:28

Hello,

	We are experiencing some strange behaviour with our UNIX systems 
and would like to find out why? 

	Our config (ALL NEW) consists of 3 A8400,10CPUs,5/440,4GB Mem, 
kftia with FDDI, PCI bus with 1 KZPSA and Memory channel card. All 
three systems are connected via the KFTIA based FDDI controller to a 
GIGaswitch/FDDI. The 3 systems are also connected via memory channel 
to a MC-HUB. The KZPSAs are also connected between the systems with 
a pair of HSZ50s (dual redundant)with write-back cache (2X128). UNIX 
ver 4.0B. Latest dupatch kit applied. KZPSA ids are 7, 6, 5, for the 
3 A8400 and 0,1,2,3 for the HSZ50s. KZPSAs are on PCI slot 7, MCs on 
slot 0, The MC ID are 1 (master) and 4,6 for the clients. All systems 
the FWD ISP controllers for the local wide disks, SCSI-2 for the RRD45
and TLZ09s. The master node uses another ISP bus for a TZ887 and TSZ07.
The only shared bus is the KZPSA.

	As the systems will be serving many third party applications
(which are hardware linked licenses). ASE is not being used. One of
the systems acts as the NIS master which serves the two clients. The 
node also mounts the file systems via the HSZ50s and NFS serves them 
to the client via the MC. Users access the systems via the FDDI.

	When installed using FDDI all worked as expected. However after
installing the MC drivers (TCR140) MCA-UA. We mounted the the NFS 
files systems via the MC path. Till this point all worked fine. Now 
if we reboot any of the clients the system hangs at the NFS mount with 
rpc time outs. If we reboot the master it hangs at the memory channel 
startup waiting for node 1 etc. The only way to bring up the system 
would be to shutdown all three systems, power off the MC-HUB, boot the 
master node, power on the MC-HUB followed by the two client nodes. 
Once up everything works fine.

	Also noted intermittently is that during boot up there are
many CAM errors, with the isp controllers similar to those reported
in note 5707 in digital_unix conf.

	So are we doing anything wrong here? Is the memory channel
behaviour normal? Would appreciate any comments/ideas to resolve this
problem.

Regards

Justin 
    
T.RTitleUserPersonal
Name
DateLines
1977.1Shared scsi not supported without ASENETRIX::"cherkus@buff.zk3.dec.com"Dave CherkusWed Apr 02 1997 18:1416
I think the root cause is when one node leaves your 'cluster' it
issues a SCSI bus reset, and because the ASE code is not installed
on the other nodes they don't know what to do about it.

The NFS hangs are probably due to IOs that will never complete
because of this.

Shared SCSI is not supported without the ASE product installed.

Shutting down everything of course clears the problem.

Why are the two client nodes on the shared scsi if they aren't
serving the data?

Getting them off the bus will prove or disprove my theory.
[Posted by WWW Notes gateway]
1977.2No-shared bus, same behaviourZPOVC::JUSTINThu Apr 03 1997 07:5014
    Hello,
    
    	I have tried what you suggested. However the symptoms remain.
    We also had shared SCSI when using FDDI for interconnect, it did not 
    have these problems. However I do see that having shared bus without
    ASE (which I understand prevents dual mounting?) is potentially
    unsafe.
    
    	I am in the process of getting a full TCR-UA license to test if it
    will still behave the same. 
    
    	Is there any other options that I can try? 
    
    Justin
1977.3more infoZPOVC::JUSTINThu Apr 03 1997 10:02102
    
    
Hello,
	Here is some additional info . When the Master system is booted 
(others shutdown) this is what we get

>>> boot
.
.
.
Dual TLEP at node 4
Dual TLEP at node 3
Dual TLEP at node 2
Dual TLEP at node 1
Dual TLEP at node 0
monitorBoot: doing it...
Cluster Memory Channel primary adaptor is online.
  Rev 14 adaptor is the primary channel (pci bus 1, slot 0)
  connected to virtual hub (VH1) as node 1.
dli: configured
clubase: configured
skipping test/delay for VH0/VH1 system
drd: configured.
dlmsl: configured
cnxagent: configured
dlm: configured.
memory channel thread init
checking for existing memory channel nodes
unresponsive mc nodes - waiting for node mask 1
unresponsive mc nodes - waiting for node mask 1
unresponsive mc nodes - waiting for node mask 1
unresponsive mc nodes - waiting for node mask 1
unresponsive mc nodes - waiting for node mask 1
unresponsive mc nodes - waiting for node mask 1
cam_logger: CAM_ERROR packet
cam_logger: bus 0 target 1 lun 0
ss_perform_timeout
timeout on disconnected request
cam_logger: CAM_ERROR packet
cam_logger: bus 0 target 1 lun 0
isp_termio_abort_bdr
Failed to abort specified IO - scheduling chip reinit
cam_logger: CAM_ERROR packet
cam_logger: bus 0
isp_reinit
Begining Adaptor/Chip reinitialization
cam_logger: CAM_ERROR packet
cam_logger: bus 0
isp_cam_bus_reset_tmo
SCSI Bus Reset performed
unresponsive mc nodes - waiting for node mask 1
unresponsive mc nodes - waiting for node mask 1
unresponsive mc nodes - waiting for node mask 1
unresponsive mc nodes - waiting for node mask 1
unresponsive mc nodes - waiting for node mask 1
unresponsive mc nodes - waiting for node mask 1
crashing unresponsive node 0


	It then hangs here forever. If the memory channel hub is turned off
then this is the boot up sequence.

.
.
.
Dual TLEP at node 4
Dual TLEP at node 3
Dual TLEP at node 2
Dual TLEP at node 1
Dual TLEP at node 0
monitorBoot: doing it...
Cluster Memory Channel primary adaptor is online.
  Rev 14 adaptor is the primary channel (pci bus 1, slot 0)
  connected to virtual hub (VH1) as node 1.
dli: configured
clubase: configured
skipping test/delay for VH0/VH1 system
drd: configured.
dlmsl: configured
cnxagent: configured
dlm: configured.
memory channel thread init
checking for existing memory channel nodes
booting as primary memory channel node on mc0
memory channel software inited - node 1 on mc0
ccomsub: configured
mcnet: configured
Starting secondary cpu 1
Starting secondary cpu 2
Starting secondary cpu 3
Starting secondary cpu 4
Starting secondary cpu 5
Starting secondary cpu 6
Starting secondary cpu 7
Starting secondary cpu 8
Starting secondary cpu 9
.
.
.


    
1977.4Bad MC jumper settingsNETRIX::"cherkus@buff.zk3.dec.com"Dave CherkusThu Apr 03 1997 11:576
Ah! You are using a real hub, yet your MC board is jumpered
for virtual hub.  The MC board should have come with a manual
explaining how to change this.  If not, let me know and I'll
vector you to a web page that explains it.

[Posted by WWW Notes gateway]
1977.5pin 1-2 jumpered?ZPOVC::JUSTINThu Apr 03 1997 12:046
    Hi,
    	We've double checked it with the manual, the jumper is across
    pin 1 and 2, of the 3 pins. Is was also the factory default. This is 
    the line card on the PCI bus that we are talking about right?
    
    Justin
1977.6Bad board?NETRIX::"cherkus@buff.zk3.dec.com"Dave CherkusThu Apr 03 1997 18:367
According to my info, you are correct, so if Digital UNIX is
still reporting the MC board is in virtual hub mode I would
suspect a defective board.  It will never work till UNIX 
reports a STD (real hub) setting instead of VH0 or VH1.

Dave
[Posted by WWW Notes gateway]
1977.7BTW...NETRIX::"cherkus@buff.zk3.dec.com"Dave CherkusThu Apr 03 1997 18:395
...your printout says VH1, which is the 'no jumper installed' setting.
I really suspect a defective board or jumper, or a misinstalled jumper.

Dave
[Posted by WWW Notes gateway]
1977.8bad jumper settingZPOVC::JUSTINFri Apr 04 1997 03:5910
    Hello,
    
    	Yes the jumper on the line card on the master node was not
    inserted properly, hence virtual hub. Once it is properly inserted 
    everything works fine. Including the NFS/NIS. We will disconnect the 
    clients from the shared FWD-SCSI bus for safety reasons.
    
    	Thanks Dave for your help.
    
    Justin
1977.9You're welcome.NETRIX::"cherkus@buff.zk3.dec.com"Dave CherkusTue Apr 08 1997 12:536
> Thanks Dave for your help.

No problem.  Glad things are working fine now.

Dave
[Posted by WWW Notes gateway]