[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference ssdevo::hsj40_product

Title:HSJ30/40 Product Conference
Moderator:SSDEVO::EDMONDS
Created:Tue Jul 13 1993
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1264
Total number of notes:4958

1245.0. "instance code=01cc3002/0122330a ?" by MANM01::NOELGESMUNDO () Mon May 05 1997 06:46

    
    
    
    
    
    
    
 Hello!
The other day, we installed a new HSJ40C on a separate BA350-MB in Smart's 
SW800. This controller is connected to 5 BA356 bays; the last cable is just 
hanging. This controller though has 32mb cache but no battery so we set it to 
read cache only pending the installation/arrival of the battery. We installed 
the third member of each of the 3 shadow sets plus 2 TZ87 tape drives on one of 
the bays. 

On the early morning of 29 April, we installed the said controller and upgraded 
the firmware of all the disks to the latest version. On the evening of 30 April 
(20:04), the customer called up to report that the shadow members residing on 
the second HSJ40 controller suddenly became 'online' instead of the usual 
'member of DSAx'. The tape drives though are still available and a standalone 
disk on the same controller remained 'mounted'. They manually mounted the 
members to their shadow sets and no more errors were encountered.

This morning, I investigated the problem and found the following information:

1. OPERATOR.LOG

%%%%%%%%%%% OPCOM 30-apr-1997 20:04:33.77 %%%%%%%%%%
Messagefrom user INTERnet on SMART
TELNET Login Request from remote Host: 31.0.1.2    Port:1066

%%%%%%%%%%% OPCOM 30-apr-1997 20:04:51.40 %%%%%%%%%%
DSA3: shadow set has changed state.
Mount verification in progress

%%%%%%%%%%% OPCOM 30-apr-1997 20:04:51.40 %%%%%%%%%%
DSA1: shadow set has changed state.
Mount verification in progress

%%%%%%%%%%% OPCOM 30-apr-1997 20:04:51.43 %%%%%%%%%%
DSA2: shadow set has changed state.
Mount verification in progress

%%%%%%%%%%% OPCOM 30-apr-1997 20:05:11.54 %%%%%%%%%%
$6$DUA12:  (HSJ410) has been removed from shadow set.

%%%%%%%%%%% OPCOM 30-apr-1997 20:05:11.54 %%%%%%%%%%
$6$DUA9:  (HSJ410) has been removed from shadow set.

%%%%%%%%%%% OPCOM 30-apr-1997 20:05:11.54 %%%%%%%%%%
$6$DUA6:  (HSJ410) has been removed from shadow set.

%%%%%%%%%%% OPCOM 30-apr-1997 20:05:11.59 %%%%%%%%%%
Mount verification has completed for device DSA3:

%%%%%%%%%%% OPCOM 30-apr-1997 20:05:11.59 %%%%%%%%%%
Mount verification has completed for device DSA1:

%%%%%%%%%%% OPCOM 30-apr-1997 20:05:11.59 %%%%%%%%%%
Mount verification has completed for device DSA2

2. ERROR LOG:

*************************ENTRY 36292. **************************
error sequence 2165.				logged on: cpu_type 00000005
date/time: 30-apr-1997 20:04:51.39			   SYS_TYPE 0000000C
SYSTEM UPTIME : 8 DAYS 14:39:09
SCS NODE: SMART					OPENVMS AXP 6.2
HW_MODEL : 0000044E HARDWARE MODEL = 1102

ERL$LOGMESSAGE ALPHASERVER 8400 5/300
CIXCD SUB-SYSTEM _SMART$PNA0:
	PORT HAS CLOSED VIRTUAL CIRCUIT

 	LOCAL STATION ADDRESS,   6(X)
	LOCAL SYSTEM ID,       408(X)

	REMOTE STATION ADDRESS,  4(X)
	REMOTE SYSTEM ID, 4200100400122(X)
	.
	.
	.



*************************ENTRY 36296. **************************
error sequence 2169.				logged on: cpu_type 00000005
date/time: 30-apr-1997 20:05:21.73			   SYS_TYPE 0000000C
SYSTEM UPTIME : 8 DAYS 14:39:09
SCS NODE: SMART					OPENVMS AXP 6.2
HW_MODEL : 0000044E HARDWARE MODEL = 1102

ERL$LOGMESSAGE ALPHASERVER 8400 5/300

	MESSAGE TYPE	0000B		DATAGRAM FOR NON-EXISTING "UCB"
	CLASS DRIVER	4B534944	/DISK/
	.
	.				UNIQUE IDENTIFIER, 964401993(X)
					MASS STORAGE CONTROLLE
					MODEL = 40


					SEQUENCE #11
					CONTROLLER LOG
					NON-ERROR/INFORMATIONAL EVENT
					CONTROLLER ERROR
					DEVICE INTERFACE HW ERROR

					CONTROLLER VERSION #39
					CONTROLLER HARDWARE VERSION #11



Above message appears several times after this entry.

3. DECEVENT:

***************************** ENTRY 36292 ***********************

TIMESTAMP			30-APR-1997 20:04:51
SYSTEM UPTIME IN SECONDS	743949
FLAGS	X0001			DYNAMIC RECOGNITION PRESENT

----DEVICE PROFILE----
PRODUCT NAME			CIMNA XMI TO CI PORT
UNIT NAME			SMART$PNA
UNIT NUMBER			0
DEVICE CLASSS			CONTROLLER


***************************** ENTRY 36293 ************************

 				SOFTWARE PARAMETERS
				HSX01 MSCP VIRTUAL DISK
				HSJ410$DUA
UNIT NUMBER			12.

UCB$X_STS	X08001010	ONLINE
				UNLOAD AT DISMOUNT
				UNIT SUPPORTS THE EXTENDED FUNCTION BIT

***************************** ENTRY 36294 ************************

 				SOFTWARE PARAMETERS
				HSX01 MSCP VIRTUAL DISK
				HSJ410$DUA
UNIT NUMBER			9.

UCB$X_STS	X08001010	ONLINE
				UNLOAD AT DISMOUNT
				UNIT SUPPORTS THE EXTENDED FUNCTION BIT

***************************** ENTRY 36295 ************************

 				SOFTWARE PARAMETERS
				HSX01 MSCP VIRTUAL DISK
				HSJ410$DUA
UNIT NUMBER			6.

UCB$X_STS	X08001010	ONLINE
				UNLOAD AT DISMOUNT
				UNIT SUPPORTS THE EXTENDED FUNCTION BIT

***************************** ENTRY 36296 ************************

				LOGGED MSCP MESSAGE
				FM DEVICE CLASS NOT DEFINED
				NO UNIT IN DATAGRAM MESSAGE
LOGGED MESSAGE FORMAT	0	CONTROLLER ERROR
MSCP FLAGS		X02	INFORMATIONAL
MSCP EVENT CODE		X016A	MAJOR EVENT = CONTROLLER ERROR
				SUB-EVENT = DRIVE INTERFACE HARDWARE ERROR
INSTANCE CODE	X03F40064	DEVICE SERVICES HAD TO RESET THE PORT TO
				CLEAR A BAD CONDITION. nOTE THAT IN THIS
				INSTANCE THE ASSOCIATED TARGEET, ASSOCIATED
				ASC, AND ASSOCIATED ASCQ FIELDS ARE UNDEFINED
				COMPONENT ID = DEVICE SERVICES
				EVENT NUMBER = X000000F4
				REPAIR ACTION = X000000
				NR THRESHOLD = X 00000064
TEMPLATE		X41	DEVICE NON-TRANSFER ERROR

****************************ENTRY 36297 ***************************

INSTANCE CODE	X01010302	AN UNRECOVERABLE HARDWARE DETECTED FAULT 
				OCCURRED
				COMPONENT ID = EXECUTIVE SERVICES
				EVENT NUMBER = X0000001
				REPAIR ACTION = X000003
				NR THRESHOLD = X 0000002
TEMPLATE		X01	LAST FAILURE EVENT
LAST FAILURE CODE X018B2580	COMPONENT ID = EXECUTIVE SERVICES
				EVENT NUMBER = X0000008B
				REPAIR ACTION = X000025
				FLAG = 1, HARDWARE DETECTED FAULT.
				RESTART CODE = FULL FIRMWARE RESTART
				PARAMETER COUNT = 0.

				AN NMI INTERRUPT WAS GENERATED WITH AN
				INDICATION THAT A MEMORY SYSTEM PROBLEM
				OCCURRED.

**************************** ENTRY 36298 **********************************

INSTANCE CODE	X01CC3002	THE CACHE10 DRAB DETECTED A WRITE DATA
				PARITY ERROR DURING A HOST PORT ATTEMPT
				COMPONENT ID = EXECUTIVE SERVICES
				EVENT NUMBER = X000000CC
				REPAIR ACTION = X0000030
				NR THRESHOLD = X 0000002

**************************** ENTRY 36299 **********************************

INSTANCE CODE	X0122330A	AN ERROR CONDITION DETECTED BY ONE 
				COMPONENT ID = EXECUTIVE SERVICES
				EVENT NUMBER = X00000022
				REPAIR ACTION = X0000032
				NR THRESHOLD = X 000000A
TEMPLATE		X14	MEMORY SYSTEM FAILURE


We surfed the web on COMET and found some suggestions:
1 - replace cache module
2 - replace controller
3 - upgrade to 16-port star coupler

I could not see the logic for upgrading to a 16-port coupler but it seemd to 
have worked on said problems reported. The thing is only one of the controllers 
(the new one without battery) is having problem. The other controller seemed 
okay.

Any help would be appreciated.

Thanks.

    Noel Gesmundo
    MCS/Digital Equipment Filipinas Inc.



T.RTitleUserPersonal
Name
DateLines
1245.116 node should fix things for youSSDEVO::RMCLEANMon May 05 1997 13:543
The 16 node coupler does usually help.  It's all black magic!  It seems that
the different coupler changes the characteristics of the CI load and makes
things work better/different.
1245.2Shadow Member Timeout too low to "ride through"MSE1::BURKEMon May 05 1997 15:0010
    Hi,
    
    As Ron stated, the incidence of this problem may well go down or go
    away with 16 node coupler, however the loss of shadow members is likely
    to have been due to the setting of the SYSGEN parameter SHADOW_MBR_TMO.
    It looks from your console entries that this is set for 20 seconds, in
    order for Shadowing to "ride through" events like this, the recomended
    setting when HSJ's are used is 120 seconds.
    
     
1245.3any connection to battery?MANM01::NOELGESMUNDOTue May 06 1997 09:349
    Hi!
    
    Thank you for the inputs. How about the fact that the HSJ40 has no
    battery? Can this cause memory failure and affected the shadow members?
    Will the battery prevented this problem?
    
    Thanks again.
    
    Noel
1245.4NopeSSDEVO::RMCLEANTue May 06 1997 12:313
Battery will make no difference in this case.  Battery only allows you to 
have writeback which will improve performance.  Setting the timeout value
will do more than anything for you.