[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference pamsrc::decmessageq

Title:	NAS Message Queuing Bus
Notice:	KITS/DOC, see 4.*; Entering QARs, see 9.1; Register in 10
Moderator:	PAMSRC::MARCUSEN

Created:	Wed Feb 27 1991
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2898
Total number of notes:	12363

2837.0. "link listener exiting without apparent cause" by WHOS01::ELKIND (Steve Elkind, Digital SI @WHO) Thu Apr 03 1997 19:59

    My customer has encountered another link driver problem in their
    production system, that forced them to shut down and restart five
    backend servers, and I need to find out if this is a known problem, and
    any hints as to the cause.

    Questions:
    
    1.  Any ideas as to the cause of the problem described below?
    
    2.  Can the link listener be restarted from the command line without
        bouncing the group?  If so, what should be done with the existing
        senders and receivers?

    Scenario:

    5 back end server groups on several machines.  These are NOT the link
    initiators.  Cross-group verify is on.

    About 20-30 client groups on the same number of machines.  These
    client groups are the link initiators.  Some of these groups are not
    supposed to link with the back ends, but are configured to do so any
    way.  

    There are several other server groups that are interconnected with the
    5 server groups of interest; these do not appear to be a factor here.

    At the same time in all five back end groups, almost simultaneously two
    xgroup connection requests come in from groups not in the xgroup table.
    There are multiple messages in the log, including "link listener is
    exiting".  After that point, new connections no longer work.  However,
    there are a few more messages in the log file after the "is exiting"
    message from the link listener pid, all with the same time stamp.
    
    For roughly two hours up to the link listener exit, two valid remote
    groups who had lost their connections (one was shutdown, the other just
    lost its connection) could no longer connect with the five groups.

    The groups continue to run; the "link listener is exiting" were not
    caused by group shutdowns.

    I've gotten copies of the back end log files (I asked for an extract of
    a half-hour window around the event), and the back end group.init
    files, in whos01""::sympt1.txt;2.  The extract of the lof for one of
    the groups is appended below.

    The backend groups are running on Solaris

T.R	Title	User	Personal Name	Date	Lines
2837.1	forgot to tack this on the end....	WHOS01::ELKIND	Steve Elkind, Digital SI @WHO	`Thu Apr 03 1997 20:26`	84
	Oops - forgot to add it on the end of my note---- ******************* ***************** group710.log ***************** ******** dmqld (20997) 02-APR-1997 08:34:22 ******** ld, link receiver for group 710 has lost connection to group 1159 ******** dmqld (20997) 02-APR-1997 08:34:22 ******** ld, link receiver for group 710 from group 1159 is exiting ******** dmqld (26040) 02-APR-1997 08:34:28 ******** ld, link sender for group 710 has lost connection to group 1159 ******** dmqld (26040) 02-APR-1997 08:34:28 ******** ld, link sender for group 710 to group 1159 is exiting ******** dmqld (20997) 02-APR-1997 09:51:08 ******** ld, link receiver for group 710 from group 1105 is running ******** dmqld (20997) 02-APR-1997 09:51:08 ******** ld, Remote node az07ae6s not found in local address data base ******** dmqld (20997) 02-APR-1997 09:51:08 ******** ld, link receiver for group 710 from group 1165 is running ******** dmqld (20997) 02-APR-1997 09:51:08 ******** ld, operation failed to complete ******** dmqld (20997) 02-APR-1997 09:51:08 ******** ld, link listener for group 710 is exiting ******** dmqld (20997) 02-APR-1997 09:51:08 ******** ld, link receiver for group 710 is connected to group 1105 ******** dmqld (20997) 02-APR-1997 09:51:08 ******** ld, link receiver for group 710 from group 1155 is running ******** dmqld (20997) 02-APR-1997 09:51:08 ******** ld, Remote node tx14ie6s not found in local address data base ******** dmqld (20997) 02-APR-1997 09:51:08 ******** ld, link receiver for group 710 is connected to group 1165 ******** dmqld (20997) 02-APR-1997 09:51:08 ******** ld, link receiver for group 710 is connected to group 1155 ******** dmqld (6232) 02-APR-1997 09:51:08 ******** ld, link sender for group 710 to group 1105 is running ******** dmqld (6232) 02-APR-1997 09:51:08 ******** ld, operation failed to complete ******** dmqld (6232) 02-APR-1997 09:51:08 ******** ld, link sender for group 710 to group 1105 is exiting ******** dmqld (6233) 02-APR-1997 09:51:08 ******** ld, link sender for group 710 to group 1165 is running ******** dmqld (6233) 02-APR-1997 09:51:08 ******** ld, operation failed to complete ******** dmqld (6233) 02-APR-1997 09:51:08 ******** ld, link sender for group 710 to group 1165 is exiting ******** dmqld (6234) 02-APR-1997 09:51:08 ******** ld, link sender for group 710 to group 1155 is running ******** dmqld (20997) 02-APR-1997 09:51:08 ******** ld, link receiver for group 710 has lost connection to group 1105 ******** dmqld (20997) 02-APR-1997 09:51:08 ******** ld, link receiver for group 710 from group 1105 is exiting ******** dmqld (20997) 02-APR-1997 09:51:08 ******** ld, link receiver for group 710 has lost connection to group 1165 ******** dmqld (20997) 02-APR-1997 09:51:08 ******** ld, link receiver for group 710 from group 1165 is exiting ******** dmqld (6234) 02-APR-1997 09:52:08 ********** ld, link sender for group 710 is connected to group 1155
2837.2		XHOST::SJZ	Kick Butt In Your Face Messaging !	`Thu Apr 03 1997 20:45`	9
	I noticed you strategically left out the version number. Given the format of the log entries it is V3.2 or earlier. Regardless of what version they are running I would sug- gest upgrading to V3.2A-1 and see if the problem goes away. _sjz.
2837.3	sorry, the info was accidentally chopped off	WHOS01::ELKIND	Steve Elkind, Digital SI @WHO	`Fri Apr 04 1997 03:01`	31
	Actually, leaving out the version number was accidental, not a strategem. An earlier draft stated that the back end is using v3.0c running on Solaris, the front ends a mixture of 3.0c on HP-UX 9.04 and 3.2A-eco1 on HP-UX v10.20. They can not upgrade the backends for another 3 months or so, as the software they are built with is built with libraries based on 3.0B; the version based on 3.2x has just entered development (and will deploy in time for us to tell them "upgrade to v4.0" 8^{ ). It will be at least a three month development/system test/acceptance test/integration test cycle before it is allowed into production - possibly as much as six months. The front ends are in the middle of upgrading to v3.2A-eco1, as the front end clients are built with the client library and so somewhat divorced from the queueing engine version (most of the front end applications are still built with v3.0B, and will continue to be so until at least the fall is my guess). I gather from another source that we can not start up the link listener from the command line with v3.x, so we have no workaround. The customer would like to get some idea of the cause to see if there is anything he can do to avoid this event happening again (or perhaps to detect it before it hits him again). He is doing some limited testing using both 3.2A-eco1 and 3.0C on his test machines to see if he can re-create the problem (and is also checking for symptoms of a memory leak in the listener) on either one, but I suspect that he may not be able to duplicate the conditions in that environment. The customer is not looking for a fix, they know they won't get one, all they want is any information that may be available.
2837.4		XHOST::SJZ	Kick Butt In Your Face Messaging !	`Fri Apr 04 1997 03:21`	12
	it isn't clear from the description or the logs what is happening. and we have never had such a report in the past. if it's reproducible then we have something to go on, but it is not. as for starting up the link listener on its own the an- swer is no. we have special code that explicitly pre- vents that. _sjz.
2837.5	possible explanation?	WHOS01::ELKIND	Steve Elkind, Digital SI @WHO	`Fri Apr 04 1997 15:35`	13
	The customer has found in testing that his link listener process size grows at about 4-5 blocks per hour, when being hit repeatedly with invalid cross-group connect requests from multiple sources. The customer's current theory is that this is a long-term memory leak problem, which can be solved by getting the invalid remote groups "off the air". Neither myself nor the customer I work for buy this fully (the group had been up for only about 4 days), but we will live with this explanation for a while. They haven't started their testing of 3.2A yet, other than to discover that if they kill -9 the link listener all cross-group communication stops (with 3.0C, currently running receivers and senders continue to work).
2837.6		XHOST::SJZ	Kick Butt In Your Face Messaging !	`Fri Apr 04 1997 18:24`	11
	The memory leak you describe is a known problem with that version. Upgrade to V3.2A-1. >other than to discover that if they kill -9 the link listener all >cross-group communication stops. no duh. please tell this customer to refrain from using our prod- uct. they are soiling it. _sjz.
2837.7	thank you for the information	WHOS01::ELKIND	Steve Elkind, Digital SI @WHO	`Sat Apr 05 1997 02:55`	17
	>>other than to discover that if they kill -9 the link listener all >>cross-group communication stops. > >no duh. please tell this customer to refrain from using our prod- >uct. they are soiling it. Actually, they quite reasonably wanted to know how 3.2A would react (versus 3.0) if it were to lose the link listener - obviously, not as well. Then again, if it is a memory leak that caused the problem, then they need not worry as much about losing the link listener. I'll pass that on to them - thanks. Maybe they are soiling the product, but at least they're buying it in large quantities (RICH philistines!), and like it well enough to stake their core business operations on its reliability. They just want to squeeze out that last erg of reliability (and they're a pain in my neck too at times).
2837.8	yeah right - reasonable	XHOST::SJZ	Kick Butt In Your Face Messaging !	`Sun Apr 06 1997 03:51`	21
	>they quite reasonably wanted to know how 3.2A would react >if it were to lose the link listener - obviously not as >well. actually it works better. in the V3.0 derivative they are using if you lose your link listener the system is left in some weird quasi state where non-determinism runs rampant and the product is anything but reliable. the later ver- sions detect the component failure and try to shutdown the subsystem associated with that component. this provides a deterministic behavior with which people can work. maybe they should find out how their operating system will react when they do a kill -9 on the init process. make sure they log in as root then have them execute the following command. # kill -9 1 _sjz.