[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference pamsrc::decmessageq

Title:NAS Message Queuing Bus
Notice:KITS/DOC, see 4.*; Entering QARs, see 9.1; Register in 10
Moderator:PAMSRC::MARCUSEN
Created:Wed Feb 27 1991
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2898
Total number of notes:12363

2837.0. "link listener exiting without apparent cause" by WHOS01::ELKIND (Steve Elkind, Digital SI @WHO) Thu Apr 03 1997 19:59

    My customer has encountered another link driver problem in their
    production system, that forced them to shut down and restart five
    backend servers, and I need to find out if this is a known problem, and
    any hints as to the cause.

    Questions:
    
    1.  Any ideas as to the cause of the problem described below?
    
    2.  Can the link listener be restarted from the command line without
        bouncing the group?  If so, what should be done with the existing
        senders and receivers?

    Scenario:

    5 back end server groups on several machines.  These are NOT the link
    initiators.  Cross-group verify is on.

    About 20-30 client groups on the same number of machines.  These
    client groups are the link initiators.  Some of these groups are not
    supposed to link with the back ends, but are configured to do so any
    way.  

    There are several other server groups that are interconnected with the
    5 server groups of interest; these do not appear to be a factor here.

    At the same time in all five back end groups, almost simultaneously two
    xgroup connection requests come in from groups not in the xgroup table.
    There are multiple messages in the log, including "link listener is
    exiting".  After that point, new connections no longer work.  However,
    there are a few more messages in the log file after the "is exiting"
    message from the link listener pid, all with the same time stamp.
    
    For roughly two hours up to the link listener exit, two valid remote
    groups who had lost their connections (one was shutdown, the other just
    lost its connection) could no longer connect with the five groups.

    The groups continue to run; the "link listener is exiting" were not
    caused by group shutdowns.

    I've gotten copies of the back end log files (I asked for an extract of
    a half-hour window around the event), and the back end group.init
    files, in whos01""::sympt1.txt;2.  The extract of the lof for one of
    the groups is appended below.

    The backend groups are running on Solaris
T.RTitleUserPersonal
Name
DateLines
2837.1forgot to tack this on the end....WHOS01::ELKINDSteve Elkind, Digital SI @WHOThu Apr 03 1997 20:2684
Oops - forgot to add it on the end of my note----

*********************
*********************	group710.log
*********************

************ dmqld (20997) 02-APR-1997 08:34:22 ************
ld, link receiver for group 710 has lost connection to group 1159

************ dmqld (20997) 02-APR-1997 08:34:22 ************
ld, link receiver for group 710 from group 1159 is exiting

************ dmqld (26040) 02-APR-1997 08:34:28 ************
ld, link sender for group 710 has lost connection to group 1159

************ dmqld (26040) 02-APR-1997 08:34:28 ************
ld, link sender for group 710 to group 1159 is exiting

************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, link receiver for group 710 from group 1105 is running

************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, Remote node az07ae6s not found in local address data base

************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, link receiver for group 710 from group 1165 is running

************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, operation failed to complete

************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, link listener for group 710 is exiting

************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, link receiver for group 710 is connected to group 1105

************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, link receiver for group 710 from group 1155 is running

************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, Remote node tx14ie6s not found in local address data base

************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, link receiver for group 710 is connected to group 1165

************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, link receiver for group 710 is connected to group 1155

************ dmqld (6232) 02-APR-1997 09:51:08 ************
ld, link sender for group 710 to group 1105 is running

************ dmqld (6232) 02-APR-1997 09:51:08 ************
ld, operation failed to complete

************ dmqld (6232) 02-APR-1997 09:51:08 ************
ld, link sender for group 710 to group 1105 is exiting

************ dmqld (6233) 02-APR-1997 09:51:08 ************
ld, link sender for group 710 to group 1165 is running

************ dmqld (6233) 02-APR-1997 09:51:08 ************
ld, operation failed to complete

************ dmqld (6233) 02-APR-1997 09:51:08 ************
ld, link sender for group 710 to group 1165 is exiting

************ dmqld (6234) 02-APR-1997 09:51:08 ************
ld, link sender for group 710 to group 1155 is running

************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, link receiver for group 710 has lost connection to group 1105

************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, link receiver for group 710 from group 1105 is exiting

************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, link receiver for group 710 has lost connection to group 1165

************ dmqld (20997) 02-APR-1997 09:51:08 ************
ld, link receiver for group 710 from group 1165 is exiting

************ dmqld (6234) 02-APR-1997 09:52:08 ************
ld, link sender for group 710 is connected to group 1155

2837.2XHOST::SJZKick Butt In Your Face Messaging !Thu Apr 03 1997 20:459
    
    I noticed you strategically left out the  version  number.
    Given the format of the log entries it is V3.2 or earlier.
    
    Regardless of what version they are running I  would  sug-
    gest upgrading to V3.2A-1 and see  if  the  problem  goes
    away.
    
    _sjz.
2837.3sorry, the info was accidentally chopped off WHOS01::ELKINDSteve Elkind, Digital SI @WHOFri Apr 04 1997 03:0131
    Actually, leaving out the version number was accidental, not a
    strategem.  An earlier draft stated that the back end is using v3.0c
    running on Solaris, the front ends a mixture of 3.0c on HP-UX 9.04 and
    3.2A-eco1 on HP-UX v10.20.
    
    They can not upgrade the backends for another 3 months or so, as the
    software they are built with is built with libraries based on 3.0B; the
    version based on 3.2x has just entered development (and will deploy in
    time for us to tell them "upgrade to v4.0" 8^{   ).  It will be at
    least a three month development/system test/acceptance test/integration
    test cycle before it is allowed into production - possibly as much as
    six months.
    
    The front ends are in the middle of upgrading to v3.2A-eco1, as the
    front end clients are built with the client library and so somewhat
    divorced from the queueing engine version (most of the front end
    applications are still built with v3.0B, and will continue to be so
    until at least the fall is my guess).
    
    I gather from another source that we can not start up the link listener
    from the command line with v3.x, so we have no workaround.  The
    customer would like to get some idea of the cause to see if there is
    anything he can do to avoid this event happening again (or perhaps to
    detect it before it hits him again).  He is doing some limited testing
    using both 3.2A-eco1 and 3.0C on his test machines to see if he can
    re-create the problem (and is also checking for symptoms of a memory
    leak in the listener) on either one, but I suspect that he may not be
    able to duplicate the conditions in that environment.
    
    The customer is not looking for a fix, they know they won't get one,
    all they want is any information that may be available.
2837.4XHOST::SJZKick Butt In Your Face Messaging !Fri Apr 04 1997 03:2112
    
    it isn't clear from the description or the logs what is
    happening.  and we have never had such a report in  the
    past.   if it's reproducible then we have something  to
    go on,  but it is not.
    
    as for starting up the link listener on its own the  an-
    swer is no.  we have special code that  explicitly  pre-
    vents that.
    
    _sjz.
    
2837.5possible explanation?WHOS01::ELKINDSteve Elkind, Digital SI @WHOFri Apr 04 1997 15:3513
    The customer has found in testing that his link listener process size
    grows at about 4-5 blocks per hour, when being hit repeatedly with
    invalid cross-group connect requests from multiple sources.  The
    customer's current theory is that this is a long-term memory leak
    problem, which can be solved by getting the invalid remote groups "off
    the air".  Neither myself nor the customer I work for buy this fully
    (the group had been up for only about 4 days), but we will live with
    this explanation for a while.
    
    They haven't started their testing of 3.2A yet, other than to discover
    that if they kill -9 the link listener all cross-group communication
    stops (with 3.0C, currently running receivers and senders continue to
    work).
2837.6XHOST::SJZKick Butt In Your Face Messaging !Fri Apr 04 1997 18:2411
    
    The memory leak you describe is a known problem with that
    version.  Upgrade to V3.2A-1.
    
    >other than to discover that if they kill -9 the link listener all
    >cross-group communication stops.
    
    no duh.  please tell this customer to refrain from using our prod-
    uct.  they are soiling it.
    
    _sjz.
2837.7thank you for the informationWHOS01::ELKINDSteve Elkind, Digital SI @WHOSat Apr 05 1997 02:5517
    >>other than to discover that if they kill -9 the link listener all
    >>cross-group communication stops.
    >
    >no duh.  please tell this customer to refrain from using our prod-
    >uct.  they are soiling it.
    
    Actually, they quite reasonably wanted to know how 3.2A would react
    (versus 3.0) if it were to lose the link listener - obviously, not as
    well.  Then again, if it is a memory leak that caused the problem, then
    they need not worry as much about losing the link listener.  I'll pass
    that on to them - thanks.
    
    Maybe they are soiling the product, but at least they're buying it in
    large quantities (RICH philistines!), and like it well enough to stake
    their core business operations on its reliability.  They just want to
    squeeze out that last erg of reliability (and they're a pain in my neck
    too at times).
2837.8yeah right - reasonableXHOST::SJZKick Butt In Your Face Messaging !Sun Apr 06 1997 03:5121
    
    >they quite reasonably wanted to  know how 3.2A would react
    >if it were to lose the link  listener - obviously  not  as
    >well.
    
    actually it works better.  in the V3.0 derivative they are
    using if you lose your link listener the system is left in
    some weird quasi state where non-determinism runs  rampant
    and the product is anything but reliable.  the  later  ver-
    sions detect the component failure and try to shutdown the
    subsystem associated with that component.  this provides a
    deterministic behavior with which people can work.
    
    maybe they should find out how their operating  system will
    react when they do a kill -9 on the init process. make sure
    they log in as root then have them  execute  the  following
    command.
    
    # kill -9 1
    
    _sjz.