[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference noted::dnu_osi

Title:DECnet/OSI for {ULTRIX,OSF/1}
Notice:Indicate version and platform when writing...see #2 for kits
Moderator:BULEAN::CARR
Created:Wed Sep 25 1991
Last Modified:Thu Jun 05 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2187
Total number of notes:10469

2142.0. "dnascd node unreachable" by COMICS::HESS () Wed Mar 12 1997 11:05

Hello,
	I have a site which is experiencing an issue where dlogin to a unix
3.2d-2 node with osi 3.2a-0 fails with node unreachable error, we have noted a
dnascd core dump produced and the daemon.log  indicates a connect from a node
for application cterm under user dcs_sh the process then exits with a
segmentation fault,.
	the site was recovering by restarting decnet, however this now does not
clear, on the last occurrance they had to kill off any decnet process and then
restart decnet.
	Is this a known issue and if so what is the resolution, if it needs to
be escalated what information is required.
     	
Many Thanks
Pete.
T.RTitleUserPersonal
Name
DateLines
2142.1XrefVAXCPU::michaudJeff Michaud - ObjectBrokerWed Mar 12 1997 13:472
   441  GIDDAY::CHONG        17-DEC-1992     3  dnascd core dumps
  1140  WPOPTH::ZAMBOTTI     14-JUL-1994    10  dnascd core dumps on any remote login (conflict with ENHANCED security)
2142.2UPSAR::WALLACEDigital: A Dilbertian CompanyWed Mar 12 1997 14:024
    There have been several fixes for dnnascd core dumps, which I believe
    are in DECnet V3.2B.  You should install 3.2B, or at least grab
    the dnascd executable from a 3.2B system.  --  Vince
    
2142.3COMICS::HESSWed Mar 12 1997 14:108
    Hi,
    	Thanks for the replies, I did check 3.2b release notes but did not
    see anything relevant and the site has already indicated that they are
    not willing to upgrade on the offchance that it will fix the issue,
    however as you indicate there are fixes in this area I will attempt to
    persuade them otherwise.
    
    Pete
2142.4COMICS::HESSWed Mar 12 1997 14:494
    Is it just the dnascd executable needed or is there anything else
    required, I think they will try that before going to 3.2b as it will
    take some time for them to upgrade due to change constraints.
    	Pete
2142.5UPSAR::WALLACEDigital: A Dilbertian CompanyThu Mar 13 1997 16:364
    I can't think of anything that changed from 3.2A -> 3.2B that
    would cause incompatabilities for dnascd.  You should be able
    to just upgrade that one executable.  --  Vince
    
2142.6COMICS::HESSWed Apr 23 1997 16:338
    well, the upgrade to 3.2b has been done, however the problem still
    remains, they have noticed though that when the problem occurs free
    memory is down to 0. restarting decnet enables logins again, I am
    waiting more information but have advised them to reduce ubc-maxpercent 
    to 60 , it was 100.
    	Any comments ?
    Thanks
    Pete
2142.7COMICS::HESSTue May 13 1997 11:0411
    Hi,
    	with ubc-maxpercent now at 40 the problem is still there although
    not as frequent, and of course the main issue is having to restart
    Decnet each time to clear the problem, as this is a production
    environment this is a real issue for this customer, I will raise this
    as an IPMT. I have extracts from the daemon.log showing the
    segmentation faults, is there anything else that would be useful to
    provide, e.g will we need to run dnascd in debug.
    
    Thanks for any advice.
    Pete.
2142.8some other resource??KITCHE::schottEric R. Schott USG Product ManagementTue May 13 1997 12:5513
Hi

 I don't know what ubcmax would have to do with thisl..I suggest
you put it back...are they running out of swap space?  or some other
kernel resource??

Have you run sys_check on the system?


see

http://www-unix.zk3.dec.com/tuning/tools/sys_check/sys_check.html

2142.9DRAGNS::WALLACETue May 13 1997 17:0221
    Hi,
    
    You really need to do some more problem isolation at the customer
    site.  This problem does not sound familiar, and we have done
    some fairly rigorous load tests on DECnet.
    
    Are you monitoring system resources, like memory and swap space?
    
    You say you restart DECnet.  I assume that means running 
    decnetshutdown/decnetstartup.  What about less drastic
    mesures, ie just cycling various parts of DECnet, eg
    osi transport, routing, session conreol, etc ?
    
    Can you establish outgoing connections from the system ?
    
    Do incoming connections other than dlogin work (eg dcp) ?
    
    Is x25 being used on the system ?
    
    Vince
         
2142.10COMICS::HESSWed May 21 1997 13:5057
Thanks for the input , however things get worse, using the Polycentre
performance monitor indicates tha memory utilisation is around 70-80 percent,
no swapping, so far only shutting down decnet (decnetshutdown) seems to clear
the problem, trying to login to the system from anywhere gives 
Login information invalid at remote node.
( we are not now seeing a node unreachable)
dcp fails 
outgoing works fine                                                            
incoming using explicit user and password fails also.
X25 is not being used on this system
	The dna processes are in the following states
dnansd IW
dnalimd Iw
dnaevld I
dnaksd IW
dnascd S
dnanoded IW

	There are no core dumps produced, and an extract from the daemon log
shows the entries for decnet as follows along with the error.
LNFSL2 >copy/log login.com lnxcm3"dcs_sh XXXXXXXXXX"::
%COPY-E-OPENOUT, error opening LNXCM3"dcs_sh password"::[]LOGIN.COM;3 as output
-RMS-E-CRE, ACP file create failed
-SYSTEM-F-INVLOGIN, login information invalid at remote node
%COPY-W-NOTCOPIED, SYS$SYSDEVICE:[OPS.OPS_HOPKINS]LOGIN.COM;3 not copied
LNFSL2 >set host lnxcm3
%SYSTEM-F-INVLOGIN, login information invalid at remote node
LNFSL2 >

looking in the daemon.log here are the last DECnet bits int it

May 15 19:56:32 LNGIBX0010G fal[15904]: DIRECTORY access from
LOCAL:.LNFSL2::uic
=[0,0]LED_OPER, user=g1_copy, directory=/disk1/pickbackup,
filename=/pickbackup/
RC*.*
May 15 19:56:32 LNGIBX0010G fal[23332]: DIRECTORY access from
LOCAL:.LNFSL2::uic
=[0,0]LED_OPER, user=g1_copy, directory=/disk1/pickbackup,
filename=/pickbackup/
G1PROUT.DAT
May 15 19:56:32 LNGIBX0010G dnascd[21279]: Process exit (PID 23332).
May 15 19:56:32 LNGIBX0010G dnascd[21279]: Process exit (PID 15904).
May 15 19:56:32 LNGIBX0010G fal[27096]: DIRECTORY access from
LOCAL:.LNFSL2::uic
=[0,0]LED_OPER, user=g1_copy, directory=/disk1/pickbackup,
filename=/pickbackup/
G1CCNTRY.???
May 15 20:20:50 LNGIBX0010G netacl[28304]: permit
host=hbsltw0002.btco.com/138.9
3.213.204 service=telnetd execute=/usr/sbin/telnetd
   
                        

	I am really at a loss as to what to check next. Any Ideas ?
Thanks
Pete
2142.11DRAGNS::WALLACEWed May 21 1997 18:0035
    Use ncl to look at session:
      ncl> show session control all
      ncl> show session control appl fal all
    The counters might help indicate what the problem is.
    
    If that doesn't help you could try running dnascd with some
    debug options:
    
      1) Edit /etc/cml.conf and change the line
    		8       lim     /usr/sbin/dnascd
    	 to
    		8 /usr/sbin/dnascd
    
      2) Send a hangup signal to dnalimd, ie "kill -HUP #" where '#'
    	 is the pid of dnalimd
    
      3) Kill the current dnascd process
    
      4) Manually start dnascd.  You might first try
    	   /usr/sbin/dnascd -logfile /tmp/log
         and if nothing in the log file helps, kill dnascd again & try
    	   /usr/sbin/dnascd -debug -verbose
         This second version will print lots of messages to your terminal
    
      5) In either case, after manually starting dnascd you have to
         issue the necessary ncl commands to start session:
    	   ncl create session control
    	   ncl enable session control
    
    Hopefully either the logging or debug messages will give enough
    information to figure out why the connect is failing.
    
    Vince
    
                                                     
2142.12COMICS::HESSFri May 23 1997 09:1614
Vince, thanks for the reply,
        The NCL counters did not reveal any problems,issueing the hangup to
dnalimd caused dnascd to coredump , is this expected ? dnascd is now started
with logging enabled but there is nothing  being written to the log file , 
will this only log errors ?
	after restarting dnascd there was not a problem.
It seems the next step is to run dnascd in debug as you suggest, as the problem
seems to occur more at night (but not always) they will direct the output to a
file . does this produce a lot of data.
	also this time they also saw segmentation faults , but these do not
always occur.

	 
Pete
2142.13DRAGNS::WALLACEFri May 23 1997 17:3613
    Hi,
    
    Sorry, it looks like you also have to specify -debug & -verbose when
    you specify a log file.  It should be recording various operations
    that it performs in response to connect requests or other events.
    
    It shouldn't core dump when you send the HUP signal to dnalimd.
    
    For clarification, are you saying that the problem went away
    when you restarted dnascd?
    
    Vince
         
2142.14COMICS::HESSTue May 27 1997 12:207
    Vince, 
    Thanks, to confirm, yes, restarting dnascd resolved the issue
    temporarily, I guess we will have to make the logging permanent to
    capture this ? 
    
    Pete
    
2142.15DRAGNS::WALLACETue May 27 1997 17:2213
    Hi,
    
    Well, it seems clear enough that dnascd is getting screwed up
    somehow.  The following invocation on my system produces output
    both to the log file and to the terminal from which I run dnascd:
    
    	./dnascd -debug -verbose -logfile /tmp/log
    
    
    BTW, did you open an IPMT case on this (CFS.51442) ?
    
    Vince
    
2142.16COMICS::HESSWed May 28 1997 08:185
    Vince,
    	Yes I have raised an IPMT, this is now getting critical for the
    customer. I will get the information you requested on the case.
    
    Pete