[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference azur::mcc

Title:DECmcc user notes file. Does not replace IPMT.
Notice:Use IPMT for problems. Newsletter location in note 6187
Moderator:TAEC::BEROUD
Created:Mon Aug 21 1989
Last Modified:Wed Jun 04 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:6497
Total number of notes:27359

6165.0. "IP Poller Value Questions" by CUJO::BROWN (Dave Brown) Thu Nov 10 1994 16:51

    
    	I've been working with a customer on the IP Poller and we have come
    up with some questions that we can not find the answers to:
    
    	1) Is there a way to see what polling values are currently in use?
    
    	2) When one runs the poller enable procedure and does not specify
    	   values, the variables are written to the action file with no
    	   values. If the poller reads an action file with no values
    	   associated with the variables, what effect will this have on the
    	   poller? 
    
    		- None?
    
    		- Reset the polling values to default?
    		
    	3) The polling values we have come up with that minimize false
    	   alarms of IP non-reachibility are:
    
    		Interval - 90
    		Retry    -  5
    		Timeout	 - 15
    
    	   Are these reasonable? Are there recommended values for polling 
    	   300+ systems?
                            
    
    
    	Thanks,
    
    	Dave
T.RTitleUserPersonal
Name
DateLines
6165.1IP Poller Value QuestionsTAEC::IRIBEMon Nov 14 1994 13:1934
    	1) Is there a way to see what polling values are currently in use?

>>>>>> No way to monitor the polling value !!!!
    
    	2) When one runs the poller enable procedure and does not specify
    	   values, the variables are written to the action file with no
    	   values. If the poller reads an action file with no values
    	   associated with the variables, what effect will this have on the
    	   poller? 
    
    		- None?
    
>>>>> You are right >>>>>> - Reset the polling values to default?
>>>>> But you can modify these value : re-enable the poller with new values.
    		
    	3) The polling values we have come up with that minimize false
    	   alarms of IP non-reachibility are:
    
    		Interval - 90
    		Retry    -  5
    		Timeout	 - 15
    
    	   Are these reasonable? Are there recommended values for polling 
    	   300+ systems?
                            
>>>> The following explain how to tune the polling period according the polling
values:

>>>> Polling_period_maxi = Retry*Timeout*number_of_systems
>>>> In this case we need a polling period equal to :  6hours and 25 mns.
>>>> The best way is to define for instance : 
>>>>         Retries=2 * Timeout=5s * machines=300 = polling period = 50s.

A Cioa JMI. 
6165.2Customer HAS to poll < every 2 minutesCUJO::BROWNDave BrownMon Nov 14 1994 17:1629
    
    
    	Given the formula:
    
    	(Quantity of systems)*(Retries)*(Timeout) = Interval
    
    	I see no way to be able to poll each system < every 2 minutes. It
    is very time-critical for the customer to know if a system goes down
    and they want to know about it no later than 2 minutes after the event.
    The poller values I mentioned in the base note seem to be working OK.
    
    	Question 1 - What are the implications of sticking with the values:
    
    		Interval - 90
    		Retry    -  5
    		Timeout  - 15
    
    	Question 2 - Is there a better way to check the IP reachability of
    300+ systems every < 2 minutes? The customer bought MCC and a fleet of
    VAXstation 4000 model 90s with the intent of performing IP Polling and
    they are not going to like it if they are told that they cannot get
    their < 2 minute resolution.
    
        Any ideas would be greatly appreciated.
    
    	Thank You,
    
    	Dave
    	
6165.3Help!CUJO::BROWNDave BrownWed Nov 16 1994 14:1922
    
    
    	DEC faces a major embarassment over the poller issue as documented
    in this note. The customer has had it with MCC and is seriously
    considering going over to HP Openview. I would implore those who read
    this note to consider methods by which we can truly poll ~300 systems
    every 2 minutes. Otherwise, we may have to throw in the towel.
    
    	Additionally, the customer is complaining that the MCC poller is
    slow in reporting an IP unreachibility. This makes sense seeing as how
    we have the polling interval set to 90 seconds and according to my
    understanding, the IP Poller will only poll 50 machines per polling
    interval. So with a 90 second polling interval, every machine gets
    polled 6*90 seconds = 540 seconds = 9 minutes. Am I correct? When we
    let the polling interval default to 30 and the retry and the timeout to
    go to default, the IP Poller is continuously issuing false IP
    reachibility problems.
    
    	Any help/advice would be greatly appreciated.
    
    
    	Dave
6165.4Poller PerformanceCUJO::BROWNDave BrownWed Nov 16 1994 16:2824
    
    
        The customer's ~300 polled machines are members of two domains, one a
    child of the other. Given that situation, the customer has asked more
    questions regarding the poller:
    
    	1) When the poller is enabled/disabled for the parent domain, is
    	   the poller also enabled/disabled for the child domain? 
    
    	2) Is it possible to have two poller processes going at the same
    	   time; one enabled for each domain. The reason this would be
    	   considered is to share the load and improve the MCC response
    	   time should a node become unrechible. Currently, they are
    	   getting notification up to 6+ minutes after a node goes down.
    
    	3) What would be the effect if multiple polling domains were set up
    	   and the poller/pollers were individually enabled for each
    	   domain? Will the poller poll 50 systems per domain per polling 
    	   interval or just 50 per poller process per polling interval?
    
    
    	Thank You,                                                  
    
    	Dave  
6165.5TAEC::IRIBEFri Nov 18 1994 14:4048
    	1) When the poller is enabled/disabled for the parent domain, is
    	   the poller also enabled/disabled for the child domain? 

>>> Yes you are right.

    
    	2) Is it possible to have two poller processes going at the same
    	   time; one enabled for each domain. The reason this would be
    	   considered is to share the load and improve the MCC response
    	   time should a node become unrechible. Currently, they are
    	   getting notification up to 6+ minutes after a node goes down.

>>>> No it is impossible to have 2 pollers.
    
    	3) What would be the effect if multiple polling domains were set up
    	   and the poller/pollers were individually enabled for each
    	   domain? Will the poller poll 50 systems per domain per polling 
    	   interval or just 50 per poller process per polling interval?



>>>> When you talk about 50 machines polled per polling period. In fact you we
can poll ((nb_machines)/50 * timeout * retry )per the polling period.

I want to say you that I have run tests (with the OSF version). Here in Valbonne
I have ~360 machines and I set the polling period to 18s, 1 retry, and 1s of
timeout. All is right, no incoherent IPreachability.
I tried to ping some machines in US it took ~300ms. If we can imagine that the
poller could take 2*300ms for the most far location (ICMP ping takes more time
than a IP ping). 
I don't think you could have any pb to poll you 300 machines.

So you can  define the following parameters :
polling period = 120s
retry = 1
timeout = 2s



An idea other (if there is no solution, becaus I don't think that it could be
very clean) :

May be,you can write a litle programm which is able to poll some IP machines
(use the ping command) and if the machine is unreachable send an event with the
mcc_evc_send to the collection_am.


JMI
6165.6How many nodes does the Poller really poll?CUJO::BROWNDave BrownMon Nov 21 1994 16:2835

	Thank you for the response to my previous questions, there is still 
something I don't understand. How many systems in the domain get polled per
polling interval? Do all of them get polled per polling interval or does
only a subset get polled?

	The following extract from .5 suggests that only a portion get polled:

>>>> When you talk about 50 machines polled per polling period. In fact you we
>>>> can poll ((nb_machines)/50 * timeout * retry )per the polling period.
	
	If we have 300 machines, and timeout was 2 and retry was 1, the 
quantity of machines polled each polling interval would be:

	300/50*2*1 = 12

	Does this mean that only 12 machines get polled per polling interval?!
If this is so and polling interval is 120 and our total machines are 300, we
cycle through the entire list of 300 machines every (25*120) = 3000 seconds
or 50 minutes. This would mean a worst case of a 50 minute latency between a 
node becoming unreachible and us getting an MCC notification. 

	Is this how it works?

		*- OR -*

	Does the Poller poll *ALL* the machines in one polling interval 
therefore making it possible to get a < 2 minute worst case notification 
latancy from *-ANY-* IP unreachibility?

	Thank you,


	Dave
6165.7MOLAR::MOLAR::BOSETue Nov 29 1994 14:4732
	Dave,
		First let me tell you that even HP Openview cannot solve
	your polling problem. I worked on the IP Poller originally and now
	I am working with Netview (based on HP Openview), and I can assure 
	you that Netview will not report on the status of unreachable nodes 
	any faster.
	
		Now, let's get the math straight. The poller has no limit
	on the number of nodes it can poll. But the time taken to poll all
	the nodes will vary depending on how many nodes you are trying to
	poll. So, in the worst case, when all the nodes are unreachable,
	the time taken to poll 300 nodes with 10 sec timeout, and 2 retries
	will be

		300/50 * 10 * (2+1) = 180 sec = 3 min.

		The math is pretty straight forward. In the worst case all
	nodes are unreachable, so there will be timeout and retries for each
	node. So to determine a node is unreachable it will take 10 seconds
	* (2 + 1). A retry value of two means the ping request is sent out
	out a total of three times. So, it takes 30 seconds to know that a node
	is unreachable. For 300 nodes, the time taken is 300 * 30 sec. Since the
	poller sends out 50 ping requests in one shot, the actual time taken
	will be 300 * 30 /50 = 180 sec.

		However if only 20% of the nodes are unreachable, the time
	taken to poll 300 would be less than a minute with 10 s timeout and
	2 retries.

	Rahul.
		
6165.8Fine detail poller actionsCUJO::BROWNDave BrownWed Nov 30 1994 15:3391
    
    
    	Rahul,
    
    	What we're trying to establish is how the polls operate within the
    TIMEOUT*(RETRY+1) time period. I understand that systems are polled in
    groups of 50 until all the systems in the domain have been polled. The
    question is, what is the frequency of the successive 50 system poll
    groups?
    
    	By your example in .7 - 
    
    	Systems = 300
    	RETRY = 2
    	TIMEOUT = 10
        POLL_INTERVAL = 200
    
     Is this how it works?:
    
    			Cumulative 
              Systems   Systems   Action
    Second    Polled    Polled    Taken
    ------    -------   --------  ------
       0        50	   50     First 50 polled. Waits TIMEOUT*(RETRY+1)sec.
      30        50        100     Second 50 polled. Waits TIMEOUT*(RETRY+1)sec.
      60        50        150     Third 50 polled. Waits TIMEOUT*(RETRY+1)sec.
      90        50        200     Fourth 50 polled. Waits TIMEOUT*(RETRY+1)sec.
     120        50        250     Fifth 50 polled. Waits TIMEOUT*(RETRY+1)sec.
     180        50        300     Final 50 polled. Waits for next POLL_INTERVAL
     200        50         50     Next POLL_INTERVAL; starts over.
    
    	According to my understanding, this is why:
    
    	Total Systems
        ------------- * TIMEOUT * (RETRY+1) must be less than POLL_INTERVAL
             50
    	
    	If it is not, you wil have poll overrun; the condition where polls
    are still ocurring from the last POLL_INTERVAL when another
    POLL_INTERVAL starts.
    
    	.7 implies that my example above is correct for worst case only. IF
    that IS true, what I don't understand then is how it works if a
    percentage to all nodes are IP Reachible. 
    
    	Does the next group of 50 polls not wait for (TIMEOUT*(RETRY+1)) seconds
    but kick off immediatly only after all of the current group of 50 have 
    responded? 
    
    	Or within the (TIMEOUT*(RETRY+1)) time period, if 30 out of 50
    immediatly respond, is another 30 polls released at this time thereby
    keeping the level of unacknowledged polls at 50?
    
    	Or within the (TIMEOUT*(RETRY+1)) time period, if all 50 nodes
    respond within TIMEOUT seconds, are the next 50 polls released at the
    next TIMEOUT second mark? Example - 
    
    		All nodes IP Reachible -
    		TIMEOUT = 10
    		RETRY = 2
    		POLL_INTERVAL = 200  
    		Total Systems = 300
                                                     Total 
    		Second		Systems Polled       Systems Polled
                ------		--------------       --------------
    		   0		     50                  50
    	          10 		     50			100		
                  20		     50			150
    		  30		     50		        200
    		  40		     50			250
    		  50		     50			300
    		  60		      0			300
		   .		      .			 .
    		   .		      .			 .
    		 200		     50			 50
    	         210		     50			100
                   .		      .                  .
    		   .                  .                  .
    
    
    	Is this how it works? These questions are all brought up by my
    customer who has a very inquiring mind and who would like to explain
    the actions that they have witnesses the poller take given a certain
    percentage of IP Reachibilities.
    
    	Any help would be appreciated.
    
    	Thanks!
    
    	Dave 
             
6165.9MOLAR::MOLAR::BOSEFri Dec 02 1994 13:0934
   
>>    	According to my understanding, this is why:
    
>>    	Total Systems
>>      ------------- * TIMEOUT * (RETRY+1) must be less than POLL_INTERVAL
             50
    	
>>    	If it is not, you wil have poll overrun; the condition where polls
>>    are still ocurring from the last POLL_INTERVAL when another
>>    POLL_INTERVAL starts.

	If the time taken to poll all the nodes is greater than the polling
	interval, then that time is regarded as the new polling interval. So,
	if the polling interval is too small, then there might be continuous 
	polling of the nodes.
    

	    
>>    	Does the next group of 50 polls not wait for (TIMEOUT*(RETRY+1)) seconds
>>    but kick off immediatly only after all of the current group of 50 have 
>>    responded? 

	Yes. So if all your nodes respond immedialtely, polling all the nodes
	will take next to no time.
    
>>    	Or within the (TIMEOUT*(RETRY+1)) time period, if 30 out of 50
>>    immediatly respond, is another 30 polls released at this time thereby
>>    keeping the level of unacknowledged polls at 50?

	Before the next batch of ICMP requests are sent out, retries are
	attempted on the 20 nodes which didn't respond. Only when all the
	retries are exhausted do we send out the next batch of 50.
    
	Rahul.
6165.10Things are getting weirder...CUJO::BROWNDave BrownFri Dec 02 1994 20:0627
    Rahul,
    
    Thanks for the update, I'll pass it onto the customer. Meanwhile, the
    customer would like me to explain to him why when he took polling
    interval from 45 seconds to 180, they got some IP unreachiblities
    followed by IP reachiblities on a continuous cyclical basis. When they
    raised RETRY from 2 to 3, the complaints of the nodes stopped.
    
    Then they started playing with different values of POLLING_INTERVAL
    and noticed that the nodes that would take the IP unreachible/IP
    reachible hits were nodes that were physically adjacent in the DNS
    namespace extension. Raise the value of POLL_INTERVAL a little bit and
    the group IP unreachible/IP reachible nodes would change to another
    group which was physically adjacent in DNS; a but further down the
    extension. They will be providing me with a cause and effect matrix
    which I will place here.
    
    As you can tell, this customer is quite inquiring. They change the
    poller variables, observe the unfavorable result and then look in the 
    namespace to try to find out what is happening. The customer would not
    be doing this if the Poller actions were stable (in their mind). They
    (and I) are having quite a bit of difficulty establishing the root
    reasons behind the cause and effect we are seeing with just slightly
    changing the Poller variables - we're trying to find variables which
    will stablize the Poller.
    
    Dave 
6165.11MOLAR::MOLAR::BOSETue Dec 06 1994 20:0918
>>    interval from 45 seconds to 180, they got some IP unreachiblities
>>    followed by IP reachiblities on a continuous cyclical basis. When they

	Can't explain why that would happen.


>>  When they
>>  raised RETRY from 2 to 3, the complaints of the nodes stopped.


	When a network is flaky, packets may be lost. Increasing the number
	of retries will cause the behaviour to stabilise. There was also a 
	problem in the poller where the socket buffer was too small and
	packets were being lost. But I fixed that problem and it should
	have been available as a patch on V1.3.

	Rahul.
6165.12Stopping the Poller ProcessCUJO::BROWNDave BrownWed Dec 07 1994 20:3126
    
    
    	How about this question - what is the approved method of stopping
    the IP Poller process?
    
    	The reason the question is asked is that once the MCC_POLLER_ENABLE
    procedure is run, the poller process is up and remains for how long the
    system is booted regardless if MCC is shut down or not. These folks do
    some interesting MCC maintenance actions, such as propagating the MCC
    dictionaries, and like to make sure that all MCC processes are off the
    system before doing things like this. Historically, they have been
    ridding themselves of the MCC_IP_POLLER process by doing a STOP/ID on
    it but they have recently made the coorelation of doing this with many
    IP rechibility problems once MCC is restarted and the poller reenabled.
    The only way they can clear this problem, once the poller has been
    STOP/ID'd, is to reboot.
    
    	We tried making an action.dat file with only an "exit" statement in
    it hoping the poller would get the idea we wanted it to go away; it did
    not.
    
    	So is there a way to make the poller process exit gracefully?
    
    	Thanks,
    
    	Dave 
6165.13.12??CUJO::BROWNDave BrownWed Dec 14 1994 16:0011
    
    	No response to .12 would suggest it's a good question. The customer
    is still wanting an answer should anyone have one. Is there anything we
    can put in the action.dat file to cause the poller to exit gracefully?
    Like stated in .12, doing a STOP/ID on the poller causes the next
    poller process to work improperly; the only apparent fix being a
    reboot.
    
    	Thanks!
    
    	Dave