[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

509.0. "integer square root algorithms" by EAGLEA::BEST (R. D. Best, 32 bit sys. arch. & A.D., VAXBI) Thu Jun 12 1986 19:18

  I'd like to reserve this note for implementations / discussion of
efficient algorithms for taking integer ( 16/32/64 bit ) square roots.
Implementations can be in any language ( this includes VAX/Macro, Macro-11,
Z-80 assembly ) as long as the source language is specified.

  Hardware implementations are also welcomed.

  Two questions to start off:

(1)	Can anyone explain to me how the multifunction pipeline
	described in sec. 2.3.3. of chap. 2 of 'The Architecture of
	Pipelined Computers' by Kogge computes square roots ?  What
	algorithm is being implemented ?

(2)	Does anyone know of any efficient divide and conquer algorithms for
	integer square roots ?

{ Pascal 32 bit integer square root; works well for larger #'s (>100) }
{ Does not require multiply or divide }

program square_root3( input, output );

var j, k, l, s: integer;

{ written by R. Best/ 6-may-86 } 
{ inspiration: EDN magazine, 17-apr-86, p. 246 } 
{ Improved integer square root function based on expanding a recurrence
derived from: 1 + 3 + 5 + .. + (2k-1) = k^2.  Three successive
applications of (1) q' := q + 2; p' = p - q - 2 result in
(2) p''' = p - 3q - 12, q''' = q + 6.  These expressions can typically
be evaluated more rapidly than the three successive applications of set (1).
The recurrence is backed up one or two cycles by p := p' + q; q := q' - 2 when
p goes negative.  The resulting compiled VAX code runs from 2% to 70%
faster than the basic 'subtract successive odd integers' method for n >=
about 120.  A hybrid algorithm of the form ' if x < 120 then sqr else sqr3 '
would probably run even faster.  The main program can be used to test
for the crossover x value.  Implementation note: clock is a VAX Pascal
function of 0 args that returns the current value of a process associated
millisecond timer.  The value is the number of milliseconds since the image
was first activated.  The granularity is 10 ms. } 

{ Performance for smaller integers suffers with respect to the naive
algorithm, but not by much.  Performance advantage improves almost monotonically
with increasing n.  The recurrence can be expanded further at the cost
of reduced performance on small integers.  Another optimisation that might
be applied is to expand the recurrence back-up cycles; for example, if we jump
forward by 5 applications per cycle, we could perform three steps of backup
automatically and then retest to see whether to continue backing-up or
to proceed forward, rather than the single back-ups performed in sqr3. } 

function sqr3( x: integer ): integer;

var q, p : integer;

begin
   if x = 0
      then
         sqr3 := 0
      else
         begin
            p := x;
            q := -1;
            repeat
               p := p - q;
               q := q + 6;
               p := p - q;
               p := p - q
            until p <= 0;
            repeat
               p := p + q;
               q := q - 2
            until p > 0;
            q := q + 3;
            sqr3 := q div 2
         end
end;

{ The basic 'subtract successive odd integers' algorithm. } 

function sqr1( x: integer ): integer;

var q, p: integer;

begin
   if x = 0
      then
         sqr1 := 0
      else
         begin
            p := x;
            q := -1;
            repeat
               q := q + 2;
               p := p - q
            until p <= 0;
            q := q + 1;
            sqr1 := q div 2
         end
end;

begin
   readln( s );
   j := clock;
   for k := 1 to 10000 do
      l := sqr3( s );
   j := clock - j;
   writeln( 'elapsed time sqr3= ', j, ' ms' );
   j := clock;
   for k := 1 to 10000 do
      l := sqr1( s );
   j := clock - j;
   writeln( 'elapsed time sqr1= ', j, ' ms' )
end.
T.RTitleUserPersonal
Name
DateLines
509.1CROSS REFWGBH::APPELLOFCarl J. AppellofThu Jun 12 1986 20:017
    Try looking at note 485 (especially .5 and .6).  What exactly do
    you mean by an "integer square root"?  Does it mean "the integer
    that, when squared, is closest to the original value"?
    
    Interested in a PDP-10 MACRO version of an integer square root which
    uses a binary adaptation of the old-fashioned grade school method?
    
509.2ComparisonWBZ::APPELLOFCarl J. AppellofThu Jun 12 1986 20:476
    It looks like your method is proportional to the size of the number.
    If you trim off leading zeroes, the old fashioned way is proportional
    to the number of digits in the number, which is roughly proportional
    to the log of the number.  If you don't trim leading zeroes, then
    the old-fashioned way is just proportional to the word length.
    
509.3replies to .1 and .2; a performance questionEAGLEA::BESTR. D. Best, 32 bit sys. arch. &amp; A.D., VAXBIThu Jun 12 1986 21:5244
In response to .1 and .2:

  Regarding the definition:  It seems that both Hemenway's and my
algorithm produce the smallest positive integer >= sqrt( x ).  The
'jump points' ( where the function output changes from n to n+1 ) occur when
the domain value changes from a_perfect_square to a_perfect_square+1.

( e.g. sqr3( 25 ) = 5, sqr3( 26 ) = 6 )

If there are methods that produce 'smallest positive integer <= sqrt( x )'
or 'closest integer to sqrt( x )' that run faster, I'm interested.

  On the PDP-10 macro routine: sure, let's put it in.  Is it already in
a file somewhere ?  I may need an interpretation though; I don't know the
instruction set of the PDP-10.

  I was aware that there are faster algorithms; for example, scanning down
from the MSB two bits at a time until you find a non-zero bit is a good
way to determine the MSB of the square root:

bit_position( MSB( sqrt( X ) ) ) ~= bit_position( MSB( X ) ) div 2.

From there, I suppose that I could use a table to look up the odd integer
to start subtracting with.  I didn't bother trying to do that in 
the program in .0 because I couldn't think of a real neat way to write
an efficient bit_position function in Pascal ( I guess the scanning technique
IS sort of taking a rough logarithm; well how about a real fast integer LOG
fcn ?)

  Hmmm.  I think that even if you don't chop leading zeroes, the time
complexity is on the order of sqrt( x ).  You effectively count up to
the square root by fixed increments.  Is this what you mean by
proportional to the word length ?

  Since the expanded recurrence version ( sqr3 ) does not work as well
as sqr1 for small numbers, I would like to find a way to improve it
without hacking up the code with a lot of special case tests.

  There is one thing about the algorithm in .0 that puzzles me.
I ran it on both a 785 and an 8600.  The crossover value ( i.e.
the value of x at which the naive algorithm and the expanded recurrence version
performed equally well ) was consistently smaller on the 8600 than the 785.
Does anyone know why this happens ?
509.4Here's oneWGBH::APPELLOFCarl J. AppellofFri Jun 13 1986 18:4749
    I was going to put in the MACRO-10 routine, since it was my native
    language for a long time, but it would have taken me longer to
    explain the instructions than writing it up in MACRO-32, so here's
    a FORTRAN-callable MACRO-32 integer square root routine.  It works
    with 32-bit integers, but there is no check for negative args.
    
	.TITLE	ISQRT

;
; Integer square root
; Input: 4(AP) contains address of integer argument
; Output: R0 contains the greatest integer less than or equal
;	to the square root of the argument
;
ISQRT::	.word	^M<R2,R3>

	CLRL	R0		;zap place for normalized word
	MOVL	@4(AP),R1	;Is the argument 0?
	BEQL	DONE		;Yes, done
	CLRQ	R2		;zero R2 and R3 also
1$:	ASHQ	#-2,R0,R0	;Shift out 2 bits
	INCL	R2		;Count digit pairs
	TSTL	R1		;Done all digits yet?
	BNEQ	1$		;No, continue
;
; At this point: R1 contains 0
;		R0 contiains normalized argument
;		R2 contains number of digit pairs
;
LOOP:	ASHQ	#2,R0,R0	;Shift 2 bits of argument into R1
	ASHL	#2,R3,R3	;One more digit of result
	INCL	R3		;and add 1
	CMPL	R3,R1		;Will it fit?
	BGTR	2$		;No
	SUBL2	R3,R1		; and subtract
	INCL	R3		;Yes, next result digit is a 1
2$:	ASHL	#-1,R3,R3	;Readjust result
	SOBGTR	R2,LOOP		;Go do next digit pair
	MOVL	R3,R0		;Put answer in the right place
DONE:	RET			;DONE

	.END


    The runtime of this routine is proportional to the number of
    bits in the argument, which is proportional to log(2) of the number.
    That is a better runtime than one which is proportional to the square
    root of the argument.
    
509.5Newton Rules, OK!?SHEILA::PUCKETTFortran will Never DieFri Jun 20 1986 06:4516
    I played about with ISQRT (.4) and a Fortran version of SQR3 (.1)
    and found that these both rise as LOG(n), as expected. This gets
    quite slow as n gets large; Newtons method is much faster for large
    n (i.e. multiprecision) PROVIDED THAT you start with a good initial
    guess.
    
    My initial guess works as follows: if the number n has k bits and
    k is even, the guess is 2**(k/2).             (top bit set)
    If k is odd the guess is 2**(k/2)+2**(k/2-1)  (top 2 bits set)
    
    Typically you only need 4-5 divisions to get from this to an
    accurate integer square root on a 50 digit number. The number of
    divisions grows much more slowly than LOG(n) but I dont know the
    theory behind this. Maybe it is LOG(LOG(n))?
    
    - Giles
509.6Another LOG(n) methodSHEILA::PUCKETTFortran will Never DieFri Jun 20 1986 06:504
    Oops, I forgot to mention another LOG(n) method I tried: binary
    chop, with initial limits 2**(k/2-1) and 2**(k/2+1) for a k-digit
    number n. This is very fast for small n (5-7 digits) but the comments
    in .-1 apply to it just like ISQRT and SQR3.
509.7How does fast division work?TOOK::APPELLOFCarl J. AppellofFri Jun 20 1986 13:2124
    I still believe that SQR3 time rises as the square root of the
    argument.  This gets painfully slow as N rises.  ISQRT goes up as
    LOG(n).  The easiest way to fake up a Newton's method rooter for
    a 32 bit number is the following FORTRAN function:
    
    	INTEGER FUNCTION FSQRT(N)
    	FSQRT=INT(DSQRT(DFLOAT(N)))
    	RETURN
    	END
    
    The description of the method used for the initial approximation
    is in the VMS RTL manual description of MTH$xSQRT.
    This method only requires 3 iterations of the Newton-Raphson procedure,
    and therefore the time is constant (if division is cheap).  It beats
    ISQRT all the time.
    
    On the other hand, we might consider hardware implementations. 
    The algorithm used in ISQRT is very similar to that used in ordinary
    integer division (shift and subtract).  I do not know if there is
    a better method to do division in hardware.  If there is not, then
    an ISQRT-like hardware algorithm would beat a Newton's method, since
    Newton uses division, which is also proportional to LOG(n) using
    the shift-and-subtract method.
    
509.8Newtons is LOG(LOG(n)) ??SHEILA::PUCKETTAn Indestructible Metallic AlloyMon Jun 30 1986 07:129
    RE: .5,.6
    It appears using multi-precision routines on 80-120 digit nos.
    that Newtons method DOUBLES the precision with each iteration.
    This is a whole heck of a lot better than LOG(n), it is in fact
    approximately LOG(LOG(n)). 
    
    Anyone up on the theory of these optimisation methods?
    
    - Giles
509.9Let them read Macro-10GALLO::AMARTINAlan H. MartinMon Jul 07 1986 17:198
Re .7:

Oh, splurge, Carl.  Stick the stupid Macro-10 routine in the conference.
Posterity ought to count for something, and for all I know, one version
or the other might get a boost from the difference in architectures.
If someone can't read it, that is their loss.  Don't bother prettying
it up - I am not trying to make work for you.
				/AHM/THX
509.10exitTOOK::APPELLOFCarl J. AppellofWed Jul 16 1986 14:0552
Alright, here is the stupid MACRO-10 routine.  The only architectural
    advantage the -10 has is the JFFO instruction, which finds the
    LEFT-most 1-bit in a word (as opposed to the VAX FFO instruction,
    which finds the RIGHT-most 1-bit which is useless for normalizing
    numbers.).  It IS possible to fake a similar operation on a VAX
    by using a CVTLF instruction and extracting the resulting exponent
    field.  The rest of the algorithm is exactly the same on a VAX or
    a -10.  Of course, all of this is useless for a multiprecision version.
    I'd like to see the FAST multiprecision SQRT which someone mentioned
    in another note.
    
	TITLE	ISQRT

	AP=16			;ARGUMENT POINTER AC
	P=17			;Stack pointer

	ENTRY	ISQRT
;
; Call:
;	I = ISQRT(N)
;
ISQRT:	PUSH	P,1		;Save all ACs we use
	PUSH	P,2
	PUSH	P,3

	MOVE	0,@(16)		;Get argument from arg list
	JFFO	0,.+2		;Find leftmost 1 bit in word
	  JRST	SQREND		;None found, answer is 0
	MOVE	2,0		;Save the argument
	TRZ	1,1		;Get nearest even number of leading zeroes
	LSH	2,(1)		;Normalize the argument
	HRROI	3,-^D36(1)	;Calculate Negative number of digits
	ASH	3,-1		;number of pairs of digits
	SETZB	0,1		;ZERO the answer and accumulator

LOOP:	LSHC	1,2		;Slide in the next digit pair
	LSH	0,2		;gives us one more digit of answer
	TRO	0,1		;In this position of the answer
	CAMLE	0,1		;Will it fit?
	JRST	LOOP1		;Nope
	SUB	1,0		;Yup, subtract from accumulator
	TRO	0,2		;Set answer digit
LOOP1:	LSH	0,-1		;Readjust accumulator
	AOJL	3,LOOP		;Go for next digit pair

SQREND:	POP	P,3		;Restore saved ACs
	POP	P,2
	POP	P,1
	POPJ	P,		;Return with answer in AC0

	END

509.11Premature optimization is the root of all evilNOBUGS::AMARTINAlan H. MartinThu Oct 30 1986 22:0625
Re .10:

Thanks.  The routine even works as written under extended addressing!

(However, you can trash AC1 and AC16 with impunity, thus eliminating two
PUSH/POP pairs, and changing "JRST SQREND" to "POPJ P," in the process!)

By the way,


>(as opposed to the VAX FFO instruction,
>    which finds the RIGHT-most 1-bit which is useless for normalizing
>    numbers.)

On a 2's complement machine with a "count the number of leading zeros"
instruction, anyone can find the number of trailing zeros with a
(small) constant additionaly amount of work.

Conversely, on a 2's complement machine with a "count the number of
trailing zeros" instruction, I am aware of no fast method of counting
leading zeros.

I pose the former problem as an exercise to the reader.  Can anyone
give a solution to the latter problem?
				/AHM/THX
509.12Is speedup possible?TOOK::APPELLOFCarl J. AppellofFri Oct 31 1986 12:299
    Thanks, Alan, for pointing out what I had forgotten: that AC1 is
    fair game in FORTRAN-10 function routines.
    
    I still have a nagging feeling that there is a version of this
    algorithm which would operate on more than 2 bits at a time.  After
    all, it's merely a binary adaptation of the grade school paper and
    pencil method, and that one looks at a pair of decimal digits on
    each pass.  Can you think of a simple way of examining 4 bits at
    a time?
509.13Who needs divisions?AKQJ10::YARBROUGHWed Apr 22 1987 18:464
    There is a recent article in IEEE Proceedings, Vol 75 #2, p. 262,
    about sqrts with a Digital Signal Processor. It may be worth a look.
    
    Lynn
509.14use a trick from DEC-10 processor referenceEAGLE1::BESTR D Best, Systems architecture, I/OTue Jun 16 1987 18:0314
re .11:

> On a 2's complement machine with a "count the number of leading zeros"
> instruction, anyone can find the number of trailing zeros with a
> (small) constant additionaly amount of work.

Let the word size of the machine be N.  Let the argument be A.
If the "count the number of leading zeros" operator is leading(x), then
the number of trailing zeros is:

	trailing( A ) = N - leading( and( A, -A ) ) - 1

This is because and( A, -A ) yields a bit mask with a single 1 in the position
of the rightmost set bit in the argument.