[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:	Mathematics at DEC

Moderator:	RUSURE::EDP

Created:	Mon Feb 03 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2083
Total number of notes:	14613

509.0. "integer square root algorithms" by EAGLEA::BEST (R. D. Best, 32 bit sys. arch. & A.D., VAXBI) Thu Jun 12 1986 19:18

  I'd like to reserve this note for implementations / discussion of
efficient algorithms for taking integer ( 16/32/64 bit ) square roots.
Implementations can be in any language ( this includes VAX/Macro, Macro-11,
Z-80 assembly ) as long as the source language is specified.

  Hardware implementations are also welcomed.

  Two questions to start off:

(1)	Can anyone explain to me how the multifunction pipeline
	described in sec. 2.3.3. of chap. 2 of 'The Architecture of
	Pipelined Computers' by Kogge computes square roots ?  What
	algorithm is being implemented ?

(2)	Does anyone know of any efficient divide and conquer algorithms for
	integer square roots ?

{ Pascal 32 bit integer square root; works well for larger #'s (>100) }
{ Does not require multiply or divide }

program square_root3( input, output );

var j, k, l, s: integer;

{ written by R. Best/ 6-may-86 } 
{ inspiration: EDN magazine, 17-apr-86, p. 246 } 
{ Improved integer square root function based on expanding a recurrence
derived from: 1 + 3 + 5 + .. + (2k-1) = k^2.  Three successive
applications of (1) q' := q + 2; p' = p - q - 2 result in
(2) p''' = p - 3q - 12, q''' = q + 6.  These expressions can typically
be evaluated more rapidly than the three successive applications of set (1).
The recurrence is backed up one or two cycles by p := p' + q; q := q' - 2 when
p goes negative.  The resulting compiled VAX code runs from 2% to 70%
faster than the basic 'subtract successive odd integers' method for n >=
about 120.  A hybrid algorithm of the form ' if x < 120 then sqr else sqr3 '
would probably run even faster.  The main program can be used to test
for the crossover x value.  Implementation note: clock is a VAX Pascal
function of 0 args that returns the current value of a process associated
millisecond timer.  The value is the number of milliseconds since the image
was first activated.  The granularity is 10 ms. } 

{ Performance for smaller integers suffers with respect to the naive
algorithm, but not by much.  Performance advantage improves almost monotonically
with increasing n.  The recurrence can be expanded further at the cost
of reduced performance on small integers.  Another optimisation that might
be applied is to expand the recurrence back-up cycles; for example, if we jump
forward by 5 applications per cycle, we could perform three steps of backup
automatically and then retest to see whether to continue backing-up or
to proceed forward, rather than the single back-ups performed in sqr3. } 

function sqr3( x: integer ): integer;

var q, p : integer;

begin
   if x = 0
      then
         sqr3 := 0
      else
         begin
            p := x;
            q := -1;
            repeat
               p := p - q;
               q := q + 6;
               p := p - q;
               p := p - q
            until p <= 0;
            repeat
               p := p + q;
               q := q - 2
            until p > 0;
            q := q + 3;
            sqr3 := q div 2
         end
end;

{ The basic 'subtract successive odd integers' algorithm. } 

function sqr1( x: integer ): integer;

var q, p: integer;

begin
   if x = 0
      then
         sqr1 := 0
      else
         begin
            p := x;
            q := -1;
            repeat
               q := q + 2;
               p := p - q
            until p <= 0;
            q := q + 1;
            sqr1 := q div 2
         end
end;

begin
   readln( s );
   j := clock;
   for k := 1 to 10000 do
      l := sqr3( s );
   j := clock - j;
   writeln( 'elapsed time sqr3= ', j, ' ms' );
   j := clock;
   for k := 1 to 10000 do
      l := sqr1( s );
   j := clock - j;
   writeln( 'elapsed time sqr1= ', j, ' ms' )
end.

T.R	Title	User	Personal Name	Date	Lines
509.1	CROSS REF	WGBH::APPELLOF	Carl J. Appellof	`Thu Jun 12 1986 20:01`	7
	Try looking at note 485 (especially .5 and .6). What exactly do you mean by an "integer square root"? Does it mean "the integer that, when squared, is closest to the original value"? Interested in a PDP-10 MACRO version of an integer square root which uses a binary adaptation of the old-fashioned grade school method?
509.2	Comparison	WBZ::APPELLOF	Carl J. Appellof	`Thu Jun 12 1986 20:47`	6
	It looks like your method is proportional to the size of the number. If you trim off leading zeroes, the old fashioned way is proportional to the number of digits in the number, which is roughly proportional to the log of the number. If you don't trim leading zeroes, then the old-fashioned way is just proportional to the word length.
509.3	replies to .1 and .2; a performance question	EAGLEA::BEST	R. D. Best, 32 bit sys. arch. & A.D., VAXBI	`Thu Jun 12 1986 21:52`	44
	In response to .1 and .2: Regarding the definition: It seems that both Hemenway's and my algorithm produce the smallest positive integer >= sqrt( x ). The 'jump points' ( where the function output changes from n to n+1 ) occur when the domain value changes from a_perfect_square to a_perfect_square+1. ( e.g. sqr3( 25 ) = 5, sqr3( 26 ) = 6 ) If there are methods that produce 'smallest positive integer <= sqrt( x )' or 'closest integer to sqrt( x )' that run faster, I'm interested. On the PDP-10 macro routine: sure, let's put it in. Is it already in a file somewhere ? I may need an interpretation though; I don't know the instruction set of the PDP-10. I was aware that there are faster algorithms; for example, scanning down from the MSB two bits at a time until you find a non-zero bit is a good way to determine the MSB of the square root: bit_position( MSB( sqrt( X ) ) ) ~= bit_position( MSB( X ) ) div 2. From there, I suppose that I could use a table to look up the odd integer to start subtracting with. I didn't bother trying to do that in the program in .0 because I couldn't think of a real neat way to write an efficient bit_position function in Pascal ( I guess the scanning technique IS sort of taking a rough logarithm; well how about a real fast integer LOG fcn ?) Hmmm. I think that even if you don't chop leading zeroes, the time complexity is on the order of sqrt( x ). You effectively count up to the square root by fixed increments. Is this what you mean by proportional to the word length ? Since the expanded recurrence version ( sqr3 ) does not work as well as sqr1 for small numbers, I would like to find a way to improve it without hacking up the code with a lot of special case tests. There is one thing about the algorithm in .0 that puzzles me. I ran it on both a 785 and an 8600. The crossover value ( i.e. the value of x at which the naive algorithm and the expanded recurrence version performed equally well ) was consistently smaller on the 8600 than the 785. Does anyone know why this happens ?
509.4	Here's one	WGBH::APPELLOF	Carl J. Appellof	`Fri Jun 13 1986 18:47`	49
	I was going to put in the MACRO-10 routine, since it was my native language for a long time, but it would have taken me longer to explain the instructions than writing it up in MACRO-32, so here's a FORTRAN-callable MACRO-32 integer square root routine. It works with 32-bit integers, but there is no check for negative args. .TITLE ISQRT ; ; Integer square root ; Input: 4(AP) contains address of integer argument ; Output: R0 contains the greatest integer less than or equal ; to the square root of the argument ; ISQRT:: .word ^M<R2,R3> CLRL R0 ;zap place for normalized word MOVL @4(AP),R1 ;Is the argument 0? BEQL DONE ;Yes, done CLRQ R2 ;zero R2 and R3 also 1$: ASHQ #-2,R0,R0 ;Shift out 2 bits INCL R2 ;Count digit pairs TSTL R1 ;Done all digits yet? BNEQ 1$ ;No, continue ; ; At this point: R1 contains 0 ; R0 contiains normalized argument ; R2 contains number of digit pairs ; LOOP: ASHQ #2,R0,R0 ;Shift 2 bits of argument into R1 ASHL #2,R3,R3 ;One more digit of result INCL R3 ;and add 1 CMPL R3,R1 ;Will it fit? BGTR 2$ ;No SUBL2 R3,R1 ; and subtract INCL R3 ;Yes, next result digit is a 1 2$: ASHL #-1,R3,R3 ;Readjust result SOBGTR R2,LOOP ;Go do next digit pair MOVL R3,R0 ;Put answer in the right place DONE: RET ;DONE .END The runtime of this routine is proportional to the number of bits in the argument, which is proportional to log(2) of the number. That is a better runtime than one which is proportional to the square root of the argument.
509.5	Newton Rules, OK!?	SHEILA::PUCKETT	Fortran will Never Die	`Fri Jun 20 1986 06:45`	16
	I played about with ISQRT (.4) and a Fortran version of SQR3 (.1) and found that these both rise as LOG(n), as expected. This gets quite slow as n gets large; Newtons method is much faster for large n (i.e. multiprecision) PROVIDED THAT you start with a good initial guess. My initial guess works as follows: if the number n has k bits and k is even, the guess is 2(k/2). (top bit set) If k is odd the guess is 2(k/2)+2**(k/2-1) (top 2 bits set) Typically you only need 4-5 divisions to get from this to an accurate integer square root on a 50 digit number. The number of divisions grows much more slowly than LOG(n) but I dont know the theory behind this. Maybe it is LOG(LOG(n))? - Giles
509.6	Another LOG(n) method	SHEILA::PUCKETT	Fortran will Never Die	`Fri Jun 20 1986 06:50`	4
	Oops, I forgot to mention another LOG(n) method I tried: binary chop, with initial limits 2(k/2-1) and 2(k/2+1) for a k-digit number n. This is very fast for small n (5-7 digits) but the comments in .-1 apply to it just like ISQRT and SQR3.
509.7	How does fast division work?	TOOK::APPELLOF	Carl J. Appellof	`Fri Jun 20 1986 13:21`	24
	I still believe that SQR3 time rises as the square root of the argument. This gets painfully slow as N rises. ISQRT goes up as LOG(n). The easiest way to fake up a Newton's method rooter for a 32 bit number is the following FORTRAN function: INTEGER FUNCTION FSQRT(N) FSQRT=INT(DSQRT(DFLOAT(N))) RETURN END The description of the method used for the initial approximation is in the VMS RTL manual description of MTH$xSQRT. This method only requires 3 iterations of the Newton-Raphson procedure, and therefore the time is constant (if division is cheap). It beats ISQRT all the time. On the other hand, we might consider hardware implementations. The algorithm used in ISQRT is very similar to that used in ordinary integer division (shift and subtract). I do not know if there is a better method to do division in hardware. If there is not, then an ISQRT-like hardware algorithm would beat a Newton's method, since Newton uses division, which is also proportional to LOG(n) using the shift-and-subtract method.
509.8	Newtons is LOG(LOG(n)) ??	SHEILA::PUCKETT	An Indestructible Metallic Alloy	`Mon Jun 30 1986 07:12`	9
	RE: .5,.6 It appears using multi-precision routines on 80-120 digit nos. that Newtons method DOUBLES the precision with each iteration. This is a whole heck of a lot better than LOG(n), it is in fact approximately LOG(LOG(n)). Anyone up on the theory of these optimisation methods? - Giles
509.9	Let them read Macro-10	GALLO::AMARTIN	Alan H. Martin	`Mon Jul 07 1986 17:19`	8
	Re .7: Oh, splurge, Carl. Stick the stupid Macro-10 routine in the conference. Posterity ought to count for something, and for all I know, one version or the other might get a boost from the difference in architectures. If someone can't read it, that is their loss. Don't bother prettying it up - I am not trying to make work for you. /AHM/THX
509.10	exit	TOOK::APPELLOF	Carl J. Appellof	`Wed Jul 16 1986 14:05`	52
	Alright, here is the stupid MACRO-10 routine. The only architectural advantage the -10 has is the JFFO instruction, which finds the LEFT-most 1-bit in a word (as opposed to the VAX FFO instruction, which finds the RIGHT-most 1-bit which is useless for normalizing numbers.). It IS possible to fake a similar operation on a VAX by using a CVTLF instruction and extracting the resulting exponent field. The rest of the algorithm is exactly the same on a VAX or a -10. Of course, all of this is useless for a multiprecision version. I'd like to see the FAST multiprecision SQRT which someone mentioned in another note. TITLE ISQRT AP=16 ;ARGUMENT POINTER AC P=17 ;Stack pointer ENTRY ISQRT ; ; Call: ; I = ISQRT(N) ; ISQRT: PUSH P,1 ;Save all ACs we use PUSH P,2 PUSH P,3 MOVE 0,@(16) ;Get argument from arg list JFFO 0,.+2 ;Find leftmost 1 bit in word JRST SQREND ;None found, answer is 0 MOVE 2,0 ;Save the argument TRZ 1,1 ;Get nearest even number of leading zeroes LSH 2,(1) ;Normalize the argument HRROI 3,-^D36(1) ;Calculate Negative number of digits ASH 3,-1 ;number of pairs of digits SETZB 0,1 ;ZERO the answer and accumulator LOOP: LSHC 1,2 ;Slide in the next digit pair LSH 0,2 ;gives us one more digit of answer TRO 0,1 ;In this position of the answer CAMLE 0,1 ;Will it fit? JRST LOOP1 ;Nope SUB 1,0 ;Yup, subtract from accumulator TRO 0,2 ;Set answer digit LOOP1: LSH 0,-1 ;Readjust accumulator AOJL 3,LOOP ;Go for next digit pair SQREND: POP P,3 ;Restore saved ACs POP P,2 POP P,1 POPJ P, ;Return with answer in AC0 END
509.11	Premature optimization is the root of all evil	NOBUGS::AMARTIN	Alan H. Martin	`Thu Oct 30 1986 22:06`	25
	Re .10: Thanks. The routine even works as written under extended addressing! (However, you can trash AC1 and AC16 with impunity, thus eliminating two PUSH/POP pairs, and changing "JRST SQREND" to "POPJ P," in the process!) By the way, >(as opposed to the VAX FFO instruction, > which finds the RIGHT-most 1-bit which is useless for normalizing > numbers.) On a 2's complement machine with a "count the number of leading zeros" instruction, anyone can find the number of trailing zeros with a (small) constant additionaly amount of work. Conversely, on a 2's complement machine with a "count the number of trailing zeros" instruction, I am aware of no fast method of counting leading zeros. I pose the former problem as an exercise to the reader. Can anyone give a solution to the latter problem? /AHM/THX
509.12	Is speedup possible?	TOOK::APPELLOF	Carl J. Appellof	`Fri Oct 31 1986 12:29`	9
	Thanks, Alan, for pointing out what I had forgotten: that AC1 is fair game in FORTRAN-10 function routines. I still have a nagging feeling that there is a version of this algorithm which would operate on more than 2 bits at a time. After all, it's merely a binary adaptation of the grade school paper and pencil method, and that one looks at a pair of decimal digits on each pass. Can you think of a simple way of examining 4 bits at a time?
509.13	Who needs divisions?	AKQJ10::YARBROUGH		`Wed Apr 22 1987 18:46`	4
	There is a recent article in IEEE Proceedings, Vol 75 #2, p. 262, about sqrts with a Digital Signal Processor. It may be worth a look. Lynn
509.14	use a trick from DEC-10 processor reference	EAGLE1::BEST	R D Best, Systems architecture, I/O	`Tue Jun 16 1987 18:03`	14
	re .11: > On a 2's complement machine with a "count the number of leading zeros" > instruction, anyone can find the number of trailing zeros with a > (small) constant additionaly amount of work. Let the word size of the machine be N. Let the argument be A. If the "count the number of leading zeros" operator is leading(x), then the number of trailing zeros is: trailing( A ) = N - leading( and( A, -A ) ) - 1 This is because and( A, -A ) yields a bit mask with a single 1 in the position of the rightmost set bit in the argument.

Conference rusure::math

509.0. "integer square root algorithms" by EAGLEA::BEST (R. D. Best, 32 bit sys. arch. &amp; A.D., VAXBI) Thu Jun 12 1986 19:18

509.0. "integer square root algorithms" by EAGLEA::BEST (R. D. Best, 32 bit sys. arch. & A.D., VAXBI) Thu Jun 12 1986 19:18