[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference wonder::turbolaser

Title:	TurboLaser Notesfile - AlphaServer 8200 and 8400 systems
Notice:	Welcome to WONDER::TURBOLASER in it's new homeshortly
Moderator:	LANDO::DROBNER

Created:	Tue Dec 20 1994
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	1218
Total number of notes:	4645

1148.0. "Help me refute this posting." by HGOVC::JOELBERMAN () Wed Mar 26 1997 00:31

I just read the following. You may have to answer some of these
questions. Perhaps someone can help us tear this following post to
shreds.

Path:
pa.dec.com!news1.digital.com!data.ramona.vix.com!sonysjc!su-news-hub1.bbnplanet.com!cpk-news-hub1.bbnplanet.com!news.bbnplanet.com!news.sprintlink.net!news-peer.sprintlink.net!howland.erols.net!news.sgi.com!news.corp.sgi.com!frakir.asd.sgi.com!mccalpin
From: mccalpin@frakir.asd.sgi.com (John McCalpin)
Newsgroups: comp.benchmarks
Subject: Re: Alpha performance
Date: 25 Mar 1997 16:37:28 GMT
Organization: Silicon Graphics, Inc., Mountain View, CA
Lines: 143
Message-ID: <5h8v08$fu3@murrow.corp.sgi.com>
References: <3337DCC5.4087@aries.scs.uiuc.edu>
Reply-To: mccalpin@asd.sgi.com
NNTP-Posting-Host: frakir.asd.sgi.com

In article <3337DCC5.4087@aries.scs.uiuc.edu>,
J. D. McDonald <mcdonald@aries.scs.uiuc.edu> wrote:
>YEsterday I noticed four new SGI Origins appearing in the
>office across the hall from mine. I was talking to teh
>owner and a colleague who is another big computer time user,
>which I have not been since 1974 (when I was the absolute biggest,
>literally). I asked why everybody buys SGI instead of
>something else, specifically Alpha.
>
>The answer was "Alpha can only do matrix multiplies, nothing else".
>(We're talking FP here, of course.)
>
>Is this fact clearly visible in benchmarks?

I love the quote!

It is not true, of course, but it is a great quote....

There a lots of ways to look at machine performance, not all
of which are equally useful.

For single-processor "number-crunching" there are typically two
important axes to consider: computation rate and memory bandwidth.

One way to model computation rate is by the LINPACK benchmark or by
matrix multiplication. These algorithms can be re-arranged to require
very small amounts of main memory traffic. Similar results can be
obtained for small codes that are entirely cache-contained. I am not
convinced that these are particularly good measures, but they have the
advantage of being simple.

More detailed analyses need to include things like pipeline depth
(which is deep on the Alphas and shallow on the slower-clocked machines
like SGI's). Machines with deeper pipelines are harder to generate
code for, and generally attain a lower fraction of "peak" performance
than machines with shorter pipelines/latencies.

Memory bandwidth has a couple of degrees of freedom, but the most
important one is the unit-stride bandwidth, measured by the STREAM
benchmark:

http://www.cs.virginia.edu/stream/

It is very hard to make comparisons, since I do not know the DEC
product line in any detail, but my interpretation is that SGI offers
better bandwidth at most price points. One architectural issue that
hurts the DEC bandwidth is the need for a third-level cache (since the
on-chip L1 + L2 caches provide only 96kB of data cache). STREAM
results
for DEC machines with no L3 cache are better, but the application
performance typically drops, since big caches help most applications.

Many benchmarks can be interpreted as some combination of computation-
limited work and bandwidth-limited work. For example, on SGI machines
with large caches (1-4 MB), the SPECfp95 suite appears to be (in very
rough terms) about 75% computation limited and about 25% bandwidth
limited. I am not sure how these numbers look for other vendors.
(Actually, a big chunk of the 75% "computation limited" time is
probably limited by external cache bandwidth, but it is tricky to
come up with reasonable estimates of how big that chunk is.)

High-end DEC and SGI machines have similar SPECfp numbers, with DEC
leading by a few percent with a 500 MHz machine, and at parity for
a couple of 466 MHz systems. Our 195 MHz Origin systems are faster
than the rest of the DECs, and our 180's are faster than all the
DEC machines below 400 MHz. Except for the slowest machines, these
differences are probably not worth worrying about, since they are
mostly less than 10%.

DEC makes some very nice computers, with tremendous performance for
"integer" and/or cache-friendly codes, but it is all too easy for
customers to make judgements of performance based on their clock rates,
rather than their performance on applications. SGI's Origin machines
are shipping with 180 and 195 MHz R10000 cpus, compared to the 350-500
MHz cpus that DEC is shipping. What we see across a wide variety of
large floating-point benchmarks is that the SGI machines at 195 MHz
deliver application performance better than the 375 MHz Alphas, and the
same (or bit lower) than the 437 MHz Alphas (in the 8400 series
machines).

Parallel scalability is another issue. DEC builds bus-based machines,
with buses that cannot meet the demands of their very high-speed
processors. For example, even their top-of-the line 8400 series
provides a bus that can only provide full-speed access to about 4
of the 12 cpus. Running bandwidth-limited codes on the machine
quickly runs into scaling limitations as the bus saturates. The DEC
8400 with 437 MHz cpus currently holds the record as the most
unbalanced SMP system -- with a single memory reference from a
unit-stride data stream costing as much as 64 floating-point
operations. (http://www.cs.virginia.edu/stream/standard/Balance.html)
(I believe that this is for a cluster of 7 8-cpu systems, so the
result for 12-cpu systems would be 50% worse, since the bus is
already saturated at 8 cpus.)

This does not mean that the DEC 8400 is a "bad" machine, but it does
mean that you are not likely to get as high a fraction of peak
performance for large, number-crunching codes as you would on a machine
with better balance. (For comparison, on the Origin 2000, a single
memory reference from a unit-stride data stream costs as much as
15 floating-point operations, independent of system size.)

The SGI Origin machines, on the other hand, use a distributed memory
arrangement so that each node's memory is local and accessed through a
local, non-blocking crossbar. Local memory bandwidth scales linearly
with system size. Similarly, since we use a hypercube interconnect
between the nodes, the bisection bandwidth (for non-local memory
references) also grows linearly with machine size. We have many
applications that are showing speedups of 25x on 32 cpus, and a growing
list that are obtaining speedups of ~90x on 128 cpus -- without
requiring message passing!

Users who cannot afford the larger machines may still wish to take
advantage of doing program development on a smaller Origin, then do
large-scale calculations on something like the 128-cpu array of SGI
Origin2000's at NCSA.

The bottom line is that DEC and SGI machines have rather different
performance equations, and it requires a considerable level of
experience to estimate application performance from the externally
visible machine parameters. Some benchmarks hint at the differences,
but it is hard to get accurate application performance estimates from
first principles and/or microbenchmarks. Instead, we look at
application performance, and *then* build the models that map from
machine parameters to application performance. Your mileage may
vary....

--
--
John D. McCalpin, Ph.D. Supercomputing Performance Analyst
Scalable Systems Group http://reality.sgi.com/employees/mccalpin
Silicon Graphics, Inc. mccalpin@sgi.com 415-933-7407

T.R	Title	User	Personal Name	Date	Lines
1148.1	It may not shred too easily	PERFOM::HENNING		`Wed Mar 26 1997 17:32`	22
	I don't think it shreds well. John McCalpin is reasonably responsible in what he says. BUT there is a matter of emphasis. John M. emphasizes that Alpha will rarely obtain "peak" performance, whereas other vendors may obtain their peak more readily. The other way to look at this is to realize that it is rather nice to have a peak. That is, some applications gain great advantages from Alpha's raw CPU power. Would you rather have a car that always does 100 kph, and never any better, or a car that on the right kind of road can do 200 kph? If you say the latter, how much does it bother you that on many roads it only does 100 kph? /john henning PS A different approach to John M's posting would be to attack SGI for late dates, operating system incoherence, etc. Bill Licea-Kane would be a better spokesperson in that regard - Mr. Bill, the floor is yours...
1148.2	Or perhaps build a better road...	SAVAGE::MCGEE	At this point, we don't know.	`Wed Mar 26 1997 17:45`	11
	>Would you rather have a car that always does 100 kph, and never any >better, or a car that on the right kind of road can do 200 kph? If you >say the latter, how much does it bother you that on many roads it only >does 100 kph? Of course having the engine already begs the question what are we doing to build a better road? SGI seems to have no trouble doing it, what are we doing on this front?
1148.3		HPCGRP::MANLEY		`Wed Mar 26 1997 20:03`	157
	> I just read the following. You may have to answer some of these > questions. Perhaps someone can help us tear this following post to > shreds. Well, I don't think we can "tear this ... to shreds" unless we lie. John McCalpin may now work for a competitor, but he apparently remains basically fair-minded. The simple fact is, we no longer have the tremendous advantage we had before SGI introduced the Origin sytems. Back then we beat them on everything! We're not so lucky now. > For single-processor "number-crunching" there are typically two > important axes to consider: computation rate and memory bandwidth. > > One way to model computation rate is by the LINPACK benchmark or by > matrix multiplication. These algorithms can be re-arranged to require > very small amounts of main memory traffic. Similar results can be > obtained for small codes that are entirely cache-contained. I am not > convinced that these are particularly good measures, but they have the > advantage of being simple. ... > Memory bandwidth has a couple of degrees of freedom, but the most > important one is the unit-stride bandwidth, measured by the STREAM > benchmark: > > http://www.cs.virginia.edu/stream/ > > It is very hard to make comparisons, since I do not know the DEC > product line in any detail, but my interpretation is that SGI offers > better bandwidth at most price points. One architectural issue that > hurts the DEC bandwidth is the need for a third-level cache (since the > on-chip L1 + L2 caches provide only 96kB of data cache). STREAM > results > for DEC machines with no L3 cache are better, but the application > performance typically drops, since big caches help most applications. O.K. So LINPACK performance is at one extreme and memory STREAMing bandwidth is at the other. We compute faster than they do and they stream data through memory faster than we do. Big Deal! Most applications fall somewhere between the two extremes, and Alphas certainly hold their own in that middle ground. If we price our products competitively, they will be competitive. > More detailed analyses need to include things like pipeline depth > (which is deep on the Alphas and shallow on the slower-clocked machines > like SGI's). Machines with deeper pipelines are harder to generate > code for, and generally attain a lower fraction of "peak" performance > than machines with shorter pipelines/latencies. The floating point pipeline depth of the EV5 is four cycles, down from six cycles on the EV6. I don't know what the floating pipeline depth is on the R10000, but I doubt that it's one cycle. For the sake of the argument that follows, let's assume the R10000's pipe is two deep (I think it is). Now consider a completely serial floating point code. On such a code, a 437 MHz EV5 Alpha retires one result every 4/.437 = 9.15 ns and a 195 MHz R10000 retires one results every 2/.195 = 10.26 ns. Alphas wins! And a 622 MHz Alpha will win BIG! DO NOT ever let the % Peak argument become an issue!!!! The argument is completely bogus. A customer buys a machine to do work. The machine that does the most work in a fixed period of time is the winner. PERIOD!!!! > Many benchmarks can be interpreted as some combination of computation- > limited work and bandwidth-limited work. For example, on SGI machines > with large caches (1-4 MB), the SPECfp95 suite appears to be (in very > rough terms) about 75% computation limited and about 25% bandwidth > limited. I am not sure how these numbers look for other vendors. > (Actually, a big chunk of the 75% "computation limited" time is > probably limited by external cache bandwidth, but it is tricky to > come up with reasonable estimates of how big that chunk is.) > > High-end DEC and SGI machines have similar SPECfp numbers, with DEC > leading by a few percent with a 500 MHz machine, and at parity for > a couple of 466 MHz systems. Our 195 MHz Origin systems are faster > than the rest of the DECs, and our 180's are faster than all the > DEC machines below 400 MHz. Except for the slowest machines, these > differences are probably not worth worrying about, since they are > mostly less than 10%. The SPECfp/MHz argument is irrelevant - as bogus as the % Peak argument. Who cares how performance is delivered? The only thing that matters is who delivers it and how much they deliver. Very shortly you'll have SPECfp numbers you can once again crow about. > DEC makes some very nice computers, with tremendous performance for > "integer" and/or cache-friendly codes, but it is all too easy for > customers to make judgements of performance based on their clock rates, > rather than their performance on applications. SGI's Origin machines > are shipping with 180 and 195 MHz R10000 cpus, compared to the 350-500 > MHz cpus that DEC is shipping. What we see across a wide variety of > large floating-point benchmarks is that the SGI machines at 195 MHz > deliver application performance better than the 375 MHz Alphas, and the > same (or bit lower) than the 437 MHz Alphas (in the 8400 series > machines). Once again, as new systems with much higher clock rates and larger board level caches are announced, we will reclaim some of the threatened application space. > Parallel scalability is another issue. DEC builds bus-based machines, > with buses that cannot meet the demands of their very high-speed > processors. For example, even their top-of-the line 8400 series > provides a bus that can only provide full-speed access to about 4 > of the 12 cpus. Running bandwidth-limited codes on the machine > quickly runs into scaling limitations as the bus saturates. The DEC > 8400 with 437 MHz cpus currently holds the record as the most > unbalanced SMP system -- with a single memory reference from a > unit-stride data stream costing as much as 64 floating-point > operations. (http://www.cs.virginia.edu/stream/standard/Balance.html) > (I believe that this is for a cluster of 7 8-cpu systems, so the > result for 12-cpu systems would be 50% worse, since the bus is > already saturated at 8 cpus.) Another argument that falls flat on its face. System Balance only matters if we actually get beaten. Especially, if we get beaten by a system with equal or fewer cpus, or a system that costs less. System Balance falls in the same category with the % Peak and SPECfp/MHz - it's irrelevant. > This does not mean that the DEC 8400 is a "bad" machine, but it does > mean that you are not likely to get as high a fraction of peak > performance for large, number-crunching codes as you would on a machine > with better balance. (For comparison, on the Origin 2000, a single > memory reference from a unit-stride data stream costs as much as > 15 floating-point operations, independent of system size.) Back to the % Peak argument - second time around and it's still nonsense. > The SGI Origin machines, on the other hand, use a distributed memory > arrangement so that each node's memory is local and accessed through a > local, non-blocking crossbar. Local memory bandwidth scales linearly > with system size. Similarly, since we use a hypercube interconnect > between the nodes, the bisection bandwidth (for non-local memory > references) also grows linearly with machine size. We have many > applications that are showing speedups of 25x on 32 cpus, and a growing > list that are obtaining speedups of ~90x on 128 cpus -- without > requiring message passing! Right now, our answer to this is a TruCluster system and HPF Fortran. HPF Fortran obviates the need for explicit message passing. TruCluster systems are very competitive for some applications. A 32 CPU 8 node TruCluster of 4100 outperforms a 32 CPU Origin 2000 running (non messag passing) LINPACK by a fairly wide margin. Also, early results for the MPI based NAS 2 parallel benchmarks show TruCluster'd 437 MHz 8400's to be very competitive. TruCluster hardware is also very inexpensive. We can connect an eight node TruCluster (up to 96 cpus) for somewhere around $25-30K ... very inexpensive compared to SGI has some scaling cost issues not addressed here. Their interconnect is expensive and interconnect costs grow super-linearly as nodes are added. I expect interconnect accounts for a good part of the cost of a 128 node Origin 2000. Also not addressed here, is the issue of access latency for non-local memory. As nodes are added, and interconnect infra-structure is forced to expand, memory latency grows. And latency will grow exponentially as nodes are added for applications with poor locality of memory reference.
1148.4	Some stuff....	PERFOM::LICEA_KANE	when it's comin' from the left	`Thu Mar 27 1997 20:12`	55
	Actually, Burkhard is doing a fair job at defending us. I do dislike the general hand-wave of "at real application benchmarks (wink wink nudge nudge) SGI wins." I also dislike the "look at the details" but then repeatedly draw a conclusion from the single metric SPECfp95. But let's ask a few questions. First, "YEsterday I noticed four new SGI Origins appearing in the office across the hall from mine." OK, Doug McDonald works in the UCIC Super Computer Center. But he didn't say they showed up in the lab, he said in the office. In most offices, four Origin 2000's won't fit. Four Origin 200's will, but not four Origin 2000's. A 180MHz Origin 200, which is up to two processors in a single box, where ONLY two boxes can be linked together, is a good deal different from the Origin 2000 128 CPU several million dollar system McCalpin then goes on to describe. Then I rather dislike complaining about looking at the details, but McCalpin concludes much on the basis of three metrics, SPECfp95, Streams, and his derived "peak MFLOPS" (which Burkhard was quite right, look at SGI, THEY quote "peak MFLOPS" all over their literature, we don't.) But the devil is in the details. The AlphaStation 500/500 vrs. the Origin 200, looking at the individual SPEC ratios of the CFP95 suite: Origin 200 is up to 30% faster (Gosh, McCalpin is right) But it's also up to 45% slower (Gosh, McCalpin is wrong.) More to the point, at only 3 of the 20 ratios is the Origin 200 95% or greater than the AlphaStation 500/500. For a machine that's "only good at one thing" the AlphaStation 500/500 does rather well. And we're leading with our chin here, since we know the AlphaStation 500/500 has a lower memory bandwidth than the Origin 200. (But the Personal Workstation 500a has comparable bandwidth - and it's quite a bit cheaper. McCalpin will probably point out that that's NT only.) The problem with streams, is while it's good at measuring how fast you can shuffle bytes in and out, it's not good at figuring if anything substantial can be done with those bytes once it's shuffled in and out. -mr. bill
1148.5	Minor nit	KAMPUS::NEIDECKER	EUROMEDIA: Distributed Multimedia Archives	`Tue Apr 01 1997 09:24`	6
	Minor nit re .3 (probably a typo): "The floating point pipeline depth of the EV5 is four cycles, down from six cycles on the EV6." That should be EV4 (which has 6 cycles), EV6 is 4 cycles as well.
1148.6		HPCGRP::MANLEY		`Wed Apr 02 1997 14:22`	5
	Re: .5 Yes, it's a typo. Thanks for correcting it.