[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference wonder::turbolaser

Title:TurboLaser Notesfile - AlphaServer 8200 and 8400 systems
Notice:Welcome to WONDER::TURBOLASER in it's new homeshortly
Moderator:LANDO::DROBNER
Created:Tue Dec 20 1994
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1218
Total number of notes:4645

1148.0. "Help me refute this posting." by HGOVC::JOELBERMAN () Wed Mar 26 1997 00:31

    I just read the following.  You may have to answer some of these
    questions.  Perhaps someone can help us tear this following post to
    shreds.
    
    Path:
    pa.dec.com!news1.digital.com!data.ramona.vix.com!sonysjc!su-news-hub1.bbnplanet.com!cpk-news-hub1.bbnplanet.com!news.bbnplanet.com!news.sprintlink.net!news-peer.sprintlink.net!howland.erols.net!news.sgi.com!news.corp.sgi.com!frakir.asd.sgi.com!mccalpin
    From: mccalpin@frakir.asd.sgi.com (John McCalpin)
    Newsgroups: comp.benchmarks
    Subject: Re: Alpha performance
    Date: 25 Mar 1997 16:37:28 GMT
    Organization: Silicon Graphics, Inc., Mountain View, CA
    Lines: 143
    Message-ID: <5h8v08$fu3@murrow.corp.sgi.com>
    References: <3337DCC5.4087@aries.scs.uiuc.edu>
    Reply-To: mccalpin@asd.sgi.com
    NNTP-Posting-Host: frakir.asd.sgi.com
    
    In article <3337DCC5.4087@aries.scs.uiuc.edu>,
    J. D. McDonald <mcdonald@aries.scs.uiuc.edu> wrote:
    >YEsterday I noticed four new SGI Origins appearing in the
    >office across the hall from mine. I was talking to teh
    >owner and a colleague who is another big computer time user,
    >which I have not been since 1974 (when I was the absolute biggest,
    >literally). I asked why everybody buys SGI instead of
    >something else, specifically Alpha.
    >
    >The answer was "Alpha can only do matrix multiplies, nothing else".
    >(We're talking FP here, of course.) 
    >
    >Is this fact clearly visible in benchmarks?
    
    I love the quote!
    
    It is not true, of course, but it is a great quote....
    
    
    
    There a lots of ways to look at machine performance, not all
    of which are equally useful.
    
    For single-processor "number-crunching" there are typically two
    important axes to consider: computation rate and memory bandwidth.
    
    
    One way to model computation rate is by the LINPACK benchmark or by
    matrix multiplication.  These algorithms can be re-arranged to require
    very small amounts of main memory traffic.  Similar results can be
    obtained for small codes that are entirely cache-contained.   I am not
    convinced that these are particularly good measures, but they have the
    advantage of being simple.
    
    More detailed analyses need to include things like pipeline depth
    (which is deep on the Alphas and shallow on the slower-clocked machines
    like SGI's).  Machines with deeper pipelines are harder to generate
    code for, and generally attain a lower fraction of "peak" performance
    than machines with shorter pipelines/latencies.
    
    
    Memory bandwidth has a couple of degrees of freedom, but the most
    important one is the unit-stride bandwidth, measured by the STREAM
    benchmark:
    
    	http://www.cs.virginia.edu/stream/
    
    It is very hard to make comparisons, since I do not know the DEC
    product line in any detail, but my interpretation is that SGI offers
    better bandwidth at most price points.  One architectural issue that
    hurts the DEC bandwidth is the need for a third-level cache (since the
    on-chip L1 + L2 caches provide only 96kB of data cache).  STREAM
    results
    for DEC machines with no L3 cache are better, but the application 
    performance typically drops, since big caches help most applications.
    
    
    
    Many benchmarks can be interpreted as some combination of computation-
    limited work and bandwidth-limited work.  For example, on SGI machines
    with large caches (1-4 MB), the SPECfp95 suite appears to be (in very
    rough terms) about 75% computation limited and about 25% bandwidth
    limited.  I am not sure how these numbers look for other vendors.
    (Actually, a big chunk of the 75% "computation limited" time is
    probably limited by external cache bandwidth, but it is tricky to
    come up with reasonable estimates of how big that chunk is.)
    
    High-end DEC and SGI machines have similar SPECfp numbers, with DEC
    leading by a few percent with a 500 MHz machine, and at parity for 
    a couple of 466 MHz systems.  Our 195 MHz Origin systems are faster
    than the rest of the DECs, and our 180's are faster than all the
    DEC machines below 400 MHz.  Except for the slowest machines, these
    differences are probably not worth worrying about, since they are
    mostly less than 10%.
    
    
    
    DEC makes some very nice computers, with tremendous performance for
    "integer" and/or cache-friendly codes, but it is all too easy for
    customers to make judgements of performance based on their clock rates,
    rather than their performance on applications.  SGI's Origin machines
    are shipping with 180 and 195 MHz R10000 cpus, compared to the 350-500
    MHz cpus that DEC is shipping.  What we see across a wide variety of
    large floating-point benchmarks is that the SGI machines at 195 MHz
    deliver application performance better than the 375 MHz Alphas, and the
    same (or bit lower) than the 437 MHz Alphas (in the 8400 series
    machines).
    
    
    Parallel scalability is another issue.  DEC builds bus-based machines,
    with buses that cannot meet the demands of their very high-speed
    processors.  For example, even their top-of-the line 8400 series
    provides a bus that can only provide full-speed access to about 4
    of the 12 cpus.  Running bandwidth-limited codes on the machine
    quickly runs into scaling limitations as the bus saturates.  The DEC
    8400 with 437 MHz cpus currently holds the record as the most
    unbalanced SMP system -- with a single memory reference from a 
    unit-stride data stream costing as much as 64 floating-point 
    operations. (http://www.cs.virginia.edu/stream/standard/Balance.html)
    (I believe that this is for a cluster of 7 8-cpu systems, so the 
    result for 12-cpu systems would be 50% worse, since the bus is
    already saturated at 8 cpus.)
    
    This does not mean that the DEC 8400 is a "bad" machine, but it does
    mean that you are not likely to get as high a fraction of peak
    performance for large, number-crunching codes as you would on a machine
    with better balance.  (For comparison, on the Origin 2000, a single
    memory reference from a unit-stride data stream costs as much as
    15 floating-point operations, independent of system size.)
    
    
    
    The SGI Origin machines, on the other hand, use a distributed memory
    arrangement so that each node's memory is local and accessed through a
    local, non-blocking crossbar.  Local memory bandwidth scales linearly
    with system size.  Similarly, since we use a hypercube interconnect
    between the nodes, the bisection bandwidth (for non-local memory
    references) also grows linearly with machine size.   We have many
    applications that are showing speedups of 25x on 32 cpus, and a growing
    list that are obtaining speedups of ~90x on 128 cpus -- without
    requiring message passing!
    
    Users who cannot afford the larger machines may still wish to take
    advantage of doing program development on a smaller Origin, then do
    large-scale calculations on something like the 128-cpu array of SGI
    Origin2000's at NCSA.
    
    
    
    The bottom line is that DEC and SGI machines have rather different
    performance equations, and it requires a considerable level of
    experience to estimate application performance from the externally
    visible machine parameters.  Some benchmarks hint at the differences,
    but it is hard to get accurate application performance estimates from
    first principles and/or microbenchmarks.  Instead, we look at
    application performance, and *then* build the models that map from
    machine parameters to application performance.  Your mileage may
    vary....
    
    
    -- 
    --
    John D. McCalpin, Ph.D.     Supercomputing Performance Analyst
    Scalable Systems Group      http://reality.sgi.com/employees/mccalpin
    Silicon Graphics, Inc.      mccalpin@sgi.com  415-933-7407
T.RTitleUserPersonal
Name
DateLines
1148.1It may not shred too easilyPERFOM::HENNINGWed Mar 26 1997 17:3222
    I don't think it shreds well.  John McCalpin is reasonably responsible
    in what he says.
    
    BUT there is a matter of emphasis.  John M. emphasizes that Alpha will
    rarely obtain "peak" performance, whereas other vendors may obtain
    their peak more readily.
    
    The other way to look at this is to realize that it is rather nice to
    *have* a peak.  That is, some applications gain great advantages from
    Alpha's raw CPU power.
    
    Would you rather have a car that always does 100 kph, and never any
    better, or a car that on the right kind of road can do 200 kph?  If you
    say the latter, how much does it bother you that on many roads it only
    does 100 kph?
    
    	/john henning
    
    PS A different approach to John M's posting would be to attack SGI for
    late dates, operating system incoherence, etc.  Bill Licea-Kane would
    be a better spokesperson in that regard - Mr. Bill, the floor is
    yours...
1148.2Or perhaps build a better road...SAVAGE::MCGEEAt this point, we don't know.Wed Mar 26 1997 17:4511
    >Would you rather have a car that always does 100 kph, and never any
    >better, or a car that on the right kind of road can do 200 kph?  If you
    >say the latter, how much does it bother you that on many roads it only
    >does 100 kph?
    
    Of course having the engine already begs the question what are we doing
    to build a better road?  
    
    SGI seems to have no trouble doing it, what are we doing on this front?
    
    
1148.3HPCGRP::MANLEYWed Mar 26 1997 20:03157
>    I just read the following.  You may have to answer some of these
>    questions.  Perhaps someone can help us tear this following post to
>    shreds.
 
Well, I don't think we can "tear this ... to shreds" unless we lie. John
McCalpin may now work for a competitor, but he apparently remains basically
fair-minded. The simple fact is, we no longer have the tremendous advantage
we had before SGI introduced the Origin sytems. Back then we beat them on
everything! We're not so lucky now.
    
>    For single-processor "number-crunching" there are typically two
>    important axes to consider: computation rate and memory bandwidth.
>  
>    One way to model computation rate is by the LINPACK benchmark or by
>    matrix multiplication.  These algorithms can be re-arranged to require
>    very small amounts of main memory traffic.  Similar results can be
>    obtained for small codes that are entirely cache-contained.   I am not
>    convinced that these are particularly good measures, but they have the
>    advantage of being simple.
...
>    Memory bandwidth has a couple of degrees of freedom, but the most
>    important one is the unit-stride bandwidth, measured by the STREAM
>    benchmark:
>    
>    	http://www.cs.virginia.edu/stream/
>    
>    It is very hard to make comparisons, since I do not know the DEC
>    product line in any detail, but my interpretation is that SGI offers
>    better bandwidth at most price points.  One architectural issue that
>    hurts the DEC bandwidth is the need for a third-level cache (since the
>    on-chip L1 + L2 caches provide only 96kB of data cache).  STREAM
>    results
>    for DEC machines with no L3 cache are better, but the application 
>    performance typically drops, since big caches help most applications.

O.K. So LINPACK performance is at one extreme and memory STREAMing bandwidth
is at the other. We compute faster than they do and they stream data through
memory faster than we do. Big Deal! Most applications fall somewhere between
the two extremes, and Alphas certainly hold their own in that middle ground.
If we price our products competitively, they will be competitive.
    
>    More detailed analyses need to include things like pipeline depth
>    (which is deep on the Alphas and shallow on the slower-clocked machines
>    like SGI's).  Machines with deeper pipelines are harder to generate
>    code for, and generally attain a lower fraction of "peak" performance
>    than machines with shorter pipelines/latencies.  

The floating point pipeline depth of the EV5 is four cycles, down from six
cycles on the EV6. I don't know what the floating pipeline depth is on the
R10000, but I doubt that it's one cycle. For the sake of the argument that
follows, let's assume the R10000's pipe is two deep (I think it is).

Now consider a completely serial floating point code. On such a code, a
437 MHz EV5 Alpha retires one result every 4/.437 = 9.15 ns and a 195 MHz
R10000 retires one results every 2/.195 = 10.26 ns. Alphas wins! And a
622 MHz Alpha will win BIG!

DO NOT ever let the % Peak argument become an issue!!!! The argument is
completely bogus. A customer buys a machine to do work. The machine that
does the most work in a fixed period of time is the winner. PERIOD!!!!

>    Many benchmarks can be interpreted as some combination of computation-
>    limited work and bandwidth-limited work.  For example, on SGI machines
>    with large caches (1-4 MB), the SPECfp95 suite appears to be (in very
>    rough terms) about 75% computation limited and about 25% bandwidth
>    limited.  I am not sure how these numbers look for other vendors.
>    (Actually, a big chunk of the 75% "computation limited" time is
>    probably limited by external cache bandwidth, but it is tricky to
>    come up with reasonable estimates of how big that chunk is.)
>    
>    High-end DEC and SGI machines have similar SPECfp numbers, with DEC
>    leading by a few percent with a 500 MHz machine, and at parity for 
>    a couple of 466 MHz systems.  Our 195 MHz Origin systems are faster
>    than the rest of the DECs, and our 180's are faster than all the
>    DEC machines below 400 MHz.  Except for the slowest machines, these
>    differences are probably not worth worrying about, since they are
>    mostly less than 10%.
    
The SPECfp/MHz argument is irrelevant - as bogus as the % Peak argument.
Who cares how performance is delivered? The only thing that matters is
who delivers it and how much they deliver. Very shortly you'll have SPECfp
numbers you can once again crow about.
    
>    DEC makes some very nice computers, with tremendous performance for
>    "integer" and/or cache-friendly codes, but it is all too easy for
>    customers to make judgements of performance based on their clock rates,
>    rather than their performance on applications.  SGI's Origin machines
>    are shipping with 180 and 195 MHz R10000 cpus, compared to the 350-500
>    MHz cpus that DEC is shipping.  What we see across a wide variety of
>    large floating-point benchmarks is that the SGI machines at 195 MHz
>    deliver application performance better than the 375 MHz Alphas, and the
>    same (or bit lower) than the 437 MHz Alphas (in the 8400 series
>    machines).

Once again, as new systems with much higher clock rates and larger board
level caches are announced, we will reclaim some of the threatened application
space.

>    Parallel scalability is another issue.  DEC builds bus-based machines,
>    with buses that cannot meet the demands of their very high-speed
>    processors.  For example, even their top-of-the line 8400 series
>    provides a bus that can only provide full-speed access to about 4
>    of the 12 cpus.  Running bandwidth-limited codes on the machine
>    quickly runs into scaling limitations as the bus saturates.  The DEC
>    8400 with 437 MHz cpus currently holds the record as the most
>    unbalanced SMP system -- with a single memory reference from a 
>    unit-stride data stream costing as much as 64 floating-point 
>    operations. (http://www.cs.virginia.edu/stream/standard/Balance.html)
>    (I believe that this is for a cluster of 7 8-cpu systems, so the 
>    result for 12-cpu systems would be 50% worse, since the bus is
>    already saturated at 8 cpus.)

Another argument that falls flat on its face. System Balance only matters
if we actually get beaten. Especially, if we get beaten by a system with
equal or fewer cpus, or a system that costs less. System Balance falls in
the same category with the % Peak and SPECfp/MHz - it's irrelevant.

>    This does not mean that the DEC 8400 is a "bad" machine, but it does
>    mean that you are not likely to get as high a fraction of peak
>    performance for large, number-crunching codes as you would on a machine
>    with better balance.  (For comparison, on the Origin 2000, a single
>    memory reference from a unit-stride data stream costs as much as
>    15 floating-point operations, independent of system size.)

Back to the % Peak argument - second time around and it's still nonsense.

>    The SGI Origin machines, on the other hand, use a distributed memory
>    arrangement so that each node's memory is local and accessed through a
>    local, non-blocking crossbar.  Local memory bandwidth scales linearly
>    with system size.  Similarly, since we use a hypercube interconnect
>    between the nodes, the bisection bandwidth (for non-local memory
>    references) also grows linearly with machine size.   We have many
>    applications that are showing speedups of 25x on 32 cpus, and a growing
>    list that are obtaining speedups of ~90x on 128 cpus -- without
>    requiring message passing!

Right now, our answer to this is a TruCluster system and HPF Fortran.
HPF Fortran obviates the need for explicit message passing. TruCluster
systems are very competitive for some applications. A 32 CPU 8 node
TruCluster of 4100 outperforms a 32 CPU Origin 2000 running (non messag
passing) LINPACK by a fairly wide margin. Also, early results for the MPI
based NAS 2 parallel benchmarks show TruCluster'd 437 MHz 8400's to be
very competitive. TruCluster hardware is also very inexpensive. We can
connect an eight node TruCluster (up to 96 cpus) for somewhere around
$25-30K ... very inexpensive compared to 

SGI has some scaling cost issues not addressed here. Their interconnect
is expensive and interconnect costs grow super-linearly as nodes are added.
I expect interconnect accounts for a good part of the cost of a 128 node
Origin 2000.

Also not addressed here, is the issue of access latency for non-local
memory. As nodes are added, and interconnect infra-structure is forced to
expand, memory latency grows. And latency will grow exponentially as
nodes are added for applications with poor locality of memory reference.

1148.4Some stuff....PERFOM::LICEA_KANEwhen it's comin' from the leftThu Mar 27 1997 20:1255
    Actually, Burkhard is doing a fair job at defending us.
    
    I do dislike the general hand-wave of "at *real* application benchmarks
    (wink wink nudge nudge) SGI wins."
    
    I also dislike the "look at the details" but then repeatedly draw a
    conclusion from the single metric SPECfp95.
    
    
    But let's ask a few questions.  First, 
    
    	"YEsterday I noticed four new SGI Origins appearing in the
    	office across the hall from mine."
    
    OK, Doug McDonald works in the UCIC Super Computer Center.  But he
    didn't say they showed up in the lab, he said in the office.
    
    In most offices, four Origin 2000's won't fit.  Four Origin 200's will,
    but not four Origin 2000's.
    
    A 180MHz Origin 200, which is up to two processors in a single box,
    where *ONLY* two boxes can be linked together, is a good deal different
    from the Origin 2000 128 CPU several million dollar system McCalpin then
    goes on to describe.
    
    
    Then I rather dislike complaining about looking at the details, but
    McCalpin concludes much on the basis of *three* metrics, SPECfp95,
    Streams, and his derived "peak MFLOPS" (which Burkhard was quite right,
    look at SGI, *THEY* quote "peak MFLOPS" all over their literature,
    we don't.)
    
    
    But the devil is in the details.  The AlphaStation 500/500 vrs. the
    Origin 200, looking at the *individual* SPEC ratios of the CFP95 suite:
    
    	Origin 200 is up to 30% faster (Gosh, McCalpin is right)
    	But it's also up to 45% slower (Gosh, McCalpin is wrong.)
    
    More to the point, at only 3 of the 20 ratios is the Origin 200 95% or
    greater than the AlphaStation 500/500.
    
    
    For a machine that's "only good at one thing" the AlphaStation 500/500
    does rather well.  And we're leading with our chin here, since we
    *know* the AlphaStation 500/500 has a lower memory bandwidth than
    the Origin 200.  (But the Personal Workstation 500a has comparable
    bandwidth - and it's quite a bit cheaper.  McCalpin will probably
    point out that that's NT only.)
    
    The problem with streams, is while it's good at measuring how fast
    you can shuffle bytes in and out, it's not good at figuring if anything
    substantial can be done with those bytes once it's shuffled in and out.
    
    								-mr. bill
1148.5Minor nitKAMPUS::NEIDECKEREUROMEDIA: Distributed Multimedia ArchivesTue Apr 01 1997 09:246
    Minor nit re .3 (probably a typo):
    
    "The floating point pipeline depth of the EV5 is four cycles, down from
    six cycles on the EV6."
    
    That should be EV4 (which has 6 cycles), EV6 is 4 cycles as well.
1148.6HPCGRP::MANLEYWed Apr 02 1997 14:225
Re: .5

Yes, it's a typo. Thanks for correcting it.