[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference ricks::dechips

Title:Hudson VLSI
Notice:For Digital Chip Data - CHIPBZ::PRODUCTION$:[DS_INFO...]
Moderator:RICKS::PHIPPS
Created:Wed Feb 12 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:701
Total number of notes:4658

673.0. "memory 'bandwidth' questions" by RDGENG::WILLIAMS_A () Wed Apr 02 1997 19:33

    Do we have the following info available anywhere ?
    
    What I want is the effective memory bandwidth between L1, L2 and L3
    cache, then from L3 to main memory for:
    
    8400 /440 (620Mhz too)
    
    4100 /466
    
    
    Oh, and if anyone has calculated similar for HP K460 and Sun UE6000, or
    can hazard a plausible estimate, then that would be great too.
    
    Mr Henning, you know what I am up to.....
    
    
    AW
    
T.RTitleUserPersonal
Name
DateLines
673.1L1 & L2 don't depend on the platformWIBBIN::NOYCEPulling weeds, pickin' stonesWed Apr 02 1997 20:092
For the on-chip caches, see 538.14, though you'll have to
recalculate based on the clock speeds you care about.
673.1L1& L2 don't depend on the platformWIBBIN::NOYCEPulling weeds, pickin' stonesThu Apr 03 1997 12:184
For the on-chip caches, see 528.14, though you'll have to
recalculate based on the clock speeds you care about.

<A previous version of this reply had the wrong note number>
673.2Pointers to existing stuff...PERFOM::HENNINGThu Apr 03 1997 15:5010
    Re: main memory bandwidth -  I don't believe 440 changed from 350, nor
    466 from 400 - though if  minor differences of a percent or two are
    crucial to you let me know and I'll try to remeasure.
    
    It's been a few months since I updated my bandwidth web page, but if
    you've not seen it, check out
    http://tlg-www.zko.dec.com/~henning/Mem_bw.html and also the less
    pleasant http://tlg-www.zko.dec.com/~henning/Mem_bw_internal.html
    
    Both the above need to be updated for the Miata good news.
673.3How difficult to hack McCalpin to measure?BBPBV1::WALLACEjohn wallace @ bbp. +44 860 675093Fri Apr 04 1997 07:098
    Could we *measure* these figures given McCalpin source with suitably
    modified (i.e. small) arrays (and suitable hardware to run the test
    on)? Or does it not work like that?
    
    Would that help?
    
    regards
    john
673.4TKOV50::NAKANOFri Apr 04 1997 13:567
    The memory bandwidth of MIATA without cache is better than with 2mb
    cache. We have found it on several application. Also miata 433a is
    better than ASt500 400. Could you tell me the reason?
    
    Regards
    mamoru
    
673.5cache actually hurts when it's too smallWRKSYS::SCHUMANNFri Apr 04 1997 15:327
An application that does not fit in the cache frequently runs faster
on a machine without cache. Each access that misses in the cache must first
make a cache access to find out about the miss, and these "wasted" cache 
accesses increase the effective memory access time, and use 20-30% of
the bus bandwidth.

--RS
673.6DECCXL::OUELLETTEcrunchFri Apr 04 1997 16:5910
> (i.e. small) arrays

The benchmark measures memory bandwidth.
You must choose array sizes larger than the largest cache.
That's the deal...  otherwise you haven't run McCalpin's benchmark.

Prof McCalpin's field of study at the U. of Delaware (I think) was
(he's at SGI now) resivoir simulations and weather modeling.
None of his data stays in cache long  ...  that's why he (and people
like him) care  ...  that's why he wrote the benchmark.
673.8On-chip measurements; b-cache vs. no b-cachePERFOM::HENNINGMon Apr 07 1997 01:47106
[repost with corrected total size calculations]

Yes one can measure bandwidth for on-chip caches and between b-cache and
CPU with a suitably modified McCalpin Streams bmark - you use rppc and 
small array sizes.  But trying to get close to the theoretical peak values
is tricky - Bob Nix wrote a memory test a while back in which he made the
comment:
    
*
* Code quality:  These tests are sensitive to the quality of code
* generated by the compiler.  The test results also include the loop control
* overhead.  This overhead can be factored out of the latency tests by
* simply subtracting off a known latency, the L1 latency on a cache-line
* sized stride, from all results.  The bandwidth results can't be adjusted
* in such a simple way -- these tests simply require a good optimizing compiler

* I've successfully compiled the bandwidth tests to run at ~80% of hardware pea
k on
* Alpha OSF, Alpha NT, HP Snake Unix, IBM Power1 Unix, and Pentium NT; but
* getting that performance required fiddling with source, switches, and careful
* checking against expected results.
*/

Running an rpcc'd McCalpin bmark with array sizes of 250 (~2k for each of
the three arrays, so all three should fit in the Dcache) just now with
Digital Fortran 77 V4.1-92-33BL on an EV5 @ 300 Mhz using 
f77 -O5 -tune ev4 gave:

Function     Rate (MB/s)
Assignment: 2096.3372
Scaling   : 1564.0641
Summing   : 2080.6025
SAXPYing  : 1848.0018

f77 -O5 -tune ev5 gave:

Assignment: 2296.7518
Scaling   : 1871.0299
Summing   : 2509.4733
SAXPYing  : 2438.1621

and f77 -tune ev5 (i.e. dropping the -O5) gave:

Assignment: 2204.0389
Scaling   : 1562.0302
Summing   : 2052.1661
SAXPYing  : 1682.3546

Taking the best of the above (-O5 -tune ev5) and varying the array size
to 1000 (~8kb each array, ~24kb total, fits in one bank of S-cache) gave:

Assignment: 1714.1651
Scaling   : 1749.7468
Summing   : 1393.5049
SAXPYing  : 1574.6575

Changing the array size to 4000 (which means ~32kb for each of the three
arrays, ~96kb total) gives:

Assignment: 1364.9044
Scaling   :  825.7452
Summing   :  972.5669
SAXPYing  : 1058.3659

which is quite a drop - one suspects that not all 96kb was stored in the 
s-cache.  Switching to 20,000 elements (~1/2 mb total) drops down to:

Assignment:  410.1500
Scaling   :  401.9950
Summing   :  418.6805
SAXPYing  :  436.3363

Finally, changing the array size to 1,000,000 (~24mb total) drops the
bandwidth down to main memory speed:

Assignment:   98.3563
Scaling   :  105.1684
Summing   :  108.7296
SAXPYing  :  118.0426

CAUTION #1: The on-chip measurements above are for DIGITAL INTERNAL USE
ONLY. The on-chip figures listed in 528.14 are the right ones to quote
externally, not these, because of their extreme sensitivity to code quality
and the fact that  there is no way to ensure a "level playing field"
between one vendor's on-chip vs another's figures.  

CAUTION #2: You also should not quote the final figures above, the
bandwidth to main memory, because my available EV5 happened to be a
system that does not have stellar main memory bandwidth.  If the
customer cares about main memory bandwidth, point out Turbolaser,
Rawhide, or Miata.  See published results at 

   http://www.cs.virginia.edu/stream/standard/Bandwidth.html 

Machine ID                ncpus    COPY    SCALE      ADD    TRIAD
DEC_8400_5-350               1    215.7    207.5    219.6    234.2
DEC_4100_5-400-              1    247.8    243.9    264.9    268.0
DEC_433a-(0MB_L3)            1    292.6    292.6    323.4    341.3

As to B-cache or no B-cache, right, sometimes the B-cache will actually slow
you down.  A little note on that subject is at 

   http://tlg-www.zko.dec.com/~henning/bcache.html

/John Henning
 CSD Performance Group
673.9..and.RDGENG::WILLIAMS_AMon Apr 07 1997 13:213
    and the 'on-chip' stuff gets 'wider' with a faster clock right ?
    
    AW
673.10OK, a more contemporary chipI4GET::HENNINGMon Apr 07 1997 16:1160
Per Adrian's request, a follow-on to .7 with a faster system.

The following data is for Digital Internal Use Only.  It uses an Alpha
21164 near or at the MHz limits recently announced (see 
http://www.digital.com/PR00SK) which has been incorporated into an as-yet
unannounced system.  Actual mileage may vary.  The precise system that I 
am using may never be announced; product definition is subject to change. 
Insert additional qualfiers here.  

Anyway, with array size 250, Fortran 77 V4.1-92-33BL, f77 -O5 -tune ev5,
yes the achieved on-chip bandwidth goes up substantially to the dcache:

Function     Rate (MB/s)
Assignment: 4579.7506
Scaling   : 3736.6764
Summing   : 5003.9197
SAXPYing  : 4848.6024

And to the S-cache (Array size 1000)

Assignment: 3419.2856
Scaling   : 3491.5592
Summing   : 2779.2028
SAXPYing  : 3140.5720

Array size 4000

Assignment: 2722.6025
Scaling   : 1814.7246
Summing   : 2047.1877
SAXPYing  : 2201.5637

Here's an array size big enough to hit the board cache (20000) - note that
the system under test here has more than double the bw to the cache of the
system in .7:

Assignment:  928.6953
Scaling   :  925.8906
Summing   :  975.0332
SAXPYing  : 1016.0961

And in fact this bandwidth to the (8mb) cache holds up fairly well even as
the array sizes are increased to 300,000 elements (~2.4mb each, ~7.2mb total):

Assignment:  890.7892
Scaling   :  740.7456
Summing   :  998.8359
SAXPYing  :  918.9741

Finally, here's the main memory speed, with an array size of 3,000,000 
elements (~72mb total)

Assignment:  266.0516
Scaling   :  251.8598
Summing   :  271.3901
SAXPYing  :  277.2678

CAUTION: these numbers are for Digital Internal Use Only, as explained
         in reply .7