[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference 7.286::fddi

Title:FDDI - The Next Generation
Moderator:NETCAD::STEFANI
Created:Thu Apr 27 1989
Last Modified:Thu Jun 05 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2259
Total number of notes:8590

655.0. "Protocols for Low Latency User Proc's?" by LEMAN::MBROWN () Thu Jul 23 1992 17:51

I received an interesting question from a customer the other day.  I
don't really know where the correct place is to ask the question, but
this can't be the wrong one since it was asked about FDDI.

If one has two stations on an FDDI ring (probably should be any LAN),
and you want a user process to send a low-latency signal to another user
process on the other system, what would be the best protocol (standard
or non-standard)?

And, if you did this, could you also send standard DECnet and TCP/IP
packets across the network using the same adaptor?  The real target
would be Alpha/OpenOSF, but Alpha/OpenVMS would also be useful, and
any system/OS could be used for development.

The customer may be willing to write his own device driver to make this
happen.  The goal is to minimize the number of instructions used by
both the sending and receiving nodes, and therefore minimize the latency.
The limitation is that there may be a moderate number of processes on
a node (less than 100) which want to send or receive using this mechanism.

Pointers to other conferences or technical articles would be appreciated.

Thanks in advance,

Michael Brown
European HEP Group
T.RTitleUserPersonal
Name
DateLines
655.1KONING::KONINGPaul Koning, A-13683Thu Jul 23 1992 22:064
How low is low?  There are a lot of different answers that make sense depending
on whether you mean 1 microsecond, 1 millisecond, or 100 milliseconds.

	paul
655.2MSBCS::KALKUNTERam Kalkunte 293-5139Fri Jul 24 1992 00:2835
    As in .1, I would like to know what is the target latency your customer
    has in mind. But generally ....
    
>>If one has two stations on an FDDI ring (probably should be any LAN),
>>and you want a user process to send a low-latency signal to another user
>>process on the other system, what would be the best protocol (standard
>>or non-standard)?

    You can obviously get better performance with customized, light-weight
    protocols ( I am assuming this is what you mention as non-standard).
    
>>And, if you did this, could you also send standard DECnet and TCP/IP
>>packets across the network using the same adaptor?  The real target
>>would be Alpha/OpenOSF, but Alpha/OpenVMS would also be useful, and
>>any system/OS could be used for development.

    Definitely possible. 
    
    >>The customer may be willing to write his own device driver to make this
>>happen.  
    
    It's not the device driver that he should plan to write, it is the
    application (with comm protocol). There is not much fat that you can
    remove by writing your own device driver. 
    
    >>The goal is to minimize the number of instructions used by
>>both the sending and receiving nodes, and therefore minimize the latency.
>>The limitation is that there may be a moderate number of processes on
>>a node (less than 100) which want to send or receive using this mechanism.

    This cannot be answered without a complete set of requirements for this
    application. It may not be a limitation if the application is designed
    correctly. 
    
    Ram
655.310-30 microsecond CPU overheadLEMAN::MBROWNFri Jul 24 1992 11:3537
Sorry, I should have been more specific.

What is desired is System induced latency (as opposed to transmission latency)
of about 10 microseconds for the combination of send and receive overhead.
This would be on an Alpha Desktop system, so a DS5000-240 should be about 
30 microseconds.

The reason for the low number is to use the spare workstation cycles as a 
low-cost MPP during the evening hours.  I have talked to the MPSG group,
but their efforts are not directly relevant, at least not now.

IBM is currently pushing RS6000's with PVM (Parallel Virtual Machine)
software from ORNL plus Ultranet as an interconnect.  Many applications
will not work using standard PVM over TCP/IP because the communication
latency for signals and small data packets are too long (multi-milliseconds).

What I am looking for is 1) the lightest weight, 2) closest to standard,
3) easy to implement [or better yet already implemented or prototyped]
protocol that is available.  Time to customer is more important than
going from 35 to 30 microseconds.

>    This cannot be answered without a complete set of requirements for this
>    application. It may not be a limitation if the application is designed
>    correctly.

The problem is that it isn't a single application, but the use of a series
of systems to run different applications distributed over the whole group.
Multiple of these applications (I said 100 before, but probably usually 10)
may be running at a time.  As mentioned in -.1, I made a mistake in that it
probably isn't a driver that needs written, but a simple high level interface
with a little bit of logic to do a few to few mapping.

Anyway, any suggestions of things on the shelf would be useful.

Thanks,

Michael
655.4Re-examine your application!KONING::KONINGPaul Koning, A-13683Mon Jul 27 1992 18:5226
655.5Simulating Nuclear collitionsBONNET::LISSDANIELSTue Jul 28 1992 11:3822
Paul,

I believe they are gearing up to simulate what happened in a nuclear
collition like an experiment in the CERN collider. HEP stands for
HIGH Energy Physics... They may e.g. want to track the paths of the resulting
particles...

If they throw enough with APLHA workstations at the problem it should
be a zinch ;-)

as for the network - Maybe this THE job for GIGAswitch ???

In Full Duplex Mode you would not have to wait for a token,
the GIGAswitch is the only "station" between sender and receiver.
So the distance would then be the only variable for the network
delay - provided the traffic is well spread between the participating
CPUs...

So that brings us back to the inital question - 
Any good reliable, but leightweight protocols out there ?

Comments anyone ?
655.6KONING::KONINGPaul Koning, A-13683Tue Jul 28 1992 15:0215
What I meant is: what properties of the application require this sort of latency?
Compute-intensive simulation is an obvious application for a high BANDWIDTH
network, but it does not impose a low latency requirement.  So I'm still
looking for an explanation.  It may well be that the requester is confused
and we simply need to straighten out the requirement.  It may also be that
the requirement is valid, but it's a lot easier to answer a requirement if
there is a clear definition of the background that justifies it, and there
hasn't been.

Yes, Gigaswitch seems like the only interconnect technology that would meet
the numbers quoted.  But keep in mind you also have to get the data through
the adapter, across the bus, and through the software (that's probably the
list of increasing order of slowness...).  

	paul
655.7Wow ! What are they willing to spend ?MSBCS::KALKUNTERam Kalkunte 293-5139Tue Jul 28 1992 16:2732
    Well, for some of the reasons as outlined earlier, FDDI (asynchronous) 
    was never a right choice for such applications (even though I am still
    having a hard time figuring out what exactly this application is ).
    
    The ideal protocol for such communication would do its own flow control
    and would be engineered to work with a given network. In any case, the
    latency goal of 10usec seems unreasonable with existent technology; 
    DEMFA, the fastest FDDI adapter to date, takes ~6 usec (best case) to 
    deliver the smallest FDDI packet from the fiber to the memory. An
    average case will consist of queuing delays in the adapter and the
    memory. An average case will also account for packet size. I do not
    know what your average packet size would be (?) and what your average
    system will be doing (?), so I cannot comment on what your end-to-end
    transmission latency will be with FDDI. This latency obviously does not 
    include even a single CPU instruction to process the packet. 
    
    Also, IO bound tasks behave differently than compute bound tasks, and 
    the bottom line is the CPI that you get for IO programs are typically 
    much worse than CPU bound programs. I mention this so that people will be 
    careful when coming up with how many instructions there should be in
    the run-time loop for this application. 
    
    Since the kind of beast that you are looking for hasn't evolved yet
    (me thinks), don't waste time looking for it. If there is considerable
    bucks on the line to make this happen, it would be a good idea to 
    write your own application. But this has to be with a revised 
    expectation of latency. If you need an estimate of what is achievable, 
    (much better than IBM's millisecond range), I will need to understand 
    your application. Either you can post the details here or we can discuss 
    offline. 
    
    Ram
655.8Setup=wasted instructionsRDVAX::MCCABEMon Aug 24 1992 15:3235
    Maybe I can offer some help with the low latency requirement.
    
    Distributed compiler technology provides automatic parallelizm for
    array based operations.  The result is that a data movement to another
    processor can use the CPU cycles of many other processors.  However,
    the cost to initate a send/recieve pair equates to instructions that
    could be used locally to process the data.
    
    A 50u latency second on a MIPS workstation is on the order of 1200 
    instructions.  If the compiler does not have a good idea of how
    long the remote procesing step is going to take it becomes quite
    possible to spend more on the communication than the local processing
    would take.
    
    As the numbers move up in magnitude, the cost of the remote processing
    becomes relatively expensive due to the latency.  Hence less
    distribution is more efficient.  
    
    Granted there are many course grained applications that can still
    benefit even when the latency is accounted for, but the total set
    of applications is reduced. 
    
    Matrix reductions, distrubuited AXPY's, even SUM operations can be
    done very quickly in parallel when communication is cheap.  When it
    is not, the addition of processors to a given problem can result in
    longer, not shorter execution times.
    
    GIGIswitch does indeed look like a good mechnism for this distribution
    mechnism.
    
    -Kevin McCabe
     Engineering Manager, MPSG
    
    P.S.  We may indeed be quite interested in what you are doing ...
    
655.9Thanks and more detailsLEMAN::MBROWNTue Sep 29 1992 08:4546
I apologize for not getting back to this sooner.  We have been swamped with
Alpha activity, several big conferences, and MPP work.

I will get in touch with Kevin and Ram independently, but let me say that
Torbjorn and Kevin are 100% on target.  We are planning on using GIGAswitch
as the interconnect, and 10 uS is still an interesting target number.

Actually, I would go farther than Kevin and say that Setup time equates to
wasted instructions on MANY systems.  And, it isn't just setup time.  It is
the time required for copying data from one buffer into another into another
and finally into user buffers.

The applications are not constant.  Some will have large transfers, some 
will have small transfers, most will have a mix of transfers.  However, from
my experience in other parallel processing environments, it is the issue of
synchronization latency (small packets) which is the most critical issue.

There will likely be two or three modes of operation, and this might equate
more directly to Ram's request for "application information".  

The first mode is 10 Alpha workstations acting as a batch compute engine.  
Uninteresting for special communication protocols.

The second mode is using a "data flow" programming model like PVM (Parallel
Virtual Machine) developed by Jack Dongarra and promoted by IBM (and hopefully
Digital) as a way of using workstations to solve medium-to-fine grained
parallel problems.  Among other things, PVM provides a programming library 
that hides details of the location of program modules and the communication
between them.

Dongarra's graduate students developed a special program library for efficient 
ethernet communication, the same is needed for an FDDI GIGAswitch environment.
IBM has done this for their version of PVM (called PVM/e) over Fibrechannel 
connections.

The third mode of operation is where High Performance Fortran applications
are automatically distributed across multiple "workstations", and they are
linked together via a  high speed network.    FDDI is probably too slow, but 
it is the best we have right now.

The shortest term need is for the PVM style support, but the HPF style support
will be very close behind.  I expect that Kevin is already working on it.

Thanks for the help.  More later when it becomes available.

Michael
655.10KONING::KONINGPaul Koning, A-13683Tue Sep 29 1992 14:097
I don't see anything in that list that suggests severe latency requirements,
certainly nothing anywhere near as tight as 10 microseconds.  So I'm still
wondering how you came to the conclusion that such performance was needed.
(Never mind whether it's achievable with any hardware available from anyone
today.)

	paul
655.11Missouri Requirement <show me>LEMAN::MBROWNTue Sep 29 1992 14:5620
Paul,

You are right that there isn't a requirement that 100% of all latency be
under 10 microseconds.  The original number I used in note .3 was 30 micro-
second delay for the application on system 1 to begin the transmission of
a small packet (say 100 bytes of useful data) until the application on
system 2 has the data in its buffer.  There should be a reasonable confidence
level that the transmission will complete in this amount of time.

Until I see otherwise, I will assume that this cannot be done using standard
UDP packets, transparent or non-transparent DECnet.  

Paul, if you or anyone else can show how long this takes using standard 
protocols, I would love to see the data and be proved wrong.  This would be 
using GIGAswitch, so some of the default assumptions about token availability 
are not valid.  Tests on 2 node rings would be of high value.

Regards,

Michael
655.12KONING::KONINGPaul Koning, A-13683Tue Sep 29 1992 17:3346
I don't know how long this takes with standard protocols.  Actually, that's
a fairly meaningless question; the more meaningful question is how long it
takes on a given implementation.  (The particular implementation properties
are what determines the answer, not really any common properties of a particular
protocol.)

Something is backwards here.  Requirements are supposed to be derived from
the application's needs.  If you can determine what the application needs
(and I'm NOT referring to a number such as "30 microseconds" unless it comes
with some explanation of how it was derived from parameters observable by
users of the system) then you can determine whether a particular implementation
of some particular protocol will do the job.  Tests of implementations will
validate performance claims for them and will give you confidence that they
will meet the requirements.  But I'm getting the impression that you're looking
for performance data as a way to determine what the performance requirements
should be, and that's not the way to do it.

Looking back at .9:

mode 1 (batch compute engines) -- sounds like bulk data transfer (similar to
file transfer).  Requires high throughput, but does not impose any significant
latency requirement.

mode 2 (fine grained parallelism) -- how fine is "fine"?  I know this sort of
stuff has been done in academic R&D.  To use it in commercial applications
requires picking grain sizes that aren't so small that most of the time
spent is overhead.  As far as I know, remote procedure call or similar
approaches for doing this sort of thing currently have overheads measured
in milliseconds, not microseconds.  Even if the actual network overhead
were zero, there's the application layer overhead (argument marshalling)
which can be quite substantial.  So if "fine grained" refers to operations
that take a second or so, using thousands but not millions of bytes per second,
again you have no special requirements.  If your grains complete in a few
milliseconds, you're not going to get much efficiency.

mode 3 (distribution of high performance fortran apps) -- that sounds similar
to mode 1, and again involves no significant latency requirements.  How much
data has to be moved?  You didn't mention, and that's the real question.

So to summarize: one of the three application modes you mentioned MAY 
justify low latency requirements.  You'll need to learn more about those
applications to find out the actual numbers.  The other two applications
have no latency requirements (beyond the modest ones needed for good
throughput, which any reasonable implementation already meets).

	paul
655.13Another Low-latency ApplicationJULIET::HATTRUP_JAJim Hattrup, Santa Clara, CAThu Feb 24 1994 16:0214
    
    I am looking for a 'reflective memory' type solution for a real-time
    application.  I am wondering if a low-latency FDDI solution (perhaps
    using the Gigaswitch) would work.
    
    A configuration would be 3 to 10 systems, that need to update 50Kbytes
    of data between themselves (all of them) at 30 times/sec.  This is
    1.5Mbytes/sec - and delays in updates would cause problems.  They have
    33 milliseconds for computation and I/O (30 frames/sec), and can't miss
    this window.
    
    Is FDDI a workable solution?  (SYSTRAN SCRAMnet is an alternative, but
    we don't have VME based mmap support on VAX 7000.  Likely config is
    a VAX 7000 M620 and 2 to 4 SGI systems)
655.14KONING::KONINGPaul Koning, B-16504Thu Feb 24 1994 19:049
Doesn't sound like a big deal.  The throughput you need is a small fraction
of that available on a single FDDI ring, so you don't even need Gigaswitch;
just hang the nodes on a private ring.  Given the low load, there is absolutely
no channel access delay problem, and the adapters won't introduce any 
significant delay either.

Judging by the numbers, you could ALMOST do this on Ethernet... :-)

	paul
655.15UFP::LARUEJeff LaRue: U.S. Network Resource CenterFri Feb 25 1994 17:3215
    re: reflective memory
    
    This is exactly the kind of thing that we are in the process of
    creating for an air space management system here at Westinghouse.
    
    I have architected a solution that relies on multicast addressing
    inorder to allow every node to be seen by every other node, etc.
    We were required to use Ada for the implementation of this capability.
    
    To date, we have found that the bandwidth of a private FDDI ring
    is sufficient to handle multiple 10's of alphas with an aggregate
    transmission of 10+ Mb/sec.  Additionally, the latency is more than low
    enough to meet the needs of the program.
    
    -Jeff