[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference 7.286::fddi

Title:	FDDI - The Next Generation

Moderator:	NETCAD::STEFANI

Created:	Thu Apr 27 1989
Last Modified:	Thu Jun 05 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2259
Total number of notes:	8590

655.0. "Protocols for Low Latency User Proc's?" by LEMAN::MBROWN () Thu Jul 23 1992 17:51

I received an interesting question from a customer the other day.  I
don't really know where the correct place is to ask the question, but
this can't be the wrong one since it was asked about FDDI.

If one has two stations on an FDDI ring (probably should be any LAN),
and you want a user process to send a low-latency signal to another user
process on the other system, what would be the best protocol (standard
or non-standard)?

And, if you did this, could you also send standard DECnet and TCP/IP
packets across the network using the same adaptor?  The real target
would be Alpha/OpenOSF, but Alpha/OpenVMS would also be useful, and
any system/OS could be used for development.

The customer may be willing to write his own device driver to make this
happen.  The goal is to minimize the number of instructions used by
both the sending and receiving nodes, and therefore minimize the latency.
The limitation is that there may be a moderate number of processes on
a node (less than 100) which want to send or receive using this mechanism.

Pointers to other conferences or technical articles would be appreciated.

Thanks in advance,

Michael Brown
European HEP Group

T.R	Title	User	Personal Name	Date	Lines
655.1		KONING::KONING	Paul Koning, A-13683	`Thu Jul 23 1992 22:06`	4
	How low is low? There are a lot of different answers that make sense depending on whether you mean 1 microsecond, 1 millisecond, or 100 milliseconds. paul
655.2		MSBCS::KALKUNTE	Ram Kalkunte 293-5139	`Fri Jul 24 1992 00:28`	35
	As in .1, I would like to know what is the target latency your customer has in mind. But generally .... >>If one has two stations on an FDDI ring (probably should be any LAN), >>and you want a user process to send a low-latency signal to another user >>process on the other system, what would be the best protocol (standard >>or non-standard)? You can obviously get better performance with customized, light-weight protocols ( I am assuming this is what you mention as non-standard). >>And, if you did this, could you also send standard DECnet and TCP/IP >>packets across the network using the same adaptor? The real target >>would be Alpha/OpenOSF, but Alpha/OpenVMS would also be useful, and >>any system/OS could be used for development. Definitely possible. >>The customer may be willing to write his own device driver to make this >>happen. It's not the device driver that he should plan to write, it is the application (with comm protocol). There is not much fat that you can remove by writing your own device driver. >>The goal is to minimize the number of instructions used by >>both the sending and receiving nodes, and therefore minimize the latency. >>The limitation is that there may be a moderate number of processes on >>a node (less than 100) which want to send or receive using this mechanism. This cannot be answered without a complete set of requirements for this application. It may not be a limitation if the application is designed correctly. Ram
655.3	10-30 microsecond CPU overhead	LEMAN::MBROWN		`Fri Jul 24 1992 11:35`	37
	Sorry, I should have been more specific. What is desired is System induced latency (as opposed to transmission latency) of about 10 microseconds for the combination of send and receive overhead. This would be on an Alpha Desktop system, so a DS5000-240 should be about 30 microseconds. The reason for the low number is to use the spare workstation cycles as a low-cost MPP during the evening hours. I have talked to the MPSG group, but their efforts are not directly relevant, at least not now. IBM is currently pushing RS6000's with PVM (Parallel Virtual Machine) software from ORNL plus Ultranet as an interconnect. Many applications will not work using standard PVM over TCP/IP because the communication latency for signals and small data packets are too long (multi-milliseconds). What I am looking for is 1) the lightest weight, 2) closest to standard, 3) easy to implement [or better yet already implemented or prototyped] protocol that is available. Time to customer is more important than going from 35 to 30 microseconds. > This cannot be answered without a complete set of requirements for this > application. It may not be a limitation if the application is designed > correctly. The problem is that it isn't a single application, but the use of a series of systems to run different applications distributed over the whole group. Multiple of these applications (I said 100 before, but probably usually 10) may be running at a time. As mentioned in -.1, I made a mistake in that it probably isn't a driver that needs written, but a simple high level interface with a little bit of logic to do a few to few mapping. Anyway, any suggestions of things on the shelf would be useful. Thanks, Michael
655.4	Re-examine your application!	KONING::KONING	Paul Koning, A-13683	`Mon Jul 27 1992 18:52`	26
655.5	Simulating Nuclear collitions	BONNET::LISSDANIELS		`Tue Jul 28 1992 11:38`	22
	Paul, I believe they are gearing up to simulate what happened in a nuclear collition like an experiment in the CERN collider. HEP stands for HIGH Energy Physics... They may e.g. want to track the paths of the resulting particles... If they throw enough with APLHA workstations at the problem it should be a zinch ;-) as for the network - Maybe this THE job for GIGAswitch ??? In Full Duplex Mode you would not have to wait for a token, the GIGAswitch is the only "station" between sender and receiver. So the distance would then be the only variable for the network delay - provided the traffic is well spread between the participating CPUs... So that brings us back to the inital question - Any good reliable, but leightweight protocols out there ? Comments anyone ?
655.6		KONING::KONING	Paul Koning, A-13683	`Tue Jul 28 1992 15:02`	15
	What I meant is: what properties of the application require this sort of latency? Compute-intensive simulation is an obvious application for a high BANDWIDTH network, but it does not impose a low latency requirement. So I'm still looking for an explanation. It may well be that the requester is confused and we simply need to straighten out the requirement. It may also be that the requirement is valid, but it's a lot easier to answer a requirement if there is a clear definition of the background that justifies it, and there hasn't been. Yes, Gigaswitch seems like the only interconnect technology that would meet the numbers quoted. But keep in mind you also have to get the data through the adapter, across the bus, and through the software (that's probably the list of increasing order of slowness...). paul
655.7	Wow ! What are they willing to spend ?	MSBCS::KALKUNTE	Ram Kalkunte 293-5139	`Tue Jul 28 1992 16:27`	32
	Well, for some of the reasons as outlined earlier, FDDI (asynchronous) was never a right choice for such applications (even though I am still having a hard time figuring out what exactly this application is ). The ideal protocol for such communication would do its own flow control and would be engineered to work with a given network. In any case, the latency goal of 10usec seems unreasonable with existent technology; DEMFA, the fastest FDDI adapter to date, takes ~6 usec (best case) to deliver the smallest FDDI packet from the fiber to the memory. An average case will consist of queuing delays in the adapter and the memory. An average case will also account for packet size. I do not know what your average packet size would be (?) and what your average system will be doing (?), so I cannot comment on what your end-to-end transmission latency will be with FDDI. This latency obviously does not include even a single CPU instruction to process the packet. Also, IO bound tasks behave differently than compute bound tasks, and the bottom line is the CPI that you get for IO programs are typically much worse than CPU bound programs. I mention this so that people will be careful when coming up with how many instructions there should be in the run-time loop for this application. Since the kind of beast that you are looking for hasn't evolved yet (me thinks), don't waste time looking for it. If there is considerable bucks on the line to make this happen, it would be a good idea to write your own application. But this has to be with a revised expectation of latency. If you need an estimate of what is achievable, (much better than IBM's millisecond range), I will need to understand your application. Either you can post the details here or we can discuss offline. Ram
655.8	Setup=wasted instructions	RDVAX::MCCABE		`Mon Aug 24 1992 15:32`	35
	Maybe I can offer some help with the low latency requirement. Distributed compiler technology provides automatic parallelizm for array based operations. The result is that a data movement to another processor can use the CPU cycles of many other processors. However, the cost to initate a send/recieve pair equates to instructions that could be used locally to process the data. A 50u latency second on a MIPS workstation is on the order of 1200 instructions. If the compiler does not have a good idea of how long the remote procesing step is going to take it becomes quite possible to spend more on the communication than the local processing would take. As the numbers move up in magnitude, the cost of the remote processing becomes relatively expensive due to the latency. Hence less distribution is more efficient. Granted there are many course grained applications that can still benefit even when the latency is accounted for, but the total set of applications is reduced. Matrix reductions, distrubuited AXPY's, even SUM operations can be done very quickly in parallel when communication is cheap. When it is not, the addition of processors to a given problem can result in longer, not shorter execution times. GIGIswitch does indeed look like a good mechnism for this distribution mechnism. -Kevin McCabe Engineering Manager, MPSG P.S. We may indeed be quite interested in what you are doing ...
655.9	Thanks and more details	LEMAN::MBROWN		`Tue Sep 29 1992 08:45`	46
	I apologize for not getting back to this sooner. We have been swamped with Alpha activity, several big conferences, and MPP work. I will get in touch with Kevin and Ram independently, but let me say that Torbjorn and Kevin are 100% on target. We are planning on using GIGAswitch as the interconnect, and 10 uS is still an interesting target number. Actually, I would go farther than Kevin and say that Setup time equates to wasted instructions on MANY systems. And, it isn't just setup time. It is the time required for copying data from one buffer into another into another and finally into user buffers. The applications are not constant. Some will have large transfers, some will have small transfers, most will have a mix of transfers. However, from my experience in other parallel processing environments, it is the issue of synchronization latency (small packets) which is the most critical issue. There will likely be two or three modes of operation, and this might equate more directly to Ram's request for "application information". The first mode is 10 Alpha workstations acting as a batch compute engine. Uninteresting for special communication protocols. The second mode is using a "data flow" programming model like PVM (Parallel Virtual Machine) developed by Jack Dongarra and promoted by IBM (and hopefully Digital) as a way of using workstations to solve medium-to-fine grained parallel problems. Among other things, PVM provides a programming library that hides details of the location of program modules and the communication between them. Dongarra's graduate students developed a special program library for efficient ethernet communication, the same is needed for an FDDI GIGAswitch environment. IBM has done this for their version of PVM (called PVM/e) over Fibrechannel connections. The third mode of operation is where High Performance Fortran applications are automatically distributed across multiple "workstations", and they are linked together via a high speed network. FDDI is probably too slow, but it is the best we have right now. The shortest term need is for the PVM style support, but the HPF style support will be very close behind. I expect that Kevin is already working on it. Thanks for the help. More later when it becomes available. Michael
655.10		KONING::KONING	Paul Koning, A-13683	`Tue Sep 29 1992 14:09`	7
	I don't see anything in that list that suggests severe latency requirements, certainly nothing anywhere near as tight as 10 microseconds. So I'm still wondering how you came to the conclusion that such performance was needed. (Never mind whether it's achievable with any hardware available from anyone today.) paul
655.11	Missouri Requirement <show me>	LEMAN::MBROWN		`Tue Sep 29 1992 14:56`	20
	Paul, You are right that there isn't a requirement that 100% of all latency be under 10 microseconds. The original number I used in note .3 was 30 micro- second delay for the application on system 1 to begin the transmission of a small packet (say 100 bytes of useful data) until the application on system 2 has the data in its buffer. There should be a reasonable confidence level that the transmission will complete in this amount of time. Until I see otherwise, I will assume that this cannot be done using standard UDP packets, transparent or non-transparent DECnet. Paul, if you or anyone else can show how long this takes using standard protocols, I would love to see the data and be proved wrong. This would be using GIGAswitch, so some of the default assumptions about token availability are not valid. Tests on 2 node rings would be of high value. Regards, Michael
655.12		KONING::KONING	Paul Koning, A-13683	`Tue Sep 29 1992 17:33`	46
	I don't know how long this takes with standard protocols. Actually, that's a fairly meaningless question; the more meaningful question is how long it takes on a given implementation. (The particular implementation properties are what determines the answer, not really any common properties of a particular protocol.) Something is backwards here. Requirements are supposed to be derived from the application's needs. If you can determine what the application needs (and I'm NOT referring to a number such as "30 microseconds" unless it comes with some explanation of how it was derived from parameters observable by users of the system) then you can determine whether a particular implementation of some particular protocol will do the job. Tests of implementations will validate performance claims for them and will give you confidence that they will meet the requirements. But I'm getting the impression that you're looking for performance data as a way to determine what the performance requirements should be, and that's not the way to do it. Looking back at .9: mode 1 (batch compute engines) -- sounds like bulk data transfer (similar to file transfer). Requires high throughput, but does not impose any significant latency requirement. mode 2 (fine grained parallelism) -- how fine is "fine"? I know this sort of stuff has been done in academic R&D. To use it in commercial applications requires picking grain sizes that aren't so small that most of the time spent is overhead. As far as I know, remote procedure call or similar approaches for doing this sort of thing currently have overheads measured in milliseconds, not microseconds. Even if the actual network overhead were zero, there's the application layer overhead (argument marshalling) which can be quite substantial. So if "fine grained" refers to operations that take a second or so, using thousands but not millions of bytes per second, again you have no special requirements. If your grains complete in a few milliseconds, you're not going to get much efficiency. mode 3 (distribution of high performance fortran apps) -- that sounds similar to mode 1, and again involves no significant latency requirements. How much data has to be moved? You didn't mention, and that's the real question. So to summarize: one of the three application modes you mentioned MAY justify low latency requirements. You'll need to learn more about those applications to find out the actual numbers. The other two applications have no latency requirements (beyond the modest ones needed for good throughput, which any reasonable implementation already meets). paul
655.13	Another Low-latency Application	JULIET::HATTRUP_JA	Jim Hattrup, Santa Clara, CA	`Thu Feb 24 1994 16:02`	14
	I am looking for a 'reflective memory' type solution for a real-time application. I am wondering if a low-latency FDDI solution (perhaps using the Gigaswitch) would work. A configuration would be 3 to 10 systems, that need to update 50Kbytes of data between themselves (all of them) at 30 times/sec. This is 1.5Mbytes/sec - and delays in updates would cause problems. They have 33 milliseconds for computation and I/O (30 frames/sec), and can't miss this window. Is FDDI a workable solution? (SYSTRAN SCRAMnet is an alternative, but we don't have VME based mmap support on VAX 7000. Likely config is a VAX 7000 M620 and 2 to 4 SGI systems)
655.14		KONING::KONING	Paul Koning, B-16504	`Thu Feb 24 1994 19:04`	9
	Doesn't sound like a big deal. The throughput you need is a small fraction of that available on a single FDDI ring, so you don't even need Gigaswitch; just hang the nodes on a private ring. Given the low load, there is absolutely no channel access delay problem, and the adapters won't introduce any significant delay either. Judging by the numbers, you could ALMOST do this on Ethernet... :-) paul
655.15		UFP::LARUE	Jeff LaRue: U.S. Network Resource Center	`Fri Feb 25 1994 17:32`	15
	re: reflective memory This is exactly the kind of thing that we are in the process of creating for an air space management system here at Westinghouse. I have architected a solution that relies on multicast addressing inorder to allow every node to be seen by every other node, etc. We were required to use Ada for the implementation of this capability. To date, we have found that the bandwidth of a private FDDI ring is sufficient to handle multiple 10's of alphas with an aggregate transmission of 10+ Mb/sec. Additionally, the latency is more than low enough to meet the needs of the program. -Jeff