Network Working Group                                       M. Lambert
Request for Comments: 1030      M.I.T. Laboratory for Computer Science
                                                         November 1987


          On Testing the NETBLT Protocol over Divers Networks


STATUS OF THIS MEMO

   This RFC describes the results gathered from testing NETBLT over
   three networks of differing bandwidths and round-trip delays.  While
   the results are not complete, the information gathered so far has
   been very promising and supports RFC-998's assertion that that NETBLT
   can provide very high throughput over networks with very different
   characteristics.  Distribution of this memo is unlimited.

1. Introduction

   NETBLT (NETwork BLock Transfer) is a transport level protocol
   intended for the rapid transfer of a large quantity of data between
   computers.  It provides a transfer that is reliable and flow
   controlled, and is designed to provide maximum throughput over a wide
   variety of networks.  The NETBLT protocol is specified in RFC-998;
   this document assumes an understanding of the specification as
   described in RFC-998.

   Tests over three different networks are described in this document.
   The first network, a 10 megabit-per-second Proteon Token Ring, served
   as a "reference environment" to determine NETBLT's best possible
   performance.  The second network, a 10 megabit-per-second Ethernet,
   served as an access path to the third network, the 3 megabit-per-
   second Wideband satellite network.  Determining NETBLT's performance
   over the Ethernet allowed us to account for Ethernet-caused behaviour
   in NETBLT transfers that used the Wideband network.  Test results for
   each network are described in separate sections.  The final section
   presents some conclusions and further directions of research.  The
   document's appendices list test results in detail.

2. Acknowledgements

   Many thanks are due Bob Braden, Stephen Casner, and Annette DeSchon
   of ISI for the time they spent analyzing and commenting on test
   results gathered at the ISI end of the NETBLT Wideband network tests.
   Bob Braden was also responsible for porting the IBM PC/AT NETBLT
   implementation to a SUN-3 workstation running UNIX.  Thanks are also
   due Mike Brescia, Steven Storch, Claudio Topolcic and others at BBN
   who provided much useful information about the Wideband network, and



M. Lambert                                                      [Page 1]


RFC 1030              Testing the NETBLT Protocol          November 1987


   helped monitor it during testing.

3. Implementations and Test Programs

   This section briefly describes the NETBLT implementations and test
   programs used in the testing.  Currently, NETBLT runs on three
   machine types: Symbolics LISP machines, IBM PC/ATs, and SUN-3s.  The
   test results described in this paper were gathered using the IBM
   PC/AT and SUN-3 NETBLT implementations.  The IBM and SUN
   implementations are very similar; most differences lie in timer and
   multi-tasking library implementations.  The SUN NETBLT implementation
   uses UNIX's user-accessible raw IP socket; it is not implemented in
   the UNIX kernel.

   The test application performs a simple memory-to-memory transfer of
   an arbitrary amount of data.  All data are actually allocated by the
   application, given to the protocol layer, and copied into NETBLT
   packets.  The results are therefore fairly realistic and, with
   appropriately large amounts of buffering, could be attained by disk-
   based applications as well.

   The test application provides several parameters that can be varied
   to alter NETBLT's performance characteristics.  The most important of
   these parameters are:


        burst interval  The number of milliseconds from the start of one
                        burst transmission to the start of the next burst
                        transmission.


        burst size      The number of packets transmitted per burst.


        buffer size     The number of bytes in a NETBLT buffer (all
                        buffers must be the same size, save the last,
                        which can be any size required to complete the
                        transfer).


        data packet size
                        The number of bytes contained in a NETBLT DATA
                        packet's data segment.


        number of outstanding buffers
                       The number of buffers which can be in
                       transmission/error recovery at any given moment.



M. Lambert                                                      [Page 2]


RFC 1030              Testing the NETBLT Protocol          November 1987


   The protocol's throughput is measured in two ways.  First, the "real
   throughput" is throughput as viewed by the user: the number of bits
   transferred divided by the time from program start to program finish.
   Although this is a useful measurement from the user's point of view,
   another throughput measurement is more useful for analyzing NETBLT's
   performance.  The "steady-state throughput" is the rate at which data
   is transmitted as the transfer size approaches infinity.  It does not
   take into account connection setup time, and (more importantly), does
   not take into account the time spent recovering from packet-loss
   errors that occur after the last buffer in the transmission is sent
   out.  For NETBLT transfers using networks with long round-trip delays
   (and consequently with large numbers of outstanding buffers), this
   "late" recovery phase can add large amounts of time to the
   transmission, time which does not reflect NETBLT's peak transmission
   rate.  The throughputs listed in the test cases that follow are all
   steady-state throughputs.

4. Implementation Performance

   This section describes the theoretical performance of the IBM PC/AT
   NETBLT implementation on both the transmitting and receiving sides.
   Theoretical performance was measured on two LANs: a 10 megabit-per-
   second Proteon Token Ring and a 10 megabit-per-second Ethernet.
   "Theoretical performance" is defined to be the performance achieved
   if the sending NETBLT did nothing but transmit data packets, and the
   receiving NETBLT did nothing but receive data packets.

   Measuring the send-side's theoretical performance is fairly easy,
   since the sending NETBLT does very little more than transmit packets
   at a predetermined rate.  There are few, if any, factors which can
   influence the processing speed one way or another.

   Using a Proteon P1300 interface on a Proteon Token Ring, the IBM
   PC/AT NETBLT implementation can copy a maximum-sized packet (1990
   bytes excluding protocol headers) from NETBLT buffer to NETBLT data
   packet, format the packet header, and transmit the packet onto the
   network in about 8 milliseconds.  This translates to a maximum
   theoretical throughput of 1.99 megabits per second.

   Using a 3COM 3C500 interface on an Ethernet LAN, the same
   implementation can transmit a maximum-sized packet (1438 bytes
   excluding protocol headers) in 6.0 milliseconds, for a maximum
   theoretical throughput of 1.92 megabits per second.

   Measuring the receive-side's theoretical performance is more
   difficult.  Since all timer management and message ACK overhead is
   incurred at the receiving NETBLT's end, the processing speed can be
   slightly slower than the sending NETBLT's processing speed (this does



M. Lambert                                                      [Page 3]


RFC 1030              Testing the NETBLT Protocol          November 1987


   not even take into account the demultiplexing overhead that the
   receiver incurs while matching packets with protocol handling
   functions and connections).  In fact, the amount by which the two
   processing speeds differ is dependent on several factors, the most
   important of which are: length of the NETBLT buffer list, the number
   of data timers which may need to be set, and the number of control
   messages which are ACKed by the data packet.  Almost all of this
   added overhead is directly related to the number of outstanding
   buffers allowable during the transfer.  The fewer the number of
   outstanding buffers, the shorter the NETBLT buffer list, and the
   faster a scan through the buffer list and the shorter the list of
   unacknowledged control messages.

   Assuming a single-outstanding-buffer transfer, the receiving-side
   NETBLT can DMA a maximum-sized data packet from the Proteon Token
   Ring into its network interface, copy it from the interface into a
   packet buffer and finally copy the packet into the correct NETBLT
   buffer in 8 milliseconds: the same speed as the sender of data.

   Under the same conditions, the implementation can receive a maximum-
   sized packet from the Ethernet in 6.1 milliseconds, for a maximum
   theoretical throughput of 1.89 megabits per second.

5. Testing on a Proteon Token Ring

   The Proteon Token Ring used for testing is a 10 megabit-per-second
   LAN supporting about 40 hosts.  The machines on either end of the
   transfer were IBM PC/ATs using Proteon P1300 network interfaces.  The
   Token Ring provides high bandwidth with low round-trip delay and
   negligible packet loss, a good debugging environment in situations
   where packet loss, packet reordering, and long round-trip time would
   hinder debugging.  Also contributing to high performance is the large
   (maximum 2046 bytes) network MTU.  The larger packets take somewhat
   longer to transmit than do smaller packets (8 milliseconds per 2046
   byte packet versus 6 milliseconds per 1500 byte packet), but the
   lessened per-byte computational overhead increases throughput
   somewhat.

   The fastest single-outstanding-buffer transmission rate was 1.49
   megabits per second, and was achieved using a test case with the
   following parameters:










M. Lambert                                                      [Page 4]


RFC 1030              Testing the NETBLT Protocol          November 1987


      transfer size   2-5 million bytes


      data packet size
                      1990 bytes


      buffer size     19900 bytes


      burst size      5 packets


      burst interval  40 milliseconds.  The timer code on the IBM PC/AT
                      is accurate to within 1 millisecond, so a 40
                      millisecond burst can be timed very accurately.

   Allowing only one outstanding buffer reduced the protocol to running
   "lock-step" (the receiver of data sends a GO, the sender sends data,
   the receiver sends an OK, followed by a GO for the next buffer).
   Since the lock-step test incurred one round-trip-delay's worth of
   overhead per buffer (between transmission of a buffer's last data
   packet and receipt of an OK for that buffer/GO for the next buffer),
   a test with two outstanding buffers (providing essentially constant
   packet transmission) should have resulted in higher throughput.

   A second test, this time with two outstanding buffers, was performed,
   with the above parameters identical save for an increased burst
   interval of 43 milliseconds.  The highest throughput recorded was
   1.75 megabits per second.  This represents 95% efficiency (5 1990-
   byte packets every 43 milliseconds gives a maximum theoretical
   throughput of 1.85 megabits per second).  The increase in throughput
   over a single-outstanding-buffer transmission occurs because, with
   two outstanding buffers, there is no round-trip-delay lag between
   buffer transmissions and the sending NETBLT can transmit constantly.
   Because the P1300 interface can transmit and receive concurrently, no
   packets were dropped due to collision on the interface.

   As mentioned previously, the minimum transmission time for a
   maximum-sized packet on the Proteon Ring is 8 milliseconds.  One
   would expect, therefore, that the maximum throughput for a double-
   buffered transmission would occur with a burst interval of 8
   milliseconds times 5 packets per burst, or 40 milliseconds.  This
   would allow the sender of data to transmit bursts with no "dead time"
   in between bursts.  Unfortunately, the sender of data must take time
   to process incoming control messages, which typically forces a 2-3
   millisecond gap between bursts, lowering the throughput.  With a
   burst interval of 43 milliseconds, the incoming packets are processed



M. Lambert                                                      [Page 5]


RFC 1030              Testing the NETBLT Protocol          November 1987


   during the 3 millisecond-per-burst "dead time", making the protocol
   more efficient.

6. Testing on an Ethernet

   The network used in performing this series of tests was a 10 megabit
   per second Ethernet supporting about 150 hosts.  The machines at
   either end of the NETBLT connection were IBM PC/ATs using 3COM 3C500
   network interfaces.  As with the Proteon Token Ring, the Ethernet
   provides high bandwidth with low delay.  Unfortunately, the
   particular Ethernet used for testing (MIT's infamous Subnet 26) is
   known for being somewhat noisy.  In addition, the 3COM 3C500 Ethernet
   interfaces are relatively unsophisticated, with only a single
   hardware packet buffer for both transmitting and receiving packets.
   This gives the interface an annoying tendency to drop packets under
   heavy load.  The combination of these factors made protocol
   performance analysis somewhat more difficult than on the Proteon
   Ring.

   The fastest single-buffer transmission rate was 1.45 megabits per
   second, and was achieved using a test case with the following
   parameters:

      transfer size   2-5 million bytes


      data packet size
                      1438 bytes (maximum size excluding protocol
                      headers).


      buffer size     14380 bytes


      burst size      5 packets


      burst interval  30 milliseconds (6.0 milliseconds x 5 packets).

   A second test, this one with parameters identical to the first save
   for number of outstanding buffers (2 instead of 1) resulted in
   substantially lower throughput (994 kilobits per second), with a
   large number of packets retransmitted (10%).  The retransmissions
   occurred because the 3COM 3C500 network interface has only one
   hardware packet buffer and cannot hold a transmitting and receiving
   packet at the same time.  With two outstanding buffers, the sender of
   data can transmit constantly; this means that when the receiver of
   data attempts to send a packet, its interface's receive hardware goes



M. Lambert                                                      [Page 6]


RFC 1030              Testing the NETBLT Protocol          November 1987


   deaf to the network and any packets being transmitted at the time by
   the sender of data are lost.  A symmetrical problem occurs with
   control messages sent from receiver of data to sender of data, but
   the number of control messages sent is small enough and the
   retransmission algorithm redundant enough that little performance
   degradation occurs due to control message loss.

   When the burst interval was lengthened from 30 milliseconds per 5
   packet burst to 45 milliseconds per 5 packet burst, a third as many
   packets were dropped, and throughput climbed accordingly, to 1.12
   megabits per second.  Presumably, the longer burst interval allowed
   more dead time between bursts and less likelihood of the receiver of
   data's interface being deaf to the net while the sender of data was
   sending a packet.  An interesting note is that, when the same test
   was conducted on a special Ethernet LAN with the only two hosts
   attached being the two NETBLT machines, no packets were dropped once
   the burst interval rose above 40 milliseconds/5 packet burst.  The
   improved performance was doubtless due to the absence of extra
   network traffic.

7. Testing on the Wideband Network

   The following section describes results gathered using the Wideband
   network.  The Wideband network is a satellite-based network with ten
   stations competing for a raw satellite channel bandwidth of 3
   megabits per second.  Since the various tests resulted in substantial
   changes to the NETBLT specification and implementation, some of the
   major changes are described along with the results and problems that
   forced those changes.

   The Wideband network has several characteristics that make it an
   excellent environment for testing NETBLT.  First, it has an extremely
   long round-trip delay (1.8 seconds).  This provides a good test of
   NETBLT's rate control and multiple-buffering capabilities.  NETBLT's
   rate control allows the packet transmission rate to be regulated
   independently of the maximum allowable amount of outstanding data,
   providing flow control as well as very large "windows".  NETBLT's
   multiple-buffering capability enables data to still be transmitted
   while earlier data are awaiting retransmission and subsequent data
   are being prepared for transmission.  On a network with a long
   round-trip delay, the alternative "lock-step" approach would require
   a 1.8 second gap between each buffer transmission, degrading
   performance.

   Another interesting characteristic of the Wideband network is its
   throughput.  Although its raw bandwidth is 3 megabits per second, at
   the time of these tests fully 2/3 of that was consumed by low-level
   network overhead and hardware limitations.  (A detailed analysis of



M. Lambert                                                      [Page 7]


RFC 1030              Testing the NETBLT Protocol          November 1987


   the overhead appears at the end of this document.)  This reduces the
   available bandwidth to just over 1 megabit per second.  Since the
   NETBLT implementation can run substantially faster than that, testing
   over the Wideband net allows us to measure NETBLT's ability to
   utilize very high percentages of available bandwidth.

   Finally, the Wideband net has some interesting packet reorder and
   delay characteristics that provide a good test of NETBLT's ability to
   deal with these problems.

   Testing progressed in several phases.  The first phase involved using
   source-routed packets in a path from an IBM PC/AT on MIT's Subnet 26,
   through a BBN Butterfly Gateway, over a T1 link to BBN, onto the
   Wideband network, back down into a BBN Voice Funnel, and onto ISI's
   Ethernet to another IBM PC/AT.  Testing proceeded fairly slowly, due
   to gateway software and source-routing bugs.  Once a connection was
   finally established, we recorded a best throughput of approximately
   90K bits per second.

   Several problems contributed to the low throughput.  First, the
   gateways at either end were forwarding packets onto their respective
   LANs faster than the IBM PC/AT's could accept them (the 3COM 3C500
   interface would not have time to re-enable input before another
   packet would arrive from the gateway).  Even with bursts of size 1,
   spaced 6 milliseconds apart, the gateways would aggregate groups of
   packets coming from the same satellite frame, and send them faster
   than the PC could receive them.  The obvious result was many dropped
   packets, and degraded performance.  Also, the half-duplex nature of
   the 3COM interface caused incoming packets to be dropped when packets
   were being sent.

   The number of packets dropped on the sending NETBLT side due to the
   long interface re-enable time was reduced by packing as many control
   messages as possible into a single control packet (rather tha