Some Gigabit TCP Experiments

From ARL ONL Wiki
Jump to: navigation, search

Contents

Introduction

This page discusses the results of some TCP experiments regarding the use of NPRs that have port rates near 1 Gbps. Note that the default NPR configuration your experiment uses may exclude some of the behavior described on this page. Furthermore, some experiments were run with special priviledges not available to ordinary users (e.g., using ethtool to change the number of TX descriptors used by a NIC). The experiments show the following:

  • An NPR port can only support about 970 Mbps, not 1 Gbps.
    • You may see packet drops at an NPR's TX block if you set the port rate to 1 Gbps.
  • A pc1core host can send packets to the NIC at a rate of about 1.8 Gbps.
    • During slow-start, tcptrace shows duplicate packets called Hardware Duplicates (HDs) being sent to the NIC when the NIC begins to run out of TX descriptors.
    • At first, the effect is to halve the effective input rate to 900 Mbps during the tail of each slow-start burst.
    • But because the user-space iperf sender is still sending at 1.8 Gbps, the link layer qdisc queue overflows causing the sender to exit slow-start and to enter the Congestion Window Reduction (CWR) congestion avoidance state and then the Normal congestion avoidance state.
  • Increasing the number TX descriptors delays the onset of HDs.
    • The default number of TX and RX descriptors is 256 each. The maximum for each is 16,384.
  • A better approach to insuring that slow-start doesn't end prematurely is to increase the size of the qdisc by increasing txqueuelen at the sender and increasing the receive buffer space at the receiver by increasing netdev_max_backlog.
  • The BIC congestion avoidance algorithm may have higher throughput but lower fairness than Reno for multiple flows.
    • BIC may be too aggressive in the congestion avoidance phase.

This page also shows:

  • How to better understand the underlying behavior of TCP through the use of the tools tcpdump/tcptrace, ethtool, ifconfig and tc and the script runCwndMon.pl.
  • How a measurement tool such as tcpdump can drastically alter the behavior of TCP at gigabit speeds because of the overhead it introduces.
  • How the host parameters txqueuelen and netdev_max_backlog effect packet drops.

In the process, the page also describes the buffers in the Linux TCP stack and how Linux packet capture is done.

The Configuration

gigabit-tcp-dumbbell.png

The configuration:

  • Dumbbell configuration
  • Senders' packets go through a 50 msec delay plugin before going to queue 64 at port n1p1
  • Queue 64 capacity: 6,225,000 bytes = 4,167 packets = 1 BDP
  • Bottleneck port: n1p1, 1 Gbps (or 83,333 pkts/sec, 12 usec per pkt)
  • TCP iperf senders on the left, and TCP iperf receivers on the right
    • Top (Bottom, Middle) left host sends maximum sized segments (1496 bytes) to top right (Bottom, Middle) host
  • ACKs take the reverse path
  • Charts use a 0.25 second monitoring period


Gigabit TCP Reno

This section shows that an NPR output port can not support 1 Gbps. Our experiment shows that it can support 960 Mbps while experiments by ONL staff have shown that drops will occur in the NPR's TX block for any input rate higher than 970 Mbps.

  • Three iperf TCP flows with 2-second staggered starting times
  • The sending host is also running a tcpdump process and a runCwndMon.pl process which adds extra load and reduces the capacity of the sending host.
  • Other host (sender and receiver) parameters
    • txqueuelen was 1000.
    • netdev_recv_backlog is 300.
    • Number of NIC TX descriptors and RX desccriptors is 256 each.
  • BDP for 50 msec delay is about 4,167 MSSs
  • Linux TCP in Reno congestion avoidance (CA) increases cwnd for each ACK
    • Delayed ACKs mean that CA period is about 2 x 4,167 x 50 msec = 417 sec
gigabit-tcp-1Gbps-iperf.png
gigabit-tcp-1Gbps-bw.png

The iperf output (left) shows that the aggregate goodput of the three flows is around 226 Mbps, well below the 1 Gbps capacity of the bottleneck link.

The bandwidth chart (right) shows the aggregate bandwidth (including retransmissions) in red and labeled data. The bandwidth of the individual flows are labeled 1.x where x is the input port (0, 4, 3 or black, blue, green). A closer examination of the bandwidths of the individual flows shows the 2-second staggered starting times. But the strange part of the bandwidth chart is that after time 538, the flows appear to suffer packet losses (bandwidth decreases) even though the aggregate bandwidth is well below 1 Gbps.


gigabit-tcp-1Gbps-qlen.png
gigabit-tcp-1Gbps-drops.png

The queue length chart (left) shows no significant queueing.

The drops chart confirms that the Queue Manager doesn't drop any packets (Pkt Drops); i.e., there are no packet drops from queue 64, the common queue of the three flows. However, the chart shows that the TX block of the NPR drops about 40 and then 50 packets. Data for the TX Drops chart is obtained by monitoring register counter 34 which records the number of packets dropped by the TX block because it can not queue packets out of the output port.


Unexplained behavior:

  • The bandwidth chart of the third flow (green)
    • The slope during congestion avoidance should be identical to the other two flows but it appears to be larger
    • The bandwidth appears to drop around T = 554 even though the Drops chart doesn't show any packet drops at that time
  • The bandwidth chart of the first flow (black)
    • The bandwidth appears to drop around T = 544 even though the Drops chart doesn't show any packet drops at that time
  • The bandwidth chart of the second flow (blue)
    • The bandwidth appears to drop around T = 546 even though the Drops chart doesn't show any packet drops at that time

These bandwidth drops did not appear in five later repetitions of this same experiment.

960 Mbps Bottleneck

The charts below show that the charts behave as expected for a bottleneck rate of 960 Mbps.

gigabit-tcp-960Mbps-iperf.png
gigabit-tcp-960Mbps-bw.png

The iperf output (left) shows that the aggregate goodput of the three flows is now around 585 Mbps, well above the 226 Mbps we obtained before when the bottleneck rate was set to 1 Gbps.

The bandwidth chart (right) shows the aggregate bandwidth (including retransmissions) in red and labeled data now hovers around 480 Mbps and peaks at 960 Mbps. Furthermore, the three flows get about the same bandwidth once all flows settle down into congestion avoidance.


gigabit-tcp-960Mbps-qlen.png
gigabit-tcp-960Mbps-drops.png

The queue length chart (left) shows significant queueing with three large peaks. The largest peak shows over 3 MB (about half of the capacity) of queueing when the second flow is in its slow-start phase.

The drops chart now shows Queue Manager packet drops (Pkt Drops); i.e., from queue 64, the common queue of the three flows. These drops appear to begin when the second flow begins it slow-start phase. And, there are no TX drops.


Unexplained behavior:

  • The bandwidth chart of the third flow (green)
    • The bandwidth appears to drop around T = 302 even though the Drops chart doesn't show any packet drops at that time

The Linux Network Stack

The unexpected drops in TCP bandwidth shown earlier is usually a sign of congestion (packet drops) along a flow's network path. Typically, the drops occur at one of the queues at the NPR output port with the smallest relative capacity. But you will find that as you increase the NPR output port rate, the congestion will be at the sending or receiving host. Furthermore, if you attempt to examine the microscopic behavior of a TCP flow using a tool such as tcpdump, the details may not make sense unless you understand how the host buffers can affect TCP's behavior.

The Linux Transmit Path

gigabit-tcp-network-stack.png

As you increase the output port capacities of the NPRs, you will find that congestion and packet drops will move into resources located at the sending and receiving hosts. To illustrate how this can happen, this section points out where congestion can occur in Linux's network stack by following the path of packets as it travels through the stack. It also describes how packets are captured by tcpdump so that you can better understand the displays in the next section.

This is how data is injected into the network by an application such as iperf for a Linux 2.6 kernel:

  • An iperf TCP sender (a user-space application) inserts data into an octet stream by calling write (send or sendmsg).
  • The TCP layer copies the data from user-space into sk_buff (SKB) structures in the kernel.
    • Each SKB has extra space for the IP and TCP headers and usually contains a maximum segment size (MSS) number of octets.
    • The data is stored in the TCP send socket buffers until they are acknowledged by the receiver.
    • The sender application will be blocked when the sender runs out of socket buffer space.
    • The total size of the socket buffer space is related to the iperf -w option and the value of /proc/sys/net/ipv4/tcp/tcp_wmem (max).
  • After TCP and IP headers have been added to an SKB (in the functions tcp_transmit_skb() and ip_queue_xmit() respectively), the kernel function dev_queue_xmit() inserts the SKB into the device queue (qdisc) for transmission if there is room.
    • The qdisc can hold txqueuelen SKBs and will drop a segment if it is full.
  • Eventually, qdisc_run() is called to dequeue an SKB for transmission.
    • The function dev_queue_xmit_nit() is called to make an SKB copy for tcpdump;
    • And the function dev->hard_start_xmit() will be called to issue one or more commands to the network device for scheduling transfer of the buffer to the NIC.
  • If the NIC's TX ring buffer is full (NETDEV_TX_BUSY), the SKB will be requeued on the device queue for another transmission attempt.
  • (Not shown) Eventually, the NIC will generate an interrupt so that resources associated with the transmission can be released.


Note that tcpdump records show transmission traffic as it passes from the qdisc to the TX ring buffer. This perspective means:

  • It does not see packets that are dropped between the socket buffer space and the qdisc.
  • Packets (SKBs) that are requeued because of a full TX ring buffer appear as duplicate packets (the tcptrace utility marks these duplicates as Hardware Duplicates).

Note also that tcpdump drops packets if it runs out of buffers. You will see these transmission holes which typically occurs during the end of slow-start when the traffic rate is the highest.

The Linux Receive Path

The receive path has components analogous to the transmit path:

  • There is an RX ring buffer used by the NIC for incoming packets which can be configured with the ethtool utility (analogous to the TX ring buffer).
  • There is an incoming device queue which can hold netdev_max_backlog packets (analogous to qdisc which can hold txqueuelen packets).
  • And there are receive socket buffers (analogous to send socket buffers).

In our experiments, iperf clients send packets to iperf servers which receive data packets and then return ACK packets. So, the sender's receive path processes arriving ACKs which trigger the transmission of buffered data packets.

Thus, drops and blocking can occur at the following locations in the network stack:

  • Blocking can occur at a sender's TCP socket buffers whose size is related to the senders -w iperf option and the value of /proc/sys/net/ipv4/tcp/tcp_wmem (max).
  • Dropping can occur at a sender's qdisc device queue whose size txqueuelen can be set by the ifconfig command.
  • Blocking can occur at the sending NIC's TX ring buffer which can be sized by the ethtool command.
  • Dropping can occur at the receiving NIC's RX ring buffer which can also be sized by the ethtool command.
  • Dropping can occur at the receiver's device queue whose size netdev_max_backlog can be set by the sysctl command.
  • Blocking can occur because of the size of the receiver's socket buffer space whose size is related to the receivers -w iperf option and the value of /proc/sys/net/ipv4/tcp/tcp_rmem (max).

More Information

For more information, see:

  • The Linux network source code at /mnt/kernel_source/64bit/linux-2.6.24.7/net/ mounted on the filesystem of every ONL node.
    • The main subdirectories of interest are ipv4/{tcp,tcp_output,tcp_input}.c and core/dev.c.
  • Pasi Sarolahti and Alexey Kuznetsov, "Congestion Control in Linux TCP", Proc. Usenix 2002, Monterey, California, USA, June 2002.
    • A good high-level description of Linux 2.4 TCP features (also, mostly applicable to Linux 2.6).
  • Helali Bhuiyan, Maerk McGinley, Tao Li, and Malathi Veeraraghavan, "TCP Implementation in Linux: A Brief Tutorial", University of Virginia.
    • A good high-level description of the Linux 2.6 network stack implementation.
  • http://www.linuxfoundation.org/collaborate/workgroups/networking/kernelflow, "kernel_flow", Linux Foundation and Arnout Vandercappelle
    • A detailed summary of the Linux 2.6.20 network source code
  • http://vger.kernel.org/~davem/tcp_output.html, "The Linux TCP Output Engine", Dave Miller
    • A concise description of TCP's output engine.
  • http://vger.kernel.org/~davem/skb_sk.html, "How SKB Socket Accounting Works", Dave Miller
    • Includes a description of how a user process is resumed after it runs out of socket buffers.
  • http://www.linuxjournal.com/article/1312, "Network Buffers and Memory Management", Alan Cox
    • A description of SKBs in Linux 2.4.
  • TCP Illustrated, Volume 1, The Protocols (Second Edition), Kevin R. Fall and W. Richard Stevens, Addison-Wesley, 2012
    • Section 16.5 is a detailed Linux TCP example that includes a discussion of topics relevant to our examples.

NIC TX Ring Buffer Sizing And Hardware Duplicates

This section describes how the size of the TX ring buffer effects TCP behavior as viewed by tcpdump. It explains how Hardware Duplicates (HDs) arise. The HDs arise during the slow-start phase in experiments that use a 50 msec delay (or any large delay) but can be eliminated by increasing the number of TX descriptors. It also explains some details in tcpdump output that may be confusing. For example, TCP bandwidth mysteriously decreases when there appear to be no packet drops (and no retransmissions). But in fact, the packet drops are invisible to tcpdump because they are occuring at the qdisc which is in front of the packet capture point in the kernel.

  • Default #NIC TX descriptors: 256
  • An ONL (pc1core) host can send at 1.8 Gbps, almost twice the speed of the NIC transmission rate
  • HDs (Hardware Duplicates) in tcptrace output when the NIC runs out of free TX descriptors
  • During slow-start, octets are ACKed at the speed of the bottleneck which is 960 Mbps, and packets are sent through the sender's kernel at about 1.8 Gbps.
    • So, NIC TX descriptors are consumed at a rate of about 0.84 Gbps during slow-start bursts
  • When the NIC runs out of TX descriptors, the NIC signals device congestion (NETDEV_TX_BUSY) and the packet (SKB) has to be requeued in the qdisc.
    • This requeueing shows up as Hardware Duplicates in a tcpdump file because the requeued packet appears as a duplicate transmission in the file.
    • The requeueing also slows down the goodput of the sender. This slowdown can be as much as 50%, reducing the transmission rate into the network to under 900 Mbps.
  • Once the TX ring buffer is full, SKBs back up into the qdisc.
  • When a new packet arrives to a full qdisc, it gets dropped, and TCP eventually enters Congestion Avoidance (CA).

The following subsections discuss the tcptrace-xplot.org displays of our single Reno flow when the NIC is using the default 256-slot TX ring buffer and what happens when the number of TX descriptors is increased to 16,384.

Slow-Start,

gigabit-tcp-HDs.png
  • The sequence number plot has sequence number on the y-axis and time on the x-axis.
  • The flow begins in slow-start and ends at about 650 msec (since there is a 50 msec delay plugin) when it enters Congestion Avoidance.
  • Slow-Start has 12 rounds each lasting 50 msec except for the last round.
    • Round K has 2^K packets for K = 2, 3, ... , 11 and 1 packet in Round 1 (counting from 1). Round 12 is shortened because TCP detects congestion (actually a qdisc drop).
  • The white line shows the transmission progress. We'll call the slope of the white line the data rate which is the data rate is the bandwidth as seen at the output of the qdisc.
    • A detailed examination of rounds 2-7 shows a rate of 1.88 Gbps if we exclude the idle periods; i.e., the injection rate into the NIC is 1.8 Gbps.
    • Rounds 10-12 show a rate of 1.88 Gbps for the first 512 packets in each round but then, drops down to about 960 Mbps for the packets in the remainder of each round. (This behavior is more apparent in xplot.org charts where we zoom into these regions but are not shown here.)
    • In the Congestion Avoidance period, the rate appears to be about half of the ACK rate or 480 Mbps.
  • The green line is the ACK line which shows which octets have been ACKed. We'll call the slope of the green line the ACK rate which is the bandwidth of the octets being ACKed which we expect to be the bottleneck rate of 960 Mbps.
    • The ACK rate during non-idle slow-start periods is 960 Mbps.
    • The ACK rate drops to about 480 Mbps during Congestion Avoidance.
  • The red indicates Hardware Duplicates which are more clearly shown in the next chart.


gigabit-tcp-HDs-details.png

The figure (right) shows the transition region in Round 11 when HDs begin to reappear:

  • The HDs are labeled in red with HD and appear as blue segments instead of white ones.
  • Each HD segment appears about 12 usec after the segment it duplicates.
    • Note that the transmission time of a 1500-byte (12,000-bit) packet at 1 Gbps is 12 usec.
  • The 1500-byte (actually 1,496 byte) packet that follows the HD often often appears at about the same time.
  • Although not shown in this figure, sequence gaps often appear during periods with HDs since these periods place the greatest stress on the packet capture subsystem and tcpdump runs out of buffer space.


Round 12

gigabit-tcp-HDs-round12.png
gigabit-tcp-HDs-round12-end.png

These two figures of Round 12 illustrate observations that were made earlier:

  • Hardware duplicates begin to appear after about the first 512 packets have been transmitted.
  • The data rate before the HDs appear is about 1.88 Gbps, but drops to about 960 Mbps after they appear.
    • Note how the white (data rate) line parallels the ACK line indicating that transmission is proceding at the same rate that octets are being ACKed.
  • Sequence number gaps appear near the end of Round 12 when tcpdump runs out of buffers.
    • We know that the packets are not dropped by TCP because they are ACKed.
  • Once HDs appear, an HD and a new packet go out every 12 usec indicating that they are driven by the NIC transmitting a packet and freeing a slot in the TX ring buffer
  • The data rate right after Slow-Start ends (Round 13) shown in the far right part of the left figure is about 480 Mbps, about half of the data rate (and ACK rate) during the end of Slow-Start.
    • Congestion Avoidance doesn't begin until Round 14 (not shown).
    • This behavior illustrates Linux's Congestion Window Reduced (CWR) congestion state and not Congestion Avoidance. How CWR works will be explained more fully in the example in the section Hardware Duplicates.


Hardware Duplicates

Looking again at the preceding figure, we can see that the flow enters Congestion Avoidance in an unusual way: there is no sign of a packet drop and retransmission. So, how do we know that the HDs do not cause TCP to enter Congestion Avoidance? And what does cause TCP to enter Congestion Avoidance during Round 12?

The output of the following tc command gives us a clue to the bandwidth drop mystery:

/sbin/tc -s -d qdisc show dev data0

The output of the tc command shows the following detailed (-d) traffic control statistics (-s) for the data0 device queue (qdisc):

qdisc pfifo_fast 0: root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 1051128118 bytes 694309 pkt (dropped 1, overlimits 0 requeues 4329)
 rate 0bit 0pps backlog 0b 0p requeues 4329

The end of the second line indicates that the data0 qdisc had one (1) packet drop and 4329 requeues. The requeues are associated with the HDs, and the one packet drop at the qdisc is the one that ends Slow-Start. Some observations from the tcptrace-xplot.org displays and some rough calculations add further evidence for this interpretation.

Observation 1: No Hardware Duplicates (HDs) appear in Rounds 1-9 since there are enough slots in the TX ring buffer to prevent requeueing packets to the qdisc. For example, there are only 512 packets sent in Round 9 in response to the 256 ACKs. These packets arrive to the device queue in pairs because the slow-start algorithm increases cwnd by one for each ACK allowing the transmission of two more packets. The tcptrace-xplot.org displays show that the packets arrive at 1.88 Gbps and the octets are being ACKed about about 930 Mbps (note that the rate is 960 Mbps if we include the IP and TCP headers). Since the default number of TX ring buffer packet descriptors is 256, the ring buffer allows half of the 512 packets in Round 9 to be queued while the half are transmitted.

Observation 2: Rounds 10, 11 and 12 each begin with the transmission of about 512 packets before the Hardware Duplicates (HDs) begin. But in Rounds 10, 11 and 12, the queueing backs up into the qdisc. But because by default the qdisc can handle 1,000 packets, it will not overflow until Round 12. For example, there are only 2,048 packets in Round 11. If the relative queueing ratio is 1 (i.e., one packet queued for each transmitted or transmission rate is equal to the queueing rate), the TX ring buffer in combination with the qdisc should be able to handle 2,512 packets before the qdisc overflows. The HDs appear in the tcpdump file because the TX ring buffer is saturated after the 512th packet in each of these rounds causing each packet going to the TX ring buffer to be sent back to the qdisc for requeueing.

Observation 3: The 4,329 packets requeued reported by the tc command now makes sense. Close examination of the tcptrace-xplot.org charts indicates that about 2,800 packets are transmitted in Round 12 before an overflow occurs and the flow enters Congestion Avoidance. This means that in Rounds 10-12, 5,872 (= 1,024 + 2,048 + 2,800) packets are transmitted, but 1,536 (= 3 x 512) of them do not have HDs. Thus, the number of requeues should be about 4,336 (= 5,872 - 1,536) which is only 7 more than the 4,329 reported by the tc command.

Observation 4: The one packet drop at the qdisc is treated as a local congestion event. Linux responds to local congestion by exiting Slow-Start and entering the Congestion Window Reduced (CWR) state. While in CWR, Linux does several things:

  • Set ssthresh to the current value of cwnd/2 and sets cwnd to min(cwnd, flight size + 1)
  • Reduce cwnd by one for every other ACK until it reaches ssthresh/2 the standard Congestion Avoidance state

This is a rate-halving algorithm in which the intended effect is to gradually reach the proper value of cwnd to its proper Congestion Avoidance value while continuing to send packets at half the ACK rate.

In our example, TCP exits CWR at the end of Round 13 when cwnd has been reduced to ssthresh. (In general, Linux TCP can also exit CWR due to a network loss event.)

Increasing the Size of the TX Ring Buffer

gigabit-tcp-noHDs.png

The figure (right) shows the sequence number chart after increasing the size of the sender's TX ring buffer from 256 to 16,192. This change seems to eliminate all of the HDs. We expect that increasing the size of the TX ring buffer will reduce the number of HDs because it will take longer to fill the ring buffer. But should the change eliminate all of the HDs?

Four pieces of evidence yields an explanation:

  • The number of octets inflight (unacknowledged) at time 1.05 seconds is about 16 MB.
  • The iperf sender was run so that there was 16 MB of socket buffer space.
    • Linux actually allocates twice that amount or 32 MB to be used for buffer space and overhead. But about 16 MB of the 32 MB will be used for the outgoing packets.
  • The 16,192 TX ring buffer descriptors can handle about 24 MB of packets.
  • The output of the tc command indicated that there were 0 requeues and 0 drops at the qdisc.

These observations mean that the socket buffers may become exhausted before the TX descriptors are. When the socket buffers become depleted, the application will stall before the TX ring buffer becomes full.

A close examination of the chart reveals that no data is transmitted for about 10 msec at around 1 second. The 10 msec idle period could be due to an application stall associated with depleted socket buffer space.

The idle period is then followed by a 150 msec period in which the transmission rate has decreased to 400 Mbps from 1.86 Gbps. This is behavior is strange in that it doesn't appear to be due to slow-start, congestion avoidance or rate halving. We suspect that this strange behavior may be due to tcpdump consuming enough CPU cycles to negatively impact iperf performance. We repeated the experiment while running the script ~kenw/bin/runCwndMon.pl at the sending host to capture the values of cwnd and ssthresh. The output showed that cwnd peaked at about 10,000 packets, went through a short reduction period and then settled down to almost a constant value.

When we repeated the experiment several times without running tcpdump on the sending host, the iperf throughput jumped to over 900 Mbps. That is, tcpdump distorted the behavior of TCP at high sending rates because it reduced the CPU cycles enough to significantly affect the processing of outgoing and incoming packets. The packet drop that appears at the far right side of the chart may also be another product of tcpdump interference.


The Host Parameters txqueuelen And netdev_max_backlog

Increasing the TX ring buffer had two effects:

  • It allowed cwnd to increase to atleast the BDP during slow-start by delaying the time when the qdisc would overflow.
  • It eliminated the requeueing overhead (and hardware duplicates).

The first point can also be achieved by increasing the size of the qdisc by increasing the value of the txqueuelen parameter associated with the network interface using the ifconfig command:

/sbin/ifconfig data0 txqueuelen 8192

The benefit of this approach is that the TCP state variables give a more accurate reflection of the flow dynamics. The traffic seen coming out of the qdisc will approximate the line rate instead of appearing to be twice the line rate. And the number of packets perceived to be in the network (inflight) will be more accurate since the TX ring won't have more than 256 packets queued.

The analog to the sending side buffers is to increase the size of the receive buffers using the sysctl command:

/sbin/sysctl -e –w net.core.netdev_max_backlog=8192
  • txqueuelen is the number of slots in the sender's ring buffer for departing packets and should be large enough to allow slow-start to reach and maintain cwnd near the BDP.
  • net.core.netdev_max_backlog is the number of slots in the receiver's ring buffer for arriving packets and should be large enough to buffer incoming data packets at the receiver during moments that iperf (or similar application) loses the use of the CPU.

See:

Expected Behavior

If txqueuelen and net.core.netdev_max_backlog are large enough for TCP to fill the network pipeline and if the bottleneck NPR port rate is about 1 Gbps, one packet should be sent for every packet ACKed. When this happens, cwnd will continue to increase past the BDP because TCP will still be in slow-start. But since Linux TCP by default supports CWV (Congestion Window Validation), it will degrade cwnd so that it is approximately equal to the number of packets in flight. If no packet drops occur in the network or the receiver, cwnd will appear to be constant even though the sender is still in slow-start if the sender runs for a sufficient amount of time.

Repeated 20-second experiments without tcpdump running at the sending host confirms this is the behavior most of the time. But occassionally (in about one out of five experiments), one short packet (usually 100 octets) gets dropped. But it is unclear why the short packet is sent and where it actually gets dropped.

Single-Flow Experiments

            >>>>> RERUN EXPERIMENTS BUT WITHOUT TCPDUMP RUNNING <<<<<

The table below compares the bandwidth, queue length, drops and iperf output for four single-flow experiments. The experiments labeled with NIC mods means that the number of NIC TX descriptors has been increased from the default value of 256 to 16,382. The ones labeled with NIC+Host mean that the txqueuelen and the netdev_max_backlog host parameters have also been increased to 16,382. ( NOTE: Click on the image to get an enlarged version of the figure. )

Exp# CA/Mods Bandwidth (BW) Queue Length (QL) Drops Iperf
1 Reno Gigabit-tcp-reno-1flow-bw.png Gigabit-tcp-reno-1flow-qlen.png Gigabit-tcp-reno-1flow-drops.png Gigabit-tcp-reno-1flow-iperf.png
2 Reno
NIC
Gigabit-tcp-reno-1flow+nic-bw.png Gigabit-tcp-reno-1flow+nic-qlen.png Gigabit-tcp-reno-1flow+nic-drops.png Gigabit-tcp-reno-1flow+nic-iperf.png
3 Reno
NIC+Host
Gigabit-tcp-reno-1flow+nic+host-bw.png Gigabit-tcp-reno-1flow+nic+host-qlen.png Gigabit-tcp-reno-1flow+nic+host-drops.png Gigabit-tcp-reno-1flow+nic+host-iperf.png
4 BIC
NIC+Host
Gigabit-tcp-bic-1flow-bw.png Gigabit-tcp-bic-1flow-qlen.png Gigabit-tcp-bic-1flow-drops.png Gigabit-tcp-bic-1flow-iperf.png

Observations:

  • XXXXX



Three-Flow Experiments

            >>>>> RERUN EXPERIMENTS BUT WITHOUT TCPDUMP RUNNING <<<<<

XXXXX

( NOTE: Click on the image to get an enlarged version of the figure. )

Exp# CA/Mods Bandwidth (BW) Queue Length (QL) Drops Iperf
5 Reno Gigabit-tcp-reno-3flows-bw.png Gigabit-tcp-reno-3flows-qlen.png Gigabit-tcp-reno-3flows-drops.png Gigabit-tcp-reno-3flows-iperf.png
6 BIC Gigabit-tcp-bic-3flows-default-bw.png Gigabit-tcp-bic-3flows-default-qlen.png Gigabit-tcp-bic-3flows-default-drops.png Gigabit-tcp-bic-3flows-default-iperf.png
7 BIC
NIC+Host
Gigabit-tcp-bic-3flows-bw.png Gigabit-tcp-bic-3flows-qlen.png Gigabit-tcp-bic-3flows-drops.png Gigabit-tcp-bic-3flows-iperf.png

Observations:

  • XXXXX



To Do

Refs:

  • BIC paper: [1]
  • BIC TCP paper: [2]
  • Experiments with txqueuelen: [[1]]

References

  1. Lisong Xu, Khaled Harfoush and Injong Rhee, "Binary Increase Congestion Control (BIC) for Fast Long-Distance Networks", IEEE Infocom 2004, pp. XX-XX.
  2. Yee-Ting Li and Doug Leith, "Bic TCP Implementation in Linux Kernels", Feb. 15, 2004
Personal tools
Instructor Resources