NSP Filters, Queues and Bandwidth
A Quick Tour of Filters, Queues and Bandwidth
The Remote Laboratory Interface provides the user access to advanced features of the NSP hardware such as packet classification, queueing and redirection, bandwidth sharing and configurable parameters (e.g., link capacity). This section describes a simple experiment in which UDP traffic from multiple sources flowing through a bottleneck link are given different shares of the link bandwidth. The real-time display capability is used to verify that the system behaves as expected. Throughout this section, architectural features of Network Service Processors (NSPs) are described more fully in the companion section NSP Architecture.
As shown in Fig. 1, the experiment uses a two-NSP topology and the iperf utility to send UDP traffic
from the three hosts n2p2, n2p3 and n2p4 to hosts n1p2, n1p3
and n1p4 through the bottleneck link joining port 6 of NSP 2
to port 7 of NSP 1.
We examine three cases in which the three flows are mapped to:
- datagram queues,
- a single reserved flow queue, and
- three separate reserved flow queues.
The default behavior at an egress port is to place packets in FIFO
datagram queues based on a hash function
computed over parts of the IP packet header.
However, the FPX (NSP_Architecture#Packet_Processing_in_the_FPX) has
three parallel lookup tables at each port (Fig. 2): 1) a
Route Table that uses longest prefix matching, 2) a
Flow Table that uses Exact Match (EM) filters, and 3) a
Filter Table that uses General Match (GM) filters.
Both EM and GM filters match on five fields of a packet's IP
header: the source and destination
IP address fields, the source and destination
transport layer port fields, and the protocol field.
But GM filters differ from EM filters in two respects: GM filters
allow wildcarding of any of the fields, and each GM filter
has an assignable priority.
When a packet matches multiple filters, the
highest priority entry is chosen.
In order to give special treatment to the three flows,
we use the Filter Table in the FPX to redirect the
three flows to separate reserved queues.
Fig. 3 shows the GM filters used to direct three flows to queues 300-302 respectively at egress port 6. Each of the source address/mask fields match the interfaces of the three sending hosts, and the destination address/mask fields match packets going toward the subnets associated with NSP 1. We have wild-carded the application port fields and the protocol field (even though we will be sending UDP traffic in most cases).
Fig. 4 shows the configuration parameters for the queues
at port 6 of NSP 2.
Since each port handles traffic in both directions, the
parameters for ingress and egress sides are shown.
In this experiment, the egress link capacity was enterred as
300 Mbps, but the commit process set the actual link rate to
299.987 Mbps (explained later).
Since the internal switch capacity (not shown) has
been set to 600 Mbps, there is a 2:1 switch speed advantage NSP_Architecture#Gigabit_Router_Architecture.
The link bandwidth can be set to any rate up to 1 Gb/s.
The VOQs table contains the eight VOQs on the ingress side. Recall that packets in VOQ k are destined for output k. The threshold column indicates the discard threshold; i.e., the queue level (in bytes) above which arriving packets for that queue are discarded. The threshold has been set to the default value of 1,000,000 bytes. The rates column indicates the rate at which packets are transmitted from the VOQ when there is a backlog in the VOQ. The rate for each VOQ has been set to the default value of 600 Mbps. As described in a later section, it is also possible to have a Distributed Queueing (DQ) algorithm automatically determine the sending rate of the VOQs so as to avoid output link overload while minimizing output link underload.
The Egress Queue table shows the three reserved flow qids (300-302) and the datagram queues. There are actually 64 datagram queues. If a packet arrives to an egress port and does not match an EM or GM filter, it will be be placed in one of these 64 datagram queues based on a hash function. However, EM and GM filters can be used to place packets in one of the reserved flow queues (qids 256-439). In this example, the flows are placed in queues 300-302. The relative values of the entries in the quantum field for the three reserved flow qids indicate the bandwidth shares of a Weighted Deficit Round Robin (WDRR) packet scheduling algorithm. The desired bandwidth ratios of queues 300-302 are approximately 1:2:3 (i.e., 2048:4000:6000) which gives the most bandwidth to queue 302. If there were packets also in the datagram queues, the quantum fields in all non-empty queues would determine the share of the 300 Mbps link that each queue would receive.
Fig. 5 shows three plots. The top plot shows the bandwidths in incremental form. Specifically, the first solid curve shows the bandwidth entering the bottleneck link coming from the first flow, the second solid curve shows the bandwidth contributed by the first two flows and the third shows the total bandwidth contributed by all three flows. The dashed curves show the bandwidth leaving the bottleneck link. Note that the three sources are sending at an aggregate rate of over 750 Mbps, well over the 300 Mbps capacity of the bottleneck. The dashed curves indicate that the three UDP flows are receiving bandwidth in the proportion 1:2:3 when all three flows are active (middle section) and 2:3 (right end) when only qids 301 and 302 have packets. The middle plot shows the queue length of the reserved flow queues and that the length of the three reserved flows is in the ratio 3:4:5 as required by the threshold settings. The bottom plot shows the number of packets discarded at port 2.6 due to overflows of queues 300-302 at egress port 2.6.
Filters can also be used to re-direct individual flows or flow aggregates to different outgoing links than those specified by the routing tables. GM filters can also be configured to replicate matching packets and direct the copies to a different location which is useful for passive monitoring of a flow.
Flow Tables and Filter Tables
Now we will demonstrate the basic features of flow tables and filter tables; i.e., Exact Match (EM) filters and General Match (GM) filters. Once again we use the 2-NSP configuration from the preceding example (Fig. 6). Note that NSP 2 is on your left, and NSP 1 is on your right. Recall that the routing tables at every port have default entries except that there are extra entries for paths between NSPs. For example, at port 2.2 (NSP 2, port 2), there is a route entry of (192.168.1.0/24, 6) which will route all packets from host n2p2 through the top link 2.6-1.7; i.e., all packets destined for any interface on NSP 1 will go out port 2.6 (NSP 2, port 6) to port 1.7 (NSP 1, port 7). We will show how to override the route table entries using first an EM filter and then a GM filter.
The Flow Table and Exact Match (EM) Filters
We will override the route table entries with an EM filter so that ping packets from port 2.2 will flow over the bottom link 2.7-1.6 instead of the top link 2.6-1.7.
In Fig. 7, we select port 2.2 => Ingress Filters to open up a panel for the tables at port 2.2. The NSP2:port2 window will appear showing the NSP2:port2 Ingress Filters panel similar to the one shown to the rear of Fig. 8. The other menu entries in Fig. 7 provide access to the Route Table, Egress Filters, Queue Tables and the Plugin Table which will be described later. To also show the Routing Table, select NSP2:port2 => Tables => Routing Table.
We want to add an EM filter entry to the Flow Table that will match all ping packets from port 2.2 going to any interface at NSP 1 and force it to go through port 2.7. To do this, we must:
- Add a default EM filter.
- Enter a 5-tuple that will match the ping packet header fields;
- Specify that these matching packets should be forwarded to port 7 (instead of port 6); and
- Override any route entry that might also match the ping packets.
The top panel in Fig. 8 shows the result of the following operations:
- Add a default EM filter: Select Edit => Add Exact Match Filter.
- Enter the IP header fields:
The fields of this 5-tuple are the
full source IP address, the source port number,
the full destination IP address, the destination port number,
and the protocol.
-
Since ping uses the ICMP protocol (protocol 1) which is really handled
by an OS kernel and not an application, set the port number
fields to 0 and the protocol field to 1.
- Enter the forward to voq field: Port 7.
- Set the priority field: to a value lower than
60 which is the priority of Route Table entries.
-
Note that a lower number indicates a
higher priority.
The default value of 56 is usually sufficient, but Fig. 8 shows
that it was changed to 58.
Note that the 5-tuple (src addr, src port, dst addr, dst port, proto) is (192.168.2.48, 0, 192.168.1.48, 0, 1) which will match any ICMP (ping) packet from host n2p2 to host n1p2. These packets will be forwarded through port 2.7 over the bottom link 2.7-1.6. Note also that we accept the default for the spc qid field of no plugin .
We commit the changes ( Select File => Commit ), and proceed to test the filter. We will test our filter using one ping traffic generator that sends traffic from n2p2 to n1p2 (<a href="Recipes.html#ping-recipe"> ping recipe</a>) and monitoring the incoming and outgoing traffic at ports 2.6 and 2.7 (<a href="Recipes.html#monitoring-recipe"> monitoring recipe</a>).
If the EM filter were not installed, the ICMP echo request packets would travel along the top link 2.6-1.7 to the right and the returning ICMP echo reply packets travel along the bottom link to the left. On the other hand, if the EM filter was installed, we should see the ICMP echo request packets travel along the bottom link 2.7-1.6 to the right and the returning ICMP echo reply packets still travel along the bottom link to the left since the filter only affects the ICMP echo request packets.
Fig. 9 shows the traffic plot when the EM filter is installed at around T = 352. Prior to filter installation, traffic appears on the top to right and the bottom to left plots. But when the filter is installed, we still see return traffic on the bottom to left plot, but now we see traffic on the bottom to right plot instead of the top to right plot. ICMP echo request traffic is now flowing to the right along the bottom 2.7-1.6 link, and the ICMP echo reply traffic is still flowing to the left along the bottom 1.6-2.7 link. There is still no traffic flowing along the top-to-left link.
The Filter Table and General Match (GM) Filters
Now, we will show how to override the EM filter entry with a GM filter so that ping packets from port 2.2 will again flow over the top 2.6-1.7 link instead of the bottom 2.7-1.6 link.
In a similar fashion as before, Fig. 10 shows the GM filter:
- We use the NSP2:port2 Ingress Filters panel to define the GM filter.
- Select Edit => Add General Match Filter.
- Modify the default 5-tuple fields:
Set the source IP address/mask and
destination IP address/mask fields to match the ICMP
echo request packets and the forward to voq
to 6; and leave the default value of '*' in both
port number fields and the protocol fields.
-
We have taken advantage of the fact that
GM filters allow fields to be "wild-carded" (don't cares).
Furthermore, note that we have chosen the source
IP address/mask so that any packet in which the first
16 bits (2 octets) is 192.168 will match the source
IP address field; i.e., any packet in the testbed network.
We have also chosen the destination
IP address/mask so that any packet in which the first
24 bits (3 octets) is 192.168.1 will match the destination
IP address field; i.e., any packet in the testbed network
coming from NSP 1.
- Enter the forward to voq field: Port 6.
- Set the priority field: to a value lower than
56 which is the priority of the EM filter.
-
Note that GM filters can each have a different priority.
Whereas all EM filters have the same but settable priority.
Again, File => Commit to install the filter.
Fig. 11 shows that the GM filter was installed around T = 385
and has overridden the EM filter.
The traffic is again flowing to the right along the top link
(top to right) after T = 385.
Replicating Packets
A GM filter can also be configured to replicate every packet it matches. Replication might be useful in creating a duplicate traffic stream that is sent to a host or plugin for further analysis. In this section, we only illustrate how to duplicate packets.
Fig. 12 shows that a GM filter can be configured to
duplicate the ICMP echo request packets by checking
the aux box which defines an auxiliary GM filter.
The term auxiliary is meant to indicate that more
than one packet may be forwarded.
In this example, a Route Table (RT) entry and the EM filter
also match the ICMP echo request packets.
But since the EM filter has higher priority than the RT,
it will determine the disposition of the other copy of the
packet.
The end effect is that one copy of the packet will go to
port 2.6 because of the auxiliary GM filter, and the other
copy will go to port 2.7 because of the EM filter.
Fig. 13 shows the result of checking the aux box.
(You should compare Fig. 13 to the other traffic plots on this page.)
At T = 810, the GM filter
was installed, and now, traffic flows to the right on both
the top 2.6-1.7 link (top to right) and the bottom
2.7-1.6 link (bottom to right) between the two NSPs.
(The bottom to right plot is hidden behind the
top to right plot.)
Furthermore, note that the traffic volume on the return
path over the bottom link (bottom to left)
is about 6 Kb/s or twice the return traffic volume as before.
That is because there is now twice as many ICMP
echo reply packets as before: one for each original packet
and one for each copy.
This situation is shown in Fig. 14.
Generating Traffic With Iperf
Basic Usage
Iperf is a traffic generation tool that allows the user to experiment with different TCP and UDP parameters to see how they affect network performance. Full documentation can be found at the iperf documentation page(http://www.onl.wustl.edu/restricted/iperf.html). We give an overview of some of its basic features here.
The typical way that iperf is used is to first start one iperf process
running in server mode as the traffic receiver, and then start another
iperf process running in client mode on another host as the traffic sender.
In order to send a single UDP stream from n2p2 to n1p2 as
shown in Fig. 15, we would run iperf in server mode on n1p2 and iperf in
client mode on n2p2.
The iperf executable is located at /usr/local/bin/iperf on every ONL host.
Since the directory /usr/local/bin by default is located in every user's
PATH environment variable, you should be able to run iperf by entering
the iperf command followed by any command-line arguments.
For example, to reproduce the setup shown in Fig. 15 we would enter the following:
| Window | Command | Description |
|---|---|---|
| Window 1 | source ~/.topology | Define environment variables such as $n1p2 Use ~onl/.topology.csh if using a C-shell. (See The ~/.topology File) |
| |
ssh $n1p2 | ssh to the iperf server host |
| |
iperf -s -u | Run iperf as a UDP server: ( -s ) Run as server ( -u ) UDP |
| Window 2 | source ~/.topology | Define host environment variables |
| |
ssh $n2p2$ | ssh to the client host |
| |
iperf -c n1p2 -u -b 200m -t 30 | Run iperf as a UDP client: ( -c n1p2 ) Run as client with server on n1p2 ( -u ) UDP ( -b 200m ) 200 Mbps bandwidth ( -t 30 ) For 30 seconds |
Fig. 16 shows the resulting ssh windows.
In the server window (in back), the server reports that it is
listening on UDP port 5001, will receive 1470 byte datagrams and
uses the default 64 KByte UDP buffer.
In the client window (in front), the client reports that it is
connecting to UDP port 5001 and is using the same parameters as
the server.
It also shows that in the 30 second interval, it transfered 725 MBytes
at an average rate of 203 Mb/s.
This output is followed by the Server Report that shows
the same statistics along with some additional ones.
The fifth field "0.002 ms" is the jitter.
The next field "0/517244" indicates that 517244 datagrams were sent
and 0 were received out of order which results in a 0%
datagram error.
At the end of a UDP traffic session, the iperf client sends out a special application layer FIN packet signalling the end of transmission. The server responds to the FIN with a reply containing the statistics for the session. The client will continue to send this FIN packet every 250 milliseconds until either the server responds with its statistics or a total of 10 FIN packets have been sent. Since the server never closes the receive socket, it is possible for the server to receive FIN packets from a preceding session thinking that they belong to a new session.
Fig. 17 shows the resulting traffic chart when the traffic flows to
the right over the top link at 200 Mb/s.
Notice that the measured traffic rate is over 200 Mb/s because the
chosen measurement points were inside the NSP where the physical
data units are ATM cells which contain an additional 10% cell
header overhead.
Note also that there is no measurable return traffic since the only
returning traffic is the server's report datagram after the client
has sent the FIN packet.
Fig. 17 shows another RLI feature: the ability to show a chart coordinate. This can be done by selecting View => Show Values" and then clicking on a point in the chart.
The table below shows examples of useful variants of iperf. In UDP bandwidth specifications, "10m" represents 10 Mb/s. 'M' could also be used to signify 1,000,000. Similarly, 'K' and 'k' both indicate 1,000. The -n argument (number of bytes) in conjunction with -l (length of each datagram) is used to send a fixed number of datagrams of a specified length and is often used with small values when some forms of debugging are desired.
| iperf -h | help: Show usage information |
| iperf -s -u | Run an iperf udp server |
| iperf -c n1p3 -u -b 10m -t 10 | Run an iperf udp client with n1p3 as the server Send at 10 Mb/s to n1p3 for 10 sec (1470-byte packets + 28 bytes of header) |
| iperf -c n1p3 -u -b 10m -l 1000 -n 8000 | Run iperf udp client with n1p3 as the server Send at 10 Mb/s to n1p3. Send a total of 8000 bytes, 1000 bytes per pkt |
| iperf -s -w 16m | Run iperf TCP server with a 16 MB receive window |
| iperf -c n1p3 -t 10 | Run an iperf TCP client with n1p3 as ther server for 10 sec |
Coordinated UDP Streams
We can produce coordinated UDP streams by writing two shell scripts: one launches multiple iperf servers, and the other launches the corresponding iperf clients. For example, the following Bourne shell script run-uservers will launch iperf UDP servers:
#!/bin/sh # Usage: run-uservers # Example: ssh onl.arl; run-uservers # Note: Clients (Servers) are NSP2 (NSP1) hosts # source ~/.topology # define env vars $n1p2, ... ssh $n1p2 /usr/local/bin/iperf -s -u & ssh $n1p3 /usr/local/bin/iperf -s -u & ssh $n1p4 /usr/local/bin/iperf -s -u &
The command run-uservers will start three ssh processes running in the background. These processes will, in turn, each remotely run the iperf command on the three hosts defined by the values of the three environment variables $n1p2, $n1p3, $n1p4. Note that these hostnames are the external host names and not the internal host names (e.g., n1p2) since the ssh commands will be sent over the control network and not the internal testbed network. Alternatively, these names can be obtained by right clicking on the host icons shown in Fig. 18. But by using the .topology file, we have streamlined the launching of the servers and made the script transparent to experiment restarts. Note also that if /usr/local/bin is in your PATH environment variable, /usr/local/bin/iperf can be shortened to just iperf. Our objective is to run the iperf servers on n1p2, n1p3 and n1p4, and run the iperf clients on n2p2, n2p3 and n2p4 as shown in Fig. 18.
After the iperf servers have been started, the clients can be started with another script. For example, the following Bourne shell script run-uservers will launch iperf UDP clients:
#!/bin/sh # Usage: run-uclients # Example: run-uclients # Note: Clients (Servers) are NSP2 (NSP1) hosts # source ~/.topology # define env vars $n2p2, ... ssh $n2p2 /usr/local/bin/iperf -c n1p2 -u -b 250m -t 30 & sleep 6 ssh $n2p3 /usr/local/bin/iperf -c n1p3 -u -b 250m -t 30 & sleep 6 ssh $n2p4 /usr/local/bin/iperf -c n1p4 -u -b 250m -t 30 &
The run-uclients script operates like the run-uservers
script except that it starts iperf clients on the hosts $n2p2, $n2p3,
and $n2p4, each with a 6 second staggered starting time.
Note that the server names are specified using the internal host
names n1p2, n1p3 and n1p4 since we want the traffic to go over
the testbed network and not the control network.
Fig. 19 shows the resulting traffic chart. The 2.6 output bw line shows the total traffic from n2p1-n2p3 going to the right over the top link out of port 2.6. The 1.7 input bw line shows the total traffic going through port 1.7. In the middle part of the plot, we can see that the total traffic going out of port 2.6 is around 850 Mb/s, but the total traffic going into port 1.7 is only about 680 Mb/s. That is because the capacity of the 2.6-1.7 link is only 600 Mb/s, the default link rate. Note again that both traffic rates in this figure have the ATM overhead of about 10%.
Monitoring an Ingress Port
Packets arriving to an ingress Port are placed into separate Virtual Output Queues (VOQs) based on their forwarding port. Since each NSP has eight ports, there are eight VOQs at each input port. VOQ k contains packets that are to be forwarded to egress port k. The use of VOQs avoids the head-of-the-line (HOL) blocking problem that occurs when there is only a single queue. The section NSP Tutorial => Filters, Queues and Bandwidth => Queues describes in detail the NSP queues that are encountered as a packet travels from one host to another through an NSP.
We describe how the VOQs at ports 2.2 through 2.4 and port 1.7 of our two-NSP configuration can be monitored in a meaningful manner when there are three UDP flows. We also introduce the Add Formula dialogue box that allows us to define incremental bandwidth plots and other similar graphical compositions.
VOQ (Virtual Output Queue) Bandwidth
Fig. 20 shows the menus involved in monitoring the traffic volume passing through VOQ 6 at ingress port 2.2 (the left NSP). The recipe is:
- RLI: Port 2.2 => Ingress => Bandwidth to OPP
-
The Add Parameter dialogue box appears allowing you
to enter the VOQ and the polling rate.
- Add Parameter:
-
The to port indicates VOQ 6, and the
Enter Polling Rate indicates that the bandwidth
will be determined every one second.
One second is a typical choice since it is frequent enough to
get accurate results in most cases yet not so often that it will
overly tax the monitoring system.
It is sometimes possible to monitor effectively as often as
every 0.2 sec.
The bandwidth of VOQ 6 at ingress port 2.2 is computed by the RLI which receives the byte count at VOQ 6 at ingress port 2.2 every one second from a daemon running on the CP of NSP 2.
Fig. 21 shows the bandwidth chart of the following six UDP flows:
2.n-to-2.6 and 1.7-to-1.n for n=2, 3, 4.
But this chart is difficult to read because most of the lines
fall on top of each other.
Incremental Bandwidth (Add Formula)
Fig. 22 shows plots which are easier to read because it shows the
incremental bandwidths; i.e., the 2.3-to-2.6 bandwidth
(labeled +2.3 to 2.6)
is "stacked" on top of the 2.2-to-2.6 bandwidth, and the
2.4-to-2.6 bandwidth (labeled +2.4 to 2.6)
is stacked on top of the 2.2-to-2.6 bandwidth.
Similarly, the 1.7-to-1.n bandwidth lines are stacked on top of each other.
We now describe how to construct an incremental bandwidth chart.
In Fig. 23, we laid down the base 2.2-to-2.6 plot in the standard manner.
Now, we want to add two more lines each stacked on top of the
previous line and then repeat the process for the 1.7-to-1.n lines.
The plot in Fig. 23 was created in the standard manner to lay the base 2.2-to-2.6 plot. In order to display the bandwidth in incremental format, we select the Parameter => Add Formula menu item in the display panel and the Add Formula dialogue box appears.
The Add Formula feature acts just like a calculator. Suppose that the three plots of 2.n-to-2.6 (n = 2, 3, 4) are represented by the three time series X[], Y[] and Z[] where X[] represents the sequence X[0], X[1], .... In effect, we would like to display X[i], X[i]+Y[i] and X[i]+Y[i]+Z[i] for i = 0, 1, 2, .... However, in our example, we visually indicate the time series by selecting the monitoring items instead of the time series X[], Y[] and Z[].
The recipe for creating the +2.3 to 2.6 plot (after the
2.2 to 2.6 plot) is shown in Fig. 24 and in the table below.
| Window/Panel | Selection/Entry | Explanation |
|---|---|---|
| incremental bw | Parameter => Add Formula |
Open Add Formula panel |
| Add Formula | name: +2.3 to 2.6 | Enter the name (label) of the curve |
| Main RLI | Port 2/2 => Bandwidth to OPP | Opens Add Parameter window |
| Add Parameter | to port: 6 | VOQ (output port) number is 6.
"M(2.2 to 2.6)" label appears on Add Formula panel |
| |
Enter Polling Rate 1 | 1 second polling rate |
| Add Formula | Select + button | We want to add another time series |
| Main RLI | Port 2/3 => Bandwidth to OPP | Opens Add Parameter window |
| Add Parameter | to port: 6 | VOQ (output port) number is 6.
Label in Add Formula panel becomes "M(2.2 to 2.6)+M(2.3 to 2.6)" |
| |
Enter Polling Rate 1 | 1 second polling rate |
| Add Formula | Select Add button | Add formula to the incremental bw chart |
The traffic chart will look like Fig. 25.
We repeat the above procedure to add the other incremental
bandwidth plots.
Note that when we setup the +2.4 to 2.6 plot we will
have to specify three monitoring points: the Bandwidth to OPP
(VOQ bandwidth) for the three traffic bandwidths flowing from
2.n to 2.6 (n=1, 2, 3).
The result will be Fig. 22.
Modifying Link Rate
By default, the capacity of links leaving an egress (output) port are set to 600 Mbps. This link rate can be modified through the Queue Tables menu item at each port (Fig. 26).
In this example, we want to set the link rate to 300 Mbps instead
of its default 600 Mbps.
Fig. 27 shows three versions of the Queue Table panel for port 2.6.
The leftmost panel shows that the default link bandwidth is 600 Mbps.
We have set the desired link bandwidth to 300 Mbps in the middle panel.
The actual link bandwidth is 299.987 Mbps (shown in the rightmost panel)
after committing.
Three comments are in order:
- The bandwidth is controlled by a token bucket traffic regulator which is described in the page NSP Tutorial => Filters, Queues and Bandwidth => Link Rate.
- The difference between the desired and actual link rate is due to the precision of the token bucket implementation which is 61 Kbps; the actual rate is an integer multiple of 61 Kbps.
- The traffic rate is the IP packet traffic rate and does not any overheads except the IP header.
The Port 6 Queues panel in the Tables window shown in Fig. 27 also contains a number of other parameters related to the ingress and egress queues (these parameters will be described later):
- The 8 VOQs at ingress port 6 have threshold and quantum parameters associated with the scheduling of packets going from input port 6 to all of the output ports (0 to 7). The quantum values are only user definable if the Distributed Queueing (DQ) algorithm is disabled. If DQ is enabled, the fields will appear grey, and the user will not be able to modify them.
- The datagram queues have threshold and quantum parameters associated with the scheduling of packets that exiting output port 6
The threshold parameter indicates the queue length. For example, the threshold for each of the VOQs is 1,000,000 bytes. All queues use a tail drop discard policy. So, an arriving packet will be dropped if adding it to the queue will exceed the queue threshold.
We set the link rate at port 2/6 to 300 Mbps using the following recipe (Fig. 27):
| Window/Panel | Selection/Entry | Explanation |
|---|---|---|
| Main RLI | Port 2/6 => Queue Tables | Opens Tables window for port 2.6 |
| Port 6 Queues | Link Bandwidth(Mbps): 300 | Define link rate to be 300 Mbps |
| Main RLI | File => Commit | Actually set the link rate |
Fig. 28 shows the effect of reducing the capacity of the
2.6-to-1.7 link rate from 600 Mbps to 300 Mbps.
The solid lines show the traffic rates of the UDP
flows measured at input ports 2.2 through 2.4.
The dashed lines show the traffic rates of the flows
at port 1.7 have been reduced to the point where the
aggregate rate (the +1.7 to 1.4 curve) is 300 Mbps,
the capacity of the bottleneck link (NSP 2's egress port 6).
Fig. 28 shows that at around T = 259, the traffic from port 2.3 is competing with the one from port 2.2, and the total traffic exceeds 300 Mbps. At T = 267, the traffic from port 2.4 starts, and the three flows now compete for the limited bandwidth at port 2.6. But at T = 282, the traffic from port 2.2 stops leaving only two flows to compete for the link.
Mapping Flows to a Single Queue
Earlier, we used a General Match (GM) filter at an ingress port to map flows to VOQs and therefore forward packets to an egress port that was different than specified by the Route Table. This page shows how we can give special treatment to some flows at an egress port through the use of GM filters. This can effect the scheduling of the packets since queues in an egress port are scheduled by a weighted DRR (Deficit Round Robin) scheduler.
Fig. 29 shows the port table at egress port 2.6
obtained by selecting Port 2.6 => Egress Filters in the
main RLI window; i.e.,
it shows the egress side Flow Table (EM Filters) and
Filter Table (GM Filters).
The GM Filter Table was defined by selecting
Edit => Add General Match Filter in the Egress Filters
panel for the three filters.
The three filters at egress port 2.6 match packets going to any destination (0.0.0.0/0) in NSP 1 from n2p2, n2p3 and n2p4 respectively. Note that the port and protocol fields in all three filters are wild-carded so that they don't care about these header fields. All three filters map their matching packets to QID 300, one of the reserved flow QIDs (256-439). The priority has been set to 50 for all three filters to give them higher priority than the Route Table. Our remaining task is to define the queueing characteristics of queue 300.
Fig. 29 also shows the Queue Table obtained by selecting Tables => Queue Table and then adding an entry for QID 300 by entering Edit => Add Egress Queue in the Port 6 Queues panel. We have accepted the default threshold value (32,000 bytes) and quantum value (2,048) for now. Since the 64 datagram queues and reserved queue 300 all have a quantum of 2048, they will all get an equal share of the link bandwidth. We could give a greater share of the bandwidth to the packets matching the GM filters by increasing the quantum field for queue 300. We enter File => Commit in the main RLI window to install the new parameters.
Fig. 30 shows how to monitor queue 300 at egress port 2.6 to
demonstrate that queueing is occuring.
We select Port 6 => Egress => Qlength in the main RLI
window, and fill out the Add Parameter dialogue box to
monitor queue 300 every 1 second.
Fig. 31 shows both the incremental bw and the
Queue Lengths charts aligned to show the correspondence
between the traffic and queue length.
The traffic chart is still showing the traffic volume through
VOQ 6 at ingress ports 2.2, 2.3 and 2.4 and through VOQs 2, 3 and 4
at ingress port 1.7 when the outgoing link at port 2.6 has a
capacity of 300 Mbps.
At T = 1,541, queue 300 develops a backlog since the traffic
coming into egress port 2.6 is 500 Mbps or 200 Mbps over the
link capacity.
This backlog remains until T = 1,573 when only the one 250 Mbps
traffic source from n2p4 is still transmitting.
Note also that the peak queue length is near 32,000 bytes,
the threshold value of queue 300.
A Larger Threshold for Queue 300
Fig. 32 shows that we have now increased the threshold
to 1,000,000 bytes allowing a greater maximum queue backlog
before discarding packets.
As expected, Fig. 33 shows that queue 300 at egress port 2.6 now has a peak queue backlog of 1,000,000 bytes instead of 32,000 bytes.
Packet Drops
We can verify that the threshold for queue 300 is in fact
forcing packet drops by monitoring the packet drops at egress
port 2.6.
Fig. 34 shows how to set up this monitoring by selecting
Port 2.6 => Egress => FPX General Counter and electing
to watch FPX Counter 66
(see
FPX Counters for other drop counters).
Fig. 35 shows that indeed egress port 2.6 is dropping
around 17,000 packets every 3 seconds when two flows are
competing for the egress link, and around 39,000 packets every
3 seconds when all three flows are competing for the same link.
A rough calculation shows that the Discards chart above shows the correct behavior. When there are two 250 Mbps flows (beginning around T = 2,770), the aggregate input rate to port 2.6 is 500 Mbps or 200 Mbps over the 300 Mbps egress link rate; that is, the +2.3 to 2.6 plot is around 500 Mbps, and the +1.7 to 1.3 plot is around 300 Mbps. For 1,500 byte (12,000 bit) packets, 200 Mbps translates into about 16,700 pps (packets per second). This compares well to the 17,000 discards in the Discards chart. When there are three 250 Mbps flows, the aggregate input rate to port 2.6 is 750 Mbps or 450 Mbps over the 300 Mbps output link rate. Since this discard rate of 450 Mbps is about 2.25 higher than the 200 Mbps drop rate for two flows, we expect the number of discards to be also about 2.25 times 16,700 pps or 37,575 pps. This compares well to the 38,000 discards in the Discards chart.
Mapping Flows to Separate Queues
Instead of mapping all three UDP flows to queue 300 at egress port 2.6, we could map them to separate queues so that we can give guaranteed bandwidth to each flow. We will still use the Queue Table to define the packet scheduling characteristics, but now, we will map each of the three UDP flows to each of the queues 300-302.
In Fig. 36, the qid fields in the Egress Filters table
have been changed so that the packets from
n2p3 and n2p4 will now go to queues 301 and 302 respectively
instead of queue 300.
And, entries for queues 300-302 have been added to the
Port 6 Queues table
(select Tables => Queue Table in the NSP2:port6
window to show the table).
The packet scheduling parameters for queues 300-302
have been chosen so that the packet discard thresholds are
different (30,000, 40,000, 50,000) but the quantum fields are the
same (2,048).
Fig. 37 shows the effect on the incremental bw traffic
chart and the Queue Lengths chart.
Because the quantum parameter for all three queues are all
equal to 2,048, they will all get an equal share of the link
bandwidth.
But because the threshold parameters are in the ratio
3:4:5, their maximum queue lengths will be in that
ratio with the largest queue topping out at 50,000 bytes.
Fig. 38 shows the new Queue Table where we have changed the
quantum parameters for queues 300-302 so that the bandwidth
shares of these queues are now in the approximate ratio of 1:2:3.
Now, queue 301 should get about
twice as much bandwidth as queue 300, and queue 302 should get
about 1.5 as much bandwidth as queue 301 under full load.
Fig. 39 shows the effect of these changes on the bandwidth, queues
and discards.
The five time periods exhibit the expected bandwidth behavior:
- Before T=986: Since n2p2-n1p2 is the only flow active, it gets the entire link bandwidth and transmits at 250 Mbps; there is no queueing; and there are no packet discards.
- Between T=986 and T=994: Now flow n2p3-n1p3 joins in and there are two competing flows with an aggregate demand of 500 Mbps which is 200 Mbps over the 300 Mbps link bandwidth. The incremental bandwidth chart shows the bandwidth ratio for these two flows is 1:2; i.e., queue 300 gets 100 Mbps, and queue 301 gets 200 Mbps. Because the link is over subscribed, backlogs develop for both queue 300 and 301, and packets are discarded.
- Between T=994 and T=1,008: Flow n2p4-n1p4 becomes active, and there are three competing flows with an aggregate demand of 750 Mbps which is 450 Mbps over the 300 Mbps link bandwidth. The incremental bandwidth chart shows the bandwidth ratios for the three flows is 1:2:3; i.e., queue 300 gets 50 Mbps, queue 301 gets 100 Mbps, and queue 302 gets 150 Mbps. Because the link is still over subscribed, there are still backlogs at all three queues, and packets are discarded.
- Between T=1,008 and T=1,015: The two flows n2p3-n1p3 and n2p4-n1p4 are active and share the bottleneck link in a 2:3 ratio (i.e., 120 Mbps and 180 Mbps).
- After T=1,015: Since only the n2p4-n1p4 flow is active, all of its 250 Mbps traffic is transmitted through the bottleneck link.
During the periods of congestion, the Discards chart is nearly identical to the ones described in the page Mapping Flows to a Single Queue since the only difference is that the flows are mapped to separate queues. That is, since the excess input traffic rate is the same as before, and the only difference is that the total amount of buffering here is less than before, the only difference should be that the onset of packet drops is earlier than before (which is hard to see in the chart).
Using Exact Match Filters
Exact Match (EM) filters can also be used in a similar manner as GM filters. We will use an EM filter to override the GM filter associated with queue 300. But when using EM filters with UDP flows, we need to know all IP header fields in the filter which will require that we use the Unix netstat command.
Fig. 40 shows how to use the netstat command to determine the
port number fields in the EM filter.
The command netstat -a displays all of the listening and
non-listening sockets, and grep n2p2 grabs only those
lines containing the string n2p2 which is the name of the iperf
client host.
The output here shows that the client is using port 32784 (an
ephemeral port assigned by n2p2's operating system)
and the server is using port 5001 (the default iperf server port).
Fig. 41 shows that we have added an EM filter that will send
packets from n1p2 (192.168.2.48) to queue 304.
The first GM filter will also match UDP packets from n1p2, but
its priority (50) is lower than the EM filter priority of 40
(a lower priority number indicates a higher priority).
Fig. 42 shows that queue 304 (2.6 Q304 line) is now getting
a backlog, but queue 300 (2.6 Q300 line) no longer has a backlog.
Using Iperf With TCP
Using iperf to generate TCP traffic is not much different than than for generating UDP traffic except that the receiver's maximum window size can have a significant impact in the throughput. The default server (receiver) window size is 64 KB or about 42 fullsize TCP segments. If we insert a 100 msec delay along a TCP flow's path (discussed later), a 64 KB receiver buffer will limit the throughput to about 5 Mbps even if the link rates are 600 Mbps. All of the ONL hosts allow receiver windows to be up to 20 MB if the user so chooses.
#!/bin/sh # Usage: run-tservers # Example: ssh onl.arl; run-tservers # Note: Clients (Servers) are NSP2 (NSP1) hosts # source ~/.topology # define env vars $n1p2, ... ssh $n1p2 /usr/local/bin/iperf -s -w 4M & ssh $n1p3 /usr/local/bin/iperf -s -w 4M & ssh $n1p4 /usr/local/bin/iperf -s -w 4M &
Fig. 43. TCP Server Script.
#!/bin/sh
# Usage: run-tclients
# Example: run-tclients
# Note: Clients (Servers) are NSP2 (NSP1) hosts
#
source ~/.topology # define env vars $n2p2, ...
while true; do
ssh $n2p2 /usr/local/bin/iperf -c n1p2 -t 30 &
sleep 10
ssh $n2p3 /usr/local/bin/iperf -c n1p3 -t 30 &
sleep 10
ssh $n2p4 /usr/local/bin/iperf -c n1p4 -t 30 &
sleep 40
done
Fig. 44. TCP Client Script.
Fig. 43 and 44 show two scripts: a TCP iperf server script, and a TCP iperf client script. These two scripts are typically run on the ONL login host to remotely start the iperf TCP clients and servers and are similar to the corresponding iperf UDP scripts described in Generating Traffic With Iperf. The two main differences are that the client script here continuously loops, and both scripts use TCP-specific parameters.
The run-tservers script launches three servers in the background, all with a receiver window size of 4 MB allowing over 13,000 fullsize (1,500-byte) TCP segments to be intransit. The command 'run-tservers' will launch iperf servers running in the background on ONL hosts $n1p2, $n1p3, and $n1p4. Note that we use the host interface names on the control network and not the internal network.
The run-tclients script launches three remote clients but with start times that are staggered by 10 seconds to give an offset to the traffic charts. The command 'run-tclients' will launch three iperf clients in the background. The -c command-line argument indicates where the server is running and uses the internal network interface name (e.g., n1p2, n1p3, n1p4). The -w option that defines the receiver window size is unnecessary in this example since we are using bulk transfer only to the server (i.e., there is no bulk transfer in the reverse path). The final flag -t specifies the length of time (seconds) that the client should send traffic. The sleep 40 command at the end of the while loop is used to allow all of the flows to finish before having the iperf clients repeat the whole demonstration again.
Fig. 45 shows the traffic and queue length charts in
the two-NSP configuration we used before but with the
three UDP flows replaced by TCP flows.
In that configuration, GM filters placed the three
flows into reserved flow queues 300-302 which were
assigned equal bandwidth shares (quantum).
As before with UDP flows, we see in the traffic chart that
the three flows receive equal bandwidth out of egress
port 2.6.
But now, we see the familiar sawtooth shape in the
queue length charts due to TCP congestion control.
Filtering With TCP Flags
There are six TCP flag bits in the TCP header which can be used to identify packets that have any of these bits ON:
- A connection is starting (SYN).
- A connection is ending (FIN).
- A connection is being reset (RST).
- The packet is an acknowledgement (ACK).
- Pass all outstanding packets at the receiver to the application as soon as possible (PSH).
- There is urgent data in the packet (URG).
For example, we count the number of TCP connection attempts by
matching all TCP packets with the SYN bit on if we know the
sender's IP address.
Or, we can simulate a SYN flood attack by dropping all TCP packets
from a sender except for its SYN packet.
You can filter on the TCP flag fields (SYN, ACK, FIN, RST, PSH, URG)
using a GM (General Match) filter (Fig. 46).
Fig. 46 shows an egress GM filter that could be used to count TCP connection
attempts; i.e., it will match a TCP SYN packet coming from anywhere and
going anywhere and forward it in the usual manner.
The main things to remember are:
- The TCP flag fields can have only one of three values: * (don't care), 0, or 1.
- The only time a TCP flag field can be a value other than don't care (*) is if the protocol field is tcp.
- A packet matches a GM filter if it matches the 5-tuple portion of the filter (src address, src port, dst address, dst port, protocol) and all of the TCP flag fields.
- An auxilliary filter can not drop packets.
The Using TCP Flags example demonstrates how to use the TCP-flags feature to:
- Count TCP connection attempts
- Count TCP flow completions
- Simulate a SYN flood attack
A new version of the SYN Attack Mitigation Demo will also demonstrate how TCP flag filtering can be used.
Sampling Filters
Sampling filters allow you to probabilistically select a fraction of packets that match an auxilliary filter and are typically used for monitoring packet flows. You may want to skip this section for now and return to it after you have read about how to use router plugins . Although a sampling filter can be used without a plugin, it becomes much more useful in conjunction with a plugin that can process the sampled packets.
The main things to remember are:
- The filter must be an auxilliary filter.
-
Remember that an auxilliary filter is denoted by selecting the
aux checkbox in a GM filter.
This means that a duplicate packet will be generated if there is
another matching filter or route table entry, and it is this
duplicate packet that may be selected from sampling.
Of course, if the auxilliary filter is the only matching filter,
a duplicate packet is not made.
- The sampling probability menu offers you only one of four choices:
100%, 50%, 25% or 12.5%.
-
For example, if you choose 25%, 25% of the matching packets will
be selected and 75% will be dropped.
- The selection is probabilistic.
-
This means that every time there is a matching
packet, the FPX flips a coin based on the sampling probability to
determine if the packet should be dropped or not.
This means that if n packets are considered for sampling when the
sampling percentage is 25%, N packets will be selected where N
follows the Bernoulli probability law with a success probability
of 0.25.
For example, if 100 packets are matched, the actual number selected
by sampling at the 25% level might not be 25.
For example, suppose we use the one-NSP configuration shown in Fig. 47 and send traffic from n1p2 to n1p3. We install an auxilliary filter (shown in Fig. 48) at egress port 3 that will sample at the 25% level and send all selected packets to a stats plugin. The filter shows that selected packets are sent to SPC queue 8 which has a QID of 136. The stats plugin counts the number of ICMP, TCP, and UDP packets and then drops the packet. It is summarized in Predefined Plugins and its use is described in Monitoring With A Plugin.
The plugin table at port 3 is shown in Fig. 49. It is linked to the filter via the SPC QID of 8. That is, packets selected by the filter is sent to the SPC via SPC QID 8, and the plugin reads packets from SPC QID 8. To test the capabilities of the sampling filter, we send four consecutive flows through port 3:
- Flow 1: 32 ping packets
- Flow 2: 32 ping packets
- Flow 3: 32 ping packets
- Flow 4: 500 UDP packets
We send a command to the stats plugin requesting that it display the number of ICMP, TCP and UDP packets that it has seen to show how many packets the plugin has been sent by the filter. We do this before each flow and then after the last flow. The response from the plugin is shown in Fig. 50.
The output shows that the number of ICMP (ping) packets seen by the stats plugin for each of the first three flow are 9, 13, and 10. Note that the count is not exactly 25% of the 32 packets sent, and the count is different in each case. In the case of the UDP flow, the plugin counted 122 packets. Again, the count is not exactly 25% of the 500 packets sent; i.e., not 125.
Recipes (Flow Tables and Filter Tables)
These recipes are referenced from the main pages of this section. You can skip past this page if you are accessing it through the Next button in the top or bottom right margins. If you jumped into this page, then you probably want to pop back up to the referring page using your browser's Back button.
Below are the recipes for monitoring and sending ping traffic:
- RLI: Port 2.6 => Egress => Port Bandwidth
-
An Add Parameter dialog box appears.
- Add Parameter: Enter 0.3 seconds
(select the 1 sec default if you wish).
-
A monitoring window appears with the label OPPBW-2.
- Select the OPPBW-2 label.
-
A label dialog box appears.
- Enter top to right as the new label.
- Repeat for the other three monitoring points
- Port 1.6 => Egress => Port Bandwidth
Label: bottom to left - Port 1.7 => Egress => Port Bandwidth
Label: top to left - Port 2.7 => Egress => Port Bandwidth
Label: bottom to right
- Port 1.6 => Egress => Port Bandwidth
- Enter: source ~/.topology
-
Define ONL environment variables.
Use ~/.topology.csh if your shell is a C-shell instead
of the bash shell.
- ssh to $n2p2 from the ONL login host (note that $n2p2 is the external interface name)
- Enter: ping n1p2
-
You should see the default ping output of one line
every second showing how many bytes were sent and the
round-trip time.
Queues
Packets are held in queues as they move from one NSP port to another. Fig. 51 shows 256 of the 512 FPX queues that a packet can encounter as it moves from n1p2 to n1p7. The 256 queues associated with the SPC are not shown and will described in the section on router plugins.
When a packet enters the ingress side of an NSP, it will be placed
in one of eight VOQs (Virtual Output Queues) as it awaits transmission
to the appropriate output port.
All packets destined for output port k are placed in VOQ k and are
serviced in FCFS order.
This arrangement prevents head-of-the-line blocking that is found in
systems using only a single queue since packets destined for output port k
do not have to wait on packets going to any other output ports.
The eight VOQs correspond to internal queues numbered from 504 to 511.
The VOQs are serviced by a Distributed Queueing (DQ) algorithm if DQ is
turned ON or by rate controlled token buckets if DQ is turned OFF.
After the packet gets to the egress side (output port), it will normally be placed in one of 64 datagram queues numbered from 440 to 553. The specific datagram queue is determined by a hash function. However, the user can install a filter that directs the packet to one of 184 reserved flow queues numbered from 256 to 439. The user can give preferential treatment to a reserved flow queue by setting its quantum field in the Queue Table higher than another flows quantum value.
The egress queues are serviced by a Weighted Deficit Round Robin (WDRR) algorithm which replenishes each non-empty queue's deficit counter with its quantum value. For example, suppose we set the quantum value for reserved flow queues 300 and 301 to 2,048 (the default) and 4,096 respectively. These settings will result in flows to queue 301 getting double the service rate given to queue 300. While each queue's quantum determines each queue's relative service rate, a token bucket (described in the Link Rate page) controls the transmission rate of the link.
A user can easily monitor the length of any of these queues through the RLI by selecting the port and then either Ingress => VOQ Length for a VOQ length or Egress => Queue Length for an egress queue length. It is easy to monitor reserved flow queues because the user selects a queue in the range 256 to 439 and defines a filter that will match the desired packets and queue them appropriately.
Datagram Hash Function
A packet going to one of the 64 datagram queues is determined by the following algorithm:
- Let sa[9:8] be bits 9 through 8 of the source IP address (counting from 0 with low-order bit to the right). Similarly define sa[6:5] and da[6:5] where da is the destination IP address.
- Compute the hash function H where H = sa[9:8] sa[6:5] da[6:5], the concatenation of the three fields.
- Then, the datagram queue id is QID = 440 + decimal(H) where decimal(H) is the base 10 value of H.
For example, suppose that the source and destination IP addresses are 192.168.2.48 and 192.168.1.48 respectively; i.e., the n2p2 and n1p2 interfaces. The computation goes as follows:
Extract Address Components: sa[] = c0 a8 02 30 (hex) = sa[31:24] sa[23:16] sa[15:8] sa[7:0] sa[9:8] = 2 (hex) = 10 (binary) since sa[11:8] = 02 (hex) sa[6:5] = 01 (binary) since sa[7:0] = 30 (hex) da[6:5] = 01 (binary) since da[] = c0 a8 01 30 (hex) Compute H: H = 10 01 01 (binary) = 32 + 4 + 1 = 37 (decimal) Compute QID: QID = 440 + 37 = 437
Link Rate
The rate at which IP datagrams leave an egress port is controlled by a token bucket regulater in the FPX. When a user enters a rate r in the Queue Table for the link rate, the user is defining the token bucket that controls the effective transmission rate of the link attached to the egress port.
In Fig. 52, R is the maximum rate of the link (1 Gbps in our
testbed).
A token bucket operates as follows. A bucket is filled at a rate of r tokens per second. If the bucket is full, an arriving token is discarded. When a packet of length L arrives to the token bucket, it is allowed to proceed and L tokens are removed from the bucket if the bucket contains atleast L tokens. Otherwise, the packet is queued. Thus, a token bucket is defined by two parameters:
- r, the token fill rate, and
- b, the bucket depth (size).
The token fill rate determines the long-term transmission rate during periods when a queue is backlogged. The bucket depth limits the traffic burst size; i.e., the amount of traffic that can be immediately passed through.
Note that when the packet is selected for transmission, it will go out at the maximum rate of the link (1 Gbps) independent of the user's link rate setting. A link rate set to r means that there will be a 1 Gbps burst followed by enough idle time so that the transmission rate computed over the combined transmission and idle periods is r.
In the ONL testbed, the token bucket depth is set to two MTUs (Maximum Transmission Units) or 4,096 bytes. If a token bucket has been idle long enough to fill the bucket, this setting can result in possibly two unexpected behaviors when a packet stream is examined closely:
- Two back-to-back, maximum-sized packets will be passed through; and
- Two MTUs worth of back-to-back, minimum-sized packets will be passed through.
In both cases, a measurement of the interpacket times will seem to indicate that the link rate is 1 Gbps rather than the value set by the user. But this is only true over the short time period in which the packets are allowed to burst and not over a longer time period.
The unexpected behaviors described above are invisible for the most part because most experiments measure traffic at large enough time intervals where the behavior is not apparent. But a project that uses a technique such as packet pair will encounter this behavior since it involves measuring interpacket times.
The actual link rate may also differ from than the user's desired link rate because of the FPX's link rate implementation algorithm. In fact, the actual link rate will be:
r = Max { 61, 61 * k } Kbps
where k is the largest integer such that 61*k is no larger than R. In the actual implementation, tokens are added at each FPX clock tick (not continuously), and each token represents some number of bytes.
Packet Scheduling
In Fig. 52, we showed the token bucket regulating a single queue. In reality, the token bucket cycles through all of the egress queues in round-robin fashion. As it visits each active queue (a queue with atleast one packet), it adds the queue's quantum to the queue's deficit counter and transmits a packet from the queue if the deficit counter is atleast as large as the next packet's length and the token bucket has enough tokens. The token bucket repeats this process until it no longer has enough tokens to send a packet.
Link Rate Implementation
If a user specifies that a link rate should be R Kbps, the actual link rate r will likely be different than (although not substantially) because the link rate algorithm implemented in the FPX uses integer arithmetic. In fact, the actual link rate will be:
r = Max { 61, 61 * k } Kbps
where k is the largest integer such that 61*k is no larger than R; i.e., the actual link rate will be an integer multiple of 61 Kbps and will always be atleast 61 Kbps, even if the desired rate is set to 0. For example, if R = 100 Mbps or 100,000 Kbps, then r = 99975.6 Kbps, a value that is within 0.025% of the desired rate. On the other hand, if R = 100 Kbps, the actual rate will be 61 Kbps, a value that is 39% different than the desired rate. This page describes the link rate algorithm at a conceptual level.
The FPX uses a token bucket to determine when a packet should begin transmission. The token bucket receives new tokens at the beginning of each FPX clock period. The FPX has a clock frequency of F = 62.5 MHz which means that tokens are replenished every 16 nsec.
When a user enters a desired sending rate of R Kbps that is atleast 61 Kbps, the FPX adds z tokens to the token bucket at the beginning of each clock tick where z is the largest integer that satisfies the following relationship:
r = z*v*F <= R if R >= 61 Kbps
where F is the FPX clock frequency (62.5 MHz), and v is the value of a single token. Because of the number of bits used in the token computation algorithm, each token represents 1/1024 bits; i.e., the value of one token is 1/1024 bits. Replacing F and v with their values leads to:
r = 62,500 * z / 1024 <= R Kbps if R >= 61 Kbps
With integer division, the relationship becomes:
r = 61 * z <= R Kbps
The link rate algorithm computes z, the number of tokens to place in the token bucket, for all values of R by effectively using the following equation:
z = Max { 1, Floor(R/61) }
where R is the desired link rate in Kbps. For R less than 61 Kbps, z is 1. This means that the actual link rate r is given by:
r = 61 * z Kbps
or
r = 61 * Max { 1, Floor(R/61) } Kbps
The key parameters are summarized in the following table:
| Parameter | Description |
|---|---|
| F = 62.5 MHZ | Clock Rate (ticks/sec) |
| R | Desired Sending Rate (Kbps) |
| z = Max{ 1, Floor(62,500*R/1024) } | Token Rate (tokens/tick) |
| v = 1/8192 | Token Value (bytes/token) |
| r = 61 * Max{ 1, Floor(R/61) } | Bucket Fill Rate (Kbps) |
Relationship to Packet Scheduling
To be precise, the RLI sends the link rate R to the appropriate SPC which then computes the z value. The SPC then sends the z value to the FPX to parameterize the link rate token bucket. Recall that z is just the token bucket fill rate; i.e., the number of tokens to be added in one FPX clock tick. At first glance, it may seem odd that the RLI doesn't just directly communicate with the FPX. But the SPC gets involved because of the potential for running the Distributed Queueing (DQ) algorithm. The DQ algorithm uses the link rate (and the switch rate) to compute VOQ rates so as to avoid both underflow and overflow at the egress ports. The RLI does send quantum and threshhold values directly to the FPX since these control packet scheduling on the egress side.
When DQ is OFF, the user sets the VOQ rates through the Queue Table. The RLI translates these rates to z values and sends them directly to the FPX to configure the token buckets associated with packet transmission from the VOQs.

