SPP Datapath Software

From ARL Wiki
Jump to navigationJump to search


Introduction

In this section, we describe the software components that implement packet processing in the Line Card and NPE.

Line Card

The Line Card is part of the SPP substrate. That is, it implements packet processing functions that are common to all applications running on the SPP and performs no application-specific packet processing. However, elements of the SPP are configured for particular applications and we describe those elements and how they can be configured.

Line Card Data Path

The figure above shows the software components that implement the Line Card’s packet processing functionality. The software is organized into two pipelines, one that processes ingress traffic (that is traffic arriving on an external interface and passing through to the chassis switch) and one that processes egress traffic. Each of these pipelines is mapped onto a different NP subsystem as there is little interaction between the two. Each block in the diagram indicates the number of MicroEngines (ME) that are used to implement that component. In some cases, the multiple MEs just implement finer-grained pipeline stages. In other cases, they operate in parallel. Successive pipeline elements are separated by buffers, which are not shown in the diagram. Some of these are implemented using the Next Neighbor Rings. Others are implemented using the shared on-chip SRAM.

For most pipeline components, per packet processing overhead is the dominant performance concern, so the software was largely engineered around the case of minimum size packets. For the SPP, the minimum packet size is determined by the minimum Ethernet frame length (effectively 88 bytes when the VLAN tag, flag, preamble and inter-packet gap are all accounted for). If all ten physical interfaces are operating at full rate, this means that each of the pipelines must process a packet every 70 ns. This means that a pipeline stage that is implemented with a single ME has less than 100 processor cycles it can use to process each packet. Since an access to external memory has a latency of 150 cycles for SRAM and 300 cycles for DRAM, it can access external memory only 2-4 times, even if it uses all eight hardware thread contexts to mask the memory latency. This is the fundamental challenge that must be met to maintain high performance.

We start by describing the ingress pipeline. The RxIn block transfers packets from the IO interface of the IXP chip into memory. This involves reassembling SPI cells into packets, which are placed in DRAM buffers of 2 KB each. The RxIn block also allocates a buffer descriptor which is initialized and stored in SRAM. RxIn passes several pieces of metadata to the next pipeline element. This includes the physical interface on which the packet was received, its Ethernet frame length, several status flags and a reference to its buffer.

The Key Extract block extracts selected header fields from the buffer. These include the IP packet length, the value of the IP protocol field and the destination IP address and port number. These are added to the metadata passed down the pipeline. The Lookup block performs a lookup in the shared TCAM. This is an exact match lookup, based on the interface number, protocol, destination IP address and port number. The result of the lookup specifies the queue the packet is to be placed in, a translated port number and a VLAN number, which is used within the chassis switch to determine where it goes next. It also includes a Statistics Index which is used to identify traffic counters that are to be updated for the given packet. The Header Format block makes any required changes to the packet headers in the DRAM buffer. This includes rewriting the TCP or UDP destination port number and rewriting MAC address to reflect the address of the component the packet is to be forwarded to next. The Lookup block can send selected packets to the xScale (as determined by the lookup results) and the xScale can insert packets into the pipeline through the Header Format block.

The Queue Manager (QM) is the most complex of the components in the datapath. It is implemented using six MEs. Four of these implement the actual queueing functions, while one distributes packets received from the Header Format block across the four queueing engines, while the other provides a similar interface function on output. Each of the four queueing engines manages a separate set of linked list packet queues that are stored in external DRAM. The IXP provides low level hardware support for managing such queues, making the basic list operations highly efficient. However, maintaining high throughput is still challenging, as successive accesses to a single queue inherently requires at least one access per packet and each such access takes 150 cycles (the SRAM memory latency). Each queueing engine also implements five separate packet schedulers, which can be individually rate controlled. Each of these schedulers has its own list of queues, and implements a Weighted Deficit Round Robin scheduling policy. In the Line Card, queues are assigned to schedulers based on their destinations. In particular, all queues assigned to a particular scheduler share a common “next hop” (the CP, GPE1, GPE2 or the NPE). Multiple schedulers can share a common destination and this is used to enable higher overall throughput, as there are performance limits associated with both individual MEs and individual schedulers.

The TxIn block transfers packets from the DRAM buffers to the IO interface. This involves segmenting packets into SPI cells. The Statistics block is used by most of the other components to record traffic statistics. It has an input FIFO (not shown) which is implemented using the onchip SRAM accessible to all MEs. The IXP hardware provides low level support to enable multiple MEs to write to such FIFOs without interference and without requiring explicit software concurrency control. This allows them to issue update requests for statistics counters without having to interrupt their main packet processing flow. The Stats block processes these requests and maintains the traffic counters in external memory. This delegates the memory access overheads associated with updating the statistics counters to a separate ME, in order to minimize the impact on those MEs processing packets. A typical statistics request updates both a packet counter and a byte counter, allowing effective monitoring of both packets processed and aggregate bandwidth. There is one other ME that is not shown in the diagrams. This ME maintains the free space list of packet buffers. It also is used by multiple MEs and accepts “buffer recycling requests” from other MEs using a similar input FIFO.

The egress pipeline is similar to the ingress pipeline, but there are a few differences. First, the egress pipeline includes a Flow Statistics module that maintains information about outgoing packet flows. This data is collected to allow for accountability of outgoing packets. Since the SPP is a shared platform used by researchers to carry out networking experiments, it is possible for users of the platform the use it to send packets to Internet destinations that don’t want to receive those packets (this can happen inadvertently or maliciously). This can lead the users of the computers at those destinations to complain about the unwanted traffic. When this happens, it’s important for SPP operators to have the ability to determine the individual user whose experiment is generating the unwanted traffic. The Flow Stats module provides the low level data collection needed to support this. The requirement for flow statistics originates with the PlanetLab testbed, and the data collected is compatible with the data collected for conventional PlanetLab nodes. The Flow Stats module maintains its data in an external SRAM which can be read by the xScale control processor. Software running on the xScale aggregates the data produced by the Flow Stats module and periodically transfers it to the Control Processor, which stores it on disk and makes it available to system administrators.

All the other components of the egress pipeline are similar to their counterparts in the ingress pipeline, although there are small differences. For example, the header fields used as the lookup key in the Lookup module are different, and include the VLAN on which the packet arrived and its source IP address and port number. The Queue Manager is the same as in the ingress pipeline but is configured differently. Each of the Line Card’s outgoing interfaces is assigned a distinct packet scheduler and each experiment running on the SPP (more precisely each slice) that is using an interface has a queue on that interface with a weight that reflects its assigned share of the interface bandwidth. In the simplest case, each of the outgoing interfaces is a separate physical interface, but it is possible to define multiple virtual interfaces on a given physical interface and associate a separate scheduler with each virtual interface. This makes it possible to provision each virtual interface with a specific share of the physical interface bandwidth. The external interfaces each have an associated IP address and the GPEs associate these same IP addresses with their own virtual interfaces. So, a GPE sends a packet out of the SPP on a particular external interface by sending it with the source IP address associated with that interface. This allows the GPE software to be largely oblivious to the fact that it is operating within an SPP.

Because the SPP supports multiple processing engines, the TCP and UDP port numbers associated with the external interfaces must be shared by the CP, the GPEs and the NPE. Since the CP and GPE operating systems each control the port numbers that they associate with sockets, it’s possible that port numbers selected by different components will conflict with one another. This requires a form of Network Address Translation (NAT) on the part of the Line Card. More specifically, the Line Card must translate the port numbers of the SPP’s endpoint of TCP and UDP connections that originate with the SPP (i.e. client connections). That is, it must translate the source port number for packets leaving the SPP and the destination port number for packets arriving at the SPP. The port number translation is implemented by packet filters inserted in the Line Card’s TCAM. All packets forwarded by the Line Card must match such a filter. If there is no filter for a given packet, that packet will be directed to the xScale, where a NAT daemon will determine if the packet belongs to a new outgoing connection, and if so will assign it a port number and install a TCAM filter to implement the translation of subsequent packets. For TCP connections, an additional filter is installed to send copies of SYN packets to the NAT daemon, so it can detect the closing of the connection, remove the associated filters and de-allocate the assigned port number. Traffic on UDP connections is monitored continuously, and the associated port numbers are freed after an extended period of inactivity. The NAT daemon also performs translations on outgoing ICMP echo packets, allowing applications on GPEs to send ping packet and receive the corresponding reply.

Network Processing Engine Software – version 1

The organization of the first version of the NPE software and its mapping onto MEs is shown in the figure below. This software uses one of the two NP subsystems on the Radisys 7010 blades. Our original plan was to instantiate the same software on both blades, allowing them to be used as largely independent NPEs. Unfortunately, limitations of the 7010’s input/output layer made this infeasible, so in the initial deployment of the SPP, only one of the two NP subsystems is available to users. The next section describes another version of the software that is under development and which will replace the first version, as soon it has been completed.

NPE Software Structure

As in the Line Card, the software components that implement the NPE are organized as a pipeline. Packets received from the chassis switch are copied to DRAM buffers by the Receive (Rx) block on arrival, which also passes a reference to the packet buffer through the main packet processing pipeline. Information contained in the packet header can be retrieved from DRAM by subsequent blocks as needed, but no explicit copying of the packet takes place in the processing pipeline. At the end of the pipeline, the Transmit (Tx) block forwards the packet to the output. Buffer references (and other information) are passed along the pipeline primarily using FIFOs linking adjacent MEs. Pipeline elements typically process 8 packets concurrently using the hardware thread contexts. The Substrate Decapsulation block determines which slice the packet belongs to, by doing a lookup in a table stored in one of the SRAMs. It also effectively strips the outer header from the packet by adjusting a pointer to the packet’s buffer before passing it along the pipeline.

The Parse block includes slice-specific program segments. More precisely, Parse includes program segments that define a preconfigured set of Code Options. Slices are configured to use one of the available code options and each slice has a block of memory in SRAM that it can use for slice-specific data. Currently, code options have been implemented for IPv4 forwarding and for the Internet Indirection Infrastructure (I3) [ST02]. New code options are fairly easy to add, but this does require familiarity with the NP programming environment and must be done with care to ensure that new code options do not interfere with the operation of the other components. The primary role of Parse, is to examine the slice-specific header and use it and other information to form a lookup key, which is passed to the Lookup block.

The Lookup block provides a generic lookup capability, using the TCAM. It treats the lookup key provided by Parse as an opaque bit string with 112 bits. It augments this bit string with a slice identifier before performing the TCAM lookup. The slice’s control software can insert packet filters into the TCAM. These filters can include up to 112 bits for the lookup key and 112 bits of mask information. Software in the Management Processor augments the slice-defined filters with the appropriate slice id before inserting them into the TCAM. This gives each slice the illusion of a dedicated TCAM. The position of filter entries in the TCAM determines their lookup priority, so the data associated with the first filter in the TCAM matching a given lookup key is returned. The number of entries assigned to different slices is entirely flexible, but the total number of entries is 128K.

The Header Formatter which follows Lookup makes any necessary changes to the slice-specific packet header, based on the result of the lookup and the semantics of the slice. It also formats the required outer packet header used to forward the packet to either the next PlanetLab node, or to its ultimate destination.

The Queue Manager (QM) implements a configurable collection of queues. More specifically, it provides 20 distinct packet schedulers, each with a configurable output rate, and each with an associated set of queues. Separate schedulers are needed for each external interface supported by Line Cards. The number of distinct schedulers that can be supported by each ME is limited by the need to reserve some of the ME’s local memory for each. Each scheduler implements the weighted deficit round robin scheduling policy, allowing different shares to be assigned to different queues. When a slice’s control software inserts a new filter, it specifies a slice-specific queue id. The filter insertion software remaps this to a physical queue id, which is added, as a hidden field, to the filter result. Slices can configure which external interface its queues are associated with, the effective length of each queue and its share of the interface bandwidth.

The Statistics module maintains a variety of counts on behalf of slices. These can be accessed by slices through the xScale, to enable computation of performance statistics. The counting function is separated from the main processing pipeline to keep the associated memory accesses from slowing down the forwarding of packets, and to facilitate optimizations designed to overcome the effects of memory latency. The counts maintained by the Statistics module are kept in one of the external SRAMs and can be directly read by the xScale.

Network Processing Engine Software – version 2

The figure below shows the organization of version 2 of the NPE software. There are two primary objectives for this version. The first is to take advantage of both NP subsystems on the Radisys 7010 blade, to increase the amount of traffic that can be handled from 5 Gb/s to 10 Gb/s. The second is to provide support for packet replication, allowing applications to more easily implement services that require multicast.

Version 2 of the NPE Software Structure

In this version, the packet processing pipeline is distributed across both NPs. On NPU A (the top one in the figure), eight MEs are assigned to the Parse block, allowing for more extensive processing of user packets. These will operate in parallel and will take advantage of all eight thread contexts, allowing up to 64 packets to be processed concurrently. This allows for up to 780 instructions to be executed per minimum size packet, and for up to 20 DRAM references or 41 SRAM references per packet.

The result returned by the Lookup block on NPU A includes a Result Index, which is passed to NPU B in a shim header which is added to the packet by the AddShim block. The Lookup and Copy block on NPU B uses the result index to select an entry from an SRAM-resident table that specifies one or more queues in which the packet is to be placed. For multicast packets, a Header Buffer is created for each copy. The associated buffer descriptor includes a reference to the original packet buffer, which is now referred to as the Payload Buffer. For each copy, the IP packet header information needed to forward that copy to its next destination is placed in the header buffer and a reference to the header buffer is passed to the Queue Manager, along with the appropriate queue and packet scheduler information. A reference count is also placed in the descriptor of the payload buffer, so that it can be deallocated when the last header buffer has been processed.

The Header Format block includes slice-specific code that formats the header of the slice’s outgoing packets, by writing the header information in the header buffer. Of course, the slicespecific code may also make changes to the payload of a packet by writing to the payload buffer. The Header Format block passes the reference to the header buffer to the TxB block, which formats the outgoing packet by reading the header from the header buffer, and the payload from the payload buffer.