Scalable Overlay Hosting Platforms

From ARL Wiki
Jump to navigationJump to search

The purpose of an Overlay Hosting Platform (OHP) is to provide resources for use by individual overlay. We seek to provide the greatest possible flexibility to overlays, while maintaining appropriate separation among different overlays. This of course includes separation of their use of memory and mass storage, but also their use of network bandwidth and processing capacity.

High Level System Organization

Consideration of a conventional router or switch leads naturally to an OHP architecture in which line cards are replaced by virtualized line cards that include a substrate portion and generic processing resources that can be assigned to different virtual line cards. The substrate supports configuration of the generic processing resources so that different virtual line cards can co-exist without interference. On receiving data from the physical link, the substrate first determines which virtual line card it should be sent to and delivers it. Virtual line cards pass data back to the substrate, in order to forward it through the shared switch fabric, on input, or to the outgoing link, on output.

Virtualized Line Card Architecture

One issue with this architecture concerns how to provide generic processing resources at a line card, in a way that allows the resources to be shared by different overlays. Conventional line cards are often implemented using Network Processors (NP), programmable devices that include high performance IO and multiple processor cores to enable high throughput processing. It seems natural to take such a device and divide its internal processing resources among multiple overlays. For example, an NP with 16 processor cores could be used by up to 16 different overlays, by simply assigning processor cores. Unfortunately, current NPs are not designed to be shared. All processing cores have unprotected access to the same physical memory, making it difficult to ensure that different overlays don’t interfere with one another.

Also, each processor core has a fairly small program store. This is not a serious constraint in conventional applications, since processing can be pipelined across the different cores, allowing each to store only the program it needs for its part of the processing. However, a core implementing all the processing for one overlay must store the programs to implement all the processing steps for that overlays. The underlying issue raised by this discussion is that efficient implementation of an architecture based on virtualized line cards, requires components that support fine-grained virtualization and conventional NPs do not.

The virtualized line card approach is also problematic in other respects. Because it associates processing resources with physical links, it lacks the flexibility to support overlays with a wide range of processing needs. Some overlays may require more processing per unit IO bandwidth than NPs provide, and this is difficult to support with a virtualized line card approach. The virtualized line card approach also does not easily accommodate alternate implementation approaches for overlay nodes (such as configurable logic).

The above discussion leads us to the processing pool architecture, in which the processing resources used by overlays are separated from the physical link interfaces. This allows a more flexible allocation of processing resources and reduces the need for fine-grained virtualization. This architecture, illustrated at right, provides a pool of Processing Engines (PE), that are accessed through the switch fabric.

Processing Pool Architecture

The line cards that terminate the physical links forward packets to PEs through the switch fabric, but do no processing that is specific to a particular overlay. There may be different types of PEs, including some implemented using network processors, others implemented using conventional microprocessors and still others implemented using FPGAs. The NP and FPGA based PEs are most appropriate for high throughput packet processing, the conventional processor for control functions that require more complex software or for overlays with a high ratio of processing to IO. An overlay node may be implemented using a single PE or multiple PEs. In the case of a single PE, data will pass through the physical switch fabric twice, once on input, once on output. In a node that uses multiple PEs to obtain higher performance, packets may have to pass through the switch fabric a third time.

The primary drawback of the processing pool architecture is that it requires multiple passes through the switch fabric, increasing delay and increasing the switch capacity needed to support a given total IO bandwidth. The increase in delay is not a serious concern in wide area network contexts, since switching delays are typically 10 <math>\mu</math>s or less. The increase in capacity does add to system cost, but since a well-designed switch fabric represents a relatively small part of the cost of a conventional router (typically 10-20%), we can double, or even triple the capacity without a proportionally large increase in the overall system cost. Also, since OHPs can be expected to have a higher ratio of processing to IO than conventional network routers, the impact of the higher bandwidth use is significantly reduced.

The great advantage of the processing pool architecture is that it greatly reduces the need for fine-grained virtualization within NP and FPGA-based subsystems, for which such virtualization is difficult. Because the processing pool architecture brings together the traffic for each individual overlay node, there is much less need for PEs to be shared among multiple nodes. The one exception to this is nodes with such limited processing needs that they cannot justify the use of even one complete PE. Such nodes can still be accommodated by implementing them on a general purpose processor, running a conventional operating system that supports a virtual machine environment. Later, we discuss another approach that allows such nodes to share an NP-based PE for fast path forwarding, while relying on a virtual machine running within a general purpose processor to handle exception cases.

Another advantage of the processing pool architecture is that it simplifies sharing of the switch fabric. The switch traffic must maintain traffic isolation among the different overlay nodes. One way to ensure this is to constrain the traffic flows entering the switch fabric so as to eliminate the possibility of internal congestion. This is difficult to do in all cases. In particular, nodes consisting of multiple PEs should be allowed to use their “share” of the switch fabric capacity in a flexible fashion, without having to constrain the pair-wise traffic flows among the PEs. However allowing this flexibility makes it possible for several PEs in a given metarouter to forward traffic to another PE at a rate that exceeds the bandwidth of the interface between the switch fabric and the destination PE.

There is a straightforward solution to this problem in the processing pool architecture. To simplify the discussion, we separate the handling of traffic between line cards and PEs from the traffic among PEs in a common metarouter. In the first case, we can treat the traffic as a set of point-to-point streams that are rate-limited when they enter the fabric. Rate-limiting these flows follows naturally from the fact that they are logical extensions of traffic flows on the external links. Because the external link flows must be rate limited to provide traffic isolation on the external links, the internal flows within the switch fabric can be configured to eliminate the possibility of congestion.

For PE-to-PE traffic, we cannot simply limit the traffic entering the switch, since it’s important to let PEs communicate freely with other PEs in the same node, without constraint. However, because entire PEs are allocated to nodes in the processing pool architecture, it’s possible to obtain good traffic isolation in a straightforward way, for this case as well. In general, we need two properties from the switch fabric. First, it must support constrained routing, so that traffic from one to node cannot be sent to PEs belonging to another node. Second, we need to ensure that congestion within one node does not affect traffic within another node. The emergence of 10G Ethernet as a backplane switching technology provides the first property. Such switches support VLAN-based routing that can be used to separate the traffic from different nodes. The second property is satisfied by any switching fabric that is nonblocking at the port level. While some switch fabrics fail to a fully achieve the objective of nonblocking performance, this is the standard figure of merit for switching fabrics a and most come reasonably close to achieving it.