Internet Scale Overlay Hosting
Network overlays have become a popular tool for implementing Internet applications. While content-delivery networks provide the most prominent example of the commercial application of overlays, systems researchers have developed a variety of experimental overlay applications, demonstrating that the overlay approach can be an effective method for deploying a broad range of innovative systems. An Overlay hosting Service (OHS) is a shared infrastructure that supports multiple overlay networks and can play an important role in enabling wider-scale use of overlays, since it enables small organizations to deploy new overlay services on a global scale without the burden of having to acquire and manage their own physical infrastructure. Currently, PlanetLab is the canonical example of an overlay hosting service, and it has proven to be an effective vehicle for supporting research in distributed systems and applications. This project seeks to make PlanetLab and systems like it more capable of supporting large-scale deployments of overlay services, through the creation of more capable platforms and the control mechanisms needed to provision resources for use by different applications.
As shown at right, an OHS can be implemented using distributed data centers, comprising servers and a communications substrate that includes both L2 switching for local communication and L3 routers for communication to end users and other data centers. End users communicate with a data center using the public Internet, while data centers can communicate with each other, using either the public Internet or dedicated backbone links, allowing the OHS to support provisioned overlay links. This in turn allows overlay network providers to deliver services that require consistent performance from the networking substrate.
We have developed an experimental prototype of a system for implementing overlay hosting services. We have selected PlanetLab as our target implementation context and have dubbed the system the Supercharged PlanetLab Platform. The SPP has a scalable architecture that accommodates multiple types of processing resources, including conventional server blades and Network Processor blades base on the Intel IXP 2850. We are working to deploy five SPP nodes in Internet 2 as part of a larger prototyping effort associated with National Science Foundation's GENI initiative. The subsequent sections describe the SPP and our plans for GENI.
Contents
Supercharged PlanetLab Platform
The SPP is designed as a high performance substitute for a conventional PlanetLab node. The typical PlanetLab node is a conventional PC running a customized version of Linux that supports multiple virtual machines, using the Linux vServer mechanism. This allows different applications to share the node's computing and network resources, while being logically isolated from the other applications running on the node. PlanetLab's virtualization of the platform is imperfect, as it requires different vServers to share a common IP address and because it provides limited support for performance isolation. Nevertheless, PlanetLab has been very successful as an experimental platform for distributed applications.
The objective of the SPP is to boost the performance of PlanetLab sufficiently to allow it to serve as an effective platform for service delivery, not just for experimentation. There are several elements to this. First, the SPP is designed as a scalable system that incorporates multiple servers, while appearing to users and application developers like a conventional PlanetLab node. Second, the SPP makes use of Network Processor blades, in addition to conventional server blades, allowing developers to take advantage of the higher performance offered by NPs. Third, the SPP provides better control over both computing and networking resources, enabling developers to deliver more consistent performance to users. PlanetLab developers can use the SPP just like a conventional PlanetLab node, but in order to obtain the greatest performance benefits, they must structure their applications to take advantage of the NP resources provided by the SPP. We have tried to make this relatively painless, by supporting a simple fastpath/slowpath application structure, in which the NP is used to implement the most performance-critical parts of an application, while the more complex aspects of the application are handled by a general-purpose server that provides a conventional software execution environment. In this section, we provide an overview of the hardware and software components that collectively implement the SPP, and describe how they can be used to implement high performance PlanetLab applications.
Hardware Components
The hardware components of our prototype SPP are shown at right. The system consists of a number of processing components that are connected by an Ethernet switching layer. From a developer's perspective, the most important components of the system are the Processing Engines that host the applications. The SPP includes two types of PEs. The General-Purpose Processing Engines (GPE) are conventional server blades running the standard PlanetLab operating system. The current GPEs are dual Xeons with a clock rate of xx, xx GB of DRAM and xx GB of on-board disk. The Network Processor Blades (NPE) include two IXP 2850s, each with 16 cores for processing packets and an xScale management processor. Each IXP has 750 MB of RDRAM plus four independent SRAM banks, and the two share an 18 Mb TCAM. The NPE also has a 10 GbE network connection and is capable of forwarding packets at 10 Gb/s.
All input and output passes through a Line Card (LC) that has ten GbE interfaces. In a typical deployment, some of these interfaces will have public IP addresses and be accessible through the Internet, while others will be used for direct connection to other SPP nodes. The LC is implemented with a Network Processor blade and handles the routing of traffic between the external interfaces and the GPEs and NPEs. This is done by configuring filters and queues within the LC. The system is managed by a Control Processor (CP) that configures application slices based on slice descriptions obtained from PlanetLab Central, a centralized database that is used to manage the global PlanetLab infrastructure. The CP also hosts a netFPGA, allowing application developers to implement processing in configurable hardware, as well as software.
Our prototype system is shown in the photograph. We are using board level components that are compatible with the Advanced Telecommunication Computing Architecture (ATCA) standards. ATCA components include the server blades, the NP blades and the chassis switch, which actually includes both a 10 GbE switch and a 1 GbE control switch. The ATCA components are augmented with an external 1 GbE switch and a conventional rack-mount server that implements the CP.
Network Processor blades are used for both the NPE and the LC. NPs are engineered for high performance on IO-intensive packet processing workloads. The dual Intel IXP 2850s used in the SPP's NP blades each contain 16 multi-threaded Micro-Engines (ME) that do the bulk o f the packet processing, plus several high bandwidth memory interfaces. In typical applications DRAM is used primarily for packet buffers, while SRAM is used for implementing lookup tables and linked list queues. There are also special-purpose on-chip memory resources, both within the MEs and shared across the MEs. An xScale Management Processor (MP) is provided for overall system control. The MP typically runs a general-purpose OS like Linux, and has direct access to all of system memory and direct control over the MEs.
As with any modern processor, the primary challenge to achieving high performance is coping with the large processor/memory latency gap. Retrieving data from off-chip memory can take 50-100 ns (or more), meaning that in the time it takes to retrieve a piece of data from memory, a processor can potentially execute over 100 instructions. The challenge for system designers is to try to ensure that the processor stays busy, in spite of this. Conventional processors cope with the memory latency gap primarily using caches. However for caches to be effective, applications must exhibit locality of reference, and unfortunately, networking applications typically exhibit limited locality of reference, with respect to their data.
Since caches are relatively ineffective for networking workloads, the IXP provides a different mechanism for coping with the memory latency gap, hardware multi-threading. Each of the MEs has eight separate sets of processor registers (including Program Counter), which form the MEs hardware thread contexts. An ME can switch from one context to another in 2 clock cycles, allowing it to stay busy doing useful work, even when several of its hardware threads are suspended, waiting for data to be retrieved from external memory. Multithreading can be used in a variety of ways, but there are some common usage patterns that are well-supported by hardware mechanisms. Perhaps the most commonly used (and simplest) such pattern involves a group of threads that operate in a round-robin fashion, using hardware signals to pass control explicitly from one thread to the next. Round robin processing ensures that data items are processed in order and works well when the variation in processing times from one item to the next is bounded (which is commonly the case in packet processing contexts), and is straightforward to implement.
Each ME has a relatively small dedicated program store, from which it executes instructions. This limits the number of different functions that can be implemented by a single ME, favoring programs that are divided into smaller pieces and organized as a pipeline. The MEs support such pipeline processing by providing dedicated FIFOs between consecutive pairs of MEs (Next Neighbor Rings). A pipelined program structure also makes it easy to use the processing power of the MEs effectively, since the parallel components of the system are largely decoupled from one another.
The two NPs on the card share a Ternary Content Addressable Memory (TCAM). The TCAM can be configured for key sizes ranging from 72 bits up to 576 bits. In the LC, it is used to implement general packet filters based on the standard IP 5-tuple (source+destination address, protocol, source+destination port). In the NPE, it is used to implement generic lookups with application-specific semantics. The two NPs communication through an onboard SPI switch to the backplane and an optional IO card, with ten 1 GbE interfaces. In the SPP, the LC uses an IO card, while the NPE does not.
Network Processor Datapath Software
The software components use in the Line Card are shown at right. One of the two NPs on the LC is used to process incoming traffic (from the external interfaces to the chassis switch) and the other is used to process outgoing traffic. In both cases, the processing is structured as a pipeline in which packets flow from stage-to-stage thru inter-stage buffers (not shown). Most stages use more than one MicroEngine (ME), typically with the load being distributed in round robin fashion across the MEs and their individual thread contexts.
In the ingress pipeline, the RxIn block copies arriving packets from the IO interface into fixed size DRAM buffers (2KB each) and passes a pointer to the next stage in the pipeline. The Key Extract block reads fields from the packet header that are required to form a lookup key and passes these on to the next stage. The Lookup block formats the lookup key and issues queries to the external TCAM, which returns a pointer to results in an external SRAM. These results include modified values of selected header fields and a queue identifier. The lookup key includes the hardware interface on which the packet was received and the full IP header 5-tuple. Packets for which there is no configured filter are discarded and counted. The control software can avoid such discarding by including a default filter that directs any otherwise matched packets to the SPP's Control Processor. The lookup result includes the address of the hardware component which is to get the packet next and a queue identifier. The Header Format block rewrites selected fields in the header portion of the packet stored in DRAM. The Queue Manager block implements a general queueing subsystem. Each of the four MEs implements five independent packet schedulers. Each packet scheduler implements a Weighted Deficit Round Robin scheduling algorithm, and can be independently rate controlled, limiting its total output traffic. Each queue has a configurable weight that determines its share of its scheduler's output bandwidth, and a configurable discard threshold that limits the number of packets it can hold. The TxIn block, at the end of the pipeline, copies packets from their DRAM buffers to the output interface and returns buffers to the free list.
There are two additional MEs used in the ingress pipeline that are not shown. One manages the free space list, while the other implements statistics counters on behalf of the other blocks. We have also not shown the connections to/from the xScale. Arriving packets may be diverted to the xScale by the Lookup block and may be injected into the pipeline by the xScale at the Header Format block.
The egress pipeline is similar to the ingress pipeline, although some of the details differ. One major difference is that the egress pipeline includes a Flow Statistics block that keeps track of outgoing traffic on a per-flow basis. This is to allow it to support the PlanetLab requirement that all outgoing traffic be tracable back to the specific application slice that sent it. This allows PlanetLab staff to respond to complaints when slices send unwanted traffic to destination hosts on the Internet. Traffic statistics are compiled by the Flow Stats block (which is implemented by two MEs) and passed onto the xScale. It sends aggregated traffic records to the Control Processor, and these records are ultimately sent on to PlanetLab Central where they are archived.
The processing of packets by the Line Card is largely determined by TCAM filters. Many of these filters are installed in response to requests from SPP users. Filters can be created by users either explicitly or implicitly. Explicit configuration is used to configure a TCP or UDP port on an interface, and will be discussed further in a later section. Implicit configuration is done using Network Address Translation. The LC implements NAT for outgoing connections originating on the CP or GPEs. For example, if the LC receives an outgoing packet for a UDP connection for which there is no configured filter, it will forward the packet to the xScale and the NAT daemon running on the xScale will allocate a free source port number for the outgoing interface and install a filter in the TCAM so that subsequent packets belonging to that UDP connection will have their source port number's translated as they pass through the egress pipeline. A filter is also inserted into the ingress pipeline so that the reverse translation is performed on arriving packets belonging to the connection. NAT is also performed for TCP and ICMP packets, but it is not implemented for higher level applications that embed port numbers in their payloads. This means for example, that the SPP does not support outgoing FTP connections, although it can (and does) support SSH and SFTP. Also note that while port number translation is performed on both ingress and egress packets, only egress packets trigger the addition of a new filter pair.
The figure at right shows the current version of the software that runs on the NPE. This software uses just one of the two NPs on the blade and is structured similarly to the Line Card software. The most significant difference is that the NPE is designed to implement the fastpaths of different PlanetLab applications. Since different applications may have different header formats, the NPE supports multiple code options for handling different formats. Each slice selects a particular code option for its fast path processing, allowing multiple slices to use the same code option. Code options are implemented within the Parse and Header Format blocks, which precede and follow the Lookup block. The Parse block extracts the appropriate header fields and forms a 112 bit Lookup key which it passes on to the Lookup block. The Lookup block treats this as a generic bit string, which it uses to query the TCAM. The result returned by the TCAM is passed onto the Header format block which does whatever post-lookup processing is required, such as modifying fields in the outgoing packet headers. Each slice has its own set of filters in the TCAM (each filter includes a slice id as a hidden part of its lookup key, allowing the Lookup block to restrict lookups to the keys that are relevant for a particular packet), and has the freedom to define the semantics of the Lookup key and result in whatever way is appropriate to it. Software running in a GPE on behalf of a slice can insert filters into an NPE using a generic interface that treats the filter as an unstructured bit string. We generally expect slice developers to provide more structured interfaces that are semantically meaningful to their higher level software, and implement those interfaces using the lower level generic interface provided by the SPP.
Our original intention had been to have each of the two NPs on the blade operate execute the software described above, running independently of the other. Unfortunately, the blade does not support bi-directional sharing of the switch interface by the two NPs, making it impossible to implement this approach. Consequently, we are planning a second version of the NPE software that distributes the processing tasks across the two NPs. This will give us an NPE capable of sustained throughputs up to 10 Gb/s. At the same time, we are making some additional improvements. The most significant of these is the addition of support for multicast.
As shown in the figure, the new NPE software will devote eight MEs to the parse block, substantially increasing the number of cycles available for application-specific packet processing. The results of the lookup produced in the first NP will be inserted into a shim header that will be propagated to the second NP. This will include a secondary lookup index which will be used in the second NP to access other information, such as the set of outgoing interfaces to which the packet should be forwarded. The Header Format block has been moved after the QM, allowing for the copies of multicast packets to be handled differently.
Control Software
The figure at right shows the major components of the SPP control software. The central component of the system is the System Resource Manager that runs on the CP. The SRM's functions include retrieving slice descriptions from PlanetLab Central (PLC), instantiating slices on the SPP and allocating resources (such as external port numbers, interface bandwidth and NPE resources) to slices. Each GPE includes a Resource Management Proxy that provides an interface between applications and other system components. The NPE and LC each include a Substrate Control Daemon that provides the interface for configuring the NP. These various software components communicate with one another through the SPP's control network, which is distinct from the switch used for application traffic.
To deploy an application on an SPP, a developer first defines a slice in PLC and specifies that the SPP of interest is to be included in the slice. The SRM periodically retrieves slice definitions from PLC and on detecting a new slice definition, it instantiates the slice. This involves selecting a GPE to host the slice and creating a new Linux vServer on the selected GPE. Once a slice has been instantiated, the developer can login to the slice, giving them access to a shell running within their vServer. At this point, the only resource associated with the slice is the vServer itself.
Additional resources can be requested by the slice, through the RMP running on the GPE. A set of command line utilities are provided to allow a developer to allocate resources manually, or from a script. Alternatively, applications running within the slice can invoke RMP operations directly using a provided subroutine library. For example, to run a server within a slice, a developer must first acquire an external port number on which the server can listen. Since the various components of an SPP share the port number space on each of the external interfaces, this requires some system-wide coordination, and configuration of Line Card filters in order to route packets between the external interface and a specific vServer running on a specific GPE. The RMP provides the interface that slices can use to obtain such external port numbers. Slices can also use the RMP to request a share of the bandwidth on an external interface; reserved bandwidth shares are implemented using per-slice queues in the LC with the appropriate WDRR weights.
Slices can also use the RMP to instantiate a fastpath on the NPE, associate additional NPE resources with their fastpath and configure those resources as required by the application. Assignable fastpath resources include packet filters, packet buffers, queues, memory and bandwidth. Slices can configure different logical interfaces (called meta-interfaces) for their fastpaths and associate those meta-interfaces with specific physical Line Card interfaces. Slices can associate queues with specific meta-interfaces and configure different queues for different shares of the meta-interface bandwidth. Slices can define packet filters and assign a specific meta-interface and queue for matching packets.
The actual configuration of the resources on an NPE takes place through the Substrate Control Daemon for the NPE (SCD/N), which runs on the xScale management processor within the NP. The SCD provides a messaging interface through which filters can be installed in the TCAM and through which queues and packet scheduling parameters can be configured. The SCD can also read and write all of an NP's memory, allowing it to monitor memory-resident data, such as traffic counters and other statistics maintained by the NP datapath software. A slice fastpath typically has a dedicated region of NP memory that can be used by the fastpath for slice-specific configuration data, and it can use the SCD's general memory access mechanism to read and write from this memory region as appropriate to monitor and control its fastpath. The SCD/L provides similar control over the various configurable resources of the Line Card. These includes LC queues, queue parameters, traffic schedulers and filters. Resources requested thorough the RMP often result in configuration of LC resources, not just NPE resources.
There are a number of other control software elements. The Slice Login Manager (SLM) enables login to slices by remote slice developers. Incoming SSH connection requests are routed by LC filters to the SLM, which runs on the CP. The SLM handles authentication, determines which GPE is hosting the vServer for the slice and forwards the request to the target GPE. Once the connection is established, packets flow through a tunneled connection that passes through the CP. We have already mentioned the NAT daemon that runs in the Line Card and supports network address translation for TCP and UDP connections originating from a GPE, as well as ICMP packets. Developers can use the provided NAT support to setup SFTP connections from their slice to a remote server when transferring code and data to or from the slice. This is preferred to sending that traffic through the login SSH connection, since these packets need not pass through the CP. The SPP also provides software to enable real-time performance monitoring of slice fastpaths. A program that runs within a slice's vServer polls a monitoring daemon in the NPE, which retrieves performance counter values from the NPE's memory. These are then forwarded to a remote display program that displays the data in the form of real-time charts. This mechanism is quite flexible and allows users and developers to observer what is happening "under the covers" as traffic flows through their slice fast path.
Planned SPP Deployment
map of planned node locations
details of a typical site with connections to router and other sites
Using SPPs
Discuss how to define slices in myPLC, login to SPP nodes and do configuration. Keep the main flow at a high level, but add a page that gives a tutorial on how the GEC 4 demo is done.