Difference between revisions of "Internet Scale Overlay Hosting"

From ARL Wiki
Jump to navigationJump to search
Line 118: Line 118:
 
As with any modern processor, the primary challenge to achieving high performance is coping with the large processor/memory latency gap. Retrieving data from off-chip memory can take 50-100 ns (or more), meaning that in the time it takes to retrieve a piece of data from memory, a processor can potentially execute over 100 instructions. The challenge for system designers is to try to ensure that the processor stays busy, in spite of this. Conventional processors cope with the memory latency gap primarily using caches. However for caches to be effective, applications must exhibit locality of reference, and unfortunately, networking applications typically exhibit limited locality of reference, with respect to their data.
 
As with any modern processor, the primary challenge to achieving high performance is coping with the large processor/memory latency gap. Retrieving data from off-chip memory can take 50-100 ns (or more), meaning that in the time it takes to retrieve a piece of data from memory, a processor can potentially execute over 100 instructions. The challenge for system designers is to try to ensure that the processor stays busy, in spite of this. Conventional processors cope with the memory latency gap primarily using caches. However for caches to be effective, applications must exhibit locality of reference, and unfortunately, networking applications typically exhibit limited locality of reference, with respect to their data.
  
Since caches are relatively ineffective for networking workloads, the IXP provides a different mechanism for coping with the memory latency gap, hardware multi-threading. Each of the MEs has eight separate sets of processor registers (including Program Counter), which form the MEs hardware thread contexts. An ME can switch from one context to another in 2 clock cycles, allowing it to stay busy doing useful work, even when several of its hardware threads are suspended, waiting for data to be retrieved from external memory. Multithreading can be used in a variety of ways, but there some common usage patterns that are well-supported by hardware mechanisms. Perhaps the most commonly used (and simplest) such pattern involves a group of threads that operate in a round-robin fashion, using hardware signals to pass control explicitly from one thread to the next. Round robin processing ensures that data items are processed in order and works well when the variation in processing times from one item to the next is bounded (which is commonly the case in packet processing contexts), and is straightforward to implement.
+
Since caches are relatively ineffective for networking workloads, the IXP provides a different mechanism for coping with the memory latency gap, hardware multi-threading. Each of the MEs has eight separate sets of processor registers (including Program Counter), which form the MEs hardware thread contexts. An ME can switch from one context to another in 2 clock cycles, allowing it to stay busy doing useful work, even when several of its hardware threads are suspended, waiting for data to be retrieved from external memory. Multithreading can be used in a variety of ways, but there are some common usage patterns that are well-supported by hardware mechanisms. Perhaps the most commonly used (and simplest) such pattern involves a group of threads that operate in a round-robin fashion, using hardware signals to pass control explicitly from one thread to the next. Round robin processing ensures that data items are processed in order and works well when the variation in processing times from one item to the next is bounded (which is commonly the case in packet processing contexts), and is straightforward to implement.
  
 
Each ME has a relatively small dedicated program store, from which it executes instructions. This limits the number of different functions that can be implemented by a single ME, favoring programs that are divided into smaller pieces and organized as a pipeline. The MEs support such pipeline processing by providing dedicated FIFOs between consecutive pairs of MEs (Next Neighbor Rings). A pipelined program structure also makes it easy to use the processing power of the MEs effectively, since the parallel components of the system are largely decoupled from one another.
 
Each ME has a relatively small dedicated program store, from which it executes instructions. This limits the number of different functions that can be implemented by a single ME, favoring programs that are divided into smaller pieces and organized as a pipeline. The MEs support such pipeline processing by providing dedicated FIFOs between consecutive pairs of MEs (Next Neighbor Rings). A pipelined program structure also makes it easy to use the processing power of the MEs effectively, since the parallel components of the system are largely decoupled from one another.

Revision as of 15:49, 18 March 2009

Currently under reconstruction - check back later

Network overlays have become a popular tool for implementing Internet applications. While content-delivery networks provide the most prominent example of the commercial application of overlays, systems researchers have developed a variety of experimental overlay applications, demonstrating that the overlay approach can be an effective method for deploying a broad range of innovative systems. An Overlay hosting Service (OHS) is a shared infrastructure that supports multiple overlay networks and can play an important role in enabling wider-scale use of overlays, since it enables small organizations to deploy new overlay services on a global scale without the burden of having to acquire and manage their own physical infrastructure. Currently, PlanetLab is the canonical example of an overlay hosting service, and it has proven to be an effective vehicle for supporting research in distributed systems and applications. This project seeks to make PlanetLab and systems like it more capable of supporting large-scale deployments of overlay services, through the creation of more capable platforms and the control mechanisms needed to provision resources for use by different applications.

Overlay Hosting Service

As shown at right, an OHS can be implemented using distributed data centers, comprising servers and a communications substrate that includes both L2 switching for local communication and L3 routers for communication to end users and other data centers. End users communicate with a data center using the public Internet, while data centers can communicate with each other, using either the public Internet or dedicated backbone links, allowing the OHS to support provisioned overlay links. This in turn allows overlay network providers to deliver services that require consistent performance from the networking substrate.

We have developed an experimental prototype of a system for implementing overlay hosting services. We have selected PlanetLab as our target implementation context and have dubbed the system the Supercharged PlanetLab Platform. The SPP has a scalable architecture that accommodates multiple types of processing resources, including conventional server blades and Network Processor blades base on the Intel IXP 2850. We are working to deploy five SPP nodes in Internet 2 as part of a larger prototyping effort associated with National Science Foundation's GENI initiative. The subsequent sections describe the SPP and our plans for GENI.

Supercharged PlanetLab Platform

The SPP is designed as a high performance substitute for a conventional PlanetLab node. The typical PlanetLab node is a conventional PC running a customized version of Linux that supports multiple virtual machines, using the Linux vServer mechanism. This allows different applications to share the node's computing and network resources, while being logically isolated from the other applications running on the node. PlanetLab's virtualization of the platform is imperfect, as it requires different vServers to share a common IP address and because it provides limited support for performance isolation. Nevertheless, PlanetLab has been very successful as an experimental platform for distributed applications.

The objective of the SPP is to boost the performance of PlanetLab sufficiently to allow it to serve as an effective platform for service delivery, not just for experimentation. There are several elements to this. First, the SPP is designed as a scalable system that incorporates multiple servers, while appearing to users and application developers like a conventional PlanetLab node. Second, the SPP makes use of Network Processor blades, in addition to conventional server blades, allowing developers to take advantage of the higher performance offered by NPs. Third, the SPP provides better control over both computing and networking resources, enabling developers to deliver more consistent performance to users. PlanetLab developers can use the SPP just like a conventional PlanetLab node, but in order to obtain the greatest performance benefits, they must structure their applications to take advantage of the NP resources provided by the SPP. We have tried to make this relatively painless, by supporting a simple fastpath/slowpath application structure, in which the NP is used to implement the most performance-critical parts of an application, while the more complex aspects of the application are handled by a general-purpose server that provides a conventional software execution environment. In this section, we provide an overview of the hardware and software components that collectively implement the SPP, and describe how they can be used to implement high performance PlanetLab applications.

Hardware Components

Supercharged PlanetLab Platform Hardware Components

The hardware components of our prototype SPP are shown at right. The system consists of a number of processing components that are connected by an Ethernet switching layer. From a developer's perspective, the most important components of the system are the Processing Engines that host the applications. The SPP includes two types of PEs. The General-Purpose Processing Engines (GPE) are conventional server blades running the standard PlanetLab operating system. The current GPEs are dual Xeons with a clock rate of xx, xx GB of DRAM and xx GB of on-board disk. The Network Processor Blades (NPE) include two IXP 2850s, each with 16 cores for processing packets and an xScale management processor. Each IXP has 750 MB of RDRAM plus four independent SRAM banks, and the two share an 18 Mb TCAM. The NPE also has a 10 GbE network connection and is capable of forwarding packets at 10 Gb/s.

All input and output passes through a Line Card (LC) that has ten GbE interfaces. In a typical deployment, some of these interfaces will have public IP addresses and be accessible through the Internet, while others will be used for direct connection to other SPP nodes. The LC is implemented with a Network Processor blade and handles the routing of traffic between the external interfaces and the GPEs and NPEs. This is done by configuring filters and queues within the LC. The system is managed by a Control Processor (CP) that configures application slices based on slice descriptions obtained from PlanetLab Central, a centralized database that is used to manage the global PlanetLab infrastructure. The CP also hosts a netFPGA, allowing application developers to implement processing in configurable hardware, as well as software.

Our prototype system is shown in the photograph. We are using board level components that are compatible with the Advanced Telecommunication Computing Architecture (ATCA) standards. ATCA components include the server blades, the NP blades and the chassis switch, which actually includes both a 10 GbE switch and a 1 GbE control switch. The ATCA components are augmented with an external 1 GbE switch and a conventional rack-mount server that implements the CP.

Network Processor Blades

Network Processor blades are used for both the NPE and the LC. NPs are engineered for high performance on IO-intensive packet processing workloads. The dual Intel IXP 2850s used in the SPP's NP blades each contain 16 multi-threaded Micro-Engines (ME) that do the bulk o f the packet processing, plus several high bandwidth memory interfaces. In typical applications DRAM is used primarily for packet buffers, while SRAM is used for implementing lookup tables and linked list queues. There are also special-purpose on-chip memory resources, both within the MEs and shared across the MEs. An xScale Management Processor (MP) is provided for overall system control. The MP typically runs a general-purpose OS like Linux, and has direct access to all of system memory and direct control over the MEs.

As with any modern processor, the primary challenge to achieving high performance is coping with the large processor/memory latency gap. Retrieving data from off-chip memory can take 50-100 ns (or more), meaning that in the time it takes to retrieve a piece of data from memory, a processor can potentially execute over 100 instructions. The challenge for system designers is to try to ensure that the processor stays busy, in spite of this. Conventional processors cope with the memory latency gap primarily using caches. However for caches to be effective, applications must exhibit locality of reference, and unfortunately, networking applications typically exhibit limited locality of reference, with respect to their data.

Since caches are relatively ineffective for networking workloads, the IXP provides a different mechanism for coping with the memory latency gap, hardware multi-threading. Each of the MEs has eight separate sets of processor registers (including Program Counter), which form the MEs hardware thread contexts. An ME can switch from one context to another in 2 clock cycles, allowing it to stay busy doing useful work, even when several of its hardware threads are suspended, waiting for data to be retrieved from external memory. Multithreading can be used in a variety of ways, but there are some common usage patterns that are well-supported by hardware mechanisms. Perhaps the most commonly used (and simplest) such pattern involves a group of threads that operate in a round-robin fashion, using hardware signals to pass control explicitly from one thread to the next. Round robin processing ensures that data items are processed in order and works well when the variation in processing times from one item to the next is bounded (which is commonly the case in packet processing contexts), and is straightforward to implement.

Each ME has a relatively small dedicated program store, from which it executes instructions. This limits the number of different functions that can be implemented by a single ME, favoring programs that are divided into smaller pieces and organized as a pipeline. The MEs support such pipeline processing by providing dedicated FIFOs between consecutive pairs of MEs (Next Neighbor Rings). A pipelined program structure also makes it easy to use the processing power of the MEs effectively, since the parallel components of the system are largely decoupled from one another.

The two NPs on the card share a Ternary Content Addressable Memory (TCAM). The TCAM can be configured for key sizes ranging from 72 bits up to 576 bits. In the LC, it is used to implement general packet filters based on the standard IP 5-tuple (source+destination address, protocol, source+destination port). In the NPE, it is used to implement generic lookups with application-specific semantics. The two NPs communication through an onboard SPI switch to the backplane and an optional IO card, with ten 1 GbE interfaces. In the SPP, the LC uses an IO card, while the NPE does not.

Network Processor Datapath Software

Line Card Software Components

The software components use in the Line Card are shown at right. One of the two NPs on the LC is used to process incoming traffic (from the external interfaces to the chassis switch) and the other is used to process outgoing traffic. In both cases, the processing is structured as a pipeline in which packets flow from stage-to-stage thru inter-stage buffers (not shown). Most stages use more than one MicroEngine (ME), typically with the load being distributed in round robin fashion across the MEs and their individual thread contexts.

In the ingress pipeline, the RxIn block copies arriving packets from the IO interface into fixed size DRAM buffers (2KB each) and passes a pointer to the next stage in the pipeline. The Key Extract block reads fields from the packet header that are required to form a lookup key and passes these on to the next stage. The Lookup block formats the lookup key and issues queries to the external TCAM, which returns a pointer to results in an external SRAM. These results include modified values of selected header fields and a queue identifier.

NPE Software Components
Planned Revision to NPE Software

Control Software

Planned SPP Deployment

map of planned node locations

details of a typical site with connections to router and other sites

Using SPPs

Discuss how to define slices in myPLC, login to SPP nodes and do configuration. Keep the main flow at a high level, but add a page that gives a tutorial on how the GEC 4 demo is done.