Internet Scale Overlay Hosting
[under construction]
Network overlays have become a popular tool for implementing Internet applications. While content-delivery networks provide the most prominent example of the commercial application of overlays, systems researchers have developed a variety of experimental overlay applications, demonstrating that the overlay approach can be an effective method for deploying a broad range of innovative systems. Rising traffic volumes in overlay networks make the performance of overlay platforms an issue of growing importance. Currently, overlay platforms are constructed using general purpose servers, often organized into a cluster with a load-balancing switch acting as a front end. This project explores more integrated and scalable architectures suitable for supporting large-scale applications with thousands to many millions of end users. In addition, we are studying various network level issues relating to the control and management of large-scale overlay hosting services.
Contents
Overvew
Scalable Overlay Hosting Platforms
The purpose of an Overlay he purpose of the GBP is to enable multiple, diverse metanetworks to co-exist within a common shared infra-structure. To do this, it must enable sharing of backbone links and node processing resources. We expect researchers to use GENI for a wide range of different experiments, with highly diverse requirements. To enable the GBP to serve the widest possible range of objectives, it should be highly flexible and should provide sufficient resources to avoid constraining the research agendas of its users.
A GBP will host multiple metarouters belonging to distinct metanetworks (we use the term metarouter and metanetwork rather than virtual router and virtual network, because the latter terms have been heavily overloaded and are more subject to misunderstanding). The GBP will provide resources that can be used by different metarouters and the underlying mechanisms to allow each one to oper-ate independently of the others. The term metarouter is used here in a very generic sense to mean any network component with multiple interfaces that forwards informa-tion through a network, while possibly processing it as it passes through. It can include components whose function-ality is similar to that of an IP router, or components that switch TDM circuits, or components that operate like fire-walls or media gateways. A given metanetwork may in-clude metarouters of various types. It is left to the designers of individual metanetworks to define the precise functional-ity of their metarouters and to distinguish among different types of metarouters as they find appropriate.
The design of the GBP is distinctly different from that of conventional routers and switches. The following para-graphs summarize some important high level objectives for the GBP. • Scalable performance. The experimental networks de-veloped for GENI will have a wide range of characteris-tics, leading to widely differing requirements for GBP resources. Metanets seeking to support high volume data transfers for e-science applications may require multiple 10 Gb/s links, while metanets designed to transfer text messages among pagers may have little use for links above 100 Mb/s. Different metarouters will also have very different processing needs. While IPv4 forwarding requires fewer than 20 instructions executed per data byte, some experimental networks may require hundreds. The GBP should enable its resources to be al-located flexibly among many metarouters, and should support configurations suitable for a variety of perform-ance ranges. • Stability and reliability. If GENI is to provide a useful platform for experimentation and deployment of ex-perimental network services, it must be sufficiently reli-able and stable to allow researchers to work without interference from others. Because the experimental net-works that run within GENI will be the subjects of on-going experimentation and modification, their stability will be highly variable. Nonetheless, the platform itself must be stable, so that researchers can focus on issues arising within their own experiments and not be con-cerned with the stability of the underlying substrate. The isolation mechanisms for metarouters (discussed below) are one element of the overall strategy for achieving re-liable operation. However, it is also important that the hardware components used to implement the GBP have high intrinsic reliability and that the GBP as a whole be easy to manage and maintain, so that outages due to op-erational errors are kept to a minimum. • Ease of use. Academic researchers have limited re-sources they can devote to development of experimental systems, making it important that it be as easy as possi-ble for them to implement their metanetworks. There are intrinsic challenges here, since the technologies that yield the highest possible performance are often not the easiest to use. The GBP should enable use of high per-formance technologies, while minimizing the barriers to their use. In addition, the GBP should facilita te sharing of common software and configurable logic modules among different research groups. • Technology diversity and adaptability. The GBP should enable the construction of metanetworks using a variety of different underlying technologies, including general purpose processors, network processor subsystems and configurable logic subsystems. This will allow different researchers to pursue different strategies for meeting their research objectives and will provide the flexibility for the system to incorporate new implementation tech-nologies, as they become available. • Flexible allocation of link bandwidth. Link bandwidth is a key resource. The GBP should support flexible alloca-tion of bandwidth to different metanetworks including both reserved and shared bandwidth models. It should provide mechanisms for circuit-based management of links, allowing researchers to experiment with novel frame formats and time-domain switching techniques. • Isolation of metarouters. The GBP must allow different metarouters to co-exist without interference. Ideally, each metarouter should have the illusion that it is oper-ating within a dedicated environment. This means that resources like memory and disk space must be free from modification by other metarouters and that metarouters have the ability to reserve dedicated processing capacity and link bandwidth. • Minimize constraints on metarouters. The GBP should place as few constraints as possible on the metarouters it hosts. In particular, it should not place any constraints on data formats or limit the ways in which metanet-works provide various capabilities, or constrain the way in which they use their assigned processing resources. It should be understood that the above list is not compre-hensive. It has been intentionally limited to fairly high level objectives. The remainder of this paper provides a more detailed view of the GBP.
3. GBP ABSTRACTIONS A GBP will host multiple metarouters that terminate meta-links connecting to other metarouters and to end systems. The metarouter and metalink are key abstractions that are implemented by the GBP. These abstractions are described in the following subsections. 3.1. Metalink Abstraction The metalink is an abstraction of a point-to-point physical link. It may have a guaranteed transmission bandwidth and a maximum bandwidth, but is not required to have either. While the backbone links connecting GBPs will typically have provisioned bandwidths, we expect at least some links connecting GBPs to user sites will be implemented using unprovisioned tunnels. In addition, not all metanetworks require provisioned bandwidth. The configuration of a metanetwork will include a specification of its metalinks. For those that are specified with provisioned bandwidth, the metanet configuration software will choose an underly-ing implementation that can support this. A metalink may be unidirectional or bi-directional. A bi-directional meta-link may have asymmetric bandwidth provisioning. The metalink abstraction can be extended to support links with more than two endpoints. The primary motiva-tion for supporting a multipoint metalink is to make use of the features of multi-access subnets in the LAN environ-ment. Since this paper is primarily concerned with the backbone environment, we omit discussion of the multi-point case here. 3.2. Metarouter Abstraction A metarouter is an abstraction of a conventional router, switch or other network component, which typically con-sists of three major components, line cards, a switching fabric and a control processor. The line cards terminate the physical links and implement specific processing functions that define a particular network. On input, this may include performing a table lookup of some sort, to determine where an arriving packet should be sent next and what special processing (if any) it should receive. Alternatively, it might involve identification of TDM frame boundaries, and time-division switching of timeslots within frames. On output, it might include scheduling a packet for transmission on the outgoing link, or the formatting of a TDM frame. The switching fabric is responsible for transferring data from the line cards where they arrive, to the line cards for their outgoing links. Switching fabrics are typically designed to be nonblocking, meaning that they should handle arbitrary traffic patterns, without interference within the fabric. For circuit networks this means that any set of circuits that can be supported by the external links should be supported by the fabric. For packet networks, it also means that exces-sive traffic to a particular output line card should not inter-fere with traffic going to other line cards. The control processor in a conventional router implements various control and administrative functions (such as executing routing protocols and updating tables in the line cards). These functions are generally implemented in software running on a general-purpose microprocessor. A metarouter has a similar structure. It consists of two types of components MetaProcessing Engines (MPE) and a MetaSwitch (MS). MPEs can be used to implement data path functions within a metarouter or higher level control functions and may be implemented using various types of underlying processing resources. Metalinks terminate at meta-interfaces (MI) on MPEs and MPEs are connected to each other through the MS. MIs are subject to maximum bandwidth limits, as are the interfaces between MPEs and the MS. Figure 1 shows an example of a metarouter. Metarouters with limited performance needs may use a single MPE. In this case, there is no need for a metaswitch. Another common case is a metarouter with two MPEs, one that implements the normal data forwarding path, and an-other that implements control functions and handles excep-tion cases. In this case, the MIs will typically all be associated with the data path MPE (implemented on a net-work processor, perhaps) which will have a logical connec-tion directly to the control/exception MPE (implemented on a general purpose processor). In this case, the MS reduces to the single logical connection between the two MPEs.
Architectural Options
This section discusses two high level system architecture options for scalable overlay hosting platforms and the issues arising from these options.
Virtualized line card architecture
Consideration of a conventional router or switch leads naturally to an architecture in which line cards are replaced by virtualized line cards that include a substrate portion and generic processing resources that can be assigned to different meta line cards (Figure 2). The substrate supports configuration of the generic processing resources so that different meta line cards can co-exist without interference. On receiving data from the physical link, the substrate first determines which meta line card it should be sent to and delivers it. Meta line cards pass data back to the substrate, in order to forward it through the shared switch fabric, on input, or to the outgoing link, on output.
One issue with this architecture concerns how to provide generic processing resources at a line card, in a way that allows the resources to be shared by different meta line cards. Conventional line cards are often implemented using Network Processors (NP), programmable devices that in-clude high performance IO and multiple processor cores to enable high throughput processing. It seems natural to take such a device and divide its internal processing resources among multiple meta line cards. For example, an NP with 16 processor cores could be used by up to 16 different meta line cards, by simply assigning processor cores. Unfortunately, current NPs are not designed to be shared. All processing cores have unprotected access to the same physical memory, making it difficult to ensure that different meta line cards don’t interfere with one another. Also, each processor core has a fairly small program store. This is not a serious constraint in conventional applications, since processing can be pipelined across the different cores, allowing each to store only the program it needs for its part of the processing. However, a core implementing an entire meta line card must store the programs to implement all the processing steps for that meta line card. The underlying issue raised by this discussion is that efficient implementation of an architecture based on virtualized line cards, requires components that support fine-grained virtualization and conventional NPs do not.
The virtualized line card approach is also problematic in other respects. Because it associates processing resources with physical links, it lacks the flexibility to support metar-outers with a wide range of processing needs. Some metar-outers may require more processing per unit IO bandwidth than NPs provide, and this is difficult with a virtualized line card approach. The virtualized line card approach also does not easily accommodate alternate implementation approaches for metarouters (such as configurable logic).
Processing pool architecture
The processing pool architecture separates the processing resources used by metarouters from the physical link inter-faces. This allows a more flexible allocation of processing resources and reduces the need for fine-grained virtualization. This architecture, illustrated in Figure 3, provides a pool of Processing Engines (PE), that are accessed through the switch fabric. The line cards that terminate the physical links forward packets to PEs through the switch fabric, but do no processing that is specific to a particular metanet. There may be different types of PEs, including some im-plemented u sing network processors, others implemented using conventional microprocessors and still others imple-mented using FPGAs. The NP and FPGA based PEs are most appropriate for high throughput packet processing, the conventional processor for control functions that require more complex software or for metanets with a high ratio of processing to IO. A metarouter may be implemented using a single PE or multiple PEs. In the case of a single PE, data will pass through the physical switch fabric twice, once on input, once on output. In a metarouter that uses multiple PEs to obtain higher performance, packets may have to pass through the switch fabric a third time.
The primary drawback of the processing pool architecture is that it requires multiple passes through the switch fabric, increasing delay and increasing the switch capacity needed to support a given total IO bandwidth. The increase in delay is not a serious concern in wide area network con-texts, since switching delays are typically 10 s or less. The increase in capacity does add to system cost, but since a well-designed switch fabric represents a relatively small part of the cost of a conventional router (typically 10-20%), we can double, or even triple the \capacity without a pro-portionally large increase in the overall system cost. In the GENI context, the switch fabric bandwidth implications of the processing pool architecture are significantly reduced, since we expect the metarouters implemented within a GBP to have a relative high ratio of processing capacity to IO bandwidth, compared to conventional routers.
The great advantage of the processing pool architecture is that it greatly reduces the need for fine-grained virtual-ization within NP and FPGA-based subsystems, for which such virtualization is difficult. Because the processing pool architecture brings together the traffic for each individual metarouter, there is much less need for PEs to be shared among multiple metarouters. The one exception to this i s metarouters with such limited processing needs that they cannot justify the use of even one complete PE. Such metarouters can still be accommodated by implementing them on a general purpose processor, running a conven-tional operating system that supports a virtual machine environment. We discuss below one approach that allows such metarouters to share an NP for fast path forwarding, while relying on a virtual
machine running within a general purpose processor to handle exception cases.
Another advantage of the processing pool architecture is that it simplifies sharing of the switch fabric. The switch traffic must maintain traffic isolation among the different metarouters. One way to ensure this is to constrain the traffic flows entering the switch fabric so as to eliminate the possibility of internal congestion. This is difficult to do in all cases. In particular, metarouters consisting of multiple PEs should be allowed to use their “share” of the switch fabric capacity in a flexible fashion, without having to constrain the pair-wise traffic flows among the PEs. How-ever allowing this flexibility makes it possible for several PEs in a given metarouter to forward traffic to another PE at a rate that exceeds the bandwidth of the interface be-tween the switch fabric and the destination PE.
There is a straightforward solution to this problem in the processing pool architecture. To simplify the discus-sion, we separate the handling of traffic between line cards and PEs from the traffic among PEs in a common metar-outer. In the first case, we can treat the traffic as a set of point-to-point streams that are rate-limited when they enter the fabric. Rate-limiting these flows follows naturally from the fact that they are logical extensions of traffic flows on the external links. Because the external link flows must be rate limited to provide traffic isolation on the external links, the internal flows within the switch fabric can be config-ured to eliminate the possibility of congestion.
For PE-to-PE traffic, we cannot simply limit the traffic entering the switch, since it’s important to let PEs commu-nicate freely with other PEs in the same metarouter, with-out constraint. However, because entire PEs are allocated to metarouters in the processing pool architecture, it’s possi-ble to obtain good traffic isolation in a straightforward way, for this case as well. In general, we need two properties from the switch fabric. First, it must support constrained routing, so that traffic from one to metarouter cannot be sent to PEs belonging to another metarouter. Second, we need to ensure that congestion within one metarouter does not affect traffic within another metarouter. The emergence of Ethernet as a backplane switching technology provides the first property. Such switches support VLAN-based routing that can be used to separate the traffic from differ-ent metarouters. The second property is satisfied by any switching fabric that is nonblocking at the port level. While some switch fabrics fail to a fully achieve the objective of nonblocking performance, this is the standard figure of merit for switching fabrics a and most come reasonably close to achieving it.
Abstraction vs. Transparency
Scaling Up
Scaling Down
Implementation Options
T
specifics for GENI and SPP
Control of Overlay Hosting Services
General control architecture, including design of a control overlay network.
Internet Scale Overlay Applications
Network games work.
Scalable audio.
Mapping Overlays onto an OHS Infrastructure
Jing's work.
Issues for Multi-domain Overlay Hosting
Control issues and multi-domain resource mapping.
References
- [BA06]
- Bavier, A., N. Feamster, M. Huang, L. Peterson, J. Rexford. “In VINI Veritas: Realistic and Controlled Network Experimentation,” Proc. of ACM SIGCOMM, 2006.
- [BH06]
- Bharambe, A., J. Pang, S. Seshan. “Colyseus: A Distributed Archi-tecture for Online Multiplayer Games,” In Proc. Symposium on Networked Systems Design and Implementation (NSDI), 3/06.
- [CH02]
- Choi, S., J. Dehart, R. Keller, F. Kuhns, J. Lockwood, P. Pappu, J. Parwatikar, W. D. Richard, E. Spitznagel, D. Taylor, J. Turner and K. Wong. “Design of a High Performance Dynamically Extensible Router.” In Proceedings of the DARPA Active Networks Conference and Exposition, 5/02.
- [CH03]
- Chun, B., D. Culler, T. Roscoe, A. Bavier, L. Peterson, M. Wawr-zoniak, and M. Bowman. “PlanetLab: An Overlay Testbed for Broad-Coverage Services,” ACM Computer Communications Review, vol. 33, no. 3, 7/03.
- [CI06]
- Cisco Carrier Routing System. At www.cisco.com/en/ US/products/ps5763/, 2006
- [DI02]
- Dilley, J., B. Maggs, J. Parikh, H. Prokop, R. Sitaraman, and B. Weihl. “Globally Distributed Content Delivery,” IEEE Internet Computing, September/October 2002, pp. 50-58.
- [FO07]
- Force 10 Networks. “S2410 Data Center Switch,” http:// www.force10networks.com/products/s2410.asp, 2007.
- [FR04]
- Freedman, M., E. Freudenthal and D. Mazières. “Democratizing Content Publication with Coral,” In Proc. 1st USENIX/ACM Sym-posium on Networked Systems Design and Implementation, 3/04.
- [GE06]
- Global Environment for Network Innovations. http://www.geni.net/, 2006.
- [HI98]
- Mike Hicks_ Pankaj Kakkar_ Jonathan T_ Moore_ Carl A_ Gunter_ and Scott Nettles. “PLAN, A packet language for active networks,” In Proceedings of the Third ACM SIGPLAN International Conference on Functional Programming Languages, 1998.
- [IXP]
- Intel IXP 2xxx Product Line of Network Processors. http://www .intel.com/design/network/products/npfamily/ixp2xxx.htm.
- [KA02]
- Karlin, Scott and Larry Peterson. “VERA: An Extensible Router Architecture,” In Computer Networks, 2002.
- [KO00]
- Kohler, Eddie, Robert Morris, Benjie Chen, John Jannotti and M. Frans Kaashoek. “The Click modular router,” ACM Transactions on Computer Systems, 8/2000.
- [KO04]
- Kontothanassis, L. R. Sitaraman, J. Wein, D. Hong, R. Kleinberg, B. Mancuso, D. Shaw and D. Stodolsky. “A Transport Layer for Live Streaming in a Content Delivery Network,” Proc. of the IEEE, Special Issue on Evolution of Internet Technologies, 9/04.
- [PA03]
- Pappu, P., J. Parwatikar, J. Turner and K. Wong. “Distributed Queueing in Scalable High Performance Routers.” Proceeding of IEEE Infocom, 4/03.
- [PE02]
- Peterson, L., T. Anderson, D. Culler and T. Roscoe. “A Blueprint for Introducing Disruptive Technology into the Internet,” Proceed-ings of ACM HotNets-I Workshop, 10/02.
- [RA05]
- Radisys Corporation. “Promentum™ ATCA-7010 Data Sheet,” product brief, available at http://www. radisys.com/files/ATCA-7010_07-1283-01_0505_datasheet.pdf.
- [RH05]
- Rhea, S., B. Godfrey, B. Karp, J. Kubiatowicz, S. Ratnasamy, S. Shenker, I. Stoica and H. Yu. “OpenDHT: A Public DHT Service and Its Uses,” Proceedings of ACM SIGCOMM, 9/2005.
- [SP01]
- Spalink, T., S. Karlin, L. Peterson and Y. Gottlieb. “Building a Robust Software-Based Router Using Network Processors,” In ACM Symposium on Operating System Principles (SOSP), 2001.
- [ST01]
- Stoica, I., R. Morris, D. Karger, F. Kaashoek and H. Balakrishnan. “Chord: A scalable peer-to-peer lookup service for internet applica-tions.” In Proceedings of ACM SIGCOMM, 2001.
- [ST02]
- Stoica, I., D. Adkins, S. Zhuang, S. Shenker, S. Surana, “Internet Indirection Infrastructure,” Proc. of ACM SIGCOMM, 8/02.
- [TU06]
- Turner, J. “A Proposed Architecture for the GENI Backbone Plat-form,” In Proceedings of ACM- IEEE Symposium on Architectures for Networking and Communications Systems (ANCS), 12/2006.
- [VS06]
- Linux vServer. http://linux-vserver.org