Dev:Forest - an overlay network for distributed applications

This page describes the high level architecture for the scalable network game system, focusing on the major system components, their interfaces and functionality. We envision several different implementations of the architecture, for different deployment contexts. These contexts include the Open Network Lab, the Supercharged PlanetLab Platform, Vini and Amazon EC2. Some of these (perhaps all) will be incomplete, but we want to maintain a reasonable degree of consistency across the platforms to minimize the amount of redundant work that will be required.

System Overview

Network Games Components

The network game system is distributed across multiple sites. Each site has a known geographic location and includes a multiplicity of processing resources of various types. In particular, each site has one or more servers and routers. Servers connect to remote users or clients over the Internet and communicate with other servers through the routers. Sites are connected by metalinks which have an associated provisioned bandwidth. Multiple routers at a site may share the use of a metalink going to another site, but must coordinate their data transmissions so as not to exceed its capacity. In addition, each site has a site controller which manages the site and configures the various other components as needed. Within a site, communication takes place through a switch component that is assumed to be nonblocking. These ideas are illustrated in the figure below. In the planned SPP deployment, each site will correspond to a slice on an SPP node (including a fastpath for the routing function and a GPE portion that implements the site controller functions), plus a set of remote servers which will be connected to the SPP node over statically configured tunnels.

In addition to the components shown in the diagram, there is a web site through which clients can register, learn about game sessions in progress and either join an existing session or start a new one. When the user requests to join a session, the system will select a server and return its IP address and a session id to the client, who will use these to establish a connection to the server and join the session.

Communication in the network occurs over multicast channels. Each channel is identified by a globally unique 32 bit value and has an associated set of components and links which form a multicast tree. Channels are used most commonly to distribute state update information for game sessions. In this case packets are delivered to all servers that belong to the session. However, channels can also be used by the system to distribute control information of various sorts. Packets can be sent to specific components on a channel, or to sets of components (e.g. all site controllers). In general, the links in a channel can carry data in either direction enabling many-to-many communication. The first 1024 channels are reserved for various system control purposes and cannot be used for game sessions.

The various system components communicate by sending packets of various types. The common packet format is shown at right and the fields are described below. Specific protocols used by the various components define specific packet types and additional fields. These are described in the specific sections dealing with those protocols.

Generic Packet Format

Version (4 bits). This field defines the version of the protocol suite. The current version is 1.
Length (12 bits). Specifies the number of bytes in the packet. Must be divisble by 4 and can be no larger than 1400.
Type (8 bits). This field defines the type of the packet. Specific types are discussed in detail below.
Flags (8 bits). The currently defined flags are listed below.
- Route Needed. This field is set by a router when it does not have a route table entry for the specific destination address in the packet and must forward the packet on all of its outgoing branches in the channel. The recipient of the packet is expected to respond with a Route Announcement packet on the same channel, addressed to all routers on the channel.
Channel id (32 bits). The channel id is a 32 bit identifier that identifies a specific communication channel on which the packet is being sent. Each channel is associated with a subset of the network links that collectively form a tree.
Source Address (32 bits). This specifies the address of the component that sent the packet.
Destination Address (32 bits). This specifies the address of the component to which the packet is being sent.
Auxiliary Information (64 bits). The interpretation of this field depends on the type of the packet. Details are given in subsequent sections.
Header Checksum (32 bits). Computed over all fields of the header. Packets with an incorrect header checksum are discarded by routers.
Payload Checksum (32 bits). Computed over the payload alone.

All system components have addresses which serve as concise machine-readable identifiers and provide location information. The first 16 bits of each 32 bit address identifies a site, and the remaining 16 bits identify a component associated with that site. The site component of the address is called the site id and the remainder is called the component number. Components with addresses includes servers and routers. Clients interact with the system only through its web interface and through their IP connections to servers, so do not require netGame addresses.

Addresses with a component number of less than 1000 are reserved for special purposes. These addresses identify certain context-specific destinations, as shown below. These destinations are interpreted within a scope defined by the site number of the packet. Packets with a site number of 0 are delivered to the designated recipients in the global scope (that is the entire channel). Packets with a non-zero site number designate recipients within the specified site. So for example, a packet with a destination address of 0.200 is delivered to all servers anywhere on the channel, while a packet with a destination address of 2.200 is delivered to all servers within site 2.

100 - Lead controller (for specified site or entire channel).
101 - All site controllers (for specified site or entire channel).
102 - Next site lead controller.
200 - All servers.
201 - All subscribed servers.
300 - All routers.
301 - Next-hop router.
302 - All edge routers.
303 - First edge router.

ALTERNATE APPROACH - Jon

Am starting to think that this approach is more structured than it needs to be and may needlessly constrain the ways that the channels are used by different game sessions. If the overlay supports different types of game sessions and virtual worlds, it's likely that there will be a variety of usage patterns. So here's an alternate addressing structure that I think we should consider.

Unicast addresses. Top bit=0, next 15 bits define a site. Lower 16 bits define an endpoint within the site. Require that the top seven bits of site number are !=127.

Channel-wide multicast address. Top bit=1, Next 7 bits=127. Remaining 24 bits defines a channel-wide multicast.

Site-local multicast address. Top bit=1, next 15 bits define a site. Remaining 16 bits define a site-local multicast. Require that top 7 bits of site number are !=127.

With this approach, we don't need all the special-purpose addresses <1000. We can simply configure a multicast within the channel with the appropriate set of endpoints. We might also then use multicast addresses in place of region numbers for state update distribution. This will simplify the lookup process and make things a bit more general.

We could make subscription to a multicast happen as a side-effect of sending to a multicast address. Suppose a server sends a packet to a multicast address which is unknown to its first-hop overlay node. The first hop node can forward the packet to all its neighboring overlay nodes, with a "subscription-requested flag". When the packet reaches a node that knows about that multicast address, it can respond with a multicast route packet that is used by the routers along the path to configure the filters needed to route multicast packets. To avoid lots of excess traffic for subscription requests, we can define a central subtree of the channel as the "core" nodes and require that all core nodes stay subscribed to all multicasts, so that subscription packets propagate only up to the first core node. This generalizes our ideas about how to deal with subscriptions for state updates. Needs more thought, but think this could be useful approach.

Packet Forwarding

The routers maintain routing tables for forwarding packets within channels. The channel routing table at a site typically contains an entry for all other sites that are on the channel, and an entry for all components within the site that are on the channel. These entries provide the information needed by the router to make local routing decisions. Routing table entries can be statically configured, or created dynamically. This section describes the basic steps involved in forwarding a packet. It is phrased in terms of a specific reference implementation, but this is meant to be illustrative, not prescriptive.

We assume that each router has a list of neighbors with which it can communicate directly. Each neighbor is identified locally by a numerical index and for each neighbor, the router has its netGames address, its node type (router, server, site controller) and whatever lower level information is needed to send netGames packets to it (e.g. Ethernet MAC address, or IP address plus port#). For each session, the router also maintains a list of those neighbors that are adjacent to it in the session. We also assume that for each arriving packet, we have the index of the neighbor from which it was received. A neighbor number of 0 is used to identify the control component of the router itself.

The node maintains a forwarding table that consists of a set of (key,value) pairs and can be implemented using a hash table data structure. The keys are 64 bits long and are formed using selected fields of the packet header. The first 32 bits are the session id, the next 8 bits identify the input interface on which the packet arrived, the next 4 bits identify the class of the packet and the last 20 bits provide auxiliary addressing information which depends on the class. The classes are defined as follows.

State update packets are assigned to class 0. These packets are recognized by their packet type (=xx). For these packets, the auxiliary address information is the region number of the state update, which is contained in the auxiliary information field of the packet header.
Packets addressed to foreign destinations (the site number part of the destination address is greater than zero and not equal to the local site number) are assigned to class 1. For such packets, the auxiliary address information is the site number portion of the destination address.
All other packets are assigned a class of 2. For these packets the auxiliary information is the component number portion of the destination address. Note that this means that routing table entries determine how the various "special" packets are forwarded.

The value portion of the routing table entry includes a list of the neighbors that the packet should be forwarded to (specified using their numerical index).

To forward a packet, the router forms the lookup key from the packet header fields, finds the appropriate entry in the table and forwards the packet to the specified neighbors. If the neighbor list includes the neighbor from which the packet was received, that copy of the packet is suppressed. Packets may be forwarded even when there is no matching entry. Specifically packets addressed to foreign sites for which there is no routing table entry are forwarded to all neighboring routers. Packets addressed to a component in the local site for which there is no routing table entry are forwarded to all neighboring components within the site. In both these cases, the "Need Route" bit is set in the flags field. When a component receives a packet addressed to it with the "Need Route" bit set, it is required to respond with an "Announce Route" packet, which is distributed to all routers in the session.

Network Control and Operation

This section defines packet types that are related to basic network operations.

0 - Route Announce. Packets of this type are used to announce a route to a given destination within a specific channel.

We reserve channel ids 0-999 for the use of various network control protocols. These reserved channels are described below.

0 - Bootstrap channel. This channel is used for configuration when a new node comes online. For each new node, one of its neighboring nodes is assigned as its parent node on the bootstrap channel. The root of the bootstrap channel is a Master Controller (MC) for the entire network. On receiving a Node Announcement message from a node, the MC configures the node.
1 - Hello Neighbor channel. This channel is used by nodes to the nodes at the other end of each of their links.
100-199 - Network Status Distribution. These channels are used to distribute network status information to site controllers in the network. For small to medium-sized network configurations, we expect a single channel to be sufficient for distributing status information. The additional channels are reserved to facilitate scaling and reliability for very large networks.
500-599 - Channel Configuration. These channels are used to control the configuration of channels used for game sessions. For small to medium-sized network configurations, we expect a single channel to be sufficient for distributing status information. The additional channels are reserved to facilitate scaling and reliability for very large networks.

Session Control and Operations

This section defines packet types used to control and manage individual game sessions.

100 - State update. This packet is used to communicate status information about dynamic objects in the game world to other servers. Packets of this type should be addressed to "all subscribed servers"; that is, the component number portion of the destination address should be 201. The auxiliary information part of the packet includes the identity of the object whose status is being updated (16 bits) and the region of the game world in which the object is currently located (20 bits). The payload includes a list of parameter/value pairs giving the current value of the specified parameters.
101 - Subscription update. This is used by servers to request changes to their region subscriptions. Packets of this type should be addressed to the "first edge-router". The payload contains two lists of up to 150 values of 20 bits each. The first specifies specifies regions to be added to the subscription list. The second specifies regions to be dropped. The first 16 bits of the payload gives the lengths of the two lists.
102 - Subscription report. This is sent periodically by routers to servers and gives the router's current view of the regions to which the server is subscribed. This is provided to allow servers to recover from lost subscription update packets. The payload simply lists all the regions (as 16 bit values) that the server is subscribed to.
103 - Object query. This is used to request the current state of a particular object in the game world. Packets of this type are addressed to a specific server.
104 - Region query. This is used to request the current state of all objects in the game world that belong to a particular region. Packets of this type should be addressed to "all subscribed servers".
105 - Query reply. This is used to respond to a prior query request. Packets of this type should be addressed to the specific server that sent the initial query.

Network Status

This section summarizes the major system level data structures. The first three items below (client data, connection data and session data) are stored in a distributed hash table, with each site responsible for a portion of the key space in the DHT. The key space covered by each site is known by the other sites, allowing one-hop access to data in the DHT.

Client data

User name (key)
User preferences
Accounting records
Current sessions

Session data

Session id (key)
Users in the session (specified in terms of client addresses)
Multicast tree - sites, inter-site links with reserved capacities

The network status information logically forms a graph, with data associated with its nodes and links. Each site maintains the authoritative information about itself and its incoming links, but each site also maintains a complete copy of the global network status. The status information is distributed across a multicast spanning tree of the network, with each node periodically sending a copy of its current status on the multicast tree. The construction of the multicast spanning tree is carried out by a distributed spanning tree maintenance algorithm.

Network Status

List of sites with up/down status, provisioned capacity (in terms of client sessions) and available capacity.
Inter-site links with up/down status, provisioned capacity and available capacity.

SPP Deployment

This section discusses some specifics of the planned SPP deployment. Part of the reason to do this here is to verify that the overall framework we are creating makes sense for the SPP (and similarly for other deployment scenarios). We focus here on the fast path component. For demo purposes, we will most likely configure the fast path manually by logging into a slice on the GPE and configuring filters as needed.

We note that implementation on the SPP requires version 2 of the NPE, since version 1 does not support multicast. Implementing a fast path requires that we develop a new code option, including a parsing component and a header formatting component. The Parsing component simply extracts from the netGames packet header the fields that are to be used in the lookup key. These include the protocol and type fields, plus the channel id, the destination address and for state update packets, the region number. This is a total of 96 bits. The TCAM lookup produces a resultsIndex and a statsIndex. The resultsIndex is used to obtain the forwarding information for the packet, the statsIndex specifies a statistics counter. Each channel id that is active at an SPP will have its own statsIndex, so that we can monitor the traffic sent on the channel. The resultsIndex is used in a second table lookup. For multicast packets, the result of the table lookup includes the fanout, and for each copy, a queue id, the IP destination address of the outgoing packet, and the source and destination UDP port numbers.

Changing the TCAM requires control interactions through the substrate. That is, code running on a GPE must request new filters through the RMP, which talks to the xScale, which changes the TCAM entry. This will be relatively slow. For region-based filtering, we would like to make things more responsive. This seems to require some special processing in either the parse or header format blocks, the two places where we can place slice-specific code (or both).

ONL Deployment

EC2 Deployment

Related Work

[Network_Games_Related_Work Here] is a link that classifies related work. This page is under construction.

List of Related work with comments

Assiotis - Netgames 2006

Baker - Transactions on Multimedia 2005

Benford - Computer-Human Interaction 1997

Bharambe - SIGCOMM 2008

A Platform for Dynamic Microcell Redeployment in Massively Multiplayer Online Games by Bruno Van Den Bossche, Tom Verdickt, Bart De Vleeschauwer, Stein Desmet, Stijn De Mulder, Filip De Turck, Bart Dhoedt and Piet Demeester. Nossdav 2006.

This paper describes a multiplayer online game systems developed by the authors. The workload is distributed by dividing the gameworld into a number of small microcells and assigning multiple microcells to each server. This allows a server with a crowded microcell to handle a reduced number of microcells, improving the ability to balance the load. On the other hand, it does imply that microcells have to be easily and efficiently migrated among servers. The system is implemented in Java (specifically, using J2EE and JMS). The reported data is very limited and not particularly impressive (they report that the round-trip time for sending a 750 B message is 3.2 ms and that the bandwidth through a server appears to be under 30 Mb/s).

Carlson - Virtual Reality 1993

Chan - Netgames 2007

Chertov - Nossdav 2006

Colombo - Transactions on Multimedia 2007

Das - Virtual Reality 1997

On Consistency and Network Latency in Distributed Interactive Applications: A Survey - Part I by Declan Delaney, Tomas Ward, Seamus McLoone. Presence 2006.

This is a survey of techniques used in distributed interactive simulations. This part concentrates on maintaining consistency. Moderately useful, although somewhat superficial. No quantitative assessment of the various methods used to maintain consistency, making it difficult to evaluate the relative merits of the different approaches. Still, it does provide a useful overview of the literature; the reference list has nearly 100 papers.

Delaney - Presence 2007

Fernandes - Nossdav 2007

Frecon - DSE 1993

Frecon - VRST 1999

Frecon - Presence 2001

GE - 2003

Glinka - NetGames 2007

Glinka - IJCGT 2008

Graffi - ISM 2008

Hampel - Netgames 2006

Hogsand - Multimedia 1996

Kantawala 1996