Dev:Forest - an overlay network for distributed applications

[major revision in progress]

This page describes the high level architecture for Forest, an overlay network architecture for distributed applications, such as virtual worlds and distributed cyber-physical systems that have demanding real-time performance requirements.

1 Overview of Forest
2 Comtrees in Forest
3 Scalable Multicast Routing
4 Capacity Provisioning of Comtrees
5 Selecting a Comtree Topology
6 Other Services and Extensions
7 Packet Forwarding
8 Network Control and Operation
9 Session Control and Operations
10 Network Status
11 SPP Deployment
12 ONL Deployment
13 EC2 Deployment
14 Related Work
15 List of Related work with comments

Overview of Forest

Network Games Components

Forest is an overlay network specifically targeted for distributed applications with multiple endpoints and demanding real-time performance requirements. An example of such an application is an online virtual world in which periodic reports are issued describing the current status of user avatars, and these reports are distributed to those application endpoints that need to track the status of particular avatars. The set of avatars of interest to a particular endpoint is continually changing and some avatars may be of interest to large numbers of endpoints. Report delivery must be timely and non-stop, even as the pattern of communication changes. This implies that the resources needed to ensure timely delivery must be available to the applications, regardless of the specific pattern of communication.

The core network service provided by Forest is a tree-structured channel, called a comtree that supports both unicast and multicast packet delivery. Distributed applications use comtrees as private networks, that are logically isolated and can be dynamically provisioned to meet application requirements. Application endpoints subscribe to multicast groups within their comtree using a lightweight signaling mechanism that enables endpoints to join/leave multicast groups many times per second. This enables endpoints to obtain status reports from many different sources and rapidly switch among sources. Forest is designed to be used as an overlay that can be deployed either within an enterprise’s private infrastructure or in a provisioned overlay hosting/cloud computing environment. This allows operators of Forest networks to take advantage of commodity technology, keeping costs low, while still enabling the non-stop communication services needed by applications.

Comtrees in Forest

The central primitive supported by Forest is a tree-structured communication channel, or comtree. While the network’s links will typically form a general graph, each comtree uses a subset of the links that forms a tree. Each application session using Forest is assigned its own comtree and all communication for the session takes place within it. Comtrees support both unicast and multicast packet forwarding and operate as independent logical networks. Comtrees are isolated from one another in the sense that packets from one comtree cannot “leak” into another, and they are separately provisioned, ensuring that excessive traffic within one comtree cannot disrupt the flow of packets within another.

Each comtree has its own address space and routes packets independently of other comtrees. We exploit the tree topology to simplify routing. Unicast routing information is acquired dynamically as a by-product of packet forwarding, in a way that is similar to the learning mechanism used by Ethernet LANs. In the absence of specific routing information, a Forest router forwards a packet to all of a comtree’s incident links (except of course, for the link on which the packet was received). Packets forwarded in this way are marked with a flag requesting routing information for the addressed destination, which triggers a response containing the required information. The use of tree-structured channels makes it fairly easy to support highly dynamic multicast groups, as there is no need to select routes for each group. Of course, the configuration of a comtree for a session does require the selection of a subtree joining the required endpoints with sufficient capacity on each link.

Comtrees are identified by a numerical identifier that is included in the header of every packet. Comtree ids are flat global identifiers and imply no semantic information. Endpoints may send packets using only comtree identifiers for which they have been configured, and Forest routers discard packets received from endpoints not configured to use them. Network endpoints and routers are each assigned a unicast address for use within the comtree. These addresses implement a two level hierarchy to improve the scalability of routing information. Specifically, each unicast address has a “site” part that identifies a geographic location or region and an “endpoint” part that identifies a particular component within the site. A Forest router uses the site part of the address to reach routers in other sites and uses the endpoint part to reach components within its own site. We require that all nodes in a comtree with the same site number form a subtree within the comtree topology. This allows routers to forward unicast packets using two lookup tables, one which identifies routes to “foreign” sites and one which identifies routes to endpoints within the “local” site. Multicast is implemented using a separate set of multicast addresses. Endpoints in a comtree may send packets to any multicast address, causing the network to deliver a copy to all subscribed endpoints. Endpoints in a comtree may subscribe to any of its multicast addresses, and the subscription protocol allows endpoints to change subscriptions for many addresses with a single packet.

Scalable Multicast Routing

Since each session communicates over its own comtree, one way to implement multicast is simply to broadcast every multicast packet to every router on the comtree and let the routers deliver packets to their directly attached endpoints based on local subscriptions (see left panel of Figure 3). This has the advantage that each router need only keep track of the subscriptions for its attached endpoints, minimizing the required multicast routing state, minimizing the subscription processing overhead and ensuring rapid response to subscription requests. On the other hand, it does require that multicast packets be distributed to routers whose endpoints have no interest in them.

An alternate approach is to define a central “core” router in the comtree and configure each router with a “pointer” telling it which of its incident links leads to the core (see center panel of Figure 3). In this approach, all multicast packets are sent to the core router, and subscription requests are also forwarded towards the core router, while adding multicast routing state at each router along the path to the core. This largely eliminates the excessive transmission of unwanted multicast packets, while still allowing efficient subscription processing. On the other hand, it can slow down the response to subscription requests and places a larger burden for handling multicast routing state on the core router.

We have chosen a more general approach that can be used to implement either of the above options, as well as various intermediate points. In particular, we allow each multicast session to define a “core subtree” consisting of a subset of its overlay routers (see right panel of Figure 3). Each router outside the core has a pointer telling it how to reach the core, and all multicast packets are sent towards the core and distributed to all the routers in the core (note that this can be done without any multicast-specific routing state). Subscriptions flow towards the core and need never propagate any further than the first core router. We note that a small core provides the most efficient use of bandwidth at the cost of higher subscription processing overhead and slower response to subscription requests. There is a variety of ways one might select which routers to include in the core. Perhaps the simplest is based on a specified maximum “distance” between an endpoint and its nearest core node; the distance metric can be a function of both hop count and link delay. The core can then be made as small as possible, consistent with this constraint, providing a bound on the response time to subscription requests. Alternatively, the core can be adjusted dynamically, based on the subscription volume at a node.

Capacity Provisioning of Comtrees

As discussed above, distributed cyber-physical systems require timely delivery of sensor reports to distributed controllers, regardless of the specific pattern of communication. Sensors can distribute reports to interested controllers using multicast, with each sensor using its own multicast address. Alternatively, a group of related sensors can share a single multicast address, allowing interested controllers to receive reports from all sensors in the group, using a single subscription. The objective of capacity provisioning is to ensure that comtrees have the resources needed to guarantee timely delivery of sensor reports, even as subscriptions change over time. The basis for this provisioning is a specified sending limit (u) and receiving limit (u) for each endpoint u. Given such limits, we can provision all the links in a comtree, so that they have the capacity to support any traffic pattern that does not exceed the limits. It is up to the endpoints to ensure that these limits are respected, but so long as they do, they can be assured that no packets will be discarded due to insufficient network capacity. The endpoints can use whatever mechanism they choose to enforce the limits, but we note that for sensors that issue periodic reports, it’s straightforward to implement the sending limit, and that controllers can limit the number of multicast connections they subscribe to, in order to avoid exceeding the receive limits. The problem of provisioning tree-structured communication channels with specified send/receive limits was studied in another context by Fingerhut in [FI94, FI97]. He showed that one can provision the bandwidth on a link from router x to router y as follows. First, let X be the set of endpoints on x’s side of the link and let (X) be the sum of the send limits for the endpoints in X. Similarly, let Y be the set of endpoints on y’s side of the link and let (Y) be the sum of the receive limits for the endpoints in Y. The bandwidth required from x to y is then just the smaller of (X) and (Y). Moreover, one can compute the required link capacities for all links in the tree, using a single tree traversal requiring O(n) time, for a tree with n nodes. To account for the use of a multicast core that receives copies of all multicast packets, we need to make a small modification to this procedure. Specifically, if there are any core routers on y’s side of the link, the required bandwidth is (X). Otherwise, the required bandwidth is min{(X), (Y)}. If the links are provisioned in this way, then the comtree is guaranteed to have the capacity needed for any traffic pattern that does not exceed the specified send and receive limits.

Selecting a Comtree Topology

Configuring a comtree for a session requires selecting a subtree of the overlay network infrastructure that has enough capacity to support arbitrary communication patterns among network endpoints. This is a special case of the constraint-based network design problem studied in [FI94, FI97, DU99]. It has been shown that in general, this problem is NP-hard, using a reduction from the Steiner tree problem. However, when the solutions are constrained to be trees, we can find optimal or near-optimal solutions in the cases most relevant to comtree configuration [FI94]. In particular, if A is the sum of all the () values and Z is the sum of all the () values, then for A=Z, the optimal solution is a shortest path tree from some “central” vertex in the overlay network to all endpoints that are to be included in the comtree. Such a tree can be constructed by computing a shortest path tree for the entire overlay network and then pruning links not used to reach endpoints required for the comtree. By trying all possible center vertices, we can find the optimal solution in O(mn + n log n) time, where m is the number of links in the overlay network infrastructure, and n is the number of nodes. If A<Z, this shortest path tree is not optimal, but is guaranteed to have a cost no more than (1+Z/A)/2 times that of the least-cost tree. The prior work provides a solid basis for comtree configuration, but leaves several issues to be addressed. First, while reference [FI94] shows that shortest path trees are within a constant factor of optimal when A<Z, it provides no information about how to obtain better trees in this case. We find that when A is much smaller than Z (which is a realistic case in the DCPS context), other trees can substantially out-perform shortest path trees. We illustrate this with results from a simple experiment, shown in Figure 4. For this experiment, we generated random trees over n (=25) points distributed uniformly over a 2x2 square centered at the origin. Trees were constructed, starting from the most central vertex in the tree and provisioned to determine the cost. We treated each point as a Forest router, with n attached sensors and n attached controllers, and we assumed that each sensor had a send limit of 1, while each controller had a receive limit that varied from 1 to 24. The cost of each provisioned link was taken to be its provisioned capacity times its length. Each data point in the figure shows normalized average results from 50 independently generated random trees. We do not show error bars, but standard deviations were computed and were typically less than 10% of the mean values. Results for three different trees are shown: shortest path trees, minimum spanning trees and an intermediate tree constructed using a variant of Prim’s minimum spanning tree algorithm, with a bound on the maximum allowed “stretch” with respect to distances from the tree root; we show the results when the stretch is limited to 1.2 (note that constraining stretch to 1, yields shortest path trees, while allowing it to be unbounded, yields minimum spanning trees). The shortest path tree cost grows linearly with the receive limit, and is very close to the analytical bound. The minimum spanning tree provides the best results for large fanout, and the bounded stretch trees perform nearly as well. We conjecture that a hybrid strategy, which mimics the minimum spanning tree algorithm in the early stages, and the shortest path tree algorithm in later stages, will out-perform the “pure” strategies considered here.

The earlier work also does not address the use of a core subtree for multicast packets. Core subtrees are useful, because they can significantly reduce the amount of routing state needed to “locate” a multicast group. This can be particularly important for applications that use many small, dynamic, multicast groups. On the other hand, the use of a core does impose a network bandwidth cost. We studied how this cost changes with the size of the core, and compared to the cost of implementing multicast without a core. We again generated random trees over n points distributed uniformly over a 2x2 square centered at the origin. Trees were constructed using the variant of Prim’s algorithm mentioned earlier; for each case, several values of stretch were evaluated and the one that produced the least expensive tree for the given provisioning method was selected. The results appear in Figure 5. First, we note that when the core consists of just the “center” node of the comtree, the cost is essentially indistinguishable from the case where no core is used. When the neighbors of the center node are added to the core, there is some increase, but the difference becomes negligible for larger receive limits. Larger cores lead to higher cost, but the cost difference shrinks rapidly as the ratio of receive limits to send limits grows.

The prior work must also be extended to account for capacity limits in the underlying substrate. One way to incorporate capacity limits is to modify the tree construction algorithm to check capacity constraints as each new link is added to the tree; if adding a link causes a constraint to be exceeded (either for the given link or other links already in the tree), the link is marked as excluded and the algorithm proceeds to consider alternate choices. In the absence of capacity constraints, this produces trees that are provably optimal or close to optimal. In the presence of capacity constraints, there is no guarantee that this method will produce a solution at all, even when a solution is known to exist. However, it is a natural starting point for algorithmic study of the capacity-constrained case and merits further study.

Our strategy for provisioning comtree bandwidth can be overly conservative in systems where there is a strong locality to the communication patterns. This can cause it to allocate more bandwidth than the application requires, needlessly increasing cost. The constraint-based network design framework is general enough to accommodate situations like this. For each endpoint, u, we define a neighborhood Nu and specify a constraint (u,Nu) on the amount of traffic that can go from u to nodes outside Nu. Constraints of the form (u,Nu) are defined similarly. With these added constraints, the objective for comtree selection is to find a subtree of the overlay network infrastructure that can support any traffic pattern that satisfies both the original send/receive constraints and these additional constraints. We expect that these neighborhood constraints will often be associated with clusters of nodes that are geographically close to one another, leading to a natural hierarchy that matches well with tree topologies. We will study how comtree selection algorithms can be designed to produce high quality solutions for cases like this.

We note that while in many cyber-physical systems, the set of sensors and controllers is static, allowing comtrees to be configured once, before the system begins operation, others are more dynamic (e.g. the case of users interacting in virtual worlds, which they enter and leave dynamically). This implies that comtrees may need to be dynamically reconfigured over time to accommodate changes in the set of endpoints. Most often, it will be possible to add an endpoint, through adjustments to the provisioned capacity of a subset of the comtree links. In other cases, comtrees may need to be restructured in order to accommodate new endpoints. We plan to study ways to migrate a running application from one comtree to another while minimizing the impact of migration on running applications.

Other Services and Extensions

A complete implementation of Forest requires a number of additional elements, which we intend to address over the course of the project. We touch briefly on those elements here.

Routing protocol. To support comtree provisioning, a routing protocol is needed to distribute information about the available capacity of links and overlay routers. We propose to use a comtree for this purpose, with each router using a separate multicast address to transmit its state information, while a set of “aggregators” collects status reports from routers in separate geographic regions and distributes succinct summaries. We note that a Forest overlay is operated as a single administrative domain, so we are not concerned with inter-domain routing.
Signaling protocol for comtree configuration. The comtree signaling protocol is used to create comtrees and add/remove endpoints. The user that creates a comtree is designated as its “owner” and controls how it is provisioned and accessed by others. The management of comtrees is distributed among the overlay routers using distributed hash table techniques. The router responsible for a given comtree maintains a complete view of its topology and determines how the topology is extended to handle new endpoints. It reserves the resources it needs through a request/response interaction with the routers directly responsible for the required resources.
Access protocol. The access protocol is used to connect endpoints to a Forest overlay. The access router within the overlay is chosen to minimize the length of the access connection. Only registered users can connect a new endpoint, and user identity is verified using standard cryptographic techniques. Forest packets going between endpoints and access routers are encapsulated and carried in a “tunnel”.
Network Management. The management protocol enables manual configuration of Forest routers and links, as well as user administration functions. It also supports online monitoring, enabling remote administrators to oversee the network operation and carry out remote control operations, as needed.
Discovery services. Applications using comtrees will typically need discovery services to determine which other endpoints in the comtree may have information of interest to them. We believe that the form this will take is likely to differ from application to application, so we do not currently plan to provide a built-in discovery service. However, as we experiment with sample applications, we expect to identify common requirements that might usefully be supported, through an application support library or possibly some network-level mechanisms.

We are considering several possible extensions to the core mechanisms planned for Forest and expect to pursue these extensions over the course of the project, as time permits.

Fault-tolerant comtrees. Many DCPS applications have critical reliability requirements, making it essential that they be able to tolerate failures in network components. We propose to develop mechanisms that enable applications to recover from failures either using pre-allocated redundant comtrees, or a shared pool of spare network capacity sufficient to enable recovery for a subset of a larger set of comtrees. This latter option will allow more fine-trained cost/reliability trade-offs.
High quality synchronization and synchronized delivery. Routers in overlay networks using provisioned links can piggy-back timing information on all packets exchanged with neighboring routers, allowing them to synchronize clocks across the entire network with high precision. Such mechanisms can enable the delivery of packets with consistent delay, no matter the source or destination; this can be useful in applications that are sensitive to the time relationship among events reported by different sensors.
Support for reliable multicast packet delivery. Although the provisioning of network capacity makes congestion-induced packet losses unlikely, packets can still be lost due to link transmission errors. References [JA00, TU97] describes mechanisms that can be used to enable efficient end-to-end support for reliable multicast. Such mechanisms can be extended to provide consistent delivery order to all endpoints, making it easier to implement applications with strong consistency requirements.

The network game system is distributed across multiple sites. Each site has a known geographic location and includes a multiplicity of processing resources of various types. In particular, each site has one or more servers and routers. Servers connect to remote users or clients over the Internet and communicate with other servers through the routers. Sites are connected by metalinks which have an associated provisioned bandwidth. Multiple routers at a site may share the use of a metalink going to another site, but must coordinate their data transmissions so as not to exceed its capacity. In addition, each site has a site controller which manages the site and configures the various other components as needed. Within a site, communication takes place through a switch component that is assumed to be nonblocking. These ideas are illustrated in the figure below. In the planned SPP deployment, each site will correspond to a slice on an SPP node (including a fastpath for the routing function and a GPE portion that implements the site controller functions), plus a set of remote servers which will be connected to the SPP node over statically configured tunnels.

In addition to the components shown in the diagram, there is a web site through which clients can register, learn about game sessions in progress and either join an existing session or start a new one. When the user requests to join a session, the system will select a server and return its IP address and a session id to the client, who will use these to establish a connection to the server and join the session.

Communication in the network occurs over multicast channels. Each channel is identified by a globally unique 32 bit value and has an associated set of components and links which form a multicast tree. Channels are used most commonly to distribute state update information for game sessions. In this case packets are delivered to all servers that belong to the session. However, channels can also be used by the system to distribute control information of various sorts. Packets can be sent to specific components on a channel, or to sets of components (e.g. all site controllers). In general, the links in a channel can carry data in either direction enabling many-to-many communication. The first 1024 channels are reserved for various system control purposes and cannot be used for game sessions.

The various system components communicate by sending packets of various types. The common packet format is shown at right and the fields are described below. Specific protocols used by the various components define specific packet types and additional fields. These are described in the specific sections dealing with those protocols.

Generic Packet Format

Version (4 bits). This field defines the version of the protocol suite. The current version is 1.
Length (12 bits). Specifies the number of bytes in the packet. Must be divisble by 4 and can be no larger than 1400.
Type (8 bits). This field defines the type of the packet. Specific types are discussed in detail below.
Flags (8 bits). The currently defined flags are listed below.
- Route Needed. This field is set by a router when it does not have a route table entry for the specific destination address in the packet and must forward the packet on all of its outgoing branches in the channel. The recipient of the packet is expected to respond with a Route Announcement packet on the same channel, addressed to all routers on the channel.
Channel id (32 bits). The channel id is a 32 bit identifier that identifies a specific communication channel on which the packet is being sent. Each channel is associated with a subset of the network links that collectively form a tree.
Source Address (32 bits). This specifies the address of the component that sent the packet.
Destination Address (32 bits). This specifies the address of the component to which the packet is being sent.
Auxiliary Information (64 bits). The interpretation of this field depends on the type of the packet. Details are given in subsequent sections.
Header Checksum (32 bits). Computed over all fields of the header. Packets with an incorrect header checksum are discarded by routers.
Payload Checksum (32 bits). Computed over the payload alone.

All system components have addresses which serve as concise machine-readable identifiers and provide location information. The first 16 bits of each 32 bit address identifies a site, and the remaining 16 bits identify a component associated with that site. The site component of the address is called the site id and the remainder is called the component number. Components with addresses includes servers and routers. Clients interact with the system only through its web interface and through their IP connections to servers, so do not require netGame addresses.

Addresses with a component number of less than 1000 are reserved for special purposes. These addresses identify certain context-specific destinations, as shown below. These destinations are interpreted within a scope defined by the site number of the packet. Packets with a site number of 0 are delivered to the designated recipients in the global scope (that is the entire channel). Packets with a non-zero site number designate recipients within the specified site. So for example, a packet with a destination address of 0.200 is delivered to all servers anywhere on the channel, while a packet with a destination address of 2.200 is delivered to all servers within site 2.

100 - Lead controller (for specified site or entire channel).
101 - All site controllers (for specified site or entire channel).
102 - Next site lead controller.
200 - All servers.
201 - All subscribed servers.
300 - All routers.
301 - Next-hop router.
302 - All edge routers.
303 - First edge router.

ALTERNATE APPROACH - Jon

Am starting to think that this approach is more structured than it needs to be and may needlessly constrain the ways that the channels are used by different game sessions. If the overlay supports different types of game sessions and virtual worlds, it's likely that there will be a variety of usage patterns. So here's an alternate addressing structure that I think we should consider.

Unicast addresses. Top bit=0, next 15 bits define a site. Lower 16 bits define an endpoint within the site. Require that the top seven bits of site number are !=127.

Channel-wide multicast address. Top bit=1, Next 7 bits=127. Remaining 24 bits defines a channel-wide multicast.

Site-local multicast address. Top bit=1, next 15 bits define a site. Remaining 16 bits define a site-local multicast. Require that top 7 bits of site number are !=127.

With this approach, we don't need all the special-purpose addresses <1000. We can simply configure a multicast within the channel with the appropriate set of endpoints. We might also then use multicast addresses in place of region numbers for state update distribution. This will simplify the lookup process and make things a bit more general.

We could make subscription to a multicast happen as a side-effect of sending to a multicast address. Suppose a server sends a packet to a multicast address which is unknown to its first-hop overlay node. The first hop node can forward the packet to all its neighboring overlay nodes, with a "subscription-requested flag". When the packet reaches a node that knows about that multicast address, it can respond with a multicast route packet that is used by the routers along the path to configure the filters needed to route multicast packets. To avoid lots of excess traffic for subscription requests, we can define a central subtree of the channel as the "core" nodes and require that all core nodes stay subscribed to all multicasts, so that subscription packets propagate only up to the first core node. This generalizes our ideas about how to deal with subscriptions for state updates. Needs more thought, but think this could be useful approach.

Packet Forwarding

The routers maintain routing tables for forwarding packets within channels. The channel routing table at a site typically contains an entry for all other sites that are on the channel, and an entry for all components within the site that are on the channel. These entries provide the information needed by the router to make local routing decisions. Routing table entries can be statically configured, or created dynamically. This section describes the basic steps involved in forwarding a packet. It is phrased in terms of a specific reference implementation, but this is meant to be illustrative, not prescriptive.

We assume that each router has a list of neighbors with which it can communicate directly. Each neighbor is identified locally by a numerical index and for each neighbor, the router has its netGames address, its node type (router, server, site controller) and whatever lower level information is needed to send netGames packets to it (e.g. Ethernet MAC address, or IP address plus port#). For each session, the router also maintains a list of those neighbors that are adjacent to it in the session. We also assume that for each arriving packet, we have the index of the neighbor from which it was received. A neighbor number of 0 is used to identify the control component of the router itself.

The node maintains a forwarding table that consists of a set of (key,value) pairs and can be implemented using a hash table data structure. The keys are 64 bits long and are formed using selected fields of the packet header. The first 32 bits are the session id, the next 8 bits identify the input interface on which the packet arrived, the next 4 bits identify the class of the packet and the last 20 bits provide auxiliary addressing information which depends on the class. The classes are defined as follows.

State update packets are assigned to class 0. These packets are recognized by their packet type (=xx). For these packets, the auxiliary address information is the region number of the state update, which is contained in the auxiliary information field of the packet header.
Packets addressed to foreign destinations (the site number part of the destination address is greater than zero and not equal to the local site number) are assigned to class 1. For such packets, the auxiliary address information is the site number portion of the destination address.
All other packets are assigned a class of 2. For these packets the auxiliary information is the component number portion of the destination address. Note that this means that routing table entries determine how the various "special" packets are forwarded.

The value portion of the routing table entry includes a list of the neighbors that the packet should be forwarded to (specified using their numerical index).

To forward a packet, the router forms the lookup key from the packet header fields, finds the appropriate entry in the table and forwards the packet to the specified neighbors. If the neighbor list includes the neighbor from which the packet was received, that copy of the packet is suppressed. Packets may be forwarded even when there is no matching entry. Specifically packets addressed to foreign sites for which there is no routing table entry are forwarded to all neighboring routers. Packets addressed to a component in the local site for which there is no routing table entry are forwarded to all neighboring components within the site. In both these cases, the "Need Route" bit is set in the flags field. When a component receives a packet addressed to it with the "Need Route" bit set, it is required to respond with an "Announce Route" packet, which is distributed to all routers in the session.

Network Control and Operation

This section defines packet types that are related to basic network operations.

0 - Route Announce. Packets of this type are used to announce a route to a given destination within a specific channel.

We reserve channel ids 0-999 for the use of various network control protocols. These reserved channels are described below.

0 - Bootstrap channel. This channel is used for configuration when a new node comes online. For each new node, one of its neighboring nodes is assigned as its parent node on the bootstrap channel. The root of the bootstrap channel is a Master Controller (MC) for the entire network. On receiving a Node Announcement message from a node, the MC configures the node.
1 - Hello Neighbor channel. This channel is used by nodes to the nodes at the other end of each of their links.
100-199 - Network Status Distribution. These channels are used to distribute network status information to site controllers in the network. For small to medium-sized network configurations, we expect a single channel to be sufficient for distributing status information. The additional channels are reserved to facilitate scaling and reliability for very large networks.
500-599 - Channel Configuration. These channels are used to control the configuration of channels used for game sessions. For small to medium-sized network configurations, we expect a single channel to be sufficient for distributing status information. The additional channels are reserved to facilitate scaling and reliability for very large networks.

Session Control and Operations

This section defines packet types used to control and manage individual game sessions.

100 - State update. This packet is used to communicate status information about dynamic objects in the game world to other servers. Packets of this type should be addressed to "all subscribed servers"; that is, the component number portion of the destination address should be 201. The auxiliary information part of the packet includes the identity of the object whose status is being updated (16 bits) and the region of the game world in which the object is currently located (20 bits). The payload includes a list of parameter/value pairs giving the current value of the specified parameters.
101 - Subscription update. This is used by servers to request changes to their region subscriptions. Packets of this type should be addressed to the "first edge-router". The payload contains two lists of up to 150 values of 20 bits each. The first specifies specifies regions to be added to the subscription list. The second specifies regions to be dropped. The first 16 bits of the payload gives the lengths of the two lists.
102 - Subscription report. This is sent periodically by routers to servers and gives the router's current view of the regions to which the server is subscribed. This is provided to allow servers to recover from lost subscription update packets. The payload simply lists all the regions (as 16 bit values) that the server is subscribed to.
103 - Object query. This is used to request the current state of a particular object in the game world. Packets of this type are addressed to a specific server.
104 - Region query. This is used to request the current state of all objects in the game world that belong to a particular region. Packets of this type should be addressed to "all subscribed servers".
105 - Query reply. This is used to respond to a prior query request. Packets of this type should be addressed to the specific server that sent the initial query.

Network Status

This section summarizes the major system level data structures. The first three items below (client data, connection data and session data) are stored in a distributed hash table, with each site responsible for a portion of the key space in the DHT. The key space covered by each site is known by the other sites, allowing one-hop access to data in the DHT.

Client data

User name (key)
User preferences
Accounting records
Current sessions

Session data

Session id (key)
Users in the session (specified in terms of client addresses)
Multicast tree - sites, inter-site links with reserved capacities

The network status information logically forms a graph, with data associated with its nodes and links. Each site maintains the authoritative information about itself and its incoming links, but each site also maintains a complete copy of the global network status. The status information is distributed across a multicast spanning tree of the network, with each node periodically sending a copy of its current status on the multicast tree. The construction of the multicast spanning tree is carried out by a distributed spanning tree maintenance algorithm.

Network Status

List of sites with up/down status, provisioned capacity (in terms of client sessions) and available capacity.
Inter-site links with up/down status, provisioned capacity and available capacity.

SPP Deployment

This section discusses some specifics of the planned SPP deployment. Part of the reason to do this here is to verify that the overall framework we are creating makes sense for the SPP (and similarly for other deployment scenarios). We focus here on the fast path component. For demo purposes, we will most likely configure the fast path manually by logging into a slice on the GPE and configuring filters as needed.

We note that implementation on the SPP requires version 2 of the NPE, since version 1 does not support multicast. Implementing a fast path requires that we develop a new code option, including a parsing component and a header formatting component. The Parsing component simply extracts from the netGames packet header the fields that are to be used in the lookup key. These include the protocol and type fields, plus the channel id, the destination address and for state update packets, the region number. This is a total of 96 bits. The TCAM lookup produces a resultsIndex and a statsIndex. The resultsIndex is used to obtain the forwarding information for the packet, the statsIndex specifies a statistics counter. Each channel id that is active at an SPP will have its own statsIndex, so that we can monitor the traffic sent on the channel. The resultsIndex is used in a second table lookup. For multicast packets, the result of the table lookup includes the fanout, and for each copy, a queue id, the IP destination address of the outgoing packet, and the source and destination UDP port numbers.

Changing the TCAM requires control interactions through the substrate. That is, code running on a GPE must request new filters through the RMP, which talks to the xScale, which changes the TCAM entry. This will be relatively slow. For region-based filtering, we would like to make things more responsive. This seems to require some special processing in either the parse or header format blocks, the two places where we can place slice-specific code (or both).

ONL Deployment

EC2 Deployment

Related Work

Here is a link that classifies related work. This page is under construction.

List of Related work with comments

Assiotis - Netgames 2006

Baker - Transactions on Multimedia 2005

Benford - Computer-Human Interaction 1997

Bharambe - SIGCOMM 2008

A Platform for Dynamic Microcell Redeployment in Massively Multiplayer Online Games by Bruno Van Den Bossche, Tom Verdickt, Bart De Vleeschauwer, Stein Desmet, Stijn De Mulder, Filip De Turck, Bart Dhoedt and Piet Demeester. Nossdav 2006.

This paper describes a multiplayer online game systems developed by the authors. The workload is distributed by dividing the gameworld into a number of small microcells and assigning multiple microcells to each server. This allows a server with a crowded microcell to handle a reduced number of microcells, improving the ability to balance the load. On the other hand, it does imply that microcells have to be easily and efficiently migrated among servers. The system is implemented in Java (specifically, using J2EE and JMS). The reported data is very limited and not particularly impressive (they report that the round-trip time for sending a 750 B message is 3.2 ms and that the bandwidth through a server appears to be under 30 Mb/s).

Carlson - Virtual Reality 1993

Chan - Netgames 2007

Chertov - Nossdav 2006

Colombo - Transactions on Multimedia 2007

Das - Virtual Reality 1997

On Consistency and Network Latency in Distributed Interactive Applications: A Survey - Part I by Declan Delaney, Tomas Ward, Seamus McLoone. Presence 2006.

This is a survey of techniques used in distributed interactive simulations. This part concentrates on maintaining consistency. Moderately useful, although somewhat superficial. No quantitative assessment of the various methods used to maintain consistency, making it difficult to evaluate the relative merits of the different approaches. Still, it does provide a useful overview of the literature; the reference list has nearly 100 papers.

Delaney - Presence 2007

Fernandes - Nossdav 2007

Frecon - DSE 1993

Frecon - VRST 1999

Frecon - Presence 2001

GE - 2003

Glinka - NetGames 2007

Glinka - IJCGT 2008

Graffi - ISM 2008

Hampel - Netgames 2006

Hogsand - Multimedia 1996

Kantawala 1996