Chapter 20. Implementing Common Systems

Chapter 20. Implementing Common Systems
Prev	Part I. Server Scripting Guide	Next

20.1. General Scalability

In general, server processing load, internal network bandwidth and external network bandwidth scales linearly to the number of players if player and entity density remain constant. There is a small extra cost as density increases.

Capacity can be added by:

Adding more BaseApps for more external connection points and connection processing capacity
Adding more CellApps for more spatial processing capacity
Adding a combination of both more BaseApps and more CellApps to increase game script processing capacity

CellAppMgr, BaseAppMgr and DBMgr are single instances and are theoretically scaling bottlenecks. CellAppMgr and BaseAppMgr are only concerned with managing CellApps and BaseApps and have very low load. They can scale to handling thousands of BaseApps and CellApps. Although BigWorld's design does not make heavy use of the database, the main concern for scaling is DBMgr. This is addressed below in BigWorld Database Scalability.

20.2. Internal inter-component communication

The number of BaseApps and CellApps required to sufficiently service an entity population should generally scale linearly with the number of entities. Most communication between entities is with those that are nearby. This is handled by keeping those entities together on CellApps as much as possible. Other communication involves point-to-point communication using remote method calls. The main issue here is to try to minimise situations where entities need to be looked up globally.

The DBMgr functionality for writing out entity state is distributed across the secondary databases for each BaseApp and consolidated when the entity is retired. See BigWorld Database Scalability below.

The general strategy to combating bottlenecks in game script is to avoid global game systems where possible, such as having singleton entities that control some operation of the game, for example, trading. In general, these bottlenecks can be avoided by restructuring game script and using distributed object methods to implement such global sub-systems, rather than entrusting the request handling to a single entity. An example of this is presented below (see AoI-based trading below).

20.3. Player AoI Updates

Updates to some of the entities in a player's AoI are propagated to the player's client every game tick (by default, gameUpdateHertz is 10Hz). The amount of update data sent to the player's client is constrained to a downstream bit rate (by default, bitsPerSecondToClient is 20kbps).

These updates consist of property changes and method calls. Every cell entity keeps a history of these changes, and for each entity in a player's AoI, the player is updated incrementally about that entity periodically. The position and direction data of an entity is specially treated so that only the most recent value of these properties (so called volatile properties) is sent to the client, instead of the full history of the property.

Internally, entities in an AoI are in a priority queue. The priority of an entity in a player's AoI determines how long it will be before the next update about that entity occurs to the player. Generally speaking, entities closer to the player are updated more frequently than those that are towards the edge of a player's AoI. Properties can have Level of Detail (LoD) rules applied so that these properties will only be updated if the entity is close enough to the player. See the chapter LOD (Level of Detail) on Properties.

In general, many game operations are localised to the specific area that a player inhabits. Load balance partitioning is done across each space depending on the load being generated per cell. As entity densities increase, the partitioning scheme changes in response to equalise the load amongst the cells servicing a space. The amount of data per entity that is sent to the client is also reduced as the density of entities increases. This reduces a lot of the extra cost due to density and also makes good use of the client's bandwidth.

However, very high entity densities can cause problems by causing each periodic update to a player client to be overrun with excessive amounts of entity event data. Recall that the amount of downstream bandwidth is a configurable constant. Due to the prioritising of change events of entities in a player's AoI, this can cause updates of entities further away to be starved if there are many more entities that are closer to the player.

Increasing the downstream bandwidth can improve on this situation, but eventually, it is usually the client that becomes the limiting factor. There is a per-entity cost of processing on game clients, for example:

processing notifications for each entity's position and direction
processing notifications for each entity's property changes
processing notifications for each entity's method calls
applying physics rules to each entity
rendering of each entity

There is also a limit to the amount of information that a player can comprehend. With large numbers of entities nearby, less information tends to be needed for distant entities.

Extreme entity densities that can negatively affect the end-user experience can be avoided with good game design.

20.4. BigWorld Database Scalability

A brief discussion of the operation of DBMgr and implications for scalability follow.

When entities are checked out of the database, they are assigned to the least-loaded BaseApp. Once entities are loaded onto a BaseApp they do not generally migrate away from that BaseApp unless that BaseApp process terminates, in which case they are restored on other BaseApps in the system. See the chapter Fault Tolerance.

For each entity that resides on it, the BaseApp is responsible for collecting all the explicit script writes (from calls to BigWorld.writeToDB()) for that entity over its checked-out lifetime, as well as the periodic backups for that entity. These writes are performed on a secondary database stored on the BaseApp machine. There can be arbitrarily many BaseApps in a cluster, and entities are statically load-balanced across them when they are instantiated. That is, they are assigned to the least-loaded BaseApp.

When the entity is destroyed, it is checked back into the primary database, this results in the sum of the database writes in the BaseApp secondary database for that entity being consolidated back into the primary database on the DBMgr.

This consolidation can be a bottleneck, and there are future features planned to reduce this so as to not overload the DBMgr. In general, writing back to the primary database is not a time-critical operation, so that checking entities back into the database is a fire-and-forget operation. No data loss is possible as the data is persisted on the secondary database, and not removed until the consolidation for that entity is done.

On server shutdown, all checked-out entities have their database writes consolidated back into the primary database. If an unexpected failure occurs (e.g. power failure), this consolidation can take place on the next server startup.

The following operations on the DBMgr can still be a bottleneck:

checking login credentials
handling lookup requests for entities by name or database ID
loading entities from the persistent storage
writing entity state when entities are checked back into persistent storage

In practice, looking up which BaseApp an entity is checked out to (by name or database ID) is a read operation and comparatively inexpensive due to the underlying MySQL query cache. However, schemes such as the PlayerRegistry entity (see Player Lookup below) can be implemented which can offload the task of handling lookup requests from the DBMgr to game script running on arbitrarily many BaseApps.

However, the global DBMgr process will still place an implicit limit on how quickly entities can be loaded from, and saved to, persistent storage. Future improvements being considered include sharding the database to spread this load over many external databases.

20.5. Player Look-up

20.5.1. Requirements

Each player must be able to query the status of another player by name:

whether or not they are logged in
if they are logged in, get their player mailbox

20.5.2. Design

Using BigWorld.lookUpBaseByName() causes a query to DBMgr (which causes a read on the primary database), while sufficient for many scenarios (and empirically works in many released BigWorld-based games), introduces a potential bottleneck. The discussion below outlines a design for a distributed mapping of player names to player mailboxes which effectively offloads this load to BaseApp game script, which can be scaled up by adding more BaseApps.

The idea is to have multiple PlayerRegistry Base entities that contain a distributed mapping of player names to player mailboxes. These PlayerRegistry entities have no geospatial representation, they exist only as a system service, and so generate no load with respect to AoI updates.

Each BaseApp has a corresponding PlayerRegistry entity - this spreads the PlayerRegistry entities out and protects against BaseApp failures. Having more than one PlayerRegistry entity per BaseApp does not add any additional redundancy benefit.

PlayerRegistry entity instances register themselves globally. Globally registered bases have their mailboxes registered under a string key in a global bases mapping that is synchronised across every BaseApp. Player names are hashed against the known number of player registries, and a particular PlayerRegistry instance is located via the Global Bases mechanism (see Global Bases).

When player entities are created, they add themselves to the distributed registry by hashing their own name to the appropriate PlayerRegistry entity, and registering their base mailbox with that PlayerRegistry entity. On logout, they contact that same PlayerRegistry to notify it of the logout, and this results in the removal of the mapping between that player name and that player mailbox. A scheme for rebalancing the player registry entries can be implemented which re-balances the entries across the PlayerRegistry entities when a new PlayerRegistry is added or removed, such that the hash scheme remains consistent.

Queries for a particular player name are done by first hashing the player name to be looked up to the appropriate PlayerRegistry, and then querying one of the multiple PlayerRegistry entities via a remote method, and a callback remote method with a mailbox. Requests for player lookup are asynchronous, and caller entities implement a callback method that is called back when the lookup is complete.

Each PlayerRegistry needs a persistent mailbox list for fault tolerance purposes, so that the registry is restored to another BaseApp along with the PlayerRegistry entity if the BaseApp it formerly resides on fails. In this case, it is likely that it will be restored to another BaseApp which already has its own PlayerRegistry, so re-balancing should be done and then the restored PlayerRegistry should be destroyed.

This system can be scaled up by increasing the number of BaseApps to handle queries. Tiered request schemes could also be used to avoid large numbers of globally registered base entities becoming a bottleneck.

20.6. Friends lists

20.6.1. Requirements

Each player maintains a list of other players that they can use for the following purposes:

to contact a friend
send private messages to friends
presence updates

20.6.2. Design

Assume that friendship relation is symmetric, so that if A is on the friend list of B, then B is on the friend list of A. A friends list can be implemented as an ARRAY of FIXED_DICT consisting of a STRING name property, and the MAILBOX of the player (or None if offline), and a UINT8 Boolean flag hasResponded indicating that this player has responded to our request to add that player as a friend.

20.6.2.1. Adding new friends

Let the player adding the friend be called Player A, and the friend being added to Player A's list be called Player B.

Player A checks that Player B is not already in A's friends list. Player A uses Player B's name to look up B's status and mailbox (if online) via the Player Look-up mechanism.
If B is not online, then we fail the operation. A scheme could be implemented that accommodated this situation, but for the sake of simplicity, it will not be discussed here.

If Player B is online, then Player A adds Player B to its friend list, setting the hasResponded flag to False, and writes itself to the database using Base.writeToDB(), registering a callback when the database write completes.
If the write fails, then we rollback the friends list by removing Player B's FIXED_DICT element, and abort this process, and inform Player A's client of system error.

Otherwise, the write is completed successfully, and so Player A informs Player B via remote method call to add Player A to Player B's list, passing along Player A's name and mailbox.

Periodically, Player A resends any such outstanding requests (indicated by hasResponded being False in the friends list), every, say, 3 seconds. The mailbox for each of these resends should be looked up each time, in case Player B has been restored to another BaseApp or if Player B has logged off and/or back on again. If Player B is not online during a retry, then the operation fails and Player A's client is informed that Player B is not online.
Typically, Player B won't already have Player A as a friend, and so Player B adds to its local friends list by creating a FIXED_DICT element for Player A containing Player A's name and mailbox, and sets the hasResponded flag to True. A write to the database is requested with a callback.

Player B may already have an entry for Player A in its friends list. This can happen if Player A and Player B both simultaneously attempt to add each other as friends (in which case hasResponded will be False). It can also happen if Player B is restored to another BaseApp or is destroyed and re-created during the wait for the database write, or the database write takes so long that Player A has resent the request, and in these cases, hasResponded will be True.

If the hasResponded flag is True, then it signals to Player A that the operation succeeded straight away. If the hasResponded flag is False, then it should be set to True, and the database written to and called back from before signalling success to Player A.
Typically, the write succeeds, and so Player B calls back on Player A to indicate that the request was successful.

In the exceptional case, the write can fail. Player B removes Player A's FIXED_DICT entry in its friends list, and calls back on Player A to indicate that the operation failed.

In this scenario, Player A should try to remove Player B's FIXED_DICT element, and this should be made persistent by writing Player A to the database. However, there's a chance that Player A fails this second database write while its earlier database write succeeded, making Player A's friends list inconsistent in the database. There are some ways of handling this situation:
- Do not remove Player B's FIXED_DICT entry in Player A's list, and instead have Player A retry the request to Player B periodically until Player B responds with success.
- Do remove Player B's FIXED_DICT entry in Player A's write to database periodically.
Both of these approaches assume that the database write failures are a temporary phenomenon. It could be caused by, the BaseApp secondary databases not having enough disk space, which is cleared up when the system administrator makes more space. A retry count could be kept that would remove the FIXED_DICT entry from the friends list after the retry count exceeded some threshold, and Player A's client should be informed of failure.
On a successful callback from Player B, Player A sets the hasResponded flag to True. A database write is not necessary at this point, as the periodic backup and archival systems can be relied on to save this out eventually. In the event that the system is restarted or Player A is restored to another BaseApp, the periodic retry of FIXED_DICT entries with hasResponded set to False will get a second successful callback, and will eventually be written out.

Adding to friends lists is not expected to be a frequent operation on average over the entire player population, and players are typically spread out across the available BaseApps.

Removing a player from a friends list can be done in a similar fashion.

20.6.2.2. Private messages to friends

See the section on Chat below. Once you have a player mailbox, a chat message can be sent to them using a simple remote method call.

20.6.2.3. Presence information

Presence notifications can be implemented simply by calling a method on each player in that player's friends list indicating that they have logged in or logged out (which signals that the mailbox is invalidated, and should be set to None in the corresponding FIXED_DICT in the ARRAY).

Player status notifications (e.g. away from keyboard) can be done in a similar way. Player base entities inform their clients of any change in the status of any friends, so they can update a user interface to the friends list.

20.6.2.4. Cache of friend player mailboxes

The friends list can be used as a cache of player mailboxes while those friends are logged in, and do not need to use the general Player Look-up mechanism in order to communicate with their friend player entities. Friend mailboxes are set to None when the friends log out.

Caches are not required to be persistent, and so do not add any additional processing cost to the database.

20.6.2.5. Fault tolerance handling

When a player is restored, some of the friends may have come online or offline (or come offline, and then online) in the time since the player and the friends list was last backed up. At restore or initialisation time, player entities should perform look-ups on all the players in its friends list. It should also notify all online friends of its new mailbox when restoring.

20.7. Chat

P2P chat
AoI-based chat
Channel-based chat (includes guid chat, world chat)

20.7.1. P2P

20.7.1.1. Requirements

Players need to be able to send messages to other players. Players are identified by name.

20.7.1.2. Design

See the section on Player Look-up above. Chatting from one player to another player involves the following:

the mailbox of the destination player needs to be acquired. This can be done in one of the following ways:
- supplied by the player cell entity as the destination entity is in the player's AoI
- a local look-up in your friends list mailbox cache
- a look-up of their mailbox using the Player Look-up mechanism described above
calling chat remote method on that mailbox with the chat message contents.

To save the remote method cost of look-ups, player mailboxes can be cached on the player entities as a non-persistent entity property. For example, private messages to non-friend players tend to result in conversations, and so having a local cache of player name mapped to player mailboxes will save on a look-up each time a further chat message is sent.

20.7.2. AoI-based broadcast chat

20.7.2.1. Requirements

Players need to be able to broadcast messages to players in their immediate spatial vicinity.

20.7.2.2. Design

AoI-based chat can be implemented as a broadcast remote method call to all player entities that have the speaking player in their AoI. This does not require looping through all entities in script, and is implemented efficiently on the CellApp. The chat method call is broadcast to client entities using the same mechanism as any other broadcast method call, or when an ALL_CLIENTS or OTHER_CLIENTS property changes.

Volatile distance constraints can be specified for that chat method call so that only players within a certain radius of the originating player receive the method call message.

20.7.3. Non-AoI-based broadcast chat

20.7.3.1. Requirements

Non-AoI-based chat channels are chat channels of entities that are not necessarily in the same spatial location. This could be used for guild-scope chat and world-scope chat.

20.7.3.2. Design

A non-AoI-based channel can be implemented as a ChatChannel entity that contains a list of player mailboxes of the players that are connected to that chat channel.

When a player wants to connect to a channel, a channel look-up is performed for the particular ChatChannel entity. This could be done via a similar scheme to the Player Look-up scheme described above. Once a mailbox to the channel is found, the player registers its base mailbox with the ChatChannel entity, which adds it to the list of connected player mailboxes.

A connected player broadcasts to that channel via a remote method call with the contents of that channel. The ChatChannel entity is responsible for broadcasting that message to each of its connected player base mailboxes.

20.8. Mail

20.8.1. Requirements

Each player must have the ability to send mail to other players. This mail includes some text and optionally in-game items.

20.8.2. Design

The scalability of SMTP/IMAP mail servers can be leveraged here. Note that these game mail servers are completely internal to the game - no public access would be allowed (though this would be up to the game design).

Each player has an associated email address. BaseApps can query IMAP servers asynchronously using a TCP socket registered with BigWorld, without blocking game script. Python has good support for communication with IMAP over a socket (see the chapter Non-Blocking Socket I/O Using Mercury)

Items can be gifted using special attachments or special email headers, depending on the item system used. Item data would never be directly sent via email, instead, gifted items over email would be held in escrow, as with AoI-based player item trading. See Inventory and Item trading below.

20.9. Inventory System

20.9.1. Requirements

Assume a game inventory system with the following features:

Items are instances of a finite set of item archetypes
Each item instance has associated with it customisations that differentiate it between other instances of the same item archetype. These customisations may be visual customisations, different attributes (e.g. durability, bonus to strength, etc.)

20.9.2. Design

Store a fixed amount of inventory item slots per-player on the player entity to limit the amount of inventory data that is associated with player inventory.

Some popular MMOs have the concept of banks where players must be in a specific area to access items stored at the bank. This could be a separate entity that is loaded on request when a player is accessing their bank, and then destroyed once they leave the bank. The capacity of the on-player inventory and the bank inventory could be tuned to optimise database load.

With the on-player inventory, this can be stored as a BigWorld ARRAY of item descriptors. Item descriptors themselves would be persisted as a BigWorld FIXED_DICT, but could be class-customised when loaded from the database so that items are represented in script as an arbitrary object type.

Player inventory changes are expected to be frequent. Per-element changes in a BigWorld ARRAY are propagated to the client with a description of the change path to that element and the new element value (i.e. the entire array is not sent from the server to the client each time an element is changed).

If each time the inventory is changed, the entity wrote its state out to the secondary database, then a bottleneck can occur, as this operation is expected to be frequent.

In this case, we rely on the fault tolerance mechanism for ensuring against data loss. This works by periodically saving out the state of the entity to another process. That other process is responsible for restoring the entity in the event of a process failure. For example, cell entity data is backed up to the corresponding base entity's BaseApp, and base entities are backed up to other BaseApps. This is the first level of fault tolerance, and the frequency of backups can be configured.

There is also a second level of fault tolerance, which is the periodic archiving of the base and cell entity state to the secondary databases. The frequency of this can similarly tuned to achieve optimum BaseApp load.

However, for important changes to the item inventory, for example, a quest item, game script can request a write to the secondary database and have that confirmed via an onWriteToDB() callback. For trading transactions between two players, see below.

20.10. AoI-based Trading

20.10.1. Requirements

Player entities in the same spatial vicinity must be able to negotiate trade of items that they own.

Each player makes an offer to each other, placing their offered items in escrow. Once both players accept the opposing player's offer, the trade succeeds and the items are traded. If one player cancels the trade, all offered items are returned to their respective players.

Item trading transactions must not result in duplicate items or item loss.

20.10.2. Design

BigWorld can readily supply the base mailboxes of any player entity in a player's AoI. Otherwise, if trading with a specific person not in the player AoI, a player look-up is required.

Escrow entities are created for the lifetime of a transaction, and hold mailboxes to the two entities bartering. Escrow entities persist to the database. Trading consists of two stages, the negotiation stage and the transfer stage. Escrow entities are created on the least loaded BaseApp.

The negotiation stage is a series of offer operations made from a player entity to an Escrow entity, each of which is then forwarded to the opposing player entity.

If the server stops in the middle of a transaction, the Escrow entity has enough persistent information to cancel itself on restore and return items back to their owning player entities.

Player entities on the server offer items to the other player (in response to GUI interactions from their player client) in the form of remote method requests to the Escrow entity. In doing so, they transfer these items from their inventory to a special holding area on the player entity on the server. This holding area is not accessible for any other purpose by the player's client, other than to remove the item from the current offer, which moves that item back into their inventory.

Each transfer to/from their player inventory to the trade holding area results in:

notification of a change in items being offered to the Escrow entity via remote method call
a database write on the Escrow entity
an acknowledgement remote method call from the Escrow entity back to the originating player entity
removal of the items from the holding area, and a database write on the player entity

If, for some reason (temporary or otherwise), the database write fails, the entire trade is cancelled, and the items are returned to the players via remote method calls, which are acknowledged via a remote method call by the players back to the Escrow entity. When the Escrow entity receives acknowledgements from the two players that the trade has been cancelled, it deletes itself from the database.

Each player can signal to the Escrow entity that it is willing to accept the trade as it stands. Once the Escrow entity receives positive notification for both parties, it transfers ownership of the items to the corresponding opposing players by signalling to the player the item data that they have traded.

The Escrow entity transfers ownership to each player their corresponding traded items
On receipt of the items, each player initiates a write to the database. When this is confirmed to be OK, the player entity acknowledges that they have the items by calling back on the Escrow entity.
The Escrow entity waits for both acknowledgements to return, and then destroys itself and deletes itself from the database.

Total database writes: 2 for each offer made, and at least 2 offers are made. 3 writes for the transfer stage.

This illustrates that trading can potentially be an expensive operation in terms of writes to disk. However, all the writes are distributed amongst the entities involved, and most would be written to the secondary database. Only one of the database writes, when the Escrow entity is destroyed, results in the primary database being utilised, in order to remove that Escrow entity from persistent storage.

Note that each participating entity in a trading transaction is not required to be on the same process. This scales well because there can be an arbitrary number of BaseApps, and players and Escrow entities would be uniformly distributed amongst the BaseApps. Recall that while CellApps have player distributions that map to where they are spatially, base entities on BaseApps do not have this spatial relation.

There is a cost to the primary database associated with the creation and destruction of each Escrow entity. This design can be improved by consolidating the escrow operations to target a pre-existing EscrowManager entity rather than creating and destroying Escrow entities. A similar scheme could be implemented to the PlayerRegistry entities by having an EscrowManager entity per BaseApp. Trading entities would nominate and agree on a random EscrowManager to use for their trading transaction.

Prev	Up	Next
Chapter 19. Transactions and Handling Fault Tolerance and Disaster Recovery	Home	Chapter 21. User Authentication and Billing System Integration
Copyright 1999-2012 BigWorld Pty. Ltd. All rights reserved. Proprietary commercial in confidence.