Chapter 3. Developing a Robust Server

Chapter 3. Developing a Robust Server
Prev		Next

Chapter 3. Developing a Robust Server

Table of Contents

3.1. Development Considerations

3.1.1. Blocking the Main Thread
3.1.2. Ghost Entities and Mailboxes
3.1.3. Validating Client Arguments
3.1.4. Fault Tolerance Considerations
3.1.5. Profiling
3.1.6. Server Logs

3.2. Testing with Bots

3.3. Communicating with the BigWorld Support Team

Developing and deploying an MMOG is a complicated task. Developers using BigWorld Technology need to be aware of many issues to help avoid problems when deploying their game. This chapter lists things to take into account during development and testing of a BigWorld game in order to help achieve a successful deployment.

3.1. Development Considerations

3.1.1. Blocking the Main Thread

Developers should be careful not to run blocking operations in the main thread. This applies to the server components which run scripts, including the BaseApp, CellApp and DbMgr (when using class-customised data types). This is required to ensure the process can respond to its peers and manager process in an appropriate time. In an extreme case, a process that pauses for too long may be considered as dead by other server components and stopped via a signal from BWMachineD.

This time period is different for different processes. For example, by default, if a CellApp pauses for more than 3 seconds worth of inactivity, the CellAppMgr will kill that CellApp process (a common cause is because of an infinite loop in script). For BaseApps, by default, if they are non-responsive for longer than 5 seconds, the BaseAppMgr will kill that BaseApp process. These limits are configurable, see the Server Configuration with bw.xml.

One of the following solutions can be used for large or blocking operations:

Separate the calculation into multiple steps and use a timer to call each step.
Use asynchronous calls to avoid blocking disk access in the main thread. For example, use the fetchDataSection, fetchEntitiesFromChunks and fetchFromChunks methods of BaseApp's BigWorld module.
Pre-load any data from disk that cannot be accessed via asynchronous methods. This can be done, for example, in the loading of the personality module or any other module loaded during startup.
Implement asynchronous calls using an external process. For example: long blocking database queries could be implemented with an intermediate process which will process the queries and send a callback to the caller once done. Example code can be found at bigworld/src/server/baseapp/eg_tcpecho.cpp and eg_tcpechoserver.py.

It is recommended to review the server warnings during development and to give extra attention to warnings about loading resources in the main thread.

Please also note that operations which take a short time in a test environment might take more time in a production environment.

More information about blocking the main thread can be found in the Server Programming Guide.

3.1.2. Ghost Entities and Mailboxes

Care needs to be taken to avoid situations where Python script makes assumptions that an entity is a Real Entity when it is possible for it to be a Ghost Entity or even an Entity Mailbox. When referring to other entities from script inside an entity, you should never assume that you have a real entity. It is possible for this entity to be a ghost entity or, in some situations, a mailbox to a remote entity.

Generally speaking, the self property will be a real entity but entities passed in as argument or found other ways could be ghost entities or mailboxes. The exception to this is when a method is called on a ghost entity that is not in the entity's def file.

To help identify these problems during development, you should enable the bw.xml option cellApp/treatAllOtherEntitiesAsGhosts and disable the option cellApp/shouldResolveMailBoxes during development. For further details please review Server Programming Guide's section Debugging.

Note

Real entities are authoritative, and you can call defined and undefined methods, get and set attributes that are both defined properties and undefined attributes.

Ghost entities are not authoritative, but you can call remote methods and have read access to properties defined as CELL_PUBLIC.

Mailboxes can only call remote methods, and cannot access any attributes on the real entity, defined or otherwise, with the exception that the entity ID and class name can be accessed.

3.1.3. Validating Client Arguments

You should validate any input coming from a client, as gamers might try to modify the game client and exploit the server by sending invalid data.

Arguments to defined methods that are typed as PYTHON are inherently risky and should not be used at all for <Exposed> methods that the client can call on the server. This is because a PYTHON argument might contain code which will run on the server.

PYTHON arguments are useful when developing game system prototypes, but they should be converted to another data type such as FIXED_DICT prior to production. You can class-customise FIXED_DICT and wrap the values with another Python object.

Another reason for this restriction is the performance cost of a PYTHON argument. A PYTHON argument requires pickling of the data type for every send and receive. This can be much slower than simply reading and writing primitive types, and also takes up more space on the network stream.

3.1.4. Fault Tolerance Considerations

3.1.4.1. Scripting

There are some issues relating to the fault tolerance and disaster recovery mechanisms that game script developers need to be aware of when implementing entities:

Because an entity is periodically saved via BaseApp archiving to DB as well as BaseApp backups (i.e. not on every property state change), the backup data can represent an outdated copy of an entity. This becomes important in scenarios where an entity is restored due to a CellApp or BaseApp process failure.

Important events in an base entity's lifetime should manually save the entity's state to the database via the Entity or Base method writeToDB().
It is possible that when a particular entity is backed up or archived to the database, that it is in the middle of a transaction that involves other entities. There is no guarantee that the other entities in that transaction be archived to the database at the same time as this entity. Due to the way the archiving algorithm is randomised, the time when they are saved to the database may differ by up to twice the configured archive period. It is thus important to have journaling data structures (custom to your game script) of the transaction steps to be performed so that transactions can be resumed or rolled back when the base entity is restored from the database.
When restoring from the database, restored base entities have __init__() called on them. They should check whether BigWorld.hasStarted is False, which indicates that they have been restored from the database. It is important to note that entities require different initialisation handling for being restored as opposed to when they are created using BigWorld.createBase*(), as they must check data consistency of the base and cell entity state which has already been initialised from backups. The cell entity state can be accessed before re-initialising the cell entity via the cellData dictionary attribute.

Restored base entities are also responsible for recreating their associated cell entity. The cell entity attributes (such as spaceID, position and direction, as well as cell properties defined in the entity's definition file) will have been preserved in the entity archival along with the rest of the data.

If space archiving is enabled via <cellAppMgr/archiveSpaceData>, spaces will also be restored and their associated space ID will remain the same. If using space archiving, base entities can restore their corresponding cell entities by checking the spaceID key in the cellData dictionary is non-zero, and calling the BigWorld.createCellEntity().

BaseApps will have restored the base entities before the BaseApp personality script callback onBaseAppReady() is called, which usually triggers loading of entities and setting up spaces. In the case of restoring from database, it is necessary to prevent the normal start up of the game, i.e. loading entities and spaces, as they will already be present when onBaseAppReady() is called. An exception to this are cell-only entities, which will have been lost when the server was shut down, and need to be recreated if necessary to the game design.
Only persistent entities (i.e. entities with defined persistent properties) will be restored, and only those properties that are marked as persistent will have their values restored from the database. The other property values will be set to their default values.

3.1.4.2. Operations

Here are some guidelines on what should be done from an operations viewpoint:

Use controlled shutdown and startup (the default when stopping a server in WebConsole). In particular, do not prematurely kill processes during shutdown, as data loss may occur.
If your game design is such that game state is recreated from script rather than restored from the archived database state, make sure that the bw.xml option <dbMgr/clearRecoveryData> is set to false. This ensures that during startup, archived entities that were present at the most recent controlled shutdown are restored.
Ensure that there are Reviver processes which will revive the BaseAppMgr, CellAppMgr, DBMgr and LoginApps if any of them fail.

See the Server Operations Guide section Fault Tolerance with Reviver.
Ensure that there are enough CellApp and BaseApp processes running in the cluster such that the desired performance can be maintained on isolated process failures. There should be procedures in place for analysing process failures, and operations staff should start a new CellApp or BaseApps on a spare machine while analysing the initial process failure.

3.1.5. Profiling

3.1.5.1. Profiling Script Performance

You should be profiling and optimising the performance of your server scripts using the methods specified in the Debugging document.

3.1.5.2. Profiling Entity Sizes and Bandwidth Usage

As entities are the main game objects being used by all components of the BigWorld engine, it is important to make sure that your game entities are implemented as efficiently as possible. This includes:

Minimising persistent properties.
Ensuring properties have the smallest applicable data type (while considering long term scaling).
Ensuring properties have the most appropriate data propagation flags assigned.
Ensuring properties have level-of-detail if appropriate.

3.1.6. Server Logs

Ideally, developers should review and fix every WARNING, ERROR and CRITICAL message in the server logs. For those messages that cannot be fixed, the developer should have a good understanding of what the message means and why they are occurring.

In order to assist in reducing the noise when reviewing production logs, it is useful to remove any development / debugging log messages from game code / script. This should include any non-essential HACK_MSG, DEBUG_MSG as well as all non crucial print statements from entity scripts.

3.1.6.1. Collecting Log Data

A summary of the log data over a long period of time can be generated by running mlcat.py --summary this allows detecting abnormal behaviour. See the mlcat.py --help for more information.

3.2. Testing with Bots

Bot tests should be run as early and as frequently as possible to ensure your game environment scales as expected and is capable of handling the number of concurrent players you anticipate.

The following steps should be done in order to increase the testing quality.

Ideally, the bots machines should not be on the internal cluster network. This makes sure the bots connections behave similarly to the way clients connections behave on the production environment and ensures that internal network bandwidth is not being affected by the addition of external bots traffic.
External latency and packet loss should be enabled on the BaseApp and LoginApps (see the Server Operations Guide for more details). This allows testing real life networking issues while running bot tests.
Bots should log off as well as log on. The log off case is often left untested.

More information on how to run bots tests can be found at Stress Testing with Bots.

Real players should also participate in game testing as part of the normal QA cycle.

3.3. Communicating with the BigWorld Support Team

Maintaining good communication with the BigWorld Support Team is crucial for the successful release of your game. Early reporting of issues will allow us to help solve deployment problems. Communicating your expected beta and release dates will allow us to prepare in advance for the extra effort required to help in releasing your game.

Prev		Next
Chapter 2. Release Planning Checklist	Home	Chapter 4. Cluster Hardware
Copyright 1999-2012 BigWorld Pty. Ltd. All rights reserved. Proprietary commercial in confidence.