bw logo

Chapter 5. Backups and Disaster Recovery

BigWorld Technology's fault tolerance ensures that the server continues to operate if a single process is lost. The server also provides a second level of fault tolerance known as disaster recovery that comes into play when multiple server components fail at once and the server needs to be shut down.

For additional protection against exploits or bugs the database should be backed up regularly. The snapshot tool facilitates making backups of the database.

5.1. Disaster Recovery

The server's state can be written periodically to the database. In case the entire server fails, then it can be restarted using this information. To enable the periodic archiving of entities, game time and/or persistence of space data, an archive period needs to be set via the configuration option <archivePeriod> for BaseApp and CellAppMgr processes. For more details, see BaseApp Configuration Options, and CellAppMgr Configuration Options.

Disaster recovery only works when the underlying database being used is MySQL, the XML database should not be used in mission critical environments. See MySQL Support for more information on how to enable MySQL database support in BigWorld.

Starting the server using recovery information from the database is the same as starting it from a controlled shutdown. For more details on controlled shutdowns see Controlled Startup and Shutdown.

For details on the scripting API related to disaster recovery, see the document Server Programming Guide's chapter Disaster Recovery.

5.2. Database Snapshot Tool

When using Secondary Databases, it is necessary to snapshot all data sources at the same time to attempt to provide a cohesive data set that can be restored from. For background information on the need for the snapshot tool see Server Programming Guide's chapter Database Snapshot.

The snapshot tool is a Python script that facilitates making a backup copy of the database that combines information from the primary database and secondary databases. The script itself is located in bigworld/tools/server/snapshot/.

When running the snapshot tool a separate machine should be used to invoke the script to avoid interfering with the normal operation of the server. The snapshot tool can use a significant amount of CPU and disk resources and does so erratically. Henceforth the machine running the snapshot tool will be referred to as the snapshot machine.

5.2.1. Operational Behaviour

The snapshot tool performs the following sequence of operations:

  1. TransferDB (bigworld/bin/Hybrid64/commands/transfer_db) is executed on each BaseApp.

    TransferDB performs a snapshot of the secondary database and sends the newly created snapshot back to the snapshot machine using Rsync.

  2. TransferDB is executed on the primary database's MySQL server.

    TransferDB performs an LVM snapshot of the primary database and sends the newly created snapshot back to the snapshot machine using Rsync.

  3. A MySQL server is started on the snapshot machine and specifies its data directory to be the copy of the primary database.

  4. Data consolidation is performed on the snapshot copies of the primary and secondary databases.

  5. An archive is created of the consolidated snapshot.

The snapshot tool logs all messages to MessageLogger.

Note

A snapshot should be considered invalid if an error occurs during any phase of the snapshot procedure.

5.2.2. Usage

The snapshot tool requires a single command-line argument specifying the directory the snapshot will be written to, and may also be provided options if required. Options available to the snapshot tool are described in the --help as listed below:

Usage: snapshot.py [[options]] SNAPSHOT_DIR

Options:
  -h, --help            show this help message and exit
  -b BWLIMIT_KBPS, --bwlimit-kbps=BWLIMIT_KBPS
                        file transfer bandwidth limit in kbps, default is unlimited
  -n, --no-consolidate  skip consolidation, default is false

The list below provides some common examples of using the snapshot tool:

  • snapshot.py /home/bwtools/snapshots

    Takes a snapshot. The snapshot is archived to/home/bwtools/snapshots.

  • snapshot.py /home/bwtools/snapshots --bwlimit-kbps=5000

    Takes a snapshot. The snapshot is archived to/home/bwtools/snapshots. Each secondary and primary database is transferred with a 5000 kbps bandwidth limit.

  • snapshot.py /home/bwtools/snapshots --no-consolidate

    Takes a snapshot. The unconsolidated snapshot is archived to/home/bwtools/snapshots.

  • snapshot.py /home/bwtools/snapshots -n -b5000

    Takes a snapshot. The unconsolidated snapshot is archived to/home/bwtools/snapshots. Each secondary and primary database is transferred with a 5000 kbps bandwidth limit.

5.2.3. Requirements

For the snapshot tool to work, the following conditions must be met:

  • Secondary databases must be enabled.

    To enable secondary databases set the baseApp/secondaryDB/enable property to true. See Secondary Database Configuration Options for details.

  • The primary database is stored on an LVM volume and there is sufficient unpartitioned space on the primary database machine to perform an LVM snapshot. Note that, for the purposes of database snapshotting, no other cluster machine is required to have spare unpartitioned space on a LVM partition. See also Partitioning.

  • BWMachined must be running on the primary database machine.

    For background information see Server Overview's chapter Server ComponentsBWMachined.

  • All the entity scripts and entity definition files must be present and locatable on the snapshot machine.

    In order for the resource tree to be found the BWMachined configuration file (.bwmachined.conf) must be configured correctly for the user running the snapshot tool.

  • The snapshot section in /etc/bigworld.conf must be configured correctly on the primary database machine.

    See Configuration.

  • The directory the LVM volume is to be mounted to exists and has no other devices mounted to it.

  • The user performing the snapshot must be able to connect to the snapshot machine using ssh from the primary database machine and BaseApp machines without a password.

    For a user whose $HOME directory is shared between machines (e.g, using NFS) this can be done using the following commands:

    $ ssh-keygen -t dsa
    $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
  • The LVM userland program suite must be installed and accessible by the snapshot user on the primary database machine.

    The LVM userland applications can be installed under CentOS using the command:

    # yum install lvm2

    By default the tools are installed into /usr/sbin, so it might be necessary to modify the snapshot user's environment settings to at this directory to the $PATH.

    You can do this by adding to the PATH environment variable in the $(HOME)/.bash_profile file, by adding a line such as this:

    export PATH=$PATH:/usr/sbin

    You will need to logout and log back in for this change to take effect.

  • The following standard system applications must also be accessible by the snapshot user. mount, umount and chmod.

  • MySQL server must be installed on the snapshot machine. Specifically mysqld_multi must be accessible.

    To install the MySQL server binaries, run the following command as root:

    # yum install mysql-server
  • Rsync must be installed and be on the path of all machines being used, i.e, the snapshot machine, the primary database machine as well as all the BaseApp machines.

    To install Rsync run the following command as root:

    # yum install rsync
  • The SnapshotHelper utility must be built and installed.

    If the BigWorld server was installed using an RPM package, the SnapshotHelper should already be installed under the bin/Hybrid64/commands/_helpers directory of your RPM installation.

    If your server installation is not from an RPM you may need to recompile the SnapshotHelper. To build the SnapshotHelper binary, MySQL development files must be installed. See Compiling DBMgr with MySQL Support for details on how to install MySQL development files.

    After installing MySQL development files, issue the make command from within the bigworld/src/server/tools/snapshot_helper/ directory.

    $ cd bigworld/src/server/tools/snapshot_helper
    $ make
  • The SnapshotHelper binary must have its setuid attribute set so user ID is root upon execution on the primary database machine.

    To make the SnapshotHelper binary setuid root run the following commands as the root user:

    # chown root:root $MF_ROOT/bigworld/tools/server/bin/Hybrid[64]/snapshot_helper
    # chmod 4511 $MF_ROOT/bigworld/tools/server/bin/Hybrid[64]/snapshot_helper

Note

For security purposes all privileged commands (lvcreate, lvremove, mount, unmount and chmod) are invoked via snapshot_helper which reads command arguments from the trusted file /etc/bigworld.conf.

5.2.4. Partitioning

The snapshot tool requires that you have unallocated space on the LVM logical disk that your MySQL database resides on. As a rule-of-thumb, we recommend that the size of the unallocated space be 15-20% of the overall volume's size. Having too little space will cause the snapshot tool to fail.

Unallocated space can be added when partitioning the disks of a machine hosting MySQL. The steps below illustrate how to add a region of unallocated space for snapshotting, in addition to a swap partition and a file system partition.

  • In the disk partitioning step, choose Create custom layout.

  • Select the LVM group VolGroup00 that is created by default, and click on the Edit button.

  • Select each of the LogVol00 and LogVol01, and delete them. You may wish to take note of the size of the swap partition (LogVol01), and reuse this size when recreating the swap partition.

  • Add a new logical volume by clicking on the Add button. The mount point should be /, and we recommend you leave the default filesystem as ext3. The size should be 80-85% of the overall size of the volume group, less the space allocated for swap. In the example illustrated in the image below, we have allocated split a 12160MB drive into 1984MB to be unallocated, 1024MB for swap, and 9152MB for the / mount which will hold the file system.

LVM partitioning setup

5.2.4.1. Post Installation LVM Configuration

If you need to modify your LVM partition structure after the primary operation system installation, you may wish to use the LVM system configuration tools. These can be installed using the following command as the root user:

# yum install system-config-lvm

5.2.5. Configuration

The snapshot tool behaviour can be be configured using the file /etc/bigworld.conf.

Options for the snapshot tool must be placed within a [snapshot] section. Below is a list of valid keywords that can be used in the configuration file:

  • datadir

    The primary database's data directory. By default for CentOS distributions this is /var/lib/mysql.

    This should be the same value as found in the MySQL configuration file /etc/my.cnf.

    $ cat /etc/my.cnf
    [mysqld]
    datadir=/var/lib/mysql
    ...
  • lvgroup

    The name of the LVM volume group in which the primary database belongs to.

    Note

    The LVM snapshot that is created will belong to this LVM group.

  • lvorigin

    The name of the origin LVM volume snapshotted. The MySQL data directory should be on this volume.

  • lvsnapshot

    The name of the LVM snapshot volume that will be created and deleted by the snapshot tool.

    For example, if lvgroup is VolGroup00 and lvsnapshot is ExampleSnapshotVolume, a new volume device will be created called /dev/VolGroup00/ExampleSnapshotVolume.

    The snapshot name is also used to mount the snapshot volume device onto the active filesystem. Using the previous example the following mount command will be issued:

    mount /dev/VolGroup00/ExampleSnapshotVolume  /mnt/ExampleSnapshotVolume

    Note

    The mount point /mnt/lvsnapshot is assumed to already exist on the system prior to snapshots being created.

  • lvsizegb

    The size of the LVM snapshot volume in gigabytes. The snapshot does not need the same amount of storage the origin has as only the modified files are copied, in a typical scenario, 15-20% might be enough. If the LVM snapshot volume becomes full it will be dropped, so it is important to allocate enough space.

Below is an example /etc/bigworld.conf configuration file:

[snapshot]
datadir = /var/lib/mysql
lvgroup = VolGroup00
lvorigin = LogVol00
lvsnapshot = bwsnapshot
lvsizegb = 2

Note: The snapshot tool also reads <res>/server/bw.xml for the dbMgr/host, dbMgr/username, dbMgr/password and dbMgr/databaseName.

5.2.6. Restoring From a Snapshot

The snapshot tool will take a copy of a MySQL data directory consolidated with a copy of the secondary databases. To restore from a snapshotted MySQL data directory, perform the following steps on the machine hosting MySQL:

  1. Stop the BigWorld server if it is running.

  2. Stop the MySQL server if it is running, you can do this by running as root:

    # /etc/init.d/mysqld stop
  3. Move the existing /var/lib/mysql to a temporary location. You may choose to remove it after the restoration process is complete. For example:

    # mv /var/lib/mysql /tmp/mysql-backup
  4. Copy the snapshot directory's mysql directory to /var/lib/mysql (you will generally need to be root to do this) Note that snapshots are identified by date and time of the snapshot, and that the colons used in the time need to be escaped by a backslash. For example:

    # cp -r snapshots/20100924_152642/mysql /var/lib
  5. Change the ownership and permissions:

    # chown -R mysql:mysql /var/lib/mysql
    # find /var/lib/mysql -type d | xargs chmod 700
    # find /var/lib/mysql -type f | xargs chmod 660
  6. Restore the SELinux contexts for these files:

    # /sbin/restorecon -r /var/lib/mysql
  7. Start the MySQL server, verify it is running.

    # /etc/init.d/mysqld start
  8. Start the BigWorld server.

5.3. Data Consolidation Tool

The data consolidation tool is used to incorporate data from secondary databases into the primary database. The application binary is located in bigworld/bin/Hybrid[64]/commands/consolidate_dbs. This tool is only provided in source code form and must be built before use. See Enabling Secondary Databases for details on how to build the data consolidation tool.

Once this tool has been compiled, providing DBMgr is using MySQL the data consolidation tool will be automatically run during system shutdown or during system startup if the server was not shutdown correctly. There should be no need to manually run this tool except for consolidating backups created by the snapshot tool. For details on the snapshot tool see Database Snapshot Tool. The data consolidation tool sends log messages to MessageLogger which can be viewed using the LogViewer tool in WebConsole.

This tool accepts two command line arguments:

  • Primary Database

    The parameters required to connect to the primary database.

    This argument should be specified in the form host;username;password;database_name.

    Note

    When specifying this argument on the command line, it is necessary to quote the entire argument in order to prevent the shell from treating it as multiple commands.

    For example:

    ./consolidate_dbs "qa_host_03;bigworld_user;bigworld_password;fantasydemo"
  • Secondary Databases

    A file containing a newline separated list of fully qualified paths to secondary database files.

When run without command line arguments, it will use DBMgr's configuration options located in <res>/server/bw.xml to connect to the primary database. It will use the data in the bigworldSecondaryDatabase table to retrieve the secondary databases.

The data consolidation tool uses the transfer_db utility located in located in bigworld/bin/Hybrid[64]/command to transfer the secondary database files from the BaseApp machines to the machine where the data consolidation tool is running. The transfer_db utility will be launched (via bwmachined) on each machine that contains a secondary database file. The transfer_db utility then opens a TCP connection to the data consolidation tool through which it will send the secondary database file.

The data consolidation tool will store the secondary databases locally in the directory specified by the <dbMgr>/<consolidation>/<directory> configuration option. It will incorporate the data from the secondary databases into the primary database once all the secondary database files have been transferred to its local machine. If the data consolidation process is successful, both the local and remote copies of the secondary databases will be deleted, and the bigworldSecondaryDatabase table will be cleared. If the data consolidation process is unsuccessful, only the local copy of secondary database files are deleted.

The data consolidation tool logs to Message Logger under the ConsolidateDBs process. Under most circumstances, the data consolidation tool will log errors from transfer_db as well. However, under some circumstances, transfer_db errors will appear under the Tools process. The bwmachined logs, in /var/log/messages on each machine where secondary databases are located, can help to diagnose problems, especially those related to the failure to run transfer_db.

5.3.1. Skipping Data Consolidation

If, for some reason, there is a problem with the data consolidation tool and the server refuses to start, the bigworldSecondaryDatabase table can be manually cleared to skip the data consolidation process. Alternatively, consolidate_dbs can be run with the command-line parameter --clear to achieve the same result.

While skipping the data consolidation process may allow the server to start again, the data in the secondary databases that were not consolidated is essentially lost. Though the secondary database files still exists, it is not recommended to try to consolidate the data after the server has been restarted since the data in the primary database may be more recent than the data in the left over secondary databases.

5.3.2. Ignoring SQLite Errors

When consolidate_dbs encounters an error in reading a secondary database file, it will abort the entire data consolidation process. Such errors should be investigated and corrected. If a secondary database file is genuinely corrupted and cannot be repaired, then it may be preferable for consolidate_dbs to read as much data as possible from the corrupted secondary database and then proceed to consolidate the rest of the secondary databases. When the --ignore-sqlite-errors command-line option is specified, consolidate_dbs will proceed to consolidating the next secondary database when it encounters an error instead of aborting the consolidation process.