Table of Contents
If a BigWorld server component fails there are some steps that you should follow to assist BigWorld in identifying and resolve the problem as quickly as possible. These steps are outlined briefly below.
-
Change your coredump output directory, if necessary.
-
Determine the first process(es) that crashed.
-
Generate a stack trace of the process that crashed.
-
Retrieve relevant log information.
-
Back up the crash information.
-
Notify BigWorld Support of the crash.
A crash may be either an intentional event such as an
assert, or unintentional event such as segmentation
fault. Whenever a crash occurs a core file should be written to the
bigworld/bin/Hybrid64
directory (or
equivalent directory in your system). These files provide a complete dump of
the memory in use by the process at the time the failure occurred and allows
deep investigation as to the state of the program which is often necessary
to determine the cause of a problem.
By default, processes that crash will output a coredump into their current working directory, which is generally the directory where the binaries reside. This directory may need to be changed if the user running the server does not have sufficient permissions to write to that directory. This may be the case if you have installed using RPM packages.
The output path can be temporarily changed by writing the new output
path and the core file pattern to
/proc/sys/kernel/core_pattern
. Note that this will
change the coredump output path for all processes (not just BigWorld
processes) dumping core on that machine.
For example, a command similar to the following can be used (executed as the root user):
$ echo "/your/path/here/core.%e.%h.%p" > /proc/sys/kernel/core_pattern
Note that the change is only temporary and will not persist across machine restarts.
If a more permanent setting is required, there is the
CORE_PATH
shell variable defined in the BWMachined init
script that can be set to a suitable output directory. For example, it can
be set like this:
CORE_PATH=/your/path/here
The init script is located at
/etc/init.d/bwmachined2
. Note that after modifying
CORE_PATH
, the BWMachined service will need to be
restarted for the changes to take effect.
When diagnosing a crash it is important to find the first process that crashed if there have been multiple process failures. In most scenarios multiple crashes occur due to a single failure that then propagates through the cluster, so it is important to identify where the failure started. There are two methods which can be used to achieve this.
-
BigWorld logs.
-
Chronologically listing core files.
Whenever a crash occurs the logs should always be investigated as
a first step to check for any known problems or obvious errors that may
have occurred. When an issue has been identified, searching around the
time period for CRITICAL and ERROR
messages can help quickly identify the set of processes that failed.
Searching through the logs can be performed either via WebConsole and
LogViewer, or via the command line utility
mlcat.py
. Using the command line utility provides
more flexibility for saving logs to a file which can then be collected
and sent as part of a BigWorld support ticket.
If a process failure occurs due to a non-assertion failure case you may find no obvious CRITICAL or ERROR messages appear in the server logs. In this circumstance you will need to search all of the server binary directories within your cluster to find the core files relating to a crash. This sort of operation will generally require a custom script that is suited for your environment as a cluster with hundreds of machines can be time consuming to examine.
The easiest way to perform this kind of search is to simply ssh to all the server machines in the cluster and perform a directory listing. Below is the output of a simple chronologically sorted directory listing showing a sequence of core files.
$ cd bigworld/bin/Hybrid64 $ ls -lt ... -rw------- 1 game Company 124305408 Jan 7 10:17 core.cellapp.pc242.3030 -rw-r--r-- 1 game Company 275 Jan 7 10:16 assert.cellapp.pc242.3030.log -rw------- 1 game Company 67174400 Jan 7 10:16 core.baseapp.pc242.3024 -rw------- 1 game Company 21635072 Jan 7 10:16 core.cellappmgr.pc242.3018 -rw-r--r-- 1 game Company 167 Jan 7 10:15 assert.baseapp.pc242.3024.log -rw-r--r-- 1 game Company 166 Jan 7 10:15 assert.cellappmgr.pc242.3018.log -rw------- 1 game Company 125972480 Jan 6 09:48 core.cellapp.pc242.600 -rw-r--r-- 1 game Company 275 Jan 6 09:48 assert.cellapp.pc242.600.log -rw------- 1 game Company 21753856 Jan 6 09:47 core.cellappmgr.pc242.596 -rw-r--r-- 1 game Company 165 Jan 6 09:47 assert.cellappmgr.pc242.596.log -rw------- 1 game Company 62255104 Jan 5 15:49 core.baseapp.pc242.601 -rw-r--r-- 1 game Company 275 Jan 5 15:49 assert.baseapp.pc242.601.log
From the above example you can see there are 3 separate core files that have been generated on Jan 7th. As we are interested in the earliest crash we look at the two assertions from 10:15 which are for CellAppMgr and BaseApp. We then use these assertion files to find the corresponding core files using the PIDs 3018 and 3024. Thus the core files we are interested in investigating are:
-
core.baseapp.pc242.3024
-
core.cellappmgr.pc242.3018
As you can see from the directory listing in the previous section, core files can be extremely large. One of the most effective ways of notifying BigWorld of a crash is to generate a simple text stack trace of the crash. This can be performed by using the GNU Debugger program gdb. For example using the BaseApp core file from the previous section we generate a stack trace as follows:
Load the BaseApp binary and core file into GDB.
$ gdb baseapp core.baseapp.pc242.3024 GNU gdb (GDB) 7.0 Copyright (C) 2009 Free Software Foundation, Inc. ... Core was generated by `/home/game/game_dir/bigworld/bin/Hybrid64/baseapp -machined --res /home/gam'. Program terminated with signal 11, Segmentation fault. #0 DebugMsgHelper::criticalMessageHelper (this=<value optimized out>, isDevAssertion=<value optimized out>, format=<value optimized out>, argPtr=<value optimized out>) at debug.cpp:321 321 *(int*)NULL = 0; (gdb)
Use the bt (backtrace) command to generate a stack trace.
(gdb) bt #0 DebugMsgHelper::criticalMessageHelper (this=<value optimized out>, isDevAssertion=<value optimized out>, format=<value optimized out>, argPtr=<value optimized out>) at debug.cpp:321 #1 0x000000000087882d in DebugMsgHelper::criticalMessage (this=0x2aaab4000020, format=0x2aaab4000078 "\020\326_\265\252*") at debug.cpp:128 #2 0x00000000007ce56c in Mercury::Channel::checkOverflowErrors (this=0x1b2e3120) at channel.cpp:460 #3 0x00000000007d16f1 in Mercury::Channel::addResendTimer (this=0x1b2e3120, seq=<value optimized out>, p=0x2aaab55fc240, roBeg=0x0, roEnd=<value optimized out>) at channel.cpp:898 #4 0x00000000007f8808 in Mercury::NetworkInterface::send (this=0x7fff6b280320, address=..., bundle=..., pChannel=0x1b2e3120) at network_interface.cpp:1095 #5 0x00000000007d060b in Mercury::Channel::send (this=0x1b2e3120, pBundle=0x1b2e2f50) at channel.cpp:769 #6 0x00000000005373c0 in Mercury::ChannelOwner::send (this=0x7fff6b27b950, arg=<value optimized out>) at channel_owner.hpp:33 #7 BaseApp::handleTimeout (this=0x7fff6b27b950, arg=<value optimized out>) at baseapp.cpp:4693 #8 0x00000000007da7cd in TimeQueueT<unsigned long>::Node::triggerTimer (this=0x1b2a7400, now=254536585659057) at time_queue.ipp:402 #9 TimeQueueT<unsigned long>::process (this=0x1b2a7400, now=254536585659057) at time_queue.ipp:184 #10 0x00000000007d9b16 in Mercury::EventDispatcher::processTimers (this=0x7fff6b27c1c0) at event_dispatcher.cpp:408 #11 0x00000000007d9c38 in Mercury::EventDispatcher::processOnce (this=0x7fff6b27c1c0, shouldIdle=true) at event_dispatcher.cpp:580 #12 0x00000000007d9c54 in Mercury::EventDispatcher::processContinuously (this=0x2aaab4000020) at event_dispatcher.cpp:564 #13 0x00000000007d9c69 in Mercury::EventDispatcher::processUntilBreak (this=0x2aaab4000020) at event_dispatcher.cpp:598 #14 0x000000000054ddbe in BaseApp::run (this=0x7fff6b27b950, argc=<value optimized out>, argv=<value optimized out>) at baseapp.cpp:1125 #15 0x000000000051a0eb in doMain (dispatcher=<value optimized out>, interface=<value optimized out>, argc=4, argv=0x7fff6b280b58) at main.cpp:47 #16 0x000000000051aa65 in bwMain (argc=4, argv=0x7fff6b280b58) at main.cpp:70 #17 0x000000000051b048 in main (argc=4, argv=0x7fff6b280b58) at main.cpp:60
Quit gdb.
(gdb) quit
The stack trace can then be copy and pasted into a separate text file and stored along with the other report information for the crash.
It is also possible to perform this step in one quick operation by issuing a command similar to the following:
$ gdb -ex 'bt' -ex 'q' baseapp core.baseapp.pc242.3024 > core.baseapp.pc242.3024.backtrace
It is useful to identify when a bad / invalid stack trace has been generated. These can be caused by a number of issues such:
-
Process binary has been modified since the core file was generated.
-
Core file or binary is corrupt.
-
Architecture of the core file and process binary do not match.
Below is an example of an invalid stack trace that has been generated by GDB.
$ gdb baseapp core.baseapp ... warning: exec file is newer than core file. [New Thread 7384] [New Thread 7368] Program terminated with signal 11, Segmentation fault. #0 0x00000000008753d1 in MemberWatcher<double, StatWithRatesOfChange<unsigned int>, double>::getAsString (this=0x7fff13a189bf, base=0x1, path=<value optimized out>, result=..., desc=..., mode=@0x2e99d) at watcher.hpp:1846 1846 RETURN_TYPE value = (useObject.*getMethod_)(); (gdb)
You can see here that the initial loading of the file is warning us that the BaseApp executable is newer than the core file that is being examined. This can indicate that the binary has been recompiled and no longer corresponds to the core file. If we continue to examine the core file we see the following output:
(gdb) bt #0 0x00000000008753d1 in MemberWatcher<double, StatWithRatesOfChange<unsigned int>, double>::getAsString (this=0x7fff13a189bf, base=0x1, path=<value optimized out>, result=..., desc=..., mode=@0x2e99d) at watcher.hpp:1846 #1 0x616d732030373039 in ?? () #2 0x71655374754f6c6c in ?? () #3 0x383039313d5f7441 in ?? () #4 0x65646c6f202c3737 in ?? () #5 0x656b63616e557473 in ?? () #6 0x38313d5f71655364 in ?? () #7 0x6e69772032383736 in ?? () #8 0x5f657a6953776f64 in ?? () #9 0x6d202c363930343d in ?? () #10 0x6c667265764f7861 in ?? () #11 0x74656b636150776f in ?? () #12 0x000a323931383d73 in ?? () #13 0x0000000000c11440 in ?? () #14 0x000000000fa4dc00 in ?? () #15 0x000000000fa84050 in ?? () #16 0x000000000fa840ce in ?? () #17 0x000000000fa840d0 in ?? ()
What we are seeing here is GDB being unable to match the information in the core file with the executable and then presenting us with '?? ()' to indicate something is wrong. If you see this kind of output while generating a stack trace please identify the reason for this occurring and generate a new correct stack trace before sending it to BigWorld Support.
Most server clusters tend not to keep all server logs for a long period of time due to the data storage requirement. If server logs are being deleted on a semi-frequent basis it is important to make a backup of the relevant log files so they can be referred to during crash analysis by either your own team or BigWorld support.
In order to produce a quick log summary of a crash based on a core file you can use the MessageLogger tool mlcat.py. For example using the BaseApp core file from previous sections we can quickly generate a log summary using the following command:
$ mlcat.py --around core.baseapp.pc242.3024 --context=50 > core.baseapp.pc242.3024.log
This command has queried the server logs based on the timestamp of
the core file and saved 50 lines of context from before and after this
time into the log file
core.baseapp.pc242.3024.log
.
Note
As core files can be extremely large and take a long time to write to disk, the time stamp of a core file may be significantly later than the server logs (sometimes in the order of minutes later).
It is also recommended to archive the complete set of logs surrounding a crash. This can be easily performed by using the mltar.py MessageLogger utility program as follows:
$ mltar.py -zcf 20100107_logs.tar.gz <server/message_logger> Please select the segments to archive (e.g. 0,1,5-10): # Time Duration Entries Size 0 2010-01-06-09:47:07 7h 83490 4.5MB 1 2010-01-07-10:15:31 4h 97268 5.5MB Enter segments to archive [all]:
In this scenario we have already identified that the crash has occurred at around Jan 7th at 10:16am so it would be useful to archive both log segments.
We now have an archive of the relevant server logs in the file
20100107_logs.tar.gz
which can be used by your
internal team or by BigWorld support as required.
Having successfully identified all the files that are relevant to a
crash, backing up the files is the next step to ensure that all data
remains available should it be needed during the investigation process. It
is recommended to copy all files relating to a crash into a new directory
on a non cluster machine that is well named to help identify the data at a
later point. For example using the core file and crash information from
previous sections a new directory might be created such as
/home/game_admin/crashes/20100107_baseapp
.
You can now copy all the relevant files mentioned in the previous sections using scp or other network file transfer mechanisms suited to your environment. Below is a complete list of files recommended to keep.
-
Core files (e.g.,
core.baseapp.pc242.3024
) -
Binaries (e.g.,
baseapp
) -
Stack trace summaries (e.g.,
core.baseapp.pc242.3024.backtrace
) -
Log file summaries (e.g.,
core.baseapp.pc242.3024.log
) -
Complete log files (e.g.,
20100107_logs.tar.gz
)
You should now have a collected set of information regarding the crash which can be sent to BigWorld Support if you find you need assistance with the analysis and diagnosis of the crash.
When reporting a process crash to BigWorld Support please include as much information as possible. After you have reviewed the core files and log output summary, if you believe the crash is related to any specific section of Python game script or other custom game resources, providing these files with the initial bug report can greatly speed up the time it will take BigWorld Support to identify an issue and assist in finding a solution.
When creating a support ticket please include the following information:
-
Exact BigWorld server version used (e.g., 1.9.4.3)
-
All relevant stack traces (see Generate a stack trace of the process that crashed)
-
All relevant logs (see Retrieve relevant log information)
-
Is the crash reproducible?
-
How frequently does the crash occur?
By providing all the above information in the original support ticket it can be possible to save a number of days by allowing the support team to start work on the issue as quickly as possible.
If you need to upload large core files and logs to assist with the support process BigWorld provides an FTP location for you to upload this data to.
- ftp://ftp.bigworldtech.com
-
Username: bwguest
Password: 3f7eepE3
When using this FTP location please note that as it is in use by
numerous customers, naming your files as specifically as possible
greatly assists BigWorld support. The FTP is only writable by customers
so feel free to use file names with customer and game information, for
example
customer_cellapp_report_20100128.zip
.
In certain scenarios it can be necessary for BigWorld support to require access to your server cluster in order to fully assist with an issue. While this is not a common occurrence it can save time if your staff is aware of this possibility and has a plan in place should it be required.