Table of Contents
When dealing with characters outside the ASCII range it may be necessary to convert multi-byte values into a well defined format for network transmission and storage in either a file or database.
All areas of the server and the server tools default to using the UTF-8 character encoding as it is well known and widely implemented and supported.
The following chapter discusses the different areas of the server and tools that consider character encoding, along with how the default character encoding can be modified (if possible).
When discussing Unicode and character encodings we use some common terminology. This is briefly outlined below to avoid any confusion later.
-
Unicode
A standard for representing text (characters and symbols) from any language.
Characters for each language are represented by a unique code point. When discussing a code point, a U or U+ will typically be prefixed before the code point for clarity.
Examples include:
Code point Description U+0041 Latin letter A U+0448 Cyrillic letter SHA U+4E04 CJK (Han) ideogram for above U+3082 Hiragana letter MO U+30E2 Katakana letter MO -
Character encoding
A standard for representing a multi-byte value, for example a 3 byte Unicode code point, when transmitting data between applications.
Examples include: UTF-8, Big5, GB18030, GB2312, KOI8-R
-
Encode
The process of converting a Unicode code point (or series of code points) into a specific character encoding.
For example, encoding U+4E09 (Han character for 3) into UTF-8 would result in a three byte value E4 B8 89.
-
Decode
The process of converting a byte (or array of bytes) from a character encoding into a set of Unicode code points.
For example, decoding the GB18030 value 08 1A (Han character for above) into a Unicode code point would result in the value U+4E04.
In every Python interpreter there is a default encoding that is used
to convert from string
objects to
unicode
objects. The current default encoding can
be seen by running the following from within a Python interpreter:
>>> import sys >>> sys.getdefaultencoding() 'ascii'
Within the BigWorld FantasyDemo source code, the default encoding is
controlled by the variable DEFAULT_ENCODING
in the file
fantasydemo/res/scripts/common/BWAutoImport.py
. By
default this value is set to utf-8, however this may be
changed to any valid Python encoding as required.
As Python is the scripting language used for interacting with entities and their properties, it is important to understand the implications of the default Python encoding on entity properties. This primarily affects two types of entity property data types, STRING and UNICODE_STRING.
Entity properties using the STRING data type
are transferred around the network as byte arrays without any
modification and are stored as a BLOB in the MySQL
database. Properties of this type are expected to map directly to a
Python string
object.
When assigning a unicode
string to a
STRING property, the programmer must explicitly
encode()
the string. For example assuming the
default encoding is UTF-8:
>>> self.string_property = u"\u4e04".encode() >>> print repr( self.string_property ) '\xe4\xb8\x84'
As STRING properties are stored in the database as a binary BLOB, any character encoding may be used using this method as it is up to the programmer to ensure all Python script references to the string are using the same encoding.
Entity properties that use the UNICODE_STRING
data type are expected to be a Python unicode
type object that can have their encode()
and
decode()
methods invoked as required to
convert, respectively, from and to Python string
objects.
In order to transfer UNICODE_STRING properties
around the network, they are encoded to UTF-8 by the BigWorld engine and
then decoded to a Python unicode
object after
being destreamed. This is performed by the the
UnicodeStringDataType
class in
bigworld/src/lib/entitydef/data_types.
.
[ch]
pp
Note
There should be no reason to modify the encoding used to stream / destream UNICODE_STRING properties. This information is provided for reference purposes only.
The MySQL storage of UNICODE_STRING properties is slightly different to regular STRING objects. These properties result in TEXT or VARCHAR columns in the database with a specific character set encoding on each table and column. For more details please refer to the section UNICODE_STRING storage.
When considering DBMgr and MySQL's usage of character encodings, we must be clear on all areas that character sets are used.
Character encoding is only particularly relevant when dealing with UNICODE_STRING properties as STRING properties are already being considered as a byte array.
Streaming UNICODE_STRING properties to DBMgr uses the encoding / decoding mechanism outlined in UNICODE_STRING.
Once DBMgr needs to send a UNICODE _STRING property to MySQL it is
passed as UTF-8 to the MySQL client connection[16] for transmission. All BigWorld connections to a MySQL server
are established in UTF-8 mode. This can be seen in the
MySql::connect()
method located in
bigworld/src/lib/dbmgr_mysql/wrapper.cpp
.
When data is received by the MySQL server from a client connection it may optionally convert the data into another character set[17]. The client connection establishment outlined in the previous step ensures that this character set is also UTF-8 which means that no character set modification will occur here.
Now that the MySQL server has completely received the client data it can store it in whatever format is necessary. When creating entity tables, BigWorld defaults all UNICODE_STRING columns to store their data as UTF-8. This ensures the most compatible mode possible for all customers as UTF-8 should cover the entire Unicode range of characters. The following section UNICODE_STRING storage outlines how the UNICODE_STRING properties are stored in more detail along with details on how to alter the encoding on disk.
Entity properties that have a data type of UNICODE_STRING are stored in a MySQL database as TEXT or VARCHAR columns depending on whether a <DatabaseLength> was specified in the entity definition file.
In order to allow more efficient storage of data in MySQL, it is
possible to change the storage type of UNICODE_STRING
property columns using the
dbMgr/unicodeString/characterSet[18] bw.xml
option. The effect of
modifying this value can best be seen by using an example.
Using the Chinese character for 3 (unicode code point U+4E09) we can see from the following Python code that the byte representation of the character is smaller in the GB2312[19] character set than in UTF-8.
>>> print repr( three ) u'\u4e09' >>> print repr( three.encode( "utf8" ) ) '\xe4\xb8\x89' >>> print repr( three.encode( "gb2312" ) ) '\xc8\xfd'
For this reason for certain games it may make sense to use an alternate character set for storing UNICODE_STRING properties, however it is worth noting that while this is a supported feature, it may introduce unexpected issues due to differences in the Client input method[20] and the Python unicode string encoding[21].
For more information on MySQL character encodings please refer to the MySQL online documentation.
As it is possible to modify the character set that UNICODE_STRING properties are stored as in MySQL, it is important to understand how MySQL handles the case of writing data to a column that cannot be encoded to the column's character set.
To illustrate this case we will start with a simple Python example. If we attempt to encode() the code point U+4E04 to the ASCII character encoding, an exception is raised as follows:
>>> print u"\u4E04".encode( "ascii" ) Traceback (most recent call last): File "<stdin>", line 1 in ? UnicodeEncodeError: 'ascii' codec can't encode character u'\u4e04' in position 0: ordinal not in range(128)
This behaviour unfortunately is not replicated in MySQL which will instead silently fail and insert ? characters in place of the invalid characters. As this failure is silent, it is possible to unknowingly corrupt data in your database by having a dbMgr/unicodeString/characterSet value that doesn't not fully cover the range of values that may be provided to MySQL. This is one of the reasons we recommend you leave the storage type as UTF-8 unless absolutely required.
As each language has its own conventions regarding the order in which a set of values should be sorted, MySQL also provides the ability to modify the behaviour of search results when querying a database. This rules used to define sorting order is referred to as a collation.
Each character set that is available in MySQL has one or more collations available. For example the UTF-8 character set in MySQL has 21 collations available which can be seen by running the command:
mysql> SHOW COLLATION LIKE 'utf8_%';
This is relevant for both custom search results you may perform on the BigWorld entity database, as well as for internal server lookups that are performed for looking up entities by their <Identifier> property[22].
Collations are generally referred to as one of the following:
-
Case sensitive
-
Case insensitive
-
Binary
Depending on the behaviour of your game, you may wish to modify
the default UNICODE_STRING collation with the
dbMgr/unicodeString/collation[23] bw.xml
option.
By default the server collation is utf8_bin which will provide case sensitive lookups.
For more information on MySQL collations and behaviour, please refer to the MySQL online documentation Character Set Support.
[16] This corresponds to the MySQL variable
character_set_client
.
[17] This corresponds to the MySQL variable
character_set_connection
.
[18] For more information on this option see the Server Operations Guide, chapter Server Configuration with bw.xml, section DBMgr Configuration Options.
[19] The GB2312 character set is used in the example above rather than the more modern GB18030 character set as MySQL does not support GB18030.
[20] For more information see the Client Programming Guide, chapter Input Method Editors (IME).
[21] For more details see Python and Entity Properties.
[22] This only applies when an <Identifier> property is a UNICODE_STRING.
[23] For more information on this option see the Server Operations Guide, chapter Server Configuration with bw.xml, section DBMgr Configuration Options.