Chapter 10. Character Sets and Encodings

Chapter 10. Character Sets and Encodings
Prev	Part I. Server Scripting Guide	Next

Chapter 10. Character Sets and Encodings

Table of Contents

10.1. Python and Entity Properties

10.1.1. STRING
10.1.2. UNICODE_STRING

10.2. DBMgr and Encodings

10.2.1. UNICODE_STRING storage
10.2.2. Sorting search results

When dealing with characters outside the ASCII range it may be necessary to convert multi-byte values into a well defined format for network transmission and storage in either a file or database.

All areas of the server and the server tools default to using the UTF-8 character encoding as it is well known and widely implemented and supported.

The following chapter discusses the different areas of the server and tools that consider character encoding, along with how the default character encoding can be modified (if possible).

When discussing Unicode and character encodings we use some common terminology. This is briefly outlined below to avoid any confusion later.

Unicode

A standard for representing text (characters and symbols) from any language.

Characters for each language are represented by a unique code point. When discussing a code point, a U or U+ will typically be prefixed before the code point for clarity.

Examples include:

Code point	Description
U+0041	Latin letter A
U+0448	Cyrillic letter SHA
U+4E04	CJK (Han) ideogram for above
U+3082	Hiragana letter MO
U+30E2	Katakana letter MO

Character encoding

A standard for representing a multi-byte value, for example a 3 byte Unicode code point, when transmitting data between applications.

Examples include: UTF-8, Big5, GB18030, GB2312, KOI8-R
Encode

The process of converting a Unicode code point (or series of code points) into a specific character encoding.

For example, encoding U+4E09 (Han character for 3) into UTF-8 would result in a three byte value E4 B8 89.
Decode

The process of converting a byte (or array of bytes) from a character encoding into a set of Unicode code points.

For example, decoding the GB18030 value 08 1A (Han character for above) into a Unicode code point would result in the value U+4E04.

10.1. Python and Entity Properties

In every Python interpreter there is a default encoding that is used to convert from string objects to unicode objects. The current default encoding can be seen by running the following from within a Python interpreter:

>>> import sys
>>> sys.getdefaultencoding()
'ascii'

Within the BigWorld FantasyDemo source code, the default encoding is controlled by the variable DEFAULT_ENCODING in the file fantasydemo/res/scripts/common/BWAutoImport.py. By default this value is set to utf-8, however this may be changed to any valid Python encoding as required.

As Python is the scripting language used for interacting with entities and their properties, it is important to understand the implications of the default Python encoding on entity properties. This primarily affects two types of entity property data types, STRING and UNICODE_STRING.

10.1.1. STRING

Entity properties using the STRING data type are transferred around the network as byte arrays without any modification and are stored as a BLOB in the MySQL database. Properties of this type are expected to map directly to a Python string object.

When assigning a unicode string to a STRING property, the programmer must explicitly encode() the string. For example assuming the default encoding is UTF-8:

>>> self.string_property = u"\u4e04".encode()
>>> print repr( self.string_property )
'\xe4\xb8\x84'

As STRING properties are stored in the database as a binary BLOB, any character encoding may be used using this method as it is up to the programmer to ensure all Python script references to the string are using the same encoding.

10.1.2. UNICODE_STRING

Entity properties that use the UNICODE_STRING data type are expected to be a Python unicode type object that can have their encode() and decode() methods invoked as required to convert, respectively, from and to Python string objects.

In order to transfer UNICODE_STRING properties around the network, they are encoded to UTF-8 by the BigWorld engine and then decoded to a Python unicode object after being destreamed. This is performed by the the UnicodeStringDataType class in bigworld/src/lib/entitydef/data_types.[ch]pp.

Note

There should be no reason to modify the encoding used to stream / destream UNICODE_STRING properties. This information is provided for reference purposes only.

The MySQL storage of UNICODE_STRING properties is slightly different to regular STRING objects. These properties result in TEXT or VARCHAR columns in the database with a specific character set encoding on each table and column. For more details please refer to the section UNICODE_STRING storage.

10.2. DBMgr and Encodings

When considering DBMgr and MySQL's usage of character encodings, we must be clear on all areas that character sets are used.

Character encoding is only particularly relevant when dealing with UNICODE_STRING properties as STRING properties are already being considered as a byte array.

Streaming UNICODE_STRING properties to DBMgr uses the encoding / decoding mechanism outlined in UNICODE_STRING.

Once DBMgr needs to send a UNICODE _STRING property to MySQL it is passed as UTF-8 to the MySQL client connection^[16] for transmission. All BigWorld connections to a MySQL server are established in UTF-8 mode. This can be seen in the MySql::connect() method located in bigworld/src/lib/dbmgr_mysql/wrapper.cpp.

When data is received by the MySQL server from a client connection it may optionally convert the data into another character set^[17]. The client connection establishment outlined in the previous step ensures that this character set is also UTF-8 which means that no character set modification will occur here.

Now that the MySQL server has completely received the client data it can store it in whatever format is necessary. When creating entity tables, BigWorld defaults all UNICODE_STRING columns to store their data as UTF-8. This ensures the most compatible mode possible for all customers as UTF-8 should cover the entire Unicode range of characters. The following section UNICODE_STRING storage outlines how the UNICODE_STRING properties are stored in more detail along with details on how to alter the encoding on disk.

10.2.1. UNICODE_STRING storage

Entity properties that have a data type of UNICODE_STRING are stored in a MySQL database as TEXT or VARCHAR columns depending on whether a <DatabaseLength> was specified in the entity definition file.

In order to allow more efficient storage of data in MySQL, it is possible to change the storage type of UNICODE_STRING property columns using the dbMgr/unicodeString/characterSet^[18] bw.xml option. The effect of modifying this value can best be seen by using an example.

Using the Chinese character for 3 (unicode code point U+4E09) we can see from the following Python code that the byte representation of the character is smaller in the GB2312^[19] character set than in UTF-8.

>>> print repr( three )
u'\u4e09'
>>> print repr( three.encode( "utf8" ) )
'\xe4\xb8\x89'
>>> print repr( three.encode( "gb2312" ) )
'\xc8\xfd'

For this reason for certain games it may make sense to use an alternate character set for storing UNICODE_STRING properties, however it is worth noting that while this is a supported feature, it may introduce unexpected issues due to differences in the Client input method^[20] and the Python unicode string encoding^[21].

For more information on MySQL character encodings please refer to the MySQL online documentation.

10.2.1.1. Storing invalid characters

As it is possible to modify the character set that UNICODE_STRING properties are stored as in MySQL, it is important to understand how MySQL handles the case of writing data to a column that cannot be encoded to the column's character set.

To illustrate this case we will start with a simple Python example. If we attempt to encode() the code point U+4E04 to the ASCII character encoding, an exception is raised as follows:

>>> print u"\u4E04".encode( "ascii" )
Traceback (most recent call last):
  File "<stdin>", line 1 in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u4e04' in position 0: ordinal not in range(128)

This behaviour unfortunately is not replicated in MySQL which will instead silently fail and insert ? characters in place of the invalid characters. As this failure is silent, it is possible to unknowingly corrupt data in your database by having a dbMgr/unicodeString/characterSet value that doesn't not fully cover the range of values that may be provided to MySQL. This is one of the reasons we recommend you leave the storage type as UTF-8 unless absolutely required.

10.2.2. Sorting search results

As each language has its own conventions regarding the order in which a set of values should be sorted, MySQL also provides the ability to modify the behaviour of search results when querying a database. This rules used to define sorting order is referred to as a collation.

Each character set that is available in MySQL has one or more collations available. For example the UTF-8 character set in MySQL has 21 collations available which can be seen by running the command:

mysql> SHOW COLLATION LIKE 'utf8_%';

This is relevant for both custom search results you may perform on the BigWorld entity database, as well as for internal server lookups that are performed for looking up entities by their <Identifier> property^[22].

Collations are generally referred to as one of the following:

Case sensitive
Case insensitive
Binary

Depending on the behaviour of your game, you may wish to modify the default UNICODE_STRING collation with the dbMgr/unicodeString/collation^[23] bw.xml option.

By default the server collation is utf8_bin which will provide case sensitive lookups.

For more information on MySQL collations and behaviour, please refer to the MySQL online documentation Character Set Support.

^[16]This corresponds to the MySQL variable character_set_client.

^[17]This corresponds to the MySQL variable character_set_connection.

^[18]For more information on this option see the Server Operations Guide, chapter Server Configuration with bw.xml, section DBMgr Configuration Options.

^[19]The GB2312 character set is used in the example above rather than the more modern GB18030 character set as MySQL does not support GB18030.

^[20]For more information see the Client Programming Guide, chapter Input Method Editors (IME).

^[21]For more details see Python and Entity Properties.

^[22]This only applies when an <Identifier> property is a UNICODE_STRING.

^[23]For more information on this option see the Server Operations Guide, chapter Server Configuration with bw.xml, section DBMgr Configuration Options.

Prev	Up	Next
Chapter 9. The Database Layer	Home	Chapter 11. Profiling
Copyright 1999-2012 BigWorld Pty. Ltd. All rights reserved. Proprietary commercial in confidence.