Unicode

What is Unicode?

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems. The characters in Unicode are mapped to numerical values named 'code points'. Letter 'A' for example is mapped to the code point U+0041 (U+ for Unicode). Code points have no restriction on the implementation in memory. For example, the numerical (hexadecimal) value 0041 can be either stored as 0x00-0x41 (UTF-16, small endian), or as 0x41-0x00 (UTF-16, big endian) or simply 0x41 (UTF-8). There are different methods to encode Unicode characters. The important point is that we cannot assume anymore that all characters take a constant number of bytes. Some encodings of Unicode can use up to 6 byte per character. Unicode itself can be represented by different encodings. The most common are UTF-16, UTF-8, and even UTF-32.

A Unicode string is a sequence of Unicode characters, each taking a potentially variable number of bytes. Unicode code points can use any encoding methods such as ASCII, ANSI or different code pages. If a Unicode character is about to get displayed in an environment where a specific non-Unicode encoding is used, a special character called the replacement character (looks like a question mark - see http://en.wikipedia.org/wiki/Replacement_character) or a box will be displayed for the characters that are not defined in that encoding. There is a chance that the characters defined in both encodings will show up correctly. For example, the following table shows how the string "Résumé" created in different encoding systems.

Encoding in which the string was saved Display of the string in an ASCII environment
UTF-16, small endian

Résumé

UTF-16, big endian

RΘsumΘ

UTF8

Résumé

ANSI

RΘsumΘ

Character and String Types in 3ds Max

Strings are not considered as arrays of plain char data types anymore. If you are sure that your string will always be ASCII strings and do not include any extra symbols, or if your code will always be running on an operating system with the same MBCS encoding, you can still use char arrays, however this is not recommended.

You can use arrays of wchar (wide character) if you are using an encoding such as UTF-16. You can compile the same code using either char or wchar arrays for strings if you define a macro and ask the compiler to replace it by either char or wchar before compile-time. In fact, this is what the TCHAR macro in the Windows API does.

When you are programming for a specific product such as 3ds Max, there might be a case where you need to compile against a non-Unicode API (such as 3ds Max 2012), but you want to have Unicode strings in your code or plugin for internal purposes. This is why a seperate macro MCHAR is defined, so you can use MCHAR for interfacing with the 3ds Max API, and TCHAR for code internal to your plugin. Similarly, use MSTR if interfacing with the 3ds Max API and TSTR for code internal to your plugin when dealing with strings.

The following table shows the mapping for these types in both MBCS and Unicode modes. You can also refer to the basic strbasic.h and strclass.h files for more information.

Name MBCS Unicode
TCHAR char wchar_t
MCHAR char wchar_t
TSTR CStr WStr

Arrays of Characters (C Strings)

A basic C string data type is a null-terminating array of fixed-length characters. For 3ds Max, you have to use either TCHAR* or MCHAR* to point to such an object. The use of ASCII or ANSI strings for text and files should be avoided. These strings are not compatible with TCHAR or MCHAR when built with Unicode. The following table lists a sample of functions which should be avoided, and their suggested alternative(s).

Function to Avoid Reason Replacement Function
fopen 3ds Max must support opening files which contain any valid Unicode character. _tfopen
tolower, toupper Those functions don't work with many MBCS code pages, like Chinese or Japanese _tcsupr, _tcslwr
fprintf Has issues handling Unicode. Can be replaced by _ftprintf, but even that function produces MBCS output by default. MaxSDK::Util::TextFile::Writer
fwrite, fread Does not perform any kind of conversion. Using this function to read or write TCHAR will produce files that are not interchangeable between ASCII or Unicode builds. MaxSDK::Util::TextFile::Writer, MaxSDK::Util::TextFile::Reader,MaxSDK::Util::TextFile::ReaderWriter

C Runtime TCHAR-enabled functions

The standard C runtime library functions use MBCS strings that cannot be easily converted to Unicode. It is recommended to use their equivalent TCHAR functions to make your project compatible with Unicode. The following tables contain a partial list of MBCS-only functions and their TCHAR equivalent.

File Functions

MBCS function name TCHAR function name
_fprintf_l _ftprintf_l
fgets _fgetts
fopen _ftopen
fprintf _ftprintf
fputs _fputts
unlink _tunlink

String Functions

MBCS function name TCHAR function name
sprintf _stprintf
sscanf _stscanf
strcat _tcscat
strchr _tcschr
strcmp _tcscmp
strcpy _tcscpy
strdup _tcsdup
stricmp _tcsicmp
strlen _tcslen
strlwr _tcslwr
strncat _tcsncat
strncmp _tcsncmp
strncpy _tcsncpy
strstr _tcsstr
strupr _tcsupr
vsprintf _vstprintf

Date Functions

MBCS function name TCHAR function name
_ctime32 _tctime32
_ctime64 _tctime64
_strdate _tstrdate
_strtime _tstrtime
_utime _tutime
_utime32 _tutime32
_utime64 _tutime64
asctime _tasctime
ctime _tctime

Conversion Functions

MBCS function name TCHAR function name
atoi _tstoi
atof _tstof

String Objects (C++ Strings)

String objects are abstract data types composed of a sequence of characters. However, each character can take different memory sizes in the implementation. Moreover, there can be extra information in a string object. It is recommended to use TSTR or MaxString objects when you need a string object.

TSTR Strings

TSTR is a macro replaced either by WStr or CStr depending on if you have defined the Unicode symbol or not. You can still directly use CStr or WStr if you have to specifically store data in a particular format independent of the build mode. In addition, WStr and CStr implement a lazy-copying algorithm which makes it fast and cheap to copy strings inside the application.

Converting C Strings to TSTR

There are several distinct ways of storing and referencing a string in a Windows program. Examples are:

  • Strings stored in ACP (active code page) format and referenced by char*.
  • Strings stored in UTF-16 format and referenced by wchar*.
  • Strings stored in BSTR (basic string objects that hold the size of the string in the object itself) and referenced by BSTR and so on.

The TSTR class provides some methods on converting between different types of string objects. You have to be careful, because conversion to ACP and TCHAR (in MBCS mode) can lead to losing some character information. The following static methods of TSTR will construct a TSTR object from different data types:

static TSTR TSTR::FromBSTR  (BSTR           string [, size_t length]);
static TSTR TSTR::FromACP   (const char*    string [, size_t length]);
static TSTR TSTR::FromUTF8  (const char*    string [, size_t length]);
static TSTR TSTR::FromOLESTR(LPCOLESTR      string [, size_t length]);
static TSTR TSTR::FromUTF16 (const wchar_t* string [, size_t length]);
static TSTR TSTR::FromUTF32 (const unsigned int* string [, size_t length]);
static TSTR TSTR::FromTCHAR (const TCHAR*   string [, size_t length]);
static TSTR TSTR::FromCP    (UINT cp, const char* string [, size_t length]);
static TSTR TSTR::FromCStr  (const CStr& string);
static TSTR TSTR::FromWStr  (const WStr& string);
static TSTR TSTR::FromMCHAR (const wchar_t* string [, size_t length]);
static TSTR TSTR::FromMSTR  (const WStr& string);

Converting from TSTR to Other Strings

The following is a list of TSTR member functions (methods) which return different string objects from the caller TSTR object. Except for "ToBSTR", all pointers returned by TSTR are valid as long as the TSTR object remains alive and unchanged. Note that these do not return copies of the data. Ensure you use the returned pointers promptly and avoid caching them.

Return type Method
wchar_t* ToBSTR() const
MaxSDK::Util::MaxStringCastCP ToCP(UINT cp, size_t* length = NULL) const
MaxSDK::Util::MaxStringCast<char> ToACP(size_t* length = NULL) const
MaxSDK::Util::MaxStringCastUTF8 ToUTF8(size_t* length = NULL) const
MaxSDK::Util::MaxStringCast<WCHAR> ToOLESTR(size_t* length = NULL) const
MaxSDK::Util::MaxStringCast<WCHAR> ToUTF16(size_t* length = NULL) const
MaxSDK::Util::MaxStringCast<unsigned_int> ToUTF32(size_t* length = NULL) const
MaxSDK::Util::MaxString ToMaxString() const
CStr ToCStr() const
WStr ToWStr() const
Compiler dependent ToMCHAR(size_t* length = NULL) const
Compiler dependent ToMSTR() const

C++ Generic Strings

You can also safely use the following macros to target generic C++ data types. These macros are compatible with both MBCS and Unicode configurations. These are defined in strbasic.h.

Macro MBCS Unicode
M_STD_STRING std::string std::wstring
M_STD_OSTRINGSTREAM std::ostringstream std::wostringstream
M_STD_ISTRINGSTREAM std::istringstream std::wistringstream
M_STD_OSTREAM std::ostream std::wostream
M_STD_ISTREAM std::istream std::wistream
M_STD_FOSTREAM std::fostream std::wofstream
M_STD_FISTREAM std::fistream std::wfistream

There are also several global objects provided by the C++ Standard Library and are specific to either MBCS or Unicode. They have been aliased to a common name that works for both and enable users to use the same type name independent of the used standard for the strings.

Macro MBCS Unicode
M_STD_CERR std::cerr std::wcerr
M_STD_CIN std::cin std::wcin
M_STD_COUT std::cout std::wcout

MaxString Strings

MaxString is the internal string data type for TSTR. CStr and WStr use this class to perform their internal string operations. This class uses an internal buffer that can hold a string in different encodings, and has a data member that tells which encoding is being used. Plug-in developers can use this class when they need to have string objects passed to/from 3ds Max. Some advantages of this class are:

  • It is capable of holding the same string in different encodings.
  • It reduces the number of copies of the same string in the memory by using a lazy copy algorithm. Several MaxString objects will point to the same physical memory location unless they need to modify the string. It is only then that an actual copy is performed.
  • It prevents loss of information by performing optimized string operations. For example, concatenating a UTF-16 string with an ACP string will give a UTF-16 string.

Converting Strings From/To MaxString

The following static functions create and return a MaxString from other string types.

static MaxString MaxString::FromCP         (UINT codepage, const char*, size_t length = (size_t)-1);
static MaxString MaxString::FromACP        (const char*, size_t length = (size_t)-1);
static MaxString MaxString::FromUTF8       (const char*, size_t length = (size_t)-1);
static MaxString MaxString::FromUTF16      (const WCHAR*, size_t length = (size_t)-1);
static MaxString MaxString::FromUTF32      (const unsigned int*, size_t length = (size_t)-1);
static MaxString MaxString::FromWin32Error (DWORD err);
static MaxString MaxString::FromAnsiError  (int err);

You can use the MaxStringCast Template class to convert your MaxString objects to encoding-dependant string objects. The following functions can convert MaxStrings to MaxStringCast<typename ChType> or MaxStringCastUTF8.

MaxStringCast<char>         MaxString::ToACP  (size_t* length) const
MaxStringCastUTF8           MaxString::ToUTF8 (size_t* length) const
MaxStringCast<WCHAR>        MaxString::ToUTF16(size_t* length) const
MaxStringCast<unsigned int> MaxString::ToUTF32(size_t* length) const

You can then use the MaxStringCast<typename CHType>::data() to return an encoding-dependent string object:

inline const ChType* data() const {
    return (buf) ? buf: null_data();
}

buf is a protected data member pointing to the actual string of the specific type.

Reading and Writing Text Files

There are three new classes inside the MaxSDK::Util::TextFile namespace that were designed to intelligently read Unicode-enabled text files. Those classes are able to detect the encoding used in a particular file, either by identifying the Byte Order Mark (BOM) at the beginning of the file, validating that the file is either a valid Unicode encoding file such as UTF-16 or UTF-8 (it doesn't contain any invalid character) or it is simply an ACP MBCS file.

The following table describes the mapping between the function fopen and its open mode parameter and the 3ds Max API class to use:

fopen Open Mode Class and Open method
"r", "rt", "rb"
TextFile::Reader reader;
reader.Open(file);
"r+t", "r+b"
TextFile::ReaderWriter readerWriter;
readerWriter.Open(file);
"w", "wt", "wb"
TextFile::Writer writer;
writer.Open(file, false);
"w+t", "w+b"
TextFile::ReaderWriter readerWriter;
readerWriter.Open(file, false);
"a", "at"
TextFile::Writer writer;
writer.Open(file, true);
"a+t", "a+b"
TextFile::ReaderWriter readerWriter;
readerWriter.Open(file);

MaxSDK::Util::TextFile::Reader

This class reads and interprets text files, and it was designed to perform file and stream I/O in a code page neutral way. It was designed to resolve the following problems:

  • Reads and interprets correctly the Byte Order Mark (BOM, an invisible character at the beginning of Unicode files).
  • Correctly detect UTF-8 and UTF-16 files, even if the file is not signed.
  • Detect encoding cookies. XML files usually begin with "<?xml encoding='????'>" The detection algorithm will interpret this directive correctly.
  • Prevent from splitting a character. In UTF-16, UTF-8 and some ANSI codepages, characters can be stored on 1 to 6 bytes. All the operations of this object avoid returning a partial character.

Plugin developers should consider using this class to perform File I/O to ensure that the files they generate remain compatible with previous versions of 3ds Max.

Opening Reader Files

The following lists the various methods for opening a text file using the MaxSDK::Util::TextFile::Reader class.

bool TextFile::Reader::Open(
    const MCHAR* fileName, 
    unsigned int encoding = 0, 
    LineEndMode mode = Text
);

bool TextFile::Reader::Open(
    FILE* file, 
    unsigned int encoding = 0, 
    LineEndMode mode = Text
);

bool TextFile::Reader::Open(
    HANDLE fileHandle, 
    unsigned int encoding = 0, 
    LineEndMode mode = Text 
);

The parameters of the above functions are described as follows:

  • fileName/file/fileHandle - The first parameter specifies the file to open. The TextFile::Reader class will parse them to determine their format.
  • encoding - A flag to give hints to the detection algorithm. Acceptable values for this parameter are all of the code page numbers recognized by MS Windows. In addition to that, you can also set one of the following flags:

    • FAVOR_UTF8 = 0x10000000 - If not used, the function assumes file encoding is ACP when the file's encoding cannot be detected. Setting this will force this function to use UTF8 when the encoding cannot be automatically detected.
    • FOUND_BOM = 0x20000000 - Tells the function that there is a BOM mask at the beginning of the file.
    • FOUND_COOKIE = 0x40000000 - Tells the function that there is a BOM mask at the beginning of the file.
    • FLIPPED = 0x80000000 - Tells the file if flipped (big endian) UTF-16.

    For example, if you specify CP_ACP | FAVOR_UTF8, the detection algorithm will treat any non-UTF8 data as ACP.

  • mode - the last parameter specifies type of the end of line characters used in the file. Different arguments are:

    • Unchanged - Do not alter the end of line character sequence when reading the file.
    • Enforce_CRLF - Ensure that all lines are terminated by the CRLF character sequence.
    • Enforce_LF - Ensure that all lines are terminated by the LF character.
    • Text - Enforce_CRLF (default in Windows).

Querying Reader Encoding

You can also use the function unsigned int MaxSDK::Util::TextFile::Reader::Encoding() to inquire about the encoding of this object. For reading from the text files, you can use one of the following functions:

char TextFile::Reader::ReadChar              (bool peek = false);
MaxString TextFile::Reader::ReadChars        (size_t nchars);
unsigned int TextFile::Reader::ReadCharUTF32 (bool peek = false);
MaxString TextFile::Reader::ReadChunk        (size_t len, bool dontReturnLastEOL = false);
MaxString TextFile::Reader::ReadFull         ();
MaxString TextFile::Reader::ReadLine         (size_t nchars = (size_t)-1, bool dontReturnEOL = false);

MaxSDK::Util::TextFile::Writer

TextFile::Writer is used to write to any Unicode and MBCS text file format. It automatically performs the conversion between TCHAR and the underlying format. As for the TextFile::Reader class, it will produce a MBCS ACP file by default unless you change the "encoding" parameter to another format like CP_UTF8 or MSDE_CP_UTF16. If TextFile::Writer is set to append or overwrite a file, it will detect the file format of the existing file and continue using that particular format. This class avoids mixing different encodings within the same file, and also intelligently converts strings into binary data. Its open functions are similar to TextFile::Reader's. It has the following file writing functions:

size_t Write (const char *string, size_t nbchars=(size_t)-1);
size_t Write (const wchar_t *string, size_t nbchars=(size_t)-1);
size_t Write (const MaxString &string);

MaxSDK::Util::TextFile::ReaderWriter

While the TextFile::Reader and TextFile::Writer are used for the occasions where you only need to either read or write from/to Unicode text files, the TextFile::ReaderWriter class both reads and writes Unicode files. Please refer to the class documentation for a detailed list of its functionality.

Porting Existing Plug-ins to Unicode Builds of 3ds Max

Plugins have to be ported to Unicode because 3ds Max 2013 is built with Unicode. Therefore the library files that ship with the SDK are only built Unicode. The API will not support MBCS encoding anymore.

The first step for porting existing plug-ins to the Unicode build of 3ds Max is to define UNICODE and _UNICODE symbols in your project. Remove any MBCS and _MBCS definitions from your plugin project. This will replace the TCHAR, MCHAR and TSTR macros by their equivalent Unicode types during the pre-compile phase. Failing to do so will cause linker errors because your functions will have a different signature than those defined in the 3ds Max SDK.

To define the UNICODE and _UNICODE symbols, you can either define the character set in the general settings of your project's property to "use Unicode character set", or add the swtiches /D "UNICODE" /D "_UNICODE" to the command line options in project properties > Configuration Properties > C/C++ > Command Line > Additional Options.

Make sure you are using only Unicode compatible characters and strings when you are passing parameters to/from the Max API. Your plug-in will crash or at least will fail to work properly if you use non-Unicode strings such as char* or CStr. Use either TSTR, MSTR, MaxString for strings and TCHAR and MCHAR for characters.

You have to use Unicode encoded files if your plug-in reads from or writes to text files or uses database interfaces. It is recommended to use MaxSDK::Util::TextFile::Reader and MaxSDK::Util::TextFile::Writer. If you are not using those file utilities, you have to pay extra attention to avoid using file utilities that use non-Unicode strings.

Check all the usages of sizeof() and string length determinations within your code. Remember that the memory size for different characters in your string is not the same anymore. Avoid the common mistakes mentioned in the "False Assumptions" section below.

You can now build your plug-in. You might be still missing a few lines in your code where you are using non-Unicode strings or character types. The reason for many linker errors (e.g. fatal error LNK1000) is using incompatible string or character types.

Common Pitfalls

sizeof() and _countof()

  • sizeof is used in conjunction with memory allocation functions such as: malloc, memset, memcpy, etc.
  • _countof is used in functions which take both a TCHAR pointer and a size parameter. This is a standard across Microsoft and 3ds Max code. Every time a function needs the length of a TCHAR pointer, it's always in number of TCHAR, and not a number of bytes. For example: LoadString.

Varying Character Memory Usage

Developers can make mistakes when they write code which processes strings. For instance, it is possible to assume that characters use a constant amount of memory. In the case of WCHAR, this is assumption is false. For example:

  • MBCS characters can take 1 to 2 bytes.
  • UTF-8 characters can take 1 to 6 bytes.
  • UTF-16 (WCHAR) characters might take from 2 to 4 bytes.

Operations which deal with strings might also yield different string lengths. As such, do not assume that 1 character is equal to 1 byte. For example, changing the casing of a string might change its size in memory. Avoid using sizeof to determine the length of a string.

reinterpret_cast()

When adding an offset in bytes to wchar_t pointers, you will effectively add twice the value. This is why you will see many occurrences of:

reinterpret_cast<TCHAR *>(reinterpret_cast<char *>(file) + sizeof(astruct));

Note that sometimes, reinterpret_cast<char *> is missing (only the TCHAR version is present). It was decided that all occurrences of reinterpret_cast<char *> are to be replaced with reinterpet_cast<byte *>.

unsigned char*

If for any reason you must convert a char* to an unsigned char*, you have to use _TUCHAR. It's the TCHAR equivalent which guarantees that in both Unicode and ASCII versions, all characters are unsigned and non-negative.

TSTR::data() is now const

To prevent invalidating the pointers inside TSTR, the data() method now returns a const MCHAR*. Avoid casting the const string to a normal string.

GetProcAddress()

The Windows function GetProcAddress() function does not perform any kind of handling when it comes to retrieving Unicode or ASCII functions. You must explicitly handle both cases, whether you have a Unicode or an ASCII build. Also, GetProcAddress() only accepts "char" (NOT Unicode) as input. If you have a TCHAR pointer, you need to convert it before using it. For example:

#ifdef UNICODE
functionPointer = (functionProto)GetProcAddress("FunctionW");
#else
functionPointer = (functionProto)GetProcAddress("FunctionA");
#endif

Avoid calling TSTR constructors without the _T macro

When you instantiate TSTR object, you have to take attention to implicit conversion. The following code should be avoided. In ASCII mode, it's fine because the TSTR object is natively storing all its data in ACP. In Unicode mode, this source code will trigger a conversion from ACP to UTF-16.

// Avoid this:
TSTR stringObject = "some string";
TSTR stringObject2("another string");

You should enclose your string inside the _T macro. The conversion to UTF-16, if needed, is performed at compile-time instead. The following is proper use of the _T() macro to wrap string literals:

// Do this:
TSTR stringObject = _T("some string");
TSTR stringObject2(_T("another string"));

Conventions

The following conventions will likely make your code easier to read and manage.

Implicit TSTRs

If an operation doesn't imply a form of convention, for example ACP to UTF16, we favor the implicit constructors and conversions of string objects.

// Avoid this:
FunctionCall(TSTR(GetString(IDS_STRING)));
TSTR stringObject = TSTR(_T("string"));

// Do this:
FunctionCall(GetString(IDS_STRING));
TSTR stringObject = _T("string");

String Concatenation

TSTR objects have concatenation operators for both TCHAR* and TSTR. If you concatenate a TCHAR* to a TSTR, you don't have to explicitly construct an intermediate TSTR object.

// Avoid this:
TSTR o1 = FunctionCall();
TSTR o2 = o1 + TSTR(FunctionCall()) + TSTR(_T("string"));

// Do this:
TSTR o1 = FunctionCall();
TSTR o2 = o1 + FunctionCall() + _T("string");

VARIANTs

Since 3ds Max is relies on Windows, many places inside 3ds Max use the OLE variant types: VARIANT, VARIANTARG, and PROPVARIANT. Inside the 3ds Max SDK, we have functions to convert intelligently between VARIANTs and TSTR objects.

namespace MaxSDK { namespace Util {
    TSTR VariantToString(const PROPVARIANT*, UINT encoding=CP_ACP, USHORT flags=0);
    bool VariantIsString(const PROPVARIANT*);
    bool VariantIsStringVector(const PROPVARIANT*);
    bool SetStringToVariant(PROPVARIANT*, const char* str, bool clear = true);
    bool SetStringToVariant(PROPVARIANT*, const wchar_t* str, bool clear = true);
}}

Glossary

ACP: "Active Code Page" - In Windows or Linux, the active code page always refers to the code page that the underlying operating system would expect to get from an application. It always depends on the current user's configuration. For example, in US-Windows, the code page is set to 1232 by default. The Chinese Windows code page is set to 936.

BOM: "Byte Order Mark" - A sequence of bytes at the beginning of a file that is used to determine the actual format of a file. It identifies if the file is written either UTF-16 Little Endian, UTF-16 Big Endian or UTF-8.

MBCS: "Multi Bytes Character System" - An MBCS string is a string where characters use bytes as their base data type but aren't limited to use a single one. In MBCS, all the control characters such as the null terminator, the carriage return, and the line feed always use a single byte.

MCHAR - A synonym of TCHAR that is used in Max SDK's header files.

TCHAR - A primitive data type. When compiled in "Unicode" mode, it becomes a WCHAR (unsigned short), otherwise, in MBCS mode, it's defined as a signed byte.

UTF-16 (UCS-2) - The wide char standard implemented by Windows. It stores characters on 2 or 4 bytes.

UTF-32 (UCS-4) - Unicode character encoding that takes exactly 32 bits.