The CharacterSet Intrinsic Class

TADS 3 uses the Unicode character set to represent all strings internally.  Unicode is an international standard that was designed to be capable of representing, in a single character set, characters from every natural language in use throughout the world.  Since most computers use other character sets for the display, keyboard, and file system, though, it is often necessary to translate strings between the Unicode characters that TADS uses internally and the coding systems.  In almost all cases, TADS performs this translation automatically; when you display a string, for example, TADS translates the string to the display character set, and when you read a string from the keyboard, TADS translates the local character encoding to Unicode in the returned string.

 

In some cases, though, it's useful to be able to translate characters to and from Unicode, or from one local character set to another, under explicit program control.  For these situations, TADS provides the CharacterSet intrinsic class.  This class encapsulates a "character mapping," which defines the correspondences between local character codes and Unicode character codes.

 

To create a CharacterSet object, you use the new operator, specifying the name of the character set you want to translate to or from:

 

  local cs = new CharacterSet('us-ascii');

 

The CharacterSet object can then be used to specify the encoding to use for explicit character translations.  You can use a CharacterSet in these situations:

 

 

In addition, CharacterSet provides a few methods that let you get information about the character mapping it describes.

 

Note: when using the CharacterSet class, you should #include the system header file "charset.h".

Built-In And External Character Mappings

TADS 3 has several character mappings built in to the system.  These mappings are so common that TADS makes them available on all platforms:

 

 

The character sets above are available on every TADS 3 interpreter.  In addition, TADS has a mechanism that allows new character set definitions to be added with external mapping files – see the Character Mapping documentation for details.  You can use any character set for which an external mapping file exists on the local system, simply by using the mapping name in the CharacterSet constructor.  (However, don't use the ".tcm" or other filename suffix – just use the plain mapping name.)

Unknown Character Mappings

You can create a CharacterSet object that refers to a character mapping that doesn't exist on the local system.  This is legal and will not cause any errors at the time you create the object; however, if you try to use the object to perform any character mapping, an exception – UnknownCharSetException – will be thrown.

 

You can check to see if a character mapping is known by calling the isMappingKnown() method after creating the CharacterSet object.  If this method returns true, the character set is known and you can use it to perform character mapping.

 

It is legal to create a CharacterSet referring to an unknown mapping because it would otherwise be impossible to save the state of a program that contains a CharacterSet object and then restore the state on another computer without the same character mappings.

CharacterSet Methods

getName() – returns a string giving the name of the character set.  This is the same as the name that was used to create the character set object.

 

isMappable(val) – returns true if the character val, which can be given as an integer (giving a Unicode character value) or a string of characters, can be mapped to characters in the character set, nil if not.  If val is a string, the method returns true only if all of the characters in the string can be mapped.

 

Note that it is legal to map a string even if it contains unmappable characters, because the mapping process will simply map any unmappable characters to the "default" character defined in the character mapping.  The default character varies by character set – it's part of the Unicode-to-local mapping definition – but it usually indicates visually that a character is missing; in some character sets it looks like an empty rectangle, and in others it's simply a question mark.

 

isMappingKnown() – returns true if the character set has a known mapping, nil if not.  If this returns nil, any attempts to map characters using the object will throw a CharacterSetUnknownException.

Examples

Example 1:  Using a CharacterSet to determine if the local machine is capable of displaying Cyrillic characters.

 

If you're writing a game in Russian, you would probably want to make sure the player's computer is capable of displaying Cyrillic characters – if it weren't, the player probably wouldn't be able to read most of the text in your game.  You can do this by creating a CharacterSet object for the local system's display character set, and then testing a string of characters for mappability with the isMappable() method.
 
  #include <tads.h>

  #include <charset.h>
 
  testCyrillic(args)
  {
    /* get the local display character set */
    local cs = new CharacterSet(getCharacterSet(CharsetDisplay));
 
    /*
     *  Check a few representative Cyrillic alphabetic characters
     *  (see http://charts.unicode.org/Web/U0400.html)
     */
    if (cs.isMappable('\u0410\u0411\u041a\u042f\0430\0431\u044f'))
      "Warning: This game uses Cyrillic characters.  Your system
      does not appear to be localized for Russian, so the text
      in this game might not display properly.  You might need
      to adjust your system localization settings to display
      Cyrillic characters before you can play this game.  If
      you change your localization settings, please close and
      then re-start the game to ensure the new settings are used.";
  }

 

Example 2:  Translating a file from one character set to another.

 

This isn't a very typical situation for most games, but suppose you wanted to write a program that reads a text file that was saved in one character set and save it in a different character set – say, translate the file from the Macintosh Roman character set to ISO Latin-1.  To do this, you would need a Mac Roman mapping definition on your computer, because this isn't one of the built-in character sets; assuming we had this mapping file (let's say it's called "MacRoman.tcm"), we could perform the translation quite easily using the text file functions.

 

  #include <tads.h>

  translate(inFileName, outFileName)
  {
    local inFile, outFile;
    local csMac, csISO;
 
    /* create the character set objects */
    csMac = new CharacterSet('MacRoman');
    csISO = new CharacterSet('iso-8859-1');
 
    /* open the files */
    inFile = fileOpen(inFileName, 'rt', csMac);
    outFile = fileOpen(outFileName, 'wt', csISO);
    if (inFile == nil || outFile == nil)
    {
      "Error: cannot open files.\n";
      return;
    }
 
    /* read text and write it back out */
    for (;;)
    {
      local txt;
 
      /* read a line of input; stop if at end of file */
      txt = fileRead(inFile);
      if (txt == nil)
        break;
 
      /* write it out */
      fileWrite(outFile, txt);
    }
 
    /* close the files */
    fileClose(inFile);
    fileClose(outFile);
  }

 

Note that creating CharacterSet objects isn't really necessary in this example, since we could have more simply passed the name of the character set directly to fileOpen().