BabelStone Blog

Sunday, 17 September 2006

Precomposed Tibetan Part 2 : Stuck in the PUA

As discussed in Part 1, in 2002-2003 China tried and failed to get nearly a thousand precomposed Tibetan characters encoded in ISO/IEC 10646 (which is the international standard corresponding to Unicode).

Following on from this humiliating defeat, in April of 2004 Joe Zhang (Zhang Zhoucai 张轴材), formerly a contributing editor of ISO/IEC 10646, presented to a conference in China a paper that outlined a new Chinese encoding standard for Tibetan, codenamed the "Everest Scheme". This scheme utilizes the Private Use Areas (PUA) of the UCS to encode several thousand precomposed Tibetan characters, and was characterised as a "national standard within the framework of an international standard". Under this scheme Tibetan characters would be distributed as follows :

The paper also stated that there should be two implementation levels for Tibetan :

  1. Level 1 : Only works with non-combining and precomposed Tibetan characters
  2. Level 2 : Works with combining and precomposed characters

Level 1 would not be required to process any of the following characters :

Level 2 would work with both standard Unicode Tibetan and the precomposed Tibetan extensions in the PUA blocks.

Tibetan Extension-A (often referred to as "Set A"), covering the most common stacks, was published at the end of 2004, and comprises 1,536 precomposed characters in the PUA of the BMP at <F300..F8FF>. For the full repertoire see my mapping table between the Set A precomposed characters and standard Unicode Tibetan character sequences.

Tibetan Extension-B (often referred to as "Set B"), covering rarely occuring stacks, is slated for the Supplementary Private Use Area-A in Plane 15. I'm not sure how many characters it is supposed to cover, but 5,664 is figure I have heard mentioned. It has not yet been published (as far as I know) and perhaps it never will be, as the success of OpenType Tibetan fonts is rapidly making the precomposed model redundant.

One might have expected that Tibetan Extension-A would be based on the set of BrdaRten characters proposed and rejected the previous year, but that does not seem to have been the case, as :

These points make me wonder just how mature the BrdaRten proposal was and whether the 962 proposed characters were perhaps intended as a foot in the door for thousands more. The fact that the proposed BrdaRten characters were replaced by a quite different set of precomposed characters also makes a mockery of the Chinese claim that the BrdaRten characters were required to be encoded for backwards compatibility with legacy data.

One interesting issue with Tibetan Extension-A is that it does not include a precomposed character for the character sequence ཨོཾ <0F68 0F7C 0F7E> (the "om" of the mantra Om Mani Padme Hūm ཨོཾ་མ་ཎི་པདྨེ་ཧཱུཾ།). This must be because the Tibetan block already includes the character TIBETAN SYLLABLE OM at U+0F00, and the Chinese took this to be equivalent to the character sequence <0F68 0F7C 0F7E>. However, this character has no Unicode decomposition, and under Unicode it is not equivalent to <0F68 0F7C 0F7E>, so it would have been better to encode a separate precomposed character corresponding to <0F68 0F7C 0F7E> in the PUA rather than use U+0F00 as if it were a precomposed character.

Implementation of Precomposed Tibetan

If you do want to or need to work with Tibetan text encoded according to the PRC's standard for extended Tibetan, then it is possible to do so now using freely available software. My BabelPad text editor supports the conversion (both ways) between standard Unicode character sequences and Extended Tibetan-A, and Chris Fynn's Jomolhari font supports both standard combining Tibetan and precomposed Tibetan. Let's give it go.

1. We start up BabelPad, select the Jomolhari font, and open a Tibetan document encoded as standard combining Tibetan (Universal Declaration of Human Rights). The document renders perfectly (although it may not do so unless you are running Vista) :

2. Then we select "Unicode to Extended Tibetan-A" from the "Tibetan" submenu of the "Convert" menu of BabelPad. Hmm, no discernable change, document renders identically ... has it actually done anything ? Well yes it has. Take a look at the Status Bar; the character at the caret position was U+0F66 TIBETAN LETTER SA, but now it is U+F3B5 PRIVATE USE CHARACTER-F3B5, which according to the Set A Mapping Table corresponds to the decomposed sequence <0F66 0F94 0F7C> sngo (the first syllable of sngon brjod སྔོན་བརྗོད། "preamble").

3. Now hit the u" button on the BabelPad toolbar. This causes the text to be rendered in "Glyph Mode" (i.e. with all characters rendered as individual spacing glyphs). Note that the only difference is a slight change in the inter-glyph spacing and loss of smart line breaking. This shows that each stack is indeed a single character.

4. Finally, select "Extended Tibetan-A to Unicode" from the "Tibetan" submenu of the "Convert" menu of BabelPad, and it suddenly looks like we've accidentally switched to "Arial Unicode MS". Of course we haven't; we're still using Jomolhari, but now we're rendering each character as an individual spacing glyph so that the underlying difference between combining Tibetan and precomposed Tibetan is clear.

So there you are, standard combining Tibetan and precomposed Tibetan both work equally well (at least on Vista; I'm forced to admit that precomposed Tibetan will work fine on everything from Windows 95 onwards, which is not quite true for combining Tibetan). People in the PRC can used the precomposed model and everyone else can use the combining model. Everyone should be happy now, right ? Well, we'll just have to wait and see.

Meanwhile, here are two more things to consider :

1. How on earth are people supposed to enter Tibetan text consisting of thousands of precomposed characters ? You can't use a simple keyboard layout (as you can for Unicode Tibetan); a CJK style phonetic or transliteration IME (e.g. based on EWTS) would be useless for ordinary (or even most educated) Tibetans; and a "character picker" solution is totally impractical.

2. What will happen if China mandates support for its Extended Tibetan scheme as a requirement for GB18030 certification ? As I understand it, there is no such requirement at present and I have been told that there is no intention to make support for Extended Tibetan a GB1830 requirement, but things change.


Tibetan | Unicode

Index of BabelStone Blog Posts