BabelStone Blog

Thursday, 31 August 2006

Tibetan Extensions 2 : Balti

N2985 is a proposal by Michael Everson to encode four additional Tibetan letters that are used for transcribing Urdu or Arabic words when writing the Balti language (nothing to do with curry) :

The first two of the proposed letters are uncontroversial as encoding a new character is the only way to deal with letters that are created by reversal of existing letters. However, the other two letters are problematic, as there is already an established mechanism for dealing with Tibetan letters that are created by the addition of an "ear" (Tibetan tsa 'phru ཙ་འཕྲུ་ or sbrang gsad སྦྲང་གསད་). This is explained in the Unicode Standard 4.0 section 9.11 (my highlighting) :

The sign U+0F39 TIBETAN MARK TSA -PHRU (tsa-’phru, which is a lenition mark) is the ornamental flaglike mark that is an integral part of the three consonants U+0F59 TIBETAN LETTER TSA, U+0F5A TIBETAN LETTER TSHA, and U+0F5B TIBETAN LETTER DZA. Although those consonants are not decomposable, this mark has been abstracted and may by itself be applied to “pha” and other consonants to make new letters for use in transliteration and transcription of other languages. For example, in modern literary Tibetan, it is one of the ways used to transcribe the Chinese “fa” and “va” sounds not represented by the normal Tibetan consonants. Tsa-’phru is also used to represent tsa, tsha, or dza in abbreviations.

Thus, according to the Unicode Standard it is quite clear that the Balti letters KHA and GA with a flag should be represented as a sequence of KHA plus TSA -PHRU <0F41 0F39> and GA plus TSA -PHRU <0F42 0F39> respectively. Why then are these two letters being proposed for encoding as distinct characters ?

Well, the main reason is that there is a problem with U+0F39 that means that Tibetan text that uses this character may not render correctly if it is normalized; and due to the infamous Stability Policy this problem cannot be fixed. The details of the problem are quite simple : U+0F39 has been assigned an immutable canonical combining class of 216, and Tibetan vowel signs have canonical combining classes of 130 and 132; which means that when normalized U+0F39 will be reordered after any vowel signs, which is wrong as U+0F39 should be more closely attached to the base consonant than vowel signs.

The end result is that under Windows Vista a base consonant followed directly by TSA -PHRU renders correctly, but when normalized so that a vowel sign falls between a base consonant and TSA -PHRU it renders incorrectly (images taken using Uniscribe version 1.0606.5112.0 and the Tibetan Machine Uni font) :

Note that this is only a problem if TSA -PHRU is followed by a vowel, and only if the text has been normalized. Personally I do not think that the defect with the combining canonical class of U+0F39 is a legitimate reason to encode these two letters as precomposed characters. The Microsoft rendering engine (Uniscribe) could and should be changed so that both normalized and unnormalized sequences render correctly.

The real problem with accepting precomposed characters for these two cases is that there are other letters that are formed by the addition of a flag, and if it is deemed necessary to encode KHA plus TSA -PHRU and GA plus TSA -PHRU separately, then all the other transliteration letters formed with TSA -PHRU should also be encoded separately. There are quite a few of these :

Incidentally, [f] and [v] are nowadays more commonly represented using HA plus SUBJOINED PHA <0F67 0FA5> ཧྥ and HA plus SUBJOINED BA <0F67 0FA6> ཧྦ.

In addition to these seven atomic letters that are composed using TSA -PHRU, TSA -PHRU is also attached to a variety of letters when writing informal shorthand contractions (bskungs-yig བསྐུངས་ཡིག་ "concealed writing" or bsdu-yig བསྡུ་ཡིག་ "conglomerated writing"), where it is used to represent the letters U+0F59 TSA, U+0F5A TSHA, U+0F5B DZA or U+0F5F ZA. The letters it attaches to include U+0F42 GA, U+0F44 NGA, U+0F46 CHA (!), U+0F51 DA, U+0F53 NA, U+0F55 PHA, U+0F56 BA, U+0F62 RA (on the ra mgo in the combination RGYA) and U+0F63 LA, as can be seen in the following examples :

Thus, if the precomposed letters KHHA (KHA plus TSA -PHRU) and GHHA (GA plus TSA -PHRU) are accepted for encoding, then precomposed forms of at least seven other letters plus TSA -PHRU also need to be encoded, and given its common use in shorthand contractions attached to almost any letter, it may be prudent to simply encode TSA -PHRU versions of all thirty plus Tibetan consonants. Not only would this be a major change to the Tibetan encoding model, but due to the constraints of the Stability Policy none of the precomposed letters with TSA -PHRU would be canonically equivalent to the decomposed forms that are currently in use. This would introduce all sorts of problems with legacy data. Moreover, people would still be able to compose transcription letters using TSA -PHRU rather than using the new precomposed letters if they wanted (TSA -PHRU could be deprecated but once encoded a character can never be removed from the standard), thus resulting in multiple non-equivalent spellings. All in all, I believe that such a change would be disastrous, causing chaos for years to come.

So, in conclusion, I hope to see TIBETAN LETTER KKA and TIBETAN LETTER RRA encoded soon (preferably in Unicode 5.1), and Tibetan rendering engines and/or fonts modified so that they render Consonant plus TSA -PHRU sequences correctly with or without intervening vowel signs.

Addendum [2006-10-08]

As I hoped, at the September 2006 Tokyo meeting of WG2 it was agreed to encode TIBETAN LETTER KKA and TIBETAN LETTER RRA at U+0F6B and U+0F6C respectively, and not to encode the two precomposed letters with TSA -PHRU. The two accepted letters are under final ballot for inclusion in ISO/IEC 10646:20003 Amd.3, which will correspond to Unicode 5.1.


Tibetan | Unicode

Index of BabelStone Blog Posts