BabelStone Blog

Tuesday, 8 January 2013

What's new in Unicode 6.3 ?


Previously discussed :

If you were a little disappointed in Unicode 6.2 released in September 2012, which only added a single new character to the Unicode Standard (), then you may be hoping that Unicode 6.3, scheduled for release in September 2013, will provide a more substantial addition of new characters to the Unicode Standard. If that is the case then you will probably be disappointed by Unicode 6.3 as well. The version of Unicode that includes the characters currently in the process of being added to ISO/IEC 10646:2012 (3rd ed.) Amendment 1 (Linear A, Palmyrene, Manichaean, Khojki, Khudawadi, Bassa Vah, Duployan, etc.) and Amendment 2 (Caucasian Albanian, Psalter Pahlavi, Old Hungarian, Mahajani, Grantha, Modi, Pahawh Hmong, Mende, etc.) will not be released until some time in 2014 (as Unicode version 7.0). In the meanwhile, Unicode 6.3 is being released, largely to accomodate various changes to character properties. This was originally going to be an update version (6.2.1) with no new characters, but it will now include five new characters fast-tracked from ISO/IEC 10646:2012 Amendment 2, all of which are special bidirectional control format characters with no visible glyph:

These five characters are being encoded as part of a significant revision of the Unicode Bidirectional Algorithm to allow the implementation of isolate runs (draft here).

What else can we say about Unicode 6.3 ?

The UTC is still tweaking Cuneiform numeric properties, this time changing the numeric values of U+12456 𒑖 (CUNEIFORM NUMERIC SIGN NIGIDAMIN) and U+12457 𒑗 (CUNEIFORM NUMERIC SIGN NIGIDAESH) from "-1" to "2" and "3" respectively. Two misnamed Cuneiform characters, U+122D4 𒋔 (CUNEIFORM SIGN SHIR TENU) and U+122D5 𒋕 (CUNEIFORM SIGN SHIR OVER SHIR BUR OVER BUR), will also be given the formal aliases of "CUNEIFORM SIGN NU11 TENU" and "CUNEIFORM SIGN NU11 OVER NU11 BUR OVER BUR" respectively (implementations may use these aliases instead of the incorrect but immutable formal character names).

Standardized Variation Sequences for CJK Unified Ideographs

The other significant change will be the addition of 1,002 standardized variants for CJK Unified Ideographs, corresponding to the 1,002 CJK Compatibility Ideographs in the CJK Compatibility Ideographs and CJK Compatibility Ideographs Supplement blocks, as an alternative, roundtripable mechanism for representing compatibility ideographs. This means that, confusingly, there will now be two different types of variation sequences for CJK ideographs:

Variation sequences were originally intended as a mechanism for defining specific glyph variants of a Unicode character, but the IVD registration process allows (and even encourages) the registration of variation sequences for the same glyph form of a CJK character. The IVD already includes many thousands of such duplicate variation sequences. For example, there are currently 31 registered variation sequences for U+9089 邉 (itself a variant form of the character U+908A 邊 biān), but only 24 distinguishable glyph variants (15 Adobe-Japan1 variation sequences and 16 Hanyo-Denshi variation sequences):

The Unicode Standard 6.3 will define the visual appearance of standardized variants for CJK unified ideographs as being the same as the corresponding CJK compatibility ideograph in the Unicode code charts (and presumably where the charts give different national or regional glyph forms for the same compatibility ideograph, as in the case of U+F907, the glyph form of the variation sequence will not be fixed but will depend upon the national or regional context in which it is used). However, the great majority of CJK compatibility ideographs represent pronunciation variants (i.e. where a single character has multiple readings, and in some pre-Unicode national standard the different readings of the same character were assigned different code points). Therefore, in most cases the glyph form of a CJK compatibility ideograph is identical to the national or regional glyph form of the corresponding unified ideograph, with the result that a large proportion of the 1,002 standardized variants for CJK unified ideographs defined in Unicode version 6.3 do not define any meaningful difference in visual appearance from the base character by itself.

As an example, there are three Korean pronunciation variants of the character U+6A02 樂 encoded as compatibility ideographs: U+F914 (K0-5162), U+F95C (K0-5525) and U+F9BF (K0-6879). These will correspond to the standardized variation sequences <6A02, FE00>, <6A02, FE01> and <6A02, FE02> respectively, but all three variation sequences have the same glyph appearance as each other and the same glyph appearance as the Chinese, Japanese, Korean and Vietnamese glyph forms for U+6A02 in the Unicode code charts. Furthermore, the registered ideographic variation sequences <6A02 E0100> (Adobe-Japan1 CID+5276) and <6A02 E0101> (Hanyo-Denshi JA6059) share exactly the same glyph form. Thus, the variation sequences <6A02, FE00>, <6A02, FE01>, <6A02, FE02> <6A02 E0100> and <6A02 E0101> are all visually indistinguishable, and except in Taiwan and Hong Kong they are essentially identical to U+6A02 by itself.

National and regional glyph forms for U+6A02

Compatibility ideographs U+F914, U+F95C and U+F9BF (which define the visual appearance of <6A02, FE00>, <6A02, FE01> and <6A02, FE02> respectively)

Ideographic Variation Sequences for U+6A02

This is the first time that standardized variation sequences have been defined for character variants that are not visual variants, and perhaps opens the door to using standardized variation sequences to define semantic or phonetic variants of characters from other scripts (which would be a bad thing, in my opinion).



Index of BabelStone Blog Posts