BabelStone Blog

Saturday, 26 November 2005

What's new in Unicode 5.0 ?

Suzanne asked in a comment to How many Unicode characters are there? whether Phoenician is included in the repertoire for Unicode 5.0, which will be coming out next year. The answer is yes. Here are some pages on the Unicode site where you can find out more about what is new for Unicode 5.0 :

Proposed New Scripts : the status of proposed new scripts
Proposed New Characters : the status of proposed new characters
Unicode 5.0 Data : the preliminary Unicode data files

You can also see charts for all the characters that are currently under final ballot for inclusion in Amendment 2 of ISO/IEC 10646:2003 in WG2 document N2991. Normally Unicode and ISO/IEC 10646 are synchronised, so that the repertoire for any given version of Unicode corresponds exactly to the repertoire for a given version of the ISO standard (plus a given amendment if appropriate). However, the 1,369 new characters in Unicode 5.0 correspond to the 1,365 characters in the FDAM2 (Final Draft for Amendment 2) plus four characters from the PDAM3 (Provisional Draft for Amendment 3). The fast-tracked four characters (U+097B DEVANAGARI LETTER GGA, U+097C DEVANAGARI LETTER JJA, U+097E DEVANAGARI LETTER DDDA and U+097F DEVANAGARI LETTER BBA) are extended Devanagari letters used for writing the Sindhi language.

In summary, Unicode 5.0 will include 1,369 new characters and 9 new blocks :

NKo [07C0..07FF] : 59 characters
Balinese [1B00..1B7F] : 121 characters
Latin Extended-C [2C60..2C7F] : 17 characters
Latin Extended-D [A720..A7FF] : 2 characters
Phags-pa [A840..A87F] : 56 characters
Phoenician [10900..1091F] : 27 characters
Cuneiform [12000..123FF] : 879 characters
Cuneiform Numbers and Punctuation [12400..1247F] : 103 characters
Counting Rod Numerals [1D360..1D37F] : 18 characters

The new blocks cover five new scripts (ISO 15924 script codes in parentheses) :

NKo (Nkoo)
Balinese (Bali)
Phags-pa (Phag)
Phoenician (Phnx)
Cuneiform (Xsux)

Amendment 2 to ISO/IEC 10646:2003 has proved to be one of the most contentious epsiodes in the history of the standard, with controversy surrounding the encoding of NKo, Phags-pa and Phoenician.

There has been incessant bickering between the Irish and Canadian national bodies about many issues relating to the encoding of NKo, including whether certain old form characters should be encoded as separate characters or whether they should be considered to be simply glyph variants, the names of characters, and the use of NKo-specific diacritic marks against the use of script-neutral generic diacritic marks.

Whilst there have been no real disagreements about the repertoire for Phags-pa, two completely different encoding models were proposed for the script, one proposed by myself and an alternate model proposed by Professor Choijinzhab of the Inner Mongolia University and supported by the Chinese and Mongolian national bodies. It took a great deal of discussion between the various parties to finally come to a mutually acceptable agreement. In many ways the disagreement over Phags-pa, whilst invisible to casual observers of Unicode and ISO/IEC 10646 (i.e. most of the participants of the Unicode public mailing list), was far more important than the disagreements relating to NKo or even Phoenician, as the arguments involved went to the very heart of the Unicode encoding philosophy and the meaning of what a character is. But more of this another time.

The Phoenician "debate" on the Unicode public mailing list will be remembered with disgust for many years to come. For months and months the list was swamped with thousands of postings from people who held diametrically opposed (and immovable) positions, and who constantly bartered the same old arguments over and over again with extreme vitriol. Finally, and I think this was an unwarranted act of censorship, the subject was banned from the mailing list, and even now mentioning the "P" word on the list may lay you open to moderation. Personally I thought it was all a storm in a teacup ... and for the record (even though I did not take any part in the debate), I was in favour of encoding Phoenician.

In the first round of balloting of Amendment 2, Cananda, China, Germany, Ireland and Japan all voted against approving the amendment (see N2876), although only Canada, China and Germany were seriously opposed to anything in the amendment (NKo, Phags-pa and Phoenician respectively). In the end, China agreed to the encoding of Phags-pa, largely as originally proposed, and changed its vote to Yes in the second round of balloting. However, the concerns of Canada and Germany, relating to NKo and Phoenician respectively, were not met, and they both voted No in the second round of balloting (see N2959 and N2990). Whilst WG2 (the working group for ISO/IEC 10646) always tries to reach a consensus on the actions it takes, at the end of the day a ballot is a ballot, and sometimes a majority decision is taken; and in this case there was a 14-to-2 majority in favour of accepting the amendment (with 16 abstentions or non-votes).

Addendum

Ken Whistler has reminded me that the Unicode standard and ISO/IEC 10646 have been temporarily out of sync on one previous occasion. In 1998, when Unicode 2.1 was released it added two characters only from 10646-1:1993 Amendment 18 (U+20AC EURO SIGN and U+FFFC OBJECT REPLACEMENT CHARACTER). As is the case with the four Sindhi characters that have been fast-tracked for Unicode 5.0, there was an urgent implementation requirement by the software industry (aka Microsoft), and the Unicode Consortium needed to ensure that the standard did not lag behind implementations (i.e. that Microsoft did not support the Euro before it was officially encoded). Still, Unicode 5.0 will be the first time in eight years that a version of the Unicode standard does not correspond exactly to a particular publication/amendment of the ISO standard.

Tags:

Unicode

Index of BabelStone Blog Posts