BabelStone Blog

Sunday, 1 November 2009

What's new in Unicode 6.0 ?

Previously discussed :

[2010-10-11 : Unicode 6.0 was released on the 11th October 2010.]

[2010-08-30 : The Indian Rupee Sign (see N3862) has now been accepted for fast-tracking into Unicode 6 at U+20B9 by the Unicode Technical Committee, although it is not in either of the corresponding amendments of ISO/IEC 10646, which will cause a temporary desynchronization between the two standards until Unicode 6.1.]

[2010-06-02 : Unicode 6.0 is now in Beta, and is scheduled for release at the end of September on or about the 11th October 2010.]

[2010-04-24 : The character repertoire, code points and characters names for Unicode 6.0 are now fixed.]

Now that Unicode 5.2 has been out for a month, I think that it would be a good idea to look forward to Unicode 6.0, which is scheduled for release in late 2010. Unicode 6.0 will correspond to a new (2nd) edition of ISO/IEC 10646 (ISO/IEC 10646:2010), which itself corresponds to ISO/IEC 10646:2003 plus Amendments 1 through 8, of which Amendments 7 and 8 include 2,089 2,087 new characters that are not in Unicode 5.2 (if this is confusing, it might be helpful to try reading my post on the relationship between Unicode and ISO/IEC 10646) plus the Indian Rupee Sign (U+20B9) that is not yet included in ISO/IEC 10646. In sumary, Unicode 6.0 will have a total of 109,448 characters 109,449 characters in 206 blocks covering 93 scripts.

Because of problems with the fonts for the CJK-B block, the 2nd edition of ISO/IEC 10646 will have a multi-column format for the CJK, CJK-A, CJK-C and CJK-D blocks, but the large CJK-B block (42,711 characters) will be presented in a single column format with a single font. In order to rectify this failing at the earliest opportunity, it has been decided to immediately start work on yet another new edition of the standard (the 3rd edition) instead of publishing a series of amendments as is normally the case. A summary of the additions which will be made to the 3rd edition (which will correspond to the version of Unicode after 6.0) is available here.

Whereas Unicode 5.2 saw the encoding of fifteen new scripts and a total 6,648 new characters, Unicode 6.0 only has three new scripts (Mandaic, Batak and Brahmi) and a total of 2,089 2,087 new characters. Nevertheless, Unicode 6.0 includes some of the most controversial additions to the standard for a long time. In particular, the addition of a large set of characters corresponding to Japanese Emoji 絵文字 used on mobile phones has been the cause of much heated debate (original proposal documents N3582 and N3583). Google and Apple have pushed hard for the encoding of emoji in Unicode in order to solve interoperability issues between the various vendors, who currently use different variants of emoji at different private-use code points. Two groups of emoji in particular have caused a lot of contention.

Firstly, a group of five characters representing specific cultural icons (Mount Fuji, Tokyo Tower, Statue of Liberty, Silhouette of Japan and Statue of Moyai) have been vigorously opposed because they give the appearance of setting a precedent for encoding hundreds of other characters representing cultural or nationalistic icons, such as the Great Wall of China, the Pyramids of Giza, the Eiffel Tower, Tower Bridge, Mount Kilimanjaro, etc. etc. Some of us would have prefered to encode generic versions of these characters (e.g. Snow-Capped Mountain instead of Mount Fuji), but Google insisted that these characters had specific semantics that generic versions of the characters would not be able to represent, so in the end they were accepted as is. Note however, that they are not precedents for encoding other characters representing cultural icons, as they were not encoded because of the importance of the objects these characters represent, but for interoperability reasons (cross-mapping to existing emoji codes). Of course, if mobile phone vendors start adding emoji for the Great Wall of China, etc. then ....

Secondly, a group of ten characters representing the flags of ten specific countries (People's Republic of China, Germany, Spain, France, the UK, Italy, Japan, Korea, Russia and the US) caused a great deal of consternation, as it seemed unreasonable to encode flag symbols for a few select countries and not for others. Two solutions were put forward to solve the problem. The US proposed encoding them as ten characters named EMOJI COMPATIBILITY SYMBOL-n with a glyph shape comprising EC-n in a dashed box (i.e. completely hide the fact that these characters map to emoji map symbols). On the other hand, Ireland and Germany proposed encoding 256 characters representing all currently assigned ISO 3166 two-letter country codes (see N3680). Neither of these proposals were acceptable to the other parties, and in the end a compromise solution to encode twenty-six "regional indicator symbols" (see N3727) was accepted. These characters may be combined into two-character sequences corresponding to ISO 3166 two-letter country codes, and applications may then render such sequences with the corresponding country flag. Of course, this does not provide a solution for the representation of flags for countries and regions that do not have an ISO 3166 two-letter code. For example, mobile phone vendors may want to display the Welsh flag in order to indicate Welsh language (GB-WLS) options, but could not do so using the currently defined "regional indicator symbols" mechanism.

The encoding of emoji has opened up the standard to the encoding of other related symbols that were traditionally considered outside the scope of character encoding (e.g. transport and map symbols, and symbols for playing cards), so in addition to characters deriving from emoji usage you will find in Unicode 6.0 many other symbols that have been proposed for encoding (see the expanded emoji proposal by Ireland and Germany).

Amendment 7 [225 characters]

Amendment 7 has now completed its two rounds of technical balloting, and so its repertoire (including code points and character names) is stable. Code charts for Amendment 7 are available here.

New Scripts

New Blocks

Additions to Existing Blocks

Amendment 8 [1,864 1,862 characters]

Amendment 8 has now completed its two rounds of technical balloting, and so its repertoire (including code points and character names) is stable. Code charts for Amendment 8 are available here.

Please note that the original emoji proposal (N3582/N3583) does not show the final distribution of the proposed characters amongst various existing and new blocks, and underwent extensive changes. If you wish to follow the paper trail from original proposal to final allocation then you should peruse the following documents:

New Scripts

New Blocks

Additions to Existing Blocks

Unicode 6.0 Fonts

The following are some free or shareware fonts that include some of the characters added in Unicode 6.0:

In addition, the following fonts include the newly-invented Indian Rupee Sign U+20B9 ₹:

And if you have the fonts and want to look through all the 109,384 characters in Unicode 6.0, check out my Unicode Slide Show.



Index of BabelStone Blog Posts