BabelStone Blog

Monday, 8 January 2007

Vanished in the Twinkling of an Eye

Here's an entry for a Chinese character meaning "to blink" or "to twinkle" in a standard Chinese-English dictionary :

Han-Ying Cidian 漢英詞典 (Shangwu Yinshuguan, 1980) p.594.

Notice anything funny about it ? Not really, then try typing it up using Unicode ... oops, now where is that character ? ... you know, the one with a 目 radical and shǎn 㚒 phonetic.

Look as hard as you like, but not one of the 70,229 unified ideographs that have already been encoded matches the head character in this entry. A mistake or an obscure character perhaps (really obscure not to be in CJK-B) ? But no, the same character is also found in the standard pocket dictionary of Chinese characters :

Xinhua Zidian 新華字典 (Shangwu Yinshuguan, 5th ed., 1979) p.398.

And the best one-volume dictionary of modern Chinese :

Xiandai Hanyu Cidian 現代漢語詞典 (Shangwu Yinshuguan, 2nd ed., 1983) p.998.

So if it is such a common character, how come it's not in Unicode ? Well, the answer is that officially it is in Unicode, just that it's been unified with the very similar character U+4039 䀹. Notice how the character we are interested in has a shǎn 㚒 phonetic, but U+4039 has a jiā 夾 phonetic. U+4039 is an uncommon character which does not occur in any of the three dictionaries cited above, but if you look in the big dictionaries you will find both characters (one with a shǎn 㚒 phonetic, and one with a jiā 夾 phonetic), treated as separate characters in their own entries.

Here's the entry in the great Kangxi Dictionary :

Kangxi Zidian 康熙字典 (Zhonghua Shuju, 1958) p.809.

Let's see how these two entries are handled in an electronic Kangxi dictionary (click on the 康熙字典 tab) -- hmm, the two entries are conflated, but the on-line editor has had to add the apologetic note "䀹原字从㚒,不从夾。" [this character is originally written with the shan radical not the jia radical] to the first entry. Not very satisfactory !

And here's the entry in the Chinese answer to the OED :

Hanyu Dacidian 漢語大詞典 (Hanyu Dacidian Chubanshe, 1991) vol.7 pp.1221-1222.

Hmm, clearly two distinct characters. So why are they unified in Unicode ? Well, I don't believe they should be unified as the rules for CJK unification are that two characters should not be unified if a source dictionary treats them as distinct lexical items.

By now the most tenacious of my readers will have discovered that although there isn't a unified ideograph corresponding to the character in hand, there is a compatibility ideograph that looks just like it, viz. U+FAD4. So why not use U+FAD4 for this character to distinguish it from U+4039 ? Well, compatibility ideographs are not real Chinese ideographs at all, they only sometimes look as if they are, but with a wave of the magic wand of normalization they vanish away. Or in more prosaic words, U+FAD4 is canonically equivalent to U+4039, which means that any conformant Unicode process can convert U+FAD4 to U+4039 in the twinkle of an eye, without so much as a by-your-leave. So, it is not useful to try to represent our character in permanent electronic text form using its lookalike compatibility ideograph.

For this reason John Jenkins and myself have submitted a proposal to disunify U+4039, which will hopefully see this woefully overlooked character encoded as a unified ideograph in its own right before very long.

Addendum [2007-02-05]

Kenneth Whistler has pointed out that there is one more compatibility ideograph, U+2F949, which is canonically equivalent to U+4039. Whereas the reference glyph for U+FAD4 is the same as the missing character, the reference glyph for U+2F949 is the same as U+4039. So if a new character is created for the shan-radical character and the jia-radical character remains assigned to U+4039, then the compatibility ideograph U+2F949 will have the correct canonical equivalence to U+4039, but the compatibility ideograph U+FAD4 will be left with an incorrect canonical equivalence. On the other hand, if a new character is created for the jia-radical character, and the shan-radical character is assigned to U+4039, then the compatibility ideograph U+FAD4 will have the correct canonical equivalence to U+4039, but the compatibility ideograph U+2F949 will be left with an incorrect canonical equivalence. So whichever of these two solutions is chosen (if either), one of the compatability ideographs will be left with an incorrect canonical equivalence. We will just have to wait and see what the relevent committees decide to do about it.

Addendum [2007-05-13]

The proposal to disunify U+4039 was subject to much discussion at the recent WG2 meeting at Frankfurt, and resulted in a decision to encode the shǎn character at the earliest opportunity. To this end the new character has been included in the additions to ISO/IEC 10646 under ballot as Amendment 4 in the basic CJK block at U+9FC3, and if all goes to plan it will making its debut in Unicode 5.1 this time next year.

I have also revised and expanded the disunification proposal (N3196) with further examples of usage and evidence for disunification.


CJK | Unicode

Index of BabelStone Blog Posts