2005-11-24
[Mirrored from http://babelstone.blogspot.com/2005/11/how-many-unicode-characters-are-there.html]
Otto Stolz asked on the Unicode List how many Unicode characters there were, classified as control characters, format characters, graphic characters, private use chacters, noncharacters, surrogate code points, etc. Now I love Unicode facts, figures and trivia, so I can't resist trying to answer this question.
The "Unicode Version History" utility of BabelMap provides precisely the information requested by Otto for all versions of Unicode from 1.0.0 up to the current version (4.1 when I first wrote this post, but now updated to 6.1). This information is tabulated below :
[Unicode Slide Show : 110,116 characters, one at a time]
| Version | 1.0.0 | 1.0.1 | 1.1 | 2.0 | 2.1 | 3.0 | 3.1 | 3.2 | 4.0 | 4.1 | 5.0 | 5.1 | 5.2 | 6.0 | 6.1 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | October 1991 |
June 1992 |
June 1993 |
July 1996 |
May 1998 |
September 1999 |
March 2001 |
March 2002 |
April 2003 |
March 2005 |
July 2006 |
April 2008 |
October 2009 |
October 2010 |
January 2012 |
| Scripts | 24 | 25 | 24 | 25 | 25 | 38 | 41 | 45 | 52 | 59 | 64 | 75 | 90 | 93 | 100 |
| Blocks | 57 | 59 | 63 | 67 | 67 | 86 | 95 | 107 | 122 | 142 | 151 | 168 | 194 | 206 | 217 |
| Total Code Points |
65,536 | 65,536 | 65,536 | 1,114,112 | 1,114,112 | 1,114,112 | 1,114,112 | 1,114,112 | 1,114,112 | 1,114,112 | 1,114,112 | 1,114,112 | 1,114,112 | 1,114,112 | 1,114,112 |
| Assigned Code Points |
12,795 | 34,505 | 40,635 | 178,500 | 178,502 | 188,809 | 233,787 | 234,803 | 236,029 | 237,302 | 238,671 | 240,295 | 246,943 | 249,031 | 249,763 |
| Unassigned Code Points |
52,741 | 31,031 | 24,901 | 935,612 | 935,610 | 925,303 | 880,325 | 879,309 | 878,083 | 876,810 | 875,441 | 873,817 | 867,169 | 865,081 | 864,349 |
| Encoded Characters |
7,161 | 28,359 | 34,233 | 38,950 | 38,952 | 49,259 | 94,205 | 95,221 | 96,447 | 97,720 | 99,089 | 100,713 | 107,361 | 109,449 | 110,181 |
| Private Use Characters |
5,632 | 6,144 | 6,400 | 137,468 | 137,468 | 137,468 | 137,468 | 137,468 | 137,468 | 137,468 | 137,468 | 137,468 | 137,468 | 137,468 | 137,468 |
| Non characters |
2 | 2 | 2 | 34 | 34 | 34 | 66 | 66 | 66 | 66 | 66 | 66 | 66 | 66 | 66 |
| Surrogate Code Points |
0 | 0 | 0 | 2,048 | 2,048 | 2,048 | 2,048 | 2,048 | 2,048 | 2,048 | 2,048 | 2,048 | 2,048 | 2,048 | 2,048 |
| Graphic Characters |
7,085 | 28,283 | 34,151 | 38,867 | 38,869 | 49,168 | 94,009 | 95,023 | 96,243 | 97,515 | 98,884 | 100,507 | 107,154 | 109,242 | 109,975 |
| Format Characters |
2 | 2 | 2 | 18 | 18 | 26 | 131 | 133 | 139 | 140 | 140 | 141 | 142 | 142 | 141 |
| Control Characters |
74 | 74 | 80 | 65 | 65 | 65 | 65 | 65 | 65 | 65 | 65 | 65 | 65 | 65 | 65 |
For historic versions of Unicode the statistics are based on the General Category of the characters at the time of encoding, and do not take into account any subsequent changes in General Category. Thus the fact that 4.0 has 139 format characters and 4.1 has 140 format characters is not due to a new format character having been added in 4.1, but rather due to the General Category of U+200B ZERO WIDTH SPACE having been changed from Zs to Cf in Unicode 4.0.1. Note that the statistics for 1.0.0 and 1.0.1 are based upon Ken Whistler's reconstructed Unicode Character Data.
To help understand what we're talking about, here are some definitions of some of the terms used in the table (see Section 2.4 of the Unicode Standard for further information).



