1,111,998 vs 1,114,112

Dr Ken Lunde
4 min readDec 2, 2022

By Dr Ken Lunde

Very large figures have always intrigued me, perhaps as a result of my 30+ years of experience developing East Asian fonts and working with East Asian standards. And, like many things about which I write, these seemingly large figures are related to the Unicode Standard. This brief article will explore what these figures mean, how they are derived, and how they are used in real-world applications, particularly fonts.

I first encountered these figures in a practical sense when I designed (heh) and developed the open source typeface named Adobe Blank in early 2013, which was nearly 10 years ago. Its font simply maps all available code points in the Unicode Standard, to the tune of 1,111,998, to non-spacing and non-marking glyphs, and is still being used today in the context of web fonts to avoid what some refer to as FOUT (flash of unstyled text). This is the first figure. Adobe Blank 2 and Adobe Blank VF, which I developed in 2015 and 2019, respectively, continued to use this figure as part of their development.

So, how does one arrive at such a large figure?

The Unicode Standard is composed of 17 planes, the first of which is the Basic Multilingual Plane (aka BMP). Among the 16 Supplementary Planes, only the following six are currently being used: 1 (Supplementary Multilingual Plan or SMP), 2 (Supplementary Ideographic Plane or SIP), 3 (Tertiary Ideographic Plane or TIP), 14 (Supplementary Special-purpose Plane or SSP), 15 (Supplementary Private Use Area-A), and 16 (Supplementary Private Use Area-B). Each plane consists of 65,536 code points that can be represented using 16 bits. With a calculator in hand, that would naturally mean 1,114,112 available code points, which is the second figure. The difference between the two figures is 2,114.

The distinguishing factor is the use of the word available near the end of the previous paragraph, or rather its misuse. Not all code points are available for exchange! In other words, 2,114 code points in the Unicode Standard are unavailable for general use.

2,048 of these 2,114 unavailable code points correspond to the High and Low Surrogates, which are in the BMP, and which are necessary for the UTF-16 encoding form to support the 16 Supplementary Planes. 1,024 of these code points correspond to the High Surrogates Area (D800..DBFF), and the remaining 1,024 naturally correspond to the Low Surrogates Area (DC00..DFFF). This then leaves 66 code points that are unavailable for other reasons.

These 66 code points correspond to noncharacters, which are intended for internal use, not for exchange. There are two noncharacters at the very end of the BMP and at the very end of each of the 16 Supplementary Planes, which can be represented as nFFFE and nFFFF, whereby n corresponds to the hexadecimal values 0x0 through 0x10, meaning 34 in total. In other words, the following 17 code-point ranges: FFFE..FFFF, 1FFFE..1FFFF, 2FFFE..2FFFF, 3FFFE..3FFFF, 4FFFE..4FFFF, 5FFFE..5FFFF, 6FFFE..6FFFF, 7FFFE..7FFFF, 8FFFE..8FFFF, 9FFFE..9FFFF, AFFFE..AFFFF, BFFFE..BFFFF, CFFFE..CFFFF, DFFFE..DFFFF, EFFFE..EFFFF, FFFFE..FFFFF, and 10FFFE..10FFFF. The remaining 32 noncharacters are in the Arabic Presentation Forms-A block, in the range FDD0..FDEF.

Scarce as they are, the reason why Adobe Blank and its derivatives support only 1,111,998 code points, and not 1,114,112, is because they are designed to cover a superset of the mappings found in mainstream fonts, meaning fonts that serve as system or authoring (aka document) fonts. However, there exist fonts that are not considered mainstream, and the best example are so-called last resort or fallback fonts. Such fonts are designed to support arbitrary code points, including those that would not normally be supported by mainstream fonts, such as the High and Low Surrogates and noncharacters. It is therefore appropriate for last resort fonts to support all 1,114,112 code points.

The Unicode Consortium makes available, via an open source license, last resort fonts—Last Resort and Last Resort High-Efficiency—that map all 1,114,112 code points in the Unicode Standard to representative glyphs that correspond to either special characters, characters in a block, or a plane itself.

About the Author

Dr Ken Lunde has worked for Apple as a Font Developer since 2021-08-02 (and was in the same role as a contractor from 2020-01-16 through 2021-07-30), is the author of CJKV Information Processing Second Edition (O’Reilly Media, 2009), and earned BA (1987), MA (1988), and PhD (1994) degrees in linguistics from The University of Wisconsin-Madison. Prior to working at Apple, he worked at Adobe for over twenty-eight years — from 1991-07-01 to 2019-10-18 — specializing in CJKV Type Development, meaning that he architected and developed fonts for East Asian typefaces, along with the standards and specifications on which they are based. He architected and developed the Adobe-branded “Source Han” (Source Han Sans, Source Han Serif, and Source Han Mono) and Google-branded “Noto CJK” (Noto Sans CJK and Noto Serif CJK) open source Pan-CJK typeface families that were released in 2014, 2017, and 2019, and published over 300 articles on Adobe’s now-static CJK Type Blog. Ken serves as the Unicode Consortium’s IVD (Ideographic Variation Database) Registrar, attends UTC and IRG meetings, participates in the Unicode Editorial Committee, became an individual Unicode Life Member in 2018, received the 2018 Unicode Bulldog Award, was a Unicode Technical Director from 2018 to 2020, became a Vice-Chair of the Emoji Subcommittee in 2019, published UTN #43 (Unihan Database Property “kStrange) in 2020, became the Chair of the CJK & Unihan Group in 2021, and published UTN #45 (Unihan Property History) in 2022. He and his wife, Hitomi, are proud owners of a His & Hers pair of acceleration-boosted 2018 LR Dual Motor AWD Tesla Model 3 EVs.

--

--

Dr Ken Lunde

Chair, CJK & Unihan Working Group—Almaden Valley—San José—CA—USA—NW Hemisphere—Terra—Sol—Orion-Cygnus Arm—Milky Way—Local Group—Laniakea Supercluster