Improving Font Information Processing Infrastructure: Pan-CJK Typeface Development
By Dr Ken Lunde, Janitor, Spirits of Christmas Past
Development History & Background
Unlike conventional typefaces whose fonts are intended to be used for a particular region or language, such as Japanese ones, Pan-CJK typefaces are intended to serve multiple East Asian regions and languages, such as China 🇨🇳 (PRC), Taiwan 🇹🇼 (ROC), Hong Kong SAR 🇭🇰, Japan 🇯🇵, and South Korea 🇰🇷 (ROK). The inherent difficulty and complexity of designing and developing Pan-CJK typefaces is directly related to respecting regional conventions, which may necessitate multiple glyphs per code point, not only for ideographs (aka kanji or Chinese characters), but also for punctuation. The extent to which glyphs can or cannot be shared across the supported languages varies not only by typeface style, but also by actual typeface design. Figure 1 shows 68 ideographs that include five distinct language-specific glyphs in a particular Pan-CJK typeface.
There are two primary motivations or advantages of Pan-CJK fonts. One is related to design, specifically that it is possible to exhibit a consistent style across languages, to include weight. This is useful for branding purposes, particularly for businesses that serve customers in multiple East Asian regions. The other is about saving space, in that a typical Pan-CJK font is smaller and includes less glyphs than the equivalent and separate region- or language-specific fonts. To put the space-savings into perspective, one of the Source Han Sans Pan-CJK fonts includes 65K glyphs and is approximately 17MB in size, whereas the five separate language-specific subset fonts for Source Han Sans, when added together, include approximately 116K glyphs and are approximately 30MB in size. That’s nearly 45% redundancy for both the number of glyphs and file size.
The development timeline spans nearly a quarter-century, going back to my early years at Adobe. I first presented about Pan-CJK font development at UIW6 (Unicode/ISO 10646 Implementers’ Workshop 6) in September of 1994, with a presentation entitled “Creating Fonts for the Unicode Kanji Set: Problems & Solutions,” which included a demonstration of a CID-keyed font implementation that used multiple region-specific CMap resources. My former colleague, Dirk Meyer, and myself built the first working prototypes in early 1999, and he presented at IUC15 (15th International Unicode Conference) in September of 1999, and his presentation was entitled “Unihan Disambiguation Through Font Technology.” I subsequently presented about developing Pan-CJK fonts at IUC33 (33rd Internationalization & Unicode Conference) in October of 2009, and my presentation was entitled “Designing & Developing Pan-CJK Fonts for Today.” This presentation got Google’s attention, and we started business discussions in 2010. We came to an agreement in 2012, which kicked off development of two Pan-CJK typeface families. Version 1.000 of the sans serif family was released in July of 2014 as the Adobe-branded Source Han Sans and the Google-branded Noto Sans CJK. Version 1.000 of the serif family was released in April of 2017 with similar branding, as Source Han Serif and Noto Serif CJK. Version 2.000 of the san serif family, which added support for Hong Kong SAR as a second form of Traditional Chinese among its many enhancements, was released in November of 2018. As a largely personal project, I developed and released a third Pan-CJK typeface, Source Han Mono, in May of this year. While it is considered a derivative of Source Han Sans, its glyph set is different, and its development presented some challenges, such as applying anisotropic techniques to force the glyphs for hangul and jamo to be full-width, and to include italic glyphs. I also didn’t deploy region-specific subset fonts, mainly because I find them to be uninteresting.
Adobe and Google position these Pan-CJK typefaces differently due to differing business needs and requirements. While Adobe treats the Source Han families as completely independent and stand-alone fonts, Google treats the Noto CJK families as very large pieces of a much larger puzzle that is manifested as the Noto (an abbreviation for “No Tofu,” which refers to the prototypical hollow rectangular glyph that appears when attempting to use a character that doesn’t include a glyph in the selected or available font) family and implemented via font fallback in their ecosystems. (Any subsequent references to the Source Han typefaces also apply to the equivalent Noto CJK typefaces.) As a result of Adobe and Google being very large companies that have different business needs and strategies, identical fonts were given two different names for branding purposes.
In terms of actual design, Adobe Chief Type Designer, Ryoko Nishizuka (西塚涼子), is responsible for the overall design, but because Adobe has no in-house Chinese and Korean typeface design expertise, we partnered with leading type foundries in those regions: Changzhou SinoType (PRC) for Chinese (hanzi), Sandoll (ROK) for Korean (hangul syllables, letters, and symbols), and Iwata (Japan) for Korean (hanja) and additional Japanese (kanji). Adobe is now working with Arphic Technology (ROC) for Chinese. All subsequent development was done completely in-house using proven and repeatable processes, which is an absolute necessity when working with nearly a million glyphs.
These Pan-CJK typeface families are positioned differently than conventional Japanese typeface families, such as Adobe’s Kozuka Mincho and Kozuka Gothic, along with the IPA (Information-technology Promotion Agency) fonts. The characters that are supported are those that are considered the most common or required for the following regions in East Asia: Japan, South Korea (ROK), China (PRC), Taiwan (ROC), and Hong Kong SAR. Unicode also played an important role in the projects, as the backend or “working” glyphs are named according to their Unicode code points or sequences. The coverage for Japanese is somewhat unique: all of Adobe-Japan1-7 is not supported, because that glyph set includes glyphs for many symbols that are not considered useful in a Pan-CJK context. Glyphs for all of its kanji are included, however, which means that all Adobe-Japan1 IVSes (Ideographic Variation Sequences) are supported, along with the SVSes (Standardized Variation Sequences) that correspond to CJK Compatibility Ideographs. Furthermore, some JIS X 0212 and JIS X 0213 characters and symbols were intentionally excluded.
Development Details & Challenges
The massive scope and nature of the Source Han typefaces provided an opportunity to explore new possibilities and functionalities, which had the side-effect of exposing poor assumptions on the part of OSes, apps, layout engines, and libraries. Part of this was due to their Pan-CJK nature, and being open source also gave us more flexibility when it came to experimentation and exploration.
The basic characteristics of the actual fonts are that they 1) include the maximum number of glyphs in a font resource, which is 65,535 (CIDs 0 through 65534); 2) are implemented as CID-keyed OpenType/CFF fonts that are based on the special-purpose Adobe-Identity-0 ROS (Registry, Ordering, and Supplement, which refer to three CIDSystemInfo values that are required for CIDFont and CMap resources, the first two of which are used for compatibility purposes); 3) use Unicode-based working glyph names that insulated against CID changes across development and major versions, and also drove the process of establishing Unicode mappings and OpenType features; 4) are Pan-CJK, and therefore support multiple East Asian languages and their regional conventions; 5) specify cross-platform vertical metrics settings; 6) include glyphs for two- and three-em characters and their vertical forms, which result in unusually large font bounding-box values that may cause grief for some apps in terms of line height or line spacing; and 7) are open source, which means that they are free of licensing fees of any kind.
In terms of making the fonts available as open source, I have found that such a paradigm can influence peoples’ expectations, either because equivalent commercial typefaces have less vocal methods for reporting errors or requesting enhancements, or because there is a feeling that there is an anxious army of developers at the ready to issue fixes or provide enhancements. Nothing could be further from the truth, and we sometimes would jokingly refer to this as the “but, but, pony!” paradigm.
The Source Han typefaces represent the first major implementation of the ‘locl’ (Localized Forms) GSUB feature, which mainly affects ideographs, but also punctuation and even digits. The challenge is that language tagging of text — at the character, paragraph, or document level — is not universally supported in today’s apps. Only modern browsers and Adobe InDesign provide an adequate level of language-tagging support. Other Adobe apps are in the process of adding such support, as is Microsoft Word. To deal with this challenge, in addition to including a fully-functional ‘locl’ GSUB feature in the Source Han fonts, the fonts were released as separate fonts, each with a default language. Source Han Sans Version 2.000 was released with five default languages: Japanese, Korean, Simplified Chinese, Traditional Chinese for Taiwan, and Traditional Chinese for Hong Kong SAR. Source Han Serif Version 1.000 was released with only four default languages, which excluded Traditional Chinese for Hong Kong SAR. In other words, there are two methods to access the default glyphs for the supported languages: 1) by selecting the font whose default language is the target language; and 2) by language-tagging the text at the character, paragraph, or document level. The latter method requires an app that necessarily supports language tagging and the ‘locl’ GSUB feature. At some point, a sixth default language, which is Traditional Chinese for Macao SAR 🇲🇴, can be considered.
The separate language-specific fonts for each family share the same glyph set, which gave us an opportunity to deploy them as per-weight and per-family OpenType/CFF Collections (aka OTCs). An OTC is a single font resource (aka file) that contains two or more fonts, and whose main benefit is a reduced footprint that is the result of sharing tables across its fonts. Source Han Sans was the first major deployment of OTCs, whose seven per-weight OTCs include five or ten fonts, depending on the weight (the Regular and Bold weights include additional “Half-Width” fonts whose ASCII range maps to half-width glyphs), and share the same ‘CFF’ (Compact Font Format) table. Its family OTC, which is called a “Super OTC,” includes all 45 fonts and seven per-weight ‘CFF’ tables. The 45 individual fonts range in size from 16 to 18.5MB, meaning approximately 750MB in total. The seven per-weight OTCs range in size from 18 to 20M, and are approximately 136MB in total. The 45-font Super OTC is 123MB. In other words, the OTC deployment format offered great size savings by sharing the ‘CFF’ and other significant ‘sfnt’ tables. The challenge was the lack of support for OTCs in some environments. OTCs were first supported by Mac OS X 10.8 (aka Mountain Lion, and released in 2012) and Adobe CS6 and greater apps. Microsoft support for OTCs was not implemented until Windows 10 Anniversary Update (Version 1607, released on 2016-08-02). I also built and released what I refer to as “Mega” (the Source Han and Noto CJK families as two separate OTCs) and “Ultra” (the Source Han and Noto CJK families as a single, combined OTC) OTCs that contain an even larger number of fonts. The current Ultra OTC includes three Source Han families, two Noto CJK families, includes 216 fonts, and is approximately 400MB. That’s less than 2MB per 65,535-glyph font. The Mega and Ultra OTCs are available here. I personally use the Ultra OTC, and therefore recommend it to others.
The most interesting and time-consuming aspect of Pan-CJK typeface development is the extent to which characters require multiple glyphs in order to respect regional conventions. A large number of ideographs require only a single glyph, a comparable number require two glyphs, and the number that require three or more glyphs diminishes. In order to determine how many glyphs a particular ideograph needs, character set standards were referenced. Part of this process made it clear that standards, while necessary for developing such typefaces, cannot be completely trusted due to documented and undocumented errors. Figure 2 lists several code point categories, and shows how many glyphs are used to represent the 44,808 code points of Source Han Sans Version 2.001.
The default glyphs for kana in conventional Japanese fonts are shared for horizontal and vertical layout, except for the small kana that require separate vertical glyphs that are positioned differently within the em-box. Some conventional Japanese fonts include completely separate sets of glyphs for kana that are tailored for horizontal and vertical layout, but they are not the default glyphs, and must be enabled via the ‘hkna’ (Horizontal Kana Alternates) or ‘vkna’ (Vertical Kana Alternates) GSUB features. The Source Han typefaces include separate horizontal and vertical glyphs for kana by default. Speaking of vertical layout, the Source Han fonts are among the first implementations that are compliant with UAX #50 (Unicode Vertical Text Layout). Figure 3 shows the horizontal (blue) and vertical (red) forms of three hiragana characters from Source Han Sans.
Conventional CJK fonts typically include glyphs for Greek and Cyrillic, but they are almost always implemented as full-width glyphs. The glyphs for Greek and Cyrillic in the Source Han typefaces are proportional (those in Source Han Mono are monospaced), and are therefore more usable than those in conventional CJK fonts.
Supporting Korean presented an interesting challenge mainly in adding support for combining jamo. Modern Korean can be represented by 11,172 syllables that can be decomposed into 399 two- or 10,773 three-jamo sequences, and less than 3,000 of these syllables are considered the most frequently used. Non-Modern (or Archaic) Korean, to include some modern dialects such as Jeju (제주말/濟州語), require syllables that are formed using non-modern jamo. While a very small number of the possible two- and three-jamo sequences that correspond to non-modern syllables are supported by including pre-composed glyphs that are accessible via the ‘ccmp’ (Glyph Composition/Decomposition) GSUB feature, the total number of possible sequences is 1,638,750, or 1,627,578 if the sequences that correspond to the 11,172 modern syllables are excluded. Combining jamo is implemented in the Source Han fonts by including six sets of “leading consonant” jamo (aka L), two sets of “vowel” jamo (aka V), and four sets of “trailing consonant” jamo (aka T). The ‘ljmo’ (Leading Jamo Forms), ‘vjmo’ (Vowel Jamo Forms), and ‘tjmo’ (Trailing Jamo Forms) GSUB features are then used to perform the appropriate contextual substitutions. Figure 4 contrasts the three- and two-jamo sequences for the Korean word “kimchi” (김치) and their corresponding syllables, using the Source Han Sans and Source Han Serif fonts as the example.
Because new versions of Unicode often include new characters that can be considered within the scope of the Source Han typefaces, the two families tend to leap-frog each other, in terms of one family implementing support for particular characters before the other, or for making improvements to the design of some glyphs or components. As an example for the latter, Source Han Sans Version 2.000 introduced improved glyphs and layout support for bopomofo (ㄅㄆㄇㄈ/注音), and also improved the design of the 辶 (Radical #162) component for its Traditional Chinese (Taiwan and Hong Kong SAR) glyphs. As the families become more mature, I expect that this tendency to leap-frog each other will subside.
In order to make the Source Han fonts more exciting, particularly for people who would normally not get excited about fonts, Source Han Serif Version 1.000 included glyphs for a complex ideograph with nearly 60 strokes that is read biáng, and refers to a type of noodle in China. The glyph for its Simplified Chinese form with still over 40 strokes was also included. Source Han Sans Version 2.000 leap-frogged Source Han Serif by including glyphs for an even more complex ideograph with 84 strokes that is read taito (たいと) or otodo (おとど), and is used as the name of a noodle restaurant chain in Japan. These are real ideographs that will be included in CJK Unified Ideographs Extension G (Unicode Version 13.0), but because their Plane 3 code points were not yet stable, their glyphs were made accessible using their IDSes (Ideographic Description Sequences) and the ‘ccmp’ (Glyph Composition/Decomposition) GSUB feature, which can be considered a form of pseudo-encoding. Figure 5 shows the Extension G ideographs that are supported in Source Han Sans Version 2.000, provides their IDSes, and illustrates the region-specific forms.
Adobe’s current challenge is to develop Variable Font versions of the Source Han typefaces, which they may deploy as five-font Variable Font Collections. Unlike conventional fonts that require separate font resources for different variations, such as different weights, Variable Fonts use interpolation between master designs to dynamically specify one or more particular design variations, such as weight or width. For a typeface that is deployed as fonts of many weights, the footprint of a Variable Font version would be considerably less, because only the master designs, such as the lightest and heaviest weights, are actually present. Deploying the Source Han typefaces as five separate language-specific Variable Fonts would completely defeat one of the benefits of Variable Fonts, which is a reduced footprint. A related challenge is about support infrastructure. While Variable Fonts are supported in some key OSes and apps, Variable Font Collections are not.
I strongly encourage anyone who is interested in learning more about the Source Han typefaces to read the long and detailed ReadMe files that were painstakingly prepared for the three families: Source Han Sans, Source Han Serif, and Source Han Mono.
Benefits for Information Processing Infrastructure
One of the goals for the Source Han typefaces was for them to behave as conventional region-specific fonts for each of the supported languages, but with additional functionality, such as supporting the ‘locl’ (Localized Forms) GSUB feature to handle Western versus Japanese digits and some punctuation, with single and double “smart” quotes being prototypical examples. To exemplify their behavior as conventional fonts, if a customer selects the fonts whose default language is Japanese, they are intended to behave as conventional Japanese fonts. The non-Japanese portions of the fonts can be safely ignored. Although I did develop and deploy region-specific subset fonts that basically include only the glyphs that are deemed necessary for a particular region, such as Japan, I still encourage the use of the Pan-CJK versions that support multiple languages, either by selecting a font with a different default language, or through the use of language tagging in apps that support such functionality. The region-specific subset fonts are uninteresting because they are not Pan-CJK fonts.
Another goal, which is unique to the Japanese portion of the Source Han typefaces, was to include all Adobe-Japan1-7 kanji, and by extension, all kanji in the JIS X 0208, JIS X 0212, and JIS X 0213 standards. This allowed the fonts whose default language is Japanese to include all Adobe-Japan1 IVSes (Ideographic Variation Sequences) plus a small but necessary number of SVSes (Standardized Variation Sequences) in their Format 14 (Unicode Variation Sequences) ‘cmap’ subtables. This resulted in improved interoperability with existing Japanese fonts.
As mentioned earlier, the unique nature of the Source Han glyph sets required the use of the special-purpose Adobe-Identity-0 ROS, which is the glyph set that must be specified if a typeface doesn’t fit one of Adobe’s public ROSes, such as Adobe-Japan1-7 for Japanese fonts. Although the Source Han typefaces use the Adobe-Identity-0 ROS, their glyphs sets are necessarily different, mainly due to one important characteristic of Pan-CJK typefaces: the extent to which glyphs are shared depends not only on the typeface style, such as sans serif versus serif, but also on the actual typeface design. In other words, both typefaces extensively share glyphs across the supported languages, but the distribution is different. The Adobe-Identity-0 fonts that we develop serve as examples for other font developers to study, and our hope is that serve as inspiration. The first Adobe-Identity-0 font was Kazuraki (かづらき), which Adobe released in 2009. In addition to the Source Han fonts, Ten Mincho (貂明朝), which Adobe first released at the end of 2017, also takes advantage of the Adobe-Identity-0 ROS.
While not necessarily a goal, Adobe made the Source Han fonts available in multiple deployment formats. This was partly necessary to work around limitations in particular environments, but also presented an opportunity to explore new possibilities that may serve to inspire other typefaces to do the same. One of the deployment formats was an OTC (OpenType Collection), which are single font resources that include multiple fonts that share the larger ‘sfnt’ tables. Adobe’s Ten Mincho typefaces took advantage of OTCs in that all four faces are available as a four-font OTC that includes a single ‘CFF’ table.
Looking into the near future, Adobe is currently exploring Variable Font Collections as yet another deployment format that will be specific to the Variable Font versions of the Source Han typeface families. To this end, I developed a series of test fonts at the end of January that were specifically intended to simulate the Source Han typefaces as six- and 12-font Variable Font Collections, with the hope that the actual Source Han Variable Font Collections will be supported when they are released, at least in key environments.
One often overlooked benefit of the Source Han typefaces is that their fonts proved to be very useful in exposing or unmasking poor assumptions in environments that consume fonts, to include font development tools. This paved the way for similar fonts to be released, better guaranteeing that they will function as intended. Some of these assumptions included the number of glyphs in the fonts (65,535, which is the architectural limit), the number of fonts in an OTC, how line height or line spacing is determined, and so on.
As stated earlier in this article, genuine Pan-CJK typeface families provide a consistent design across East Asian languages, but also respect region-specific conventions, particularly for ideographs and punctuation. Such fonts allow companies to produce multilingual documentation, promotional materials, and other collateral using a consistent “look and feel,” which is important for branding purposes. My frequent visits to Japan allow me to observe multilingual signage, particularly in the frequently-used transit systems.
In closing, the open source nature of the Source Han fonts, which means they can be used at no charge, effectively eliminates any barriers to their use. Of course, making high-quality Pan-CJK fonts available as open source may at first seem to disrupt existing commercial font licensing businesses, but the harsh reality is that the number of such high-quality open source Pan-CJK typefaces will be limited, mainly due to the amount of time and effort that is necessary to design and develop them.
🐬
About the Author
Dr Ken Lunde worked at Adobe for over twenty-eight years — from 1991-07-01 to 2019-10-18 — specializing in CJKV Type Development, meaning that he architected and developed fonts for East Asian typefaces, along with the standards and specifications on which they are based. He architected and developed the Adobe-branded “Source Han” (Source Han Sans, Source Han Serif, and Source Han Mono) and Google-branded “Noto CJK” (Noto Sans CJK and Noto Serif CJK) open source Pan-CJK typeface families that were released in 2014, 2017, and 2019, is the author of CJKV Information Processing Second Edition (O’Reilly Media, 2009), and published over 300 articles on Adobe’s now-static CJK Type Blog. Ken earned BA (1987), MA (1988), and PhD (1994) degrees in linguistics from The University of Wisconsin-Madison, served as Adobe’s representative to the Unicode Consortium since 2006, was Adobe’s primary representative from 2015 until 2019, serves as Unicode’s IVD (Ideographic Variation Database) Registrar, attends UTC and IRG meetings, participates in the Unicode Editorial Committee, became an individual Unicode Life Member in 2018, received the 2018 Unicode Bulldog Award, was a Unicode Technical Director from 2018 to 2020, became a Vice-Chair of the Emoji Subcommittee in 2019, published UTN #43 (Unihan Database Property “kStrange”) in 2020, and became the Chair of the CJK & Unihan Group in 2021. He and his wife, Hitomi, are proud owners of a His & Hers pair of acceleration-boosted 2018 LR AWD Tesla Model 3 EVs.