2021 “State of the Unification” Report
By Dr Ken Lunde
Unicode-related activities continue to keep me busy, especially as the Chair of the now-official Unicode CJK & Unihan Group (formerly Unihan Ad Hoc), and this article, which is turning into a brief annual report that began in 2019, is my way to provide to CJK (Chinese, Japanese & Korean) and Unihan (Unified Han) experts and enthusiasts a snapshot of the current state of affairs in this particular field.
See 2019 “State of the Unification” Report for the original report that I published on Adobe’s now-static CJK Type Blog.
Unicode Version 14.0
New versions of the Unicode Standard continue to be released on an annual basis, and although delayed by six months due to COVID-19, Unicode Version 14.0 was successfully released on 2021-09-14. Besides there now being 92,865 CJK Unified Ideographs, as shown in the detailed synopsis below, some key takeaways are that the CJK Unified Ideographs URO (Unified Repertoire & Ordering) and Extension B blocks are now completely full, thanks to a five-ideograph “Urgently Needed Character” (UNC) request from Macao SAR for their MSCS-2020 standard (see IRG N2430R), and four new ideographs have been appended to the Extension C block:
In addition, John Jenkins, Vice-Chair of the CJK & Unihan Group, overhauled the provisional kCantonese property, which resulted in increased coverage to the tune of more than 6K ideographs. The provisional kStrange property, which is documented in UTN #43, Unihan Database Property “kStrange,” is new in Unicode Version 14.0.
Now that Unicode Version 14.0 has been unleashed onto an unsuspecting planet, we can now shift our focus to Version 15.0 and beyond…
UTC #169 Meeting
Immediately after the release of Unicode Version 14.0, the UTC (Unicode Technical Committee) #169 meeting took place this week — on 2021-10-05 and 2021-10-07 — and like the six prior UTC meetings, it was conducted as a very efficient two-day Zoom virtual meeting.
To prepare for UTC #169, the CJK & Unihan Group met on the evening of 2021-09-24 for a full three hours, and the nine experts who attended discussed all CJK- and Unihan-related public feedback and documents. As the chair of the meeting and group, I subsequently spent most of the weekend that followed preparing L2/21-173R, CJK & Unihan Group Recommendations for UTC #169 Meeting, which was presented during UTC #169 on 2021-10-07. We did pretty much the same for the three other UTC meetings that took place since last year’s report—UTC #166 through UTC #168—and the corresponding CJK & Unihan Group recommendations for those meetings can be found here.
The 800-pound Gorilla of Unicode Version 15.0
Han character repertoires always mean thousands of new characters in the Unicode Standard. With 4,192 ideographs, the proverbial 800-pound gorilla of Unicode Version 15.0 (2022) is clearly the CJK Unified Ideographs Extension H block (aka IRG Working Set 2017), the current draft of which is available in the IRG #57 document register as IRG N2513. Like the Extension G block, which was added in Unicode Version 13.0 (2020), the Extension H block will be encoded in Plane 3 (aka TIP or Tertiary Ideographic Plane). The detailed synopsis below is marked “tentative” in the sense that the changes that are highlighted in yellow are expected to be reflected in Unicode Version 15.0, but are, of course, subject to change:
Of course, there may be additional CJK Unified Ideographs included in Unicode Version 15.0, which may result from disunifications or “Urgently Needed Character” (UNC) requests. The Extension C block still has seven unassigned code points.
The 800-pound Gorilla of the Unicode Standard
Appreciating the massive scope of the CJK Unified Ideographs Extension B block, which is now completely full with 42,720 ideographs, is easier said than done. While we are on the topic of 800-pound gorillas, the image below should help to illustrate just how massive this particular block is when compared to the other blocks in Plane 2 (aka SIP or Supplementary Ideographic Plane):
Other Ramblings
New Unihan database properties, along with enhancements to existing ones, continue to keep me busy and off of the streets:
- I am tracking kStrange property candidates in CJK Unified Ideographs Extension H (aka IRG Working Set 2017), and have collected 33 thus far. I have also collected 24 additions in the existing URO and Extension B through G blocks. I expect these modest additions to be reflected in Unicode Version 15.0 (2022), which will also entail an update to UTN #43.
- As stated in last year’s report, CITPC (Character Information Technology Promotion Council or 文字情報技術促進協議会 in Japanese) is still in the process of granting to the Unicode Consortium formal permission to use its Moji Jōhō Kiban Database (文字情報基盤データベース) through the use of a special license. The purpose of this particular exercise is now three-fold: ① to establish the new provisional kMojiJoho property as proposed in L2/20-146; ② to expand and improve the existing provisional kMorohashi property as proposed in L2/21-032; and ③ to establish the new provisional kJapanese property that is intended to replace the existing provisional kJapaneseKun and kJapaneseOn properties that cover only 13,395 ideographs, and which is expected to cover more than 51K ideographs with readings expressed using hiragana or katakana. As a reminder, the license of the Moji Jōhō Kiban Database, CC BY-SA 2.1 JP (Attribution-ShareAlike 2.1 Japan), is incompatible with the license of the UCD (Unicode Character Database), which is why formal permission is necessary. Given the delays in establishing a license, I am now targeting Unicode Version 16.0 (2023) for the new and expanded Unihan database properties.
- One huge gap in the Unihan database is a property that specifies an IDS (Ideographic Description Sequence) for each ideograph. Per L2/21-118R, which is a preliminary proposal, work is underway to encode additional IDCs (Ideographic Description Characters), along with a modest number of new CJK Unified Ideographs that will primarily serve as IDS components. This paves the way for a new Unihan database property, tentatively named kIDS, that will specify at least one IDS for every ideograph. Due to the amount of work involved, which includes IRG (Ideographic Research Group) buy-in, the new target for encoding new IDCs and ideograph components is Unicode Version 16.0 (2023), and the new target for adding the provisional kIDS property to the Unihan database is Unicode Version 17.0 (2024).
Speaking of Unihan database properties, and as stated in the Unihan Property History article, thanks to a three-sheet spreadsheet that I developed earlier this year with the same name, along with the tools and processes to maintain it for future versions of the Unicode Standard, we now have complete historical statistics for all of the properties in the Unihan database going all the way back to Version 2.0. We are still considering how to reference this spreadsheet in the context of UAX #38, Unicode Han Database (Unihan).
What about IRG Working Set 2021 that is expected to eventually become CJK Unified Ideographs Extension I? Considering that the work has just begun, in terms of review cycles, it will still be a couple of years until this repertoire will be mature enough to start drafting Extension I code charts.
I neglected to mention in last year’s report that the Unicode Consortium established a formal liaison relationship with CESI (China Electronics Standardization Institute or 中国电子技术标准化研究院 in Chinese) in July of 2020. Formal liaison relationships were also established with CITPC in November of 2020, and with the SAT (Saṃgaṇikīkṛtaṃ Taiśotripiṭakaṃ) Daizōkyō Text Database Committee (SAT 大藏經テキストデータベース研究会 in Japanese) just last month. I serve as the Unicode Consortium’s contact person for all three liaison relationships.
In closing, it looks like I will be plenty busy between now and UTC #170, which is tentatively scheduled to take place from 2022-01-24 through 2022-01-27 as the first face-to-face meeting since UTC #162. Regardless of whether UTC #170 is a face-to-face or virtual meeting, which is completely unknown at this point, the CJK & Unihan Group meeting in preparation for that meeting is scheduled to take place on the evening of 2022-01-14.
About the Author
Dr Ken Lunde has worked for Apple as a Font Developer since 2021-08-02 (and was in the same role as a contractor from 2020-01-16 through 2021-07-30), is the author of CJKV Information Processing Second Edition (O’Reilly Media, 2009), and earned BA (1987), MA (1988), and PhD (1994) degrees in linguistics from The University of Wisconsin-Madison. Prior to working at Apple, he worked at Adobe for over twenty-eight years — from 1991-07-01 to 2019-10-18 — specializing in CJKV Type Development, meaning that he architected and developed fonts for East Asian typefaces, along with the standards and specifications on which they are based. He architected and developed the Adobe-branded “Source Han” (Source Han Sans, Source Han Serif, and Source Han Mono) and Google-branded “Noto CJK” (Noto Sans CJK and Noto Serif CJK) open source Pan-CJK typeface families that were released in 2014, 2017, and 2019, and published over 300 articles on Adobe’s now-static CJK Type Blog. Ken serves as the Unicode Consortium’s IVD (Ideographic Variation Database) Registrar, attends UTC and IRG meetings, participates in the Unicode Editorial Committee, became an individual Unicode Life Member in 2018, received the 2018 Unicode Bulldog Award, was a Unicode Technical Director from 2018 to 2020, became a Vice-Chair of the Emoji Subcommittee in 2019, published UTN #43 (Unihan Database Property “kStrange”) in 2020, and became the Chair of the CJK & Unihan Group in 2021. He and his wife, Hitomi, are proud owners of a His & Hers pair of acceleration-boosted 2018 LR AWD Tesla Model 3 EVs.