2020 “State of the Unification” Report
By Dr Ken Lunde, Janitor, Spirits of Christmas Past
Unicode-related activities have kept me busy for the past several years, and this brief report is a way to provide a snapshot of the current status of things I have been working on in this particular realm.
The UTC (Unicode Technical Committee) #165 meeting took place this week—on 2020-10-05 and 2020-10-07—and like the two prior UTC meetings, it was conducted as a Zoom virtual meeting…
…thanks to COVID-19.
To prepare for UTC #165, the Unihan Ad Hoc met on the evening of 2020-09-18 for a full three hours, and the eight experts who attended discussed all Unihan-related public feedback and documents. As a co-chair of the meeting, I subsequently spent most of the weekend that followed preparing L2/20-235, Unihan Ad Hoc Recommendations for UTC #165 Meeting, which was presented during UTC #165 on 2020-10-05.
Titles are useful, and as the use of “Unification” in the title of this report aptly suggests, this brief article is for the benefit of Unihan (Unified Han) experts and enthusiasts. The image above is an excerpt of a document that I have been maintaining since Unicode Version 6.1 (2012), and which I keep up-to-date for each subsequent version of the Unicode Standard, at least for those versions that include additional CJK Unified Ideographs. The current version of this document is marked “tentative” in the sense that the changes that are highlighted in yellow are expected to be reflected in Unicode Version 14.0 (2021).
As stated in the 2019 report that was published on the now-static CJK Type Blog, additional CJK Unified Ideographs in Unicode Version 13.0 ended up completely filling the Extension A block. This time, the small number of additional CJK Unified Ideographs appear to be destined to completely fill the URO (Unified Repertoire and Ordering) and Extension B blocks, and the unassigned code points at the end of the Extension C block will start to be used. The total number of CJK Unified Ideographs in Unicode Version 14.0 is therefore expected to be 92,862 as of this writing. That figure may increase between now and when Unicode Version 14.0 is released next year.
What about IRG Working Set 2017 that is expected to become Extension H? I am so glad that you asked.
The likelihood of Extension H being included in Unicode Version 14.0 is relatively low, but at this stage I do not consider it to be an impossibility. Given the size of its repertoire, it is expected to be encoded in Plane 3 (aka TIP or Tertiary Ideographic Plane) in a new block that starts from code point U+31350. If it is not included in Unicode Version 14.0, it is virtually guaranteed to be included in Unicode Version 15.0 (2022) given its current level of maturity. As of this writing, the latest-and-greatest IRG Working Set 2017 is Version 5.1, and its schedule states that it will progress to Version 5.2 in December.
Considerable work will be done by yours truly between now at the next UTC meeting. As stated in Section 6 of L2/20-235, the following is a description of two proposals that I expect to have ready prior to UTC #166:
- CITPC (Character Information Technology Promotion Council or 文字情報技術促進協議会 in Japanese) is in the process of granting to the Unicode Consortium formal permission to use its Moji Jōhō Kiban Database (文字情報基盤データベース) to establish the new provisional Unihan database property kMojiJoho as originally proposed in L2/20-146. As soon as permission has been formally granted, I shall prepare and submit a revised proposal, L2/20-146R, that will be considered during UTC #166 in January of 2021. Only the language in the proposal proper is expected to change. Other than tweaking the first and fourth paragraphs in the “Description” field that is intended for the property’s table in UAX #38 (Unicode Han Database (Unihan)), its data file is not expected to change. The license of the Moji Jōhō Kiban Database, CC BY-SA 2.1 JP (Attribution-ShareAlike 2.1 Japan), is incompatible with the license of the UCD (Unicode Character Database), which is why formal permission is necessary.
- The Unicode Consortium also requested formal permission from CITPC to use its Moji Jōhō Kiban Database to expand and improve the existing provisional kMorohashi Unihan database property. That particular property is currently associated with 18,168 CJK Unified Ideographs and CJK Compatibility Ideographs, though 302 of the property values for ideographs in the range U+F900 though U+FA2D are bogus in that their values are 00000. In any case, the proposal is expected to expand the kMorohashi property to nearly 50K records, and is also expected to include references to SVSes (Standardized Variation Sequences) and Moji_Joho IVSes (Ideographic Variation Sequences) where appropriate.
In other words, I will be plenty busy between now and UTC #166, which takes place on 2021-01-19 and 2020-01-21. I therefore expect that the Unihan Ad Hoc meeting for UTC #166 will take place on the evening of 2021-01-08.
About the Author
Dr Ken Lunde worked at Adobe for over twenty-eight years — from 1991-07-01 to 2019-10-18 — specializing in CJKV Type Development, meaning that he architected and developed fonts for East Asian typefaces, along with the standards and specifications on which they are based. He architected and developed the Adobe-branded “Source Han” (Source Han Sans, Source Han Serif, and Source Han Mono) and Google-branded “Noto CJK” (Noto Sans CJK and Noto Serif CJK) open source Pan-CJK typeface families that were released in 2014, 2017, and 2019, is the author of CJKV Information Processing Second Edition (O’Reilly Media, 2009), and published over 300 articles on Adobe’s now-static CJK Type Blog. Ken earned BA (1987), MA (1988), and PhD (1994) degrees in linguistics from The University of Wisconsin-Madison, served as Adobe’s representative to the Unicode Consortium since 2006, was Adobe’s primary representative from 2015 until 2019, serves as Unicode’s IVD (Ideographic Variation Database) Registrar, attends UTC and IRG meetings, participates in the Unicode Editorial Committee, became an individual Unicode Life Member in 2018, received the 2018 Unicode Bulldog Award, was a Unicode Technical Director from 2018 to 2020, became a Vice-Chair of the Emoji Subcommittee in 2019, published UTN #43 (Unihan Database Property “kStrange”) in 2020, and became the Chair of the CJK & Unihan Group in 2021. He and his wife, Hitomi, are proud owners of a His & Hers pair of acceleration-boosted 2018 LR AWD Tesla Model 3 EVs.