2022 “State of the Unification” Report
By Dr Ken Lunde
Unicode-related activities kept me particularly busy this year, and I continue to serve as the Chair of the Unicode CJK & Unihan Group. This article, which is the fourth installment of a brief annual report that began in 2019, provides to CJK (Chinese, Japanese & Korean) and Unihan (Unicode Han) experts and enthusiasts a snapshot of the latest developments.
See 2019 “State of the Unification” Report for the original report that I published on Adobe’s now-static CJK Type Blog.
Unicode Version 15.0
New versions of the Unicode Standard continue to be released on an annual basis, and Unicode Version 15.0 was successfully released on 2022-09-13. As the detailed synopsis below illustrates, there are now 97,058 CJK Unified Ideographs in the Unicode Standard. New in Unicode Version 15.0 is the CJK Unified Ideographs Extension H block, which, like the Extension G block that was encoded in Unicode Version 13.0, is encoded in Plane 3 (aka TIP or Tertiary Ideographic Plane), includes 4,192 ideographs, and is completely full out of the gate. It began its life as IRG Working Set 2017.
In addition, U+5F50 彐 was disunified, and its disunified form was appended to the Extension C block at code point U+2B739, though China chose not to change its representative glyph for U+5F50 彐 for compatibility reasons:
In terms of Unihan database properties, kAlternateTotalStrokes was added as a new provisional property, and coverage of the provisional kKangXi property increased to over 70K ideographs, which represents an increase of nearly 50K ideographs.
Three other key things were released on the same date that Unicode Version 15.0 itself was released:
- Version 2 of UTN (Unicode Technical Note) #45, Unihan Property History, which was updated for Unicode Version 15.0. Version 1, which reflected Unicode Version 14.0, was published on 2022-01-29.
- Version 15.000 of the open source Last Resort fonts, which were updated for Unicode Version 15.0.
- The 2022-09-13 (eighth) version of the IVD (Ideographic Variation Database), which added a single IVS for the registered Adobe-Japan1 IVD collection whose base character is in Extension H.
UTC #173 Meeting
In-person UTC (Unicode Technical Committee) meetings resumed with the UTC #172 meeting in July, which was hosted by Microsoft in Redmond, WA, and the UTC #173 meeting took place this past week, from 2022-11-01 through 2022-11-03, which was hosted by Apple in Cupertino, CA. Like the previous meeting, it was conducted as a hybrid meeting that used the preferred AV provider of the meeting host to allow remote participants to attend.
To prepare for the UTC #173 meeting, the CJK & Unihan Group met on the evening of 2022-10-21 for a full four hours (instead of the usual three hours), and the 14 experts who attended discussed six CJK- and Unihan-related public feedback items and 35 documents. As the chair of the meeting and group, I subsequently spent the entire weekend that followed preparing L2/22-247, CJK & Unihan Group Recommendations for UTC #173 Meeting, which was presented during the UTC #173 meeting on the afternoon of 2022-11-02. It ended up becoming 51 pages, which included 38 sections and 17 PDF attachments. Preparing it was somewhat brutal, but considering that virtually all of the group’s recommendations were accepted, it was worth it. We did the same for the three other UTC meetings that took place since last year’s SOTU report—UTC #170 through UTC #172—and the corresponding CJK & Unihan Group recommendations for those meetings can be found here.
The following are some of the notable highlights from the meeting:
- Six new provisional Unihan database properties were accepted, and will be included in Unicode Version 15.1: kJapanese (see L2/22-181), kMojiJoho (see L2/22-187), kSMSZD2003Index (see L2/22-204R), kSMSZD2003Readings (also see L2/22-204R), kVietnameseNumeric (see L2/22-223), and kZhuangNumeric (also see L2/22-223). The first two properties will cover more than 50K ideographs.
- Two provisional Unihan database properties will be removed in Unicode Version 15.1: kIRGDaiKanwaZiten (see L2/22-188) and kRSKangXi (see L2/22-195).
- The provisional kMorohashi property will be enhanced and expanded in Unicode Version 15.1 to cover nearly 50K ideographs. See L2/22-188.
- Five new Ideographic Description Characters (IDCs) that were proposed in L2/22-191 were accepted, and will be included in Unicode Version 15.1.
I am particularly excited about the changes to the provisional kMorohashi property and the new provisional kJapanese and kMojiJoho properties, because they will significantly expand the amount of Japanese metadata in the Unihan database, which has historically been biased toward Chinese metadata.
Unicode Version 15.1
How the Unicode Consortium releases new versions of the Unicode Standard is necessarily changing due to resource changes, and L2/22-270, Release Management Group L2/22-270 Recommendations for 2023–2024 Releases, provided a recommendation whose scope and plan for Unicode Version 15.1 (2023) was approved during the UTC #173 meeting.
In terms of actual character additions in Unicode Version 15.1, only the five Ideographic Description Characters were accepted thus far, and they may likely be the only character additions:
See The Pipeline.
Of course, and although somewhat unlikely at this point, there may be additional ideographs included in Unicode Version 15.1, which may result from disunifications or “Urgently Needed Character” (UNC) requests. Clarity on this particular point will come during the IRG #60 meeting (2023-03-20 through 2023-03-24). The Extension C block still has six unassigned code points that would be used for such purposes.
Unicode Version 16.0
After Unicode Version 15.1 is released, the amount of CJK- and Unihan-related churn for Unicode Version 16.0 (2024) is expected to be less. For example, IRG Working Set 2021, which will eventually become the CJK Unified Ideographs Extension I block, will not be mature enough to consider including it in the scope of Unicode Version 16.0. I expect the major change to be the addition of nearly 24K KP-Source (aka kIRG_KPSource) representative glyphs to the code charts for the URO, Extension A, and Extension B per L2/22-238. These source references correspond to the DPRK (aka North Korea) regional character set standards, KPS 9566 and KPS 10721.
Other Ramblings
Never is there a dull moment in the Lunde household when it comes to work related to the Unicode Standard, and the following activities are likely to keep me sufficiently busy through the holidays:
- The kStrange property changes (seven) and additions (61) that will be included in Unicode Version 15.1 (see L2/22-190) will need to be reflected in UTN #43, which entails preparing and publishing Version 2. This effort will include updating 466 Extension B code chart excerpts to reflect the latest versions that no longer include the UCS2003 representative glyphs.
- A UTC Action Item was recorded during the UTC #173 meeting for me to prepare and submit a proposal to add Kangxi Radical number annotations to the 115 characters in the CJK Radicals Supplement block. These annotations will serve to indicate whether a character is a variant, simplified variant, simplified Chinese variant, or simplified Japanese variant of a particular Kangxi Radical. I already prepared the data file, so all that is left is for me to prepare the proposal proper. Easy peasy.
In closing, preparing for the UTC #174 meeting, which is scheduled to take place from 2023-01-24 through 2023-01-26 as another face-to-face meeting, is expected to be much less work than preparing for the UTC #173 meeting. The CJK & Unihan Group meeting in preparation for the UTC #174 meeting is scheduled to take place on the evening of 2023-01-06.
About the Author
Dr Ken Lunde has worked for Apple as a Font Developer since 2021-08-02 (and was in the same role as a contractor from 2020-01-16 through 2021-07-30), is the author of CJKV Information Processing Second Edition (O’Reilly Media, 2009), and earned BA (1987), MA (1988), and PhD (1994) degrees in linguistics from The University of Wisconsin-Madison. Prior to working at Apple, he worked at Adobe for over twenty-eight years — from 1991-07-01 to 2019-10-18 — specializing in CJKV Type Development, meaning that he architected and developed fonts for East Asian typefaces, along with the standards and specifications on which they are based. He architected and developed the Adobe-branded “Source Han” (Source Han Sans, Source Han Serif, and Source Han Mono) and Google-branded “Noto CJK” (Noto Sans CJK and Noto Serif CJK) open source Pan-CJK typeface families that were released in 2014, 2017, and 2019, and published over 300 articles on Adobe’s now-static CJK Type Blog. Ken serves as the Unicode Consortium’s IVD (Ideographic Variation Database) Registrar, attends UTC and IRG meetings, participates in the Unicode Editorial Committee, became an individual Unicode Life Member in 2018, received the 2018 Unicode Bulldog Award, was a Unicode Technical Director from 2018 to 2020, became a Vice-Chair of the Emoji Subcommittee in 2019, published UTN #43 (Unihan Database Property “kStrange”) in 2020, became the Chair of the CJK & Unihan Group in 2021, and published UTN #45 (Unihan Property History) in 2022. He and his wife, Hitomi, are proud owners of a His & Hers pair of acceleration-boosted 2018 LR Dual Motor AWD Tesla Model 3 EVs.