2024 “State of the Unification” Report

Dr Ken Lunde
10 min readNov 8, 2024

By Dr Ken Lunde

Unusual (again) this year were several events that affected my standardization-related activities, along with the extent of my engagement. Unchanged, of course, is that I continue to serve as the Chair of the UTC CJK & Unihan Working Group that I formed in early 2020 due to COVID-19. This article, which is the sixth installment of a brief annual report that began its life in 2019, provides to CJK (Chinese, Japanese & Korean) and Unihan (Unicode Han) experts and enthusiasts a snapshot of this year’s developments.

See 2019 “State of the Unification” Report for the original report that I published on Adobe’s long-static CJK Type Blog to which I contributed 300+ articles.

Unicode Version 16.0

New for 2024 is a new version of the Unicode Standard, Unicode Version 16.0, which was successfully released on 2024-09-10. There are still 97,680 CJK Unified Ideographs in the Unicode Standard, which is unchanged from Version 15.1 that was released a year earlier. New in this version is a completely revamped Core Specification that was built from the ground up as a web-based document. I would like to make special mention of Appendix F, Documentation of CJK Strokes, whose tables were completely redone, and now includes color to highlight the strokes in the example ideographs. This appendix also includes the two new strokes that were added to the CJK Strokes block, U+31E4 CJK STROKE HXG and U+31E5 CJK STROKE SZP.

The CJK Strokes block for Unicode Version 16.0

As documented in UAX #38, there was a little bit of churn among Unihan database properties, in terms of an existing property being removed and new properties being added:

  • One existing provisional property — kFrequency — was removed from the Unihan database, because it was deemed to be out-of-date and no longer relevant.
  • Two new provisional properties — kFanqie and kZhuang — were added to the Unihan database.

In addition, the coverage of the normative kIRG_JSource property was significantly expanded to add over 36,000 new source references that correspond to the Moji Jōhō Kiban (文字情報基盤) database. This massive horizontal extension was first suggested by yours truly in the CJK Type Blog article entitled Unihan & Moji Jōhō Kiban Project: The Tip of the Iceberg that was published at the beginning of 2018. See document L2/23-144 (aka WG2 N5221) for more details.

The syntax of the informative kRSUnicode property was once again enhanced, this time to accommodate a second non-Chinese simplified radical. Chinese simplified radicals have been specified by a single apostrophe following the radical number, and Unicode Version 15.1 introduced the use of two apostrophes to specify non-Chinese simplified radicals. Unicode Version 16.0 introduced the use of three apostrophes to specify a second non-Chinese radical. The benefit of this enhancement is best visualized in the Radical-Stroke Index (46MB) whose annotated excerpt for Radical #212 is shown below:

Chinese simplified radicals are outlined in red, the first non-Chinese simplified radicals are outlined in green, and the second non-Chinese simplified radicals are outlined in blue. As you can see, the sorting order for any given radical-stroke value pair is 1) traditional; 2) Chinese simplified; 3) first non-Chinese simplified; and then 4) second non-Chinese simplified. The other radical-stroke indexes, which are associated with the kIICore and kUnihanCore2020 properties, include the same enhancement, but they support a small subset of ideographs, and do not include any instances of a second non-Chinese simplified radical.

Related to the above, Unicode Version 16.0 also introduced a “plain text” data file version of the Radical-Stroke Index that serves as an alternate representation of the same radical-stroke data. I developed a private command-line tool in Python years ago that now uses this data file to allow me to lookup ideographs and their metadata by radical-stroke value pair.

Three other related things were released on the same date that Unicode Version 16.0 itself was released:

  • Version 3 of UTN (Unicode Technical Note) #43, Unihan Database Property “kStrange,” which was updated for Unicode Version 16.0.
  • Version 4 of UTN #45, Unihan Property History, which was updated for Unicode Version 16.0.
  • Version 16.000 of the open source Last Resort fonts, which were updated for Unicode Version 16.0. This release is different from previously releases in that the sources and build scripts are now included in the repository.

Completely unrelated to Unicode Version 16.0, I published two new Unicode Technical Notes since last year’s SOTU report was published:

  • UTN #53, CJK Unified Ideographs Extension B, UCS2003 Reference Glyphs, which provides the location of and additional background information on the historical “UCS2003” (aka ISO/IEC 10646:2003) representative glyphs that were removed from the code charts for the CJK Unified Ideographs Extension B block in Unicode Version 14.0 (September, 2021).
  • UTN #60, Legacy Hangul Syllables Mappings, which provides historical mappings between the Unicode code points for the 11,172 characters in the Hangul Syllables block (AC00..D7A3) and legacy character set and encoding standards that include hangul syllables.

Ideographic Research Group

Involvement in the IRG (Ideographic Research Group) ratcheted up a couple of notches this year, as evidenced by the following two bullet items:

  • I was appointed as the new Convenor of the IRG on 2024-06-14 during the co-located SC 2 (ISO/IEC JTC 1/SC 2) #29 and WG 2 (ISO/IEC JTC 1/SC 2/WG 2) #71 meetings that took place in Prague, Czech Republic. Dr LU Qin (陸勤) retired after serving as the Convenor of the IRG for 20 solid years, from 2004 until 2024. IRG Meeting #63, which took place last month in Seoul (ROK), was the first IRG meeting for which I served as its Convenor.
  • I established a new home page and a new document register for the IRG in early July, and have thus far migrated over one-third of the IRG documents from the previous document register. I expect to complete document migration—at least for the documents that are available in the previous document register—by this time next year.

About the first item, the photo below shows the etching of the award that was presented to Dr Lu during IRG Meeting #63, and click here to see a photo from when I presented the actual award to Dr Lu on the first day of the meeting:

IRG Meeting #63 was hosted by the National Institute of Korean Language (NIKL). As the new Convenor, I prepared for the first time the meeting recommendations and action items, which I published as document IRG N2702.

A topic that will be discussed during IRG Meeting #64 next March is the establishment of a new block that includes CJK components. See document IRG N2733R for the preliminary proposal from China. My idea is to encode these CJK components as ordinary CJK Unified Ideographs in a new block tentatively named CJK Unified Ideographs Components, which would be modeled after the similarly-named Tangut Components block. I discussed this idea during UTC Meeting #181 that took place this week, and I received positive feedback.

Another topic that will be discussed during the next IRG meeting is the subject of script-hybrid ideographs. I demonstrated during IRG Meeting #63, by providing a brief tour of UTN #43, Unihan Database Property “kStrange, that script-hybrid ideographs have already been encoded, particular those that include hangul or kana components, along with other oddball components. The crux of the issue is really how to handle script-hybrid ideographs that include Latin components, such as the ones below that are abbreviated forms of the ideographs 慶 (kei) and 應 (ō) that represent 慶應大学 (keiō daigaku; Keio University):

IRG Working Set 2024

Han ideographs will continue to be encoded, and to this end, seven IRG member bodies—China, ROK, SAT, TCA, UK, UTC, and Vietnam—submitted a total of 4,674 new ideographs for IRG Working Set 2024 whose review cycle has just started, and will continue for the next two to three years. Please visit the new IRG home page to get links to each member body submission. The first review round just finished.

UTC Meetings

Attending UTC meetings in person is very important to me, and unfortunately I was able to attend only three of the four meetings this year in person. To both my surprise and amusement, I discovered in late April, while attempting to attend UTC Meeting #179, that I am suddenly on Adobe’s Watch List™, which means that I am unable to attend meetings and events that are hosted on Adobe property. I therefore attended that meeting remotely from home less than 10 miles away. Sigh. Of course, my departure from Adobe five years ago wasn’t exactly on the most amicable of terms, so I will simply attribute this phenomenon to Murphy. Or something…

In terms of good news, and about the CJK & Unihan Working Group in particular, please join me in welcoming its new Vice-Chair, Eiso Chan (陈永聪). I have known Eiso for many years, and he has already proven to be a very capable Vice-Chair of this UTC working group. I look forward to working with him for many years. The last time we met face-to-face was during IRG Meeting #51 six years ago.

To prepare for the UTC meetings, the CJK & Unihan Working Group continues to meet approximately three weeks prior to each meeting, up to a full three hours. As the chair of the meeting and working group, I subsequently spend the entire weekend that follows preparing the CJK & Unihan Working Group Recommendations, which are presented during each UTC meeting. See document L2/24–227R for our working group’s recommendations for UTC Meeting #181.

Unicode Version 17.0

Nearly all of the new characters in Unicode Version 17.0 are expected to be in the new CJK Unified Ideographs Extension J block, which was processed by the IRG as IRG Working Set 2021. As submitted to WG 2 as document WG 2 N5257R2, the total number of ideographs is exactly 4,300 (see document IRG N2707 for its first draft). However, the UTC accepted the removal of two ideographs during UTC Meeting #181, so the block is now officially smaller by that figure.

Unicode Version 17.0 is also expected to include five additional ideographs that will be appended to the CJK Unified Ideographs Extension C block. Two of them are the result of disunifying U+5CC0 峀 and U+2335F 𣍟, and the other three are urgently-needed ideographs from China (one from document IRG N2691) and TCA (two from document IRG N2709).

Below is the current synopsis of CJK Unified Ideographs in the Unicode Standard with tentative additions for Unicode Version 17.0 highlighted in yellow:

CJK Unified Ideographs in Unicode Version 17.0 (tentative)

China prepared another UNC (Urgently Needed Character) proposal, document IRG N2753 that includes nine ideographs, which will be discussed during IRG Meeting #64 next March, so the number of CJK Unified Ideographs in Unicode Version 17.0 is likely to increase by that small figure. There is always the possibility that they will be deferred until Unicode Version 18.0 (2026). If accepted, they would likely be added to the end of the CJK Unified Ideographs Extension E block.

In closing, I look forward to continuing to chair both CJK & Unihan Working Group and IRG meetings, and attending more IRG and WG 2 meetings than in the past.

About the Author

Dr Ken Lunde has worked for Apple as a Font Developer since 2021-08-02 (and was in the same role as a contractor from 2020-01-16 through 2021-07-30), is the author of CJKV Information Processing Second Edition (O’Reilly Media, 2009), and earned BA (1987), MA (1988), and PhD (1994) degrees in linguistics from The University of Wisconsin-Madison. Prior to working at Apple, he worked at Adobe for over 28 years — from 1991-07-01 to 2019-10-18 — specializing in CJKV Type Development, meaning that he architected and developed fonts for East Asian typefaces, along with the standards and specifications on which they are based. He architected and developed the Adobe-branded “Source Han” (Source Han Sans, Source Han Serif, and Source Han Mono) and Google-branded “Noto CJK” (Noto Sans CJK and Noto Serif CJK) open source Pan-CJK typeface families that were released in 2014, 2017, and 2019, and published over 300 articles on Adobe’s now-static CJK Type Blog. Ken serves as the Unicode Consortium’s IVD (Ideographic Variation Database) Registrar, attends UTC meetings, participates in the UTC Editorial Working Group, became an individual Unicode Life Member in 2018, received the 2018 Unicode Bulldog Award, was a Unicode Technical Director from 2018 to 2020, became a Vice-Chair of the Emoji Standard & Research Working Group in 2019, published UTN #43 (Unihan Database Property “kStrange) in 2020, became the Chair of the CJK & Unihan Working Group in 2021, published UTN #45 (Unihan Property History) in 2022, published UTN #50 (KP-Source Property Value History) and UTN #53 (CJK Unified Ideographs Extension B, UCS2003 Reference Glyphs) in 2023, and published UTN #60 (Legacy Hangul Syllables Mappings) and was appointed as the new IRG (Ideographic Research Group) Convenor in 2024. He and his wife, Hitomi, are proud owners of a His & Hers pair of acceleration-boosted 2018 LR Dual Motor AWD Tesla Model 3 EVs.

--

--

Dr Ken Lunde
Dr Ken Lunde

Written by Dr Ken Lunde

ISO/IEC JTC 1/SC 2/WG 2/IRG Convenor—Almaden Valley—San José—CA—USA—NW Hemisphere—Terra—Sol—Orion-Cygnus Arm—Milky Way—Local Group—Laniakea Supercluster

No responses yet