2023 “State of the Unification” Report

By Dr Ken Lunde

Unusual this year were several events that affected my standardization-related activities, along with the extent of my engagement. Unchanged, of course, is that I continue to serve as the Chair of the Unicode CJK & Unihan Group. This article, which is the fifth installment of a brief annual report that began its life in 2019, provides to CJK (Chinese, Japanese & Korean) and Unihan (Unicode Han) experts and enthusiasts a snapshot of the latest developments.

See 2019 “State of the Unification” Report for the original report that I published on Adobe’s long-static CJK Type Blog to which I contributed 300+ articles.

Unicode Version 15.1

Naturally, a new version of the Unicode Standard, Unicode Version 15.1, was successfully released on 2023-09-12. As the usual synopsis below illustrates, there are now 97,680 CJK Unified Ideographs in the Unicode Standard. New in Unicode Version 15.1 is the somewhat unexpected CJK Unified Ideographs Extension I block, which, unlike the the previous two extension blocks that were encoded in Plane 3 (aka TIP or Tertiary Ideographic Plane), is encoded in Plane 2 and includes only 622 ideographs.

CJK Unified Ideographs in Unicode Version 15.1

Read the article below to learn how the Extension I block materialized into existence:

In addition, and as mentioned in last year’s “State of the Unification” report, five new Ideographic Description Characters (IDCs) are included in Unicode Version 15.1:

Five new Ideographic Description Characters

As documented in UAX #38, there was a lot of churn among Unihan database properties, in terms of existing properties being removed and new properties being added:

In addition, the coverage of the existing kMorohashi property, which provides indexes into Japan’s Dai Kanwa Jiten (大漢和辞典) dictionary, was expanded, from just under 18,000 ideographs, to just over 49,000! That, combined with the new kJapanese and kMojiJoho properties that have comparable coverage, significantly increased the overall Japanese coverage in the Unihan database.

There were also some changes and improvements made to the code charts, such as the following:

  • The code charts for the CJK Unified Ideographs (aka URO), Extension A, and Extension B blocks now include over 24,000 KP-source (aka DPRK or North Korea) ideographs.
  • Directly related to the above, the code charts for the CJK Unified Ideographs block now accommodate seven columns.
  • The code charts for the Kangxi Radicals and CJK Radicals Supplement blocks now include helpful annotations. The former now explicitly specify the radical number, from 1 (U+2F00 ⼀ KANGXI RADICAL ONE) to 214 (U+2FD5 ⿕ KANGXI RADICAL FLUTE), and the latter now explicitly specify what type of variant a radical is, and of what radical.
Seven-column code chart excerpt for the CJK Unified Ideographs block
Annotations in the Kangxi Radicals (left) and CJK Radicals Supplement (right) blocks

Speaking of Radicals 211 through 213:

The syntax of the kRSUnicode property was enhanced to accommodate non-Chinese simplified radicals. Chinese simplified radicals have been specified by a single apostrophe following the radical number, and non-Chinese simplified radicals can now be specified by two apostrophes following the radical number. The benefit of this enhancement is best visualized in the Radical-Stroke Index (46MB) whose annotated excerpt for Radical #211 is shown below:

Chinese simplified radicals are outlined in red, and non-Chinese simplified radicals are outlined in green. As you can see, the sorting order for any given radical-stroke value pair is 1) traditional; 2) Chinese simplified; and then 3) non-Chinese simplified. The other radical-stroke indexes, which are associated with the kIICore and kUnihanCore2020 properties, include the same enhancement, but they support a small subset of ideographs.

Four other related things were released on the same date that Unicode Version 15.1 itself was released:

  • Version 2 of UTN (Unicode Technical Note) #43, Unihan Database Property “kStrange,” which was updated for Unicode Version 15.1.
  • Version 3 of UTN #45, Unihan Property History, which was updated for Unicode Version 15.1.
  • Version 2 of UTN #50, KP-Source Property Value History, which was updated for Unicode Version 15.1. This particular UTN was first published on 2023-01-23, and is meant to document changes to the kIRG_KPSource property for greater visibility for our friends in DPRK.
  • Version 15.100 of the open source Last Resort fonts, which were updated for Unicode Version 15.1.

The Passing of John H. Jenkins

I lost a close friend and colleague on 2023-02-28, and the standardization community lost a world-renowned expert on all things CJK. John has been a friend for over two decades, and was also a colleague in two senses. In one sense, we became colleagues when I started working at Apple at the beginning of 2020. Interestingly, John joined Apple in 1991, and I joined Adobe during the same year.

In another sense, we worked together on CJK- and Unihan-related matters for the Unicode Consortium, with me serving as Chair of the CJK & Unihan Group, and John serving as Vice-Chair.

As a result of John’s unexpected passing, I necessarily took over all of his responsibilities for the Unicode Standard, which included becoming the primary co-editor of UAX #38, the editor of UAX #45, and the overall maintainer of the Unihan database. This meant more editing responsibility, along with learning how to update the assets for UAX #45, and updating the Unihan database, to include the version that is used for the online Unihan Database Lookup page. It was a huge learning experience, but I am happy to report that I now have all of the tools and processes under control, and used the opportunity to make improvements that should reduce the likelihood of errors in the future.

Lee Collins and I contributed to this Unicode Blog article that honors John and his contributions to the Unicode Standard. Please take a few minutes out of your busy schedule to read it.

UTC Meetings

Hopefully the trend of in-person UTC (Unicode Technical Committee) meetings will continue, and were thankfully the norm this year, though they are necessarily taking place in hybrid fashion, with some regular participants joining remotely. The UTC #174 meeting, hosted by Google in Sunnyvale, CA, took place in January, the UTC #175 meeting, hosted by Adobe in San José, CA, took place in April, and the UTC #176, hosted by Microsoft in Redmond, WA, took place in July. I attended all three meetings in person, and only the third one involved travel. The next meeting, UTC #177, hosted by Apple in Cupertino, CA, takes place at the very beginning of November.

To prepare for the UTC meetings, the CJK & Unihan Group met almost three weeks prior to each meeting, usually for a full three hours. As the chair of the meeting and group, I subsequently spent the entire weekend that followed preparing the CJK & Unihan Group Recommendations, which were presented during each UTC meeting.

Unicode Version 16.0

All of the Unihan database changes that resulted from the UTC #175 and #176 meetings were necessarily deferred to Unicode Version 16.0 (2024), because there was risk associated with updating the Unihan database due to the passing of John H. Jenkins. He was the sole maintainer of the Unihan database. I was able to successfully restore the Unihan database on 2023-06-01, but because none of the changes were deemed critical, they continued to be deferred to Unicode Version 16.0. The only changes that were made to the Unihan database were the addition of the CJK Unified Ideographs Extension I block and a single kStrange property syntax correction.

Among the Unihan database changes that have been accumulating for Unicode 16.0, the following are two of the highlights:

  • The new provisional kFanqie property, which covers over 20,000 ideographs, will be added. See document L2/23-002R and Section 14 on pp 22 and 23 of document L2/23-163 for more details.
  • Japan submitted a horizontal extension to add over 36,000 new source references to the kIRG_JSource property using the existing “JMJ” source prefix. This horizontal extension was first suggested in the CJK Type Blog article entitled Unihan & Moji Jōhō Kiban Project: The Tip of the Iceberg that I published at the beginning of 2018. See document L2/23-144 (aka WG2 N5221) for more details.

IRG Working Set 2024?

New IRG working sets are typically initiated every three years or so, and given that the current IRG working set, IRG Working Set 2021, is nearing maturity for becoming the new CJK Unified Ideographs Extension J block, it is time to start considering a new IRG working set. I therefore predict that the next one will be designated IRG Working Set 2024.

The number of ideographs that have since been added to UAX #45, to include those that are in the pipeline to be added for Unicode Version 16.0, is below 200. That is likely to be the UTC’s submission size.

In closing, preparing for the UTC #177 meeting, which is scheduled to take place from 2023-11-01 through 2023-11-03 as another face-to-face meeting, is expected to be much less work than preparing for the UTC meetings that took place earlier this year. The CJK & Unihan Group meeting in preparation for the UTC #177 meeting is scheduled to take place on the evening of 2023-10-13.

Dr Ken Lunde has worked for Apple as a Font Developer since 2021-08-02 (and was in the same role as a contractor from 2020-01-16 through 2021-07-30), is the author of CJKV Information Processing Second Edition (O'Reilly Media, 2009), and earned BA (1987), MA (1988), and PhD (1994) degrees in linguistics from The University of Wisconsin-Madison. Prior to working at Apple, he worked at Adobe for over 28 years — from 1991-07-01 to 2019-10-18 — specializing in CJKV Type Development, meaning that he architected and developed fonts for East Asian typefaces, along with the standards and specifications on which they are based. He architected and developed the Adobe-branded "Source Han" (Source Han Sans, Source Han Serif, and Source Han Mono) and Google-branded "Noto CJK" (Noto Sans CJK and Noto Serif CJK) open source Pan-CJK typeface families that were released in 2014, 2017, and 2019, and published over 300 articles on Adobe's now-static CJK Type Blog. Ken serves as the Unicode Consortium's IVD (Ideographic Variation Database) Registrar, attends UTC and IRG meetings, participates in the Unicode Editorial Committee, became an individual Unicode Life Member in 2018, received the 2018 Unicode Bulldog Award, was a Unicode Technical Director from 2018 to 2020, became a Vice-Chair of the Emoji Subcommittee in 2019, published UTN #43 (Unihan Database Property "kStrange) in 2020, became the Chair of the CJK & Unihan Group in 2021, published UTN #45 (Unihan Property History) in 2022, and published UTN #50 (KP-Source Property Value History) in 2023.



