By Dr Ken Lunde
Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the Government for a redress of grievances.
Wait… Hold on. Full stop!
While that is clearly the text of the First Amendment of the Constitution of the United States of America, it is certainly not the first amendment that is the subject of this particular standardization-related article.
This article is about the not-yet-published first amendment of the GB 18030-2022 standard that itself was published almost one year ago in a country far, far away…
I have been working with China’s GB standards for well over three decades, and as far as I am aware, this is the very first time that an amendment was prepared for one. At least, that statement appears to be valid for the GB standards that are related to character sets and encodings.
As stated on page 14 of CJKV Information Processing, Second Edition, GB stands for Guo Biao (国标 guóbiāo), which is short for Guojia Biaozhun (国家标准 guójiā biāozhǔn), and simply means “National Standard.”
This article will cover the details and ramifications of this not-yet-published GB 18030-2022 amendment.
The First Draft
China made the first draft of this GB 18030-2022 amendment available for review sometime during the end of last year, and that is when I became aware of it. For those who would like to peruse the first draft of this amendment, please see document L2/23-113.
The CJK & Unihan Group recommendations for the UTC (Unicode Technical Committee) #174 meeting that took place in January of this year, specifically document L2/23-011, included in Section 18 on pp 23 and 24 the following three bullets that summarize the first draft:
- Adds the ideographs that have been subsequently appended to the URO (16: 9FF0..9FFF), Extension A (10: 4DB6..4DBF), Extension B (9: 2A6D7..2A6DF), and Extension C (5: 2B735..2B739), along with Extensions G and H in their entirety. This will synchronize GB 18030-2022 with Unicode Version 15.0, in terms of CJK Unified Ideograph coverage. (For those who are not aware, GB 18030-2022 proper is synchronized with Unicode Version 11.0.) Amending GB 18030 to synchronize with Unicode Version 15.0 is a good thing. At the same time, the Working Draft specifies the same implementation deadline of 2023-08-01 to support Extension G (4,939 ideographs) and Extension H (4,192 ideographs) at Implementation Level 3. This would likely be problematic for developers who are already burdened with supporting CJK Unified Ideographs through Extension F by the 2023-08-01 deadline. A longer lead time for development should be provided.
- Adds the 26 ideographs that have been appended to the URO (16: 9FF0..9FFF) and Extension A (10: 4DB6..4DBF) to the scope of Implementation Level 1, meaning that they will be required. Ken Lunde predicted that this would happen, though Ken didn’t expect it to happen so quickly. This is a good thing.
- Adds 897 ideographs, described as 公安人口信息专用字库补充汉字 (supplementary Chinese characters of the special font for population information used by public security departments), to the beginning of Plane 10 (A0000..A0380) whose code points are currently unallocated. This is an extraordinarily bad thing.
The changes that are described in the first two bullets were predictable, at least for a future version of the GB 18030 standard, and other than the burden for developers to support Extensions G and H in Implementation Level 3, are considered welcome changes. However, encoding an unvetted repertoire of 897 ideographs in an unallocated plane destabilizes the normative relationship between the GB 18030 and ISO/IEC 10646 standards. It also circumvents the international standardization process.
The 897 Plane 10 ideographs in the first draft whose code point range is U+A0000..U+A0380 are shown below:
The first that struck me as odd was that the ordering of the repertoire was seemingly random, and not according to the standard convention that uses indexing radical followed by the number of residual strokes (aka radical-stroke order).
The First Response
Exceptional circumstances require exceptional responses. I was tasked by the UTC to prepare feedback to China on the first draft of this amendment, which became document L2/23-057. The working assumption was that China was simply not aware that unallocated planes should not be used, and merely required feedback stating so. The following five recommendations were conveyed to China via that document:
- Remove altogether from this draft amendment the repertoire of 897 characters in Plane 10 in order to preserve synchronization with the ISO/IEC 10646 standard. For implementations that have an immediate need to support this repertoire of 897 characters, they should be mapped to code points that correspond to one of the three PUA (Private Use Area) blocks. Furthermore, these PUA characters should not be documented in this amendment. After more than 17 years, GB 18030-2022 finally removed the PUA requirement from this standard, so establishing a new PUA requirement, particularly for a large number of characters, is fundamentally a step backwards.
- Remove from the repertoire of 897 characters any character that is already encoded.
- Remove from the repertoire of 897 characters any character that is unifiable with an existing character in a CJK Unified Ideographs block according to IRG rules. These characters can be represented in “plain text” as registered IVSes (Ideographic Variation Sequences) in a new IVD (Ideographic Variation Database) collection according to the procedures described in UTS #37, Unicode Ideographic Variation Database.
- Remove from the repertoire of 897 characters any character that is currently in the pipeline for encoding, meaning that the character is in IRG Working Set 2021.
- Submit to the IRG as a UNC (Urgently Needed Character) repertoire according to IRG Principles and Procedures (aka IRG N2515) all remaining characters in the repertoire of 897 characters.
So far, so good… 🤞
The Second Draft
Subsequently, China made the second draft of this GB 18030-2022 amendment available for review at the beginning of March of this year, which included the disposition of comments against the first draft. For those who would like to peruse the second draft of this amendment and the disposition of comments against the first draft, please see document L2/23-110. The following are the two most important points from the second draft:
- None of the recommendations as conveyed in document L2/23-057 were adopted.
- The Plane 10 repertoire shrunk to 614 ideographs.
The CJK & Unihan Group concluded that China fully intends to encode several hundred unvetted ideographs in an accelerated fashion, so I recommended that a small group of IRG (Ideographic Research Group) experts vet the repertoire, and propose a new CJK Unified Ideographs Extension I block as a compromise solution that adheres to the standardization process to the extent possible, but also recognizes the urgent nature of these ideographs.
The 614 Plane 10 ideographs in the second draft whose code point ranges are U+A0000..U+A0259 and U+A0270..U+A027B are shown below (the 22 code points that are between those two code point ranges are represented with a full-width space):
The Second Response
Ideographs, particularly when they are in a large repertoire, require vetting by multiple experts, and we did precisely that for the 614 Plane 10 ideographs in the second draft of the amendment. This was a one-month process that also involved establishing the required kIRG_GSource, kRSUnicode, and kTotalStrokes property values. In essence, 13 ideographs were removed from the second draft repertoire, and two ideographs were restored from the first draft repertoire, which resulted in a repertoire of 603 ideographs. The proposed Extension I block (U+2EBF0..U+2EE4F) is in Plane 2 (aka SIP or Supplementary Ideographic Plane), and the code point range is U+2EBF0..U+2EE4A. See document L2/23-106 for more details. The 603 Extension I ideographs are shown below:
The UTC accepted the Extension I block for Unicode Version 15.1 per Consensus 175-C10 during the UTC #175 meeting that took place at the end of April. The next step was to add Extension I to ISO/IEC 10646:2020 Amendment 2, which can be seen in document L2/23-146. Document L2/23-114, which includes the same repertoire of 603 ideographs, was discussed during the joint ISO/IEC JTC 1/SC 2 #28 and ISO/IEC JTC 1/SC 2/WG 2 #70 meetings at in June, and was formally supported by the national bodies of Japan, the Czech Republic, and the United Kingdom. China was subsequently tasked to provide feedback on the Extension I repertoire by the end of June. China’s feedback (see document L2/23-154) was submitted on 2023-06-30, and resulted in an updated Extension I repertoire. See document L2/23-114R, which proposes a slightly expanded Extension I block (U+2EBF0..U+2EE5F) and includes 622 ideographs in the code point range U+2EBF0..U+2EE5D. Compared to the previous repertoire of 603 ideographs, 47 ideographs were removed and 66 ideographs from the first draft were restored. The 622 Extension I ideographs are shown below:
The updated Extension I repertoire will be discussed during the UTC #176 meeting that takes place at the end of this month, and the UTC is expected to accept it for Unicode Version 15.1. This topic is covered in the CJK & Unihan Group recommendations for the UTC #176 meeting, specifically in Section 01 on pp 1 and 2 of document L2/23-163.
ADDED on 2023-08-30: As expected, the UTC accepted the updated 622-ideograph Extension I repertoire on 2023-07-25 during the UTC #176 meeting per Consensus 176-C1, and its repertoire and ordering are now frozen and stable for Unicode Version 15.1.
My high-level analysis of this GB 18030-2022 amendment can be summarized in the following two bullet points:
- Implementation Level 1 and Implementation Level 2 will become relatively stable. The URO and Extension A blocks became full as of Unicode Version 14.0 (2021) and Version 13.0 (2020), respectively, and the subsequent change to the former implementation level will be reflected in this amendment. The latter implementation level will not change.
- Implementation Level 3 is a genuine moving target. The GB 18030-2022 amendment, once published, is expected to add the Extension G, Extension H, and Extension I blocks. This means that developers who choose—or need—to support this implementation level are effectively on the hook to support any subsequent extension blocks. The next extension block, Extension J, will be the result of IRG Working Set 2021, which is in the final stages of standardization.
In terms of fonts that support the implementation levels as updated according to this GB 18030-2022 amendment, as of this writing, only Apple’s PingFangSC system fonts, starting from the versions included in iOS Version 16.4, iPadOS Version 16.4, macOS Ventura Version 13.2, tvOS Version 16.4, and watchOS Version 9.4, all of which were released on 2023-03-27, are already compliant with Implementation Level 2.
About the Author
Dr Ken Lunde has worked for Apple as a Font Developer since 2021-08-02 (and was in the same role as a contractor from 2020-01-16 through 2021-07-30), is the author of CJKV Information Processing Second Edition (O’Reilly Media, 2009), and earned BA (1987), MA (1988), and PhD (1994) degrees in linguistics from The University of Wisconsin-Madison. Prior to working at Apple, he worked at Adobe for over twenty-eight years — from 1991-07-01 to 2019-10-18 — specializing in CJKV Type Development, meaning that he architected and developed fonts for East Asian typefaces, along with the standards and specifications on which they are based. He architected and developed the Adobe-branded “Source Han” (Source Han Sans, Source Han Serif, and Source Han Mono) and Google-branded “Noto CJK” (Noto Sans CJK and Noto Serif CJK) open source Pan-CJK typeface families that were released in 2014, 2017, and 2019, and published over 300 articles on Adobe’s now-static CJK Type Blog. Ken serves as the Unicode Consortium’s IVD (Ideographic Variation Database) Registrar, attends UTC and IRG meetings, participates in the Unicode Editorial Committee, became an individual Unicode Life Member in 2018, received the 2018 Unicode Bulldog Award, was a Unicode Technical Director from 2018 to 2020, became a Vice-Chair of the Emoji Subcommittee in 2019, published UTN #43 (Unihan Database Property “kStrange”) in 2020, became the Chair of the CJK & Unihan Group in 2021, published UTN #45 (Unihan Property History) in 2022, and published UTN #50 (KP-Source Property Value History) in 2023. He and his wife, Hitomi, are proud owners of a His & Hers pair of acceleration-boosted 2018 LR Dual Motor AWD Tesla Model 3 EVs.