The GB 18030-2022 Standard

Dr Ken Lunde
14 min readAug 2, 2022

By Dr Ken Lunde

A non-trivial amount of my professional life is spent tracking regional character set standards, with an extraordinarily strong focus on ones for East Asia. When a significant standard is published or updated, I take it upon myself to research what changed, in a practical sense, and to spread the word to the developer community. That is the purpose of this particular article.

History & Overview

China first published the GB 18030 standard in 2000 as GB 18030-2000 (信息技术 信息交换用汉字编码字符集 基本集的扩充 Information technology — Chinese ideograms coded character set for information interchange — Extension for the basic set), and it was revised five years later as GB 18030-2005 (信息技术 中文编码字符集 Information Technology — Chinese coded character set). Note that the title changed. Seventeen years later, GB 18030-2022 was published with the same title.

GB 18030-2022 cover

The previous versions of this standard simply specified a required repertoire according to code point ranges. The 2022 version now specifies in Section 9 three implementation levels. The main change to be aware of is that Implementation Level 1 requires the following 66 ideographs in the ranges 9FA6..9FB3 and 9FBC..9FEF that were not required in the 2005 version (the ideographs in the intervening range, 9FB4..9FBB, were part of the required repertoire in the 2005 version, and are, of course, still required):

0x82358F33  U+9FA6 龦
0x82358F34 U+9FA7 龧
0x82358F35 U+9FA8 龨
0x82358F36 U+9FA9 龩
0x82358F37 U+9FAA 龪
0x82358F38 U+9FAB 龫
0x82358F39 U+9FAC 龬
0x82359030 U+9FAD 龭
0x82359031 U+9FAE 龮
0x82359032 U+9FAF 龯
0x82359033 U+9FB0 龰
0x82359034 U+9FB1 龱
0x82359035 U+9FB2 龲
0x82359036 U+9FB3 龳
0x82359135 U+9FBC 龼
0x82359136 U+9FBD 龽
0x82359137 U+9FBE 龾
0x82359138 U+9FBF 龿
0x82359139 U+9FC0 鿀
0x82359230 U+9FC1 鿁
0x82359231 U+9FC2 鿂
0x82359232 U+9FC3 鿃
0x82359233 U+9FC4 鿄
0x82359234 U+9FC5 鿅
0x82359235 U+9FC6 鿆
0x82359236 U+9FC7 鿇
0x82359237 U+9FC8 鿈
0x82359238 U+9FC9 鿉
0x82359239 U+9FCA 鿊
0x82359330 U+9FCB 鿋
0x82359331 U+9FCC 鿌
0x82359332 U+9FCD 鿍
0x82359333 U+9FCE 鿎
0x82359334 U+9FCF 鿏
0x82359335 U+9FD0 鿐
0x82359336 U+9FD1 鿑
0x82359337 U+9FD2 鿒
0x82359338 U+9FD3 鿓
0x82359339 U+9FD4 鿔
0x82359430 U+9FD5 鿕
0x82359431 U+9FD6 鿖
0x82359432 U+9FD7 鿗
0x82359433 U+9FD8 鿘
0x82359434 U+9FD9 鿙
0x82359435 U+9FDA 鿚
0x82359436 U+9FDB 鿛
0x82359437 U+9FDC 鿜
0x82359438 U+9FDD 鿝
0x82359439 U+9FDE 鿞
0x82359530 U+9FDF 鿟
0x82359531 U+9FE0 鿠
0x82359532 U+9FE1 鿡
0x82359533 U+9FE2 鿢
0x82359534 U+9FE3 鿣
0x82359535 U+9FE4 鿤
0x82359536 U+9FE5 鿥
0x82359537 U+9FE6 鿦
0x82359538 U+9FE7 鿧
0x82359539 U+9FE8 鿨
0x82359630 U+9FE9 鿩
0x82359631 U+9FEA 鿪
0x82359632 U+9FEB 鿫
0x82359633 U+9FEC 鿬
0x82359634 U+9FED 鿭
0x82359635 U+9FEE 鿮
0x82359636 U+9FEF 鿯

These are URO additions from Unicode Versions 4.1 (14), 5.1 (8), 5.2 (8), 6.1 (1), 8.0 (9), 10.0 (21), and 11.0 (5). The corresponding four-byte GB 18030 code points are shown in the first column to be consistent with the rest of this article.

Implementation Level 1 also states that implementations can choose to support one or more non-Chinese (aka regional or minority) scripts whose four-byte GB 18030 encoding ranges, numbers of characters, and Unicode block names are shown in Table 3 of GB 18030-2022 that is reproduced below with Unicode block names:

0x81318132–0x81319934  42     Arabic
0x8430BA32–0x8430FE35 59 Arabic Presentation Forms-A
0x84318730–0x84319530 84 Arabic Presentation Forms-B
0x8132E834–0x8132FD31 193 Tibetan
0x8134D238–0x8134E337 149 Mongolian
0x9034C538–0x9034C730 13 Mongolian Supplement
0x8134F434–0x8134F830 35 Tai Le
0x8134F932–0x81358437 83 New Tai Lue
0x81358B32–0x81359935 127 Tai Tham
0x82359833–0x82369435 1215 Yi Syllables
0x82369535–0x82369A32 48 Lisu
0x81339D36–0x8133B635 69 Hangul Jamo
0x8139A933–0x8139B734 51 Hangul Compatibility Jamo
0x8237CF35–0x8336BE36 3431 Hangul Syllables
0x9232C636–0x9232D635 133 Miao
0x81398B32–0x8139A135 214 Kangxi Radicals
0x8139EE39–0x82358738 6530 CJK Unified Ideographs Extension A
0x82358F33–0x82359636 66 CJK Unified Ideographs (URO)
0x95328236–0x9835F336 42711 CJK Unified Ideographs Extension B
0x9835F738–0x98399E36 4149 CJK Unified Ideographs Extension C
0x98399F38–0x9839B539 222 CJK Unified Ideographs Extension D
0x9839B632–0x9933FE33 5762 CJK Unified Ideographs Extension E
0x99348138–0x9939F730 7473 CJK Unified Ideographs Extension F

In other words, supporting non-Chinese scripts continues to be optional. We can also see from this table that the following non-Chinese scripts are recognized by GB 18030-2022: Arabic, Tibetan, Mongolian, Tai Le, New Tai Lue, Tai Tham, Yi, Lisu, Hangul (Korean), and Miao.

Those who are familiar with CJK Unified Ideographs blocks may be wondering why Extension A, which includes 6,582 ideographs (at least before Unicode Version 13.0), shows only 6530 in the table above. The missing 52 ideographs are actually mapped from the two-byte region of GB 18030 encoding — in all three versions of the GB 18030 standard — as follows:

0xFE56  U+3447 㑇
0xFE55 U+3473 㑳
0xFE5A U+359E 㖞
0xFE5C U+360E 㘎
0xFE5B U+361A 㘚
0xFE60 U+3918 㤘
0xFE5F U+396E 㥮
0xFE62 U+39CF 㧏
0xFE65 U+39D0 㧐
0xFE63 U+39DF 㧟
0xFE64 U+3A73 㩳
0xFE68 U+3B4E 㭎
0xFE69 U+3C6E 㱮
0xFE6A U+3CE0 㳠
0xFE6F U+4056 䁖
0xFE70 U+415F 䅟
0xFE72 U+4337 䌷
0xFE78 U+43AC 䎬
0xFE77 U+43B1 䎱
0xFE7A U+43DD 䏝
0xFE7B U+44D6 䓖
0xFE7D U+464C 䙌
0xFE7C U+4661 䙡
0xFE80 U+4723 䜣
0xFE81 U+4729 䜩
0xFE82 U+477C 䝼
0xFE83 U+478D 䞍
0xFE85 U+4947 䥇
0xFE86 U+497A 䥺
0xFE87 U+497D 䥽
0xFE88 U+4982 䦂
0xFE89 U+4983 䦃
0xFE8A U+4985 䦅
0xFE8B U+4986 䦆
0xFE8D U+499B 䦛
0xFE8C U+499F 䦟
0xFE8F U+49B6 䦶
0xFE8E U+49B7 䦷
0xFE96 U+4C77 䱷
0xFE93 U+4C9F 䲟
0xFE94 U+4CA0 䲠
0xFE95 U+4CA1 䲡
0xFE97 U+4CA2 䲢
0xFE92 U+4CA3 䲣
0xFE98 U+4D13 䴓
0xFE99 U+4D14 䴔
0xFE9A U+4D15 䴕
0xFE9B U+4D16 䴖
0xFE9C U+4D17 䴗
0xFE9D U+4D18 䴘
0xFE9E U+4D19 䴙
0xFE9F U+4DAE 䶮

I will go over Implementation Level 2 in the last section of this article. Implementation Level 3 is best described as all additional CJK Unified Ideographs, meaning Extensions B through F in their entirety, along with the 214 characters in the Kangxi Radicals (康熙部首) block.

In terms of corrections to representative glyphs, one that I recorded over eight years ago in this CJK Type Blog article, for U+4548 䕈 in Extension A, was corrected in GB 18030-2022. Below is the GB 18030-2022 code chart excerpt:

GB 18030-2022 code chart excerpt for 0x8233AF32 (U+4548 䕈)

Although it is not technically a correction per se, GB 18030-2022 is the very first GB character set standard whose representative glyph of U+FFE5 ¥ FULLWIDTH YEN SIGN is the more correct double-crossbar form. All previous GB character set standards that include this character include the single-crossbar form, which does not actually occur in the wild. Below is the GB 18030-2022 code chart excerpt:

GB 18030-2022 code chart excerpt for 0xA3A4 (U+FFE5 ¥)

No PUA Requirement

Except for 24 characters, all required characters in GB 18030-2005 had equivalent characters in the Unicode Standard. The characters that formerly were required to be represented using PUA (Private Use Area) code points can be classified into two groups. The first group involves mapping changes for 18 characters. The following list provides the two- or four-byte GB 18030 code points, their mappings to the Unicode Standard in GB 18030-2005, their mappings to the Unicode Standard in GB 18030-2022, and the characters, if any, that are displayed in the GB 18030-2022 code charts (on pp 11, 81, 160, and 242):

0xA6D9      U+E78D  U+FE10  ︐
0xA6DA U+E78E U+FE12 ︒
0xA6DB U+E78F U+FE11 ︑
0xA6DC U+E790 U+FE13 ︓
0xA6DD U+E791 U+FE14 ︔
0xA6DE U+E792 U+FE15 ︕
0xA6DF U+E793 U+FE16 ︖
0xA6EC U+E794 U+FE17 ︗
0xA6ED U+E795 U+FE18 ︘
0xA6F3 U+E796 U+FE19 ︙
0xFE59 U+E81E U+9FB4 龴
0xFE61 U+E826 U+9FB5 龵
0xFE66 U+E82B U+9FB6 龶
0xFE67 U+E82C U+9FB7 龷
0xFE6D U+E832 U+9FB8 龸
0xFE7E U+E843 U+9FB9 龹
0xFE90 U+E854 U+9FBA 龺
0xFEA0 U+E864 U+9FBB 龻
0x82359037 U+9FB4 U+E81E (none)
0x82359038 U+9FB5 U+E826 (none)
0x82359039 U+9FB6 U+E82B (none)
0x82359130 U+9FB7 U+E82C (none)
0x82359131 U+9FB8 U+E832 (none)
0x82359132 U+9FB9 U+E843 (none)
0x82359133 U+9FBA U+E854 (none)
0x82359134 U+9FBB U+E864 (none)
0x84318236 U+FE10 U+E78D (none)
0x84318237 U+FE11 U+E78F (none)
0x84318238 U+FE12 U+E78E (none)
0x84318239 U+FE13 U+E790 (none)
0x84318330 U+FE14 U+E791 (none)
0x84318331 U+FE15 U+E792 (none)
0x84318332 U+FE16 U+E793 (none)
0x84318333 U+FE17 U+E794 (none)
0x84318334 U+FE18 U+E795 (none)
0x84318335 U+FE19 U+E796 (none)

The second group simply involves preferring Extension B mappings over PUA ones. The following list provides the two- or four-byte GB 18030 code points, their mappings to the Unicode Standard in GB 18030-2005 and GB 18030-2022, and the ideographs, if any, that are displayed in the GB 18030-2022 code charts (on pp 81, 245, 246, 271, 295, and 325):

0xFE51      U+E816   (none)
0xFE52 U+E817 (none)
0xFE53 U+E818 (none)
0xFE6C U+E831 (none)
0xFE76 U+E83B (none)
0xFE91 U+E855 (none)
0x95329031 U+20087 𠂇
0x95329033 U+20089 𠂉
0x95329730 U+200CC 𠃌
0x9536B937 U+215D7 𡗗
0x9630BA35 U+2298F 𢦏
0x9635B630 U+241FE 𤇾

No CJK Compatibility Ideographs

Some of the required characters, 21 ideographs to be exact, were in the CJK Compatibility Ideographs block. 12 of them are actually CJK Unified Ideographs, which are still required according to GB 18030-2022. However, the nine that are actual CJK Compatibility Ideographs are no longer required, and a character is not displayed in the GB 18030-2022 code charts (on pp 80 and 81). The following list provides the two-byte GB 18030 code points, their mappings to the Unicode Standard in GB 18030-2000, GB 18030-2005, and GB 18030-2022, the ideographs themselves, and their canonically-equivalent CJK Unified Ideographs in parentheses:

0xFD9C  U+F92C 郎 (U+90CE 郎)
0xFD9D U+F979 凉 (U+51C9 凉)
0xFD9E U+F995 秊 (U+79CA 秊)
0xFD9F U+F9E7 裏 (U+88CF 裏)
0xFDA0 U+F9F1 隣 (U+96A3 隣)
0xFE40 U+FA0C 兀 (U+5140 兀)
0xFE41 U+FA0D 嗀 (U+55C0 嗀)
0xFE47 U+FA18 礼 (U+793C 礼)
0xFE49 U+FA20 蘒 (U+8612 蘒)

Of course, there is no requirement for font implementations to remove these nine CJK Compatibility Ideographs. There is simply no longer a requirement to include them. I suspect that most font implementations will leave the glyphs and their mappings in place for the sake of backward compatibility.

TGH 2013 Requirement

Incremental additions to the required portion of the GB 18030 standard are to be expected, and the 2022 update is no exception. The first section of this article mentioned the 66 ideographs that are now required for Implementation Level 1.

In terms of requirements for Implementation Level 2, China published 通用规范汉字表 (Tōngyòng Guīfàn Hànzìbiǎo; aka TGH 2013) in 2013, which lists 8,105 ideographs, called hanzi (汉字 hànzì) in Chinese. These ideographs are separated into three distinct levels that consist of 3,500, 3,000, and 1,605 ideographs, respectively.

通用规范汉字表 (Tōngyòng Guīfàn Hànzìbiǎo) cover

If we factor in Implementation Level 1, which includes the range 4E00..9FEF, there is coverage of 7,909 of the 8,105 ideographs in TGH 2013. What remains are 196 ideographs that are in the Extension B (36), Extension C (44), Extension D (8), and Extension E (108) blocks as listed below (their four-byte GB 18030 code points, their mappings to the Unicode Standard in GB 18030-2005 and GB 18030-2022, and the ideographs themselves are shown):

Extension B — 36 ideographs — Unicode Version 3.1 (2001)

0x9532A632  U+20164 𠅤
0x9533AA30 U+20676 𠙶
0x9534CE36 U+20CD0 𠳐
0x9535FE34 U+2139A 𡎚
0x95368C35 U+21413 𡐓
0x9632F737 U+235CB 𣗋
0x9634A937 U+23C97 𣲗
0x9634A938 U+23C98 𣲘
0x9634D133 U+23E23 𣸣
0x96378333 U+249DB 𤧛
0x96379335 U+24A7D 𤩽
0x96379B31 U+24AC9 𤫉
0x9639A936 U+25532 𥔲
0x9639AE34 U+25562 𥕢
0x9639B534 U+255A8 𥖨
0x9731A435 U+25ED7 𥻗
0x9731F837 U+26221 𦈡
0x9732B837 U+2648D 𦒍
0x9732E936 U+26676 𦙶
0x97338538 U+2677C 𦝼
0x9733E930 U+26B5C 𦭜
0x9733FC37 U+26C21 𦰡
0x97388237 U+27FF9 𧿹
0x9738EA36 U+28408 𨐈
0x9739AB30 U+28678 𨙸
0x9739AD39 U+28695 𨚕
0x9739CF30 U+287E0 𨟠
0x9830A833 U+28B49 𨭉
0x9830C137 U+28C47 𨱇
0x9830C235 U+28C4F 𨱏
0x9830C237 U+28C51 𨱑
0x9830C330 U+28C54 𨱔
0x9830FD31 U+28E99 𨺙
0x9834B536 U+29F7E 𩽾
0x9834B631 U+29F83 𩾃
0x9834B730 U+29F8C 𩾌

Extension C — 44 ideographs — Unicode Version 5.2 (2009)

0x98368F39  U+2A7DD 𪟝
0x9836AC35 U+2A8FB 𪣻
0x9836AF33 U+2A917 𪤗
0x9836CB34 U+2AA30 𪨰
0x9836CC30 U+2AA36 𪨶
0x9836CF34 U+2AA58 𪩘
0x9837D838 U+2AFA2 𪾢
0x98388137 U+2B127 𫄧
0x98388138 U+2B128 𫄨
0x98388333 U+2B137 𫄷
0x98388334 U+2B138 𫄸
0x98389535 U+2B1ED 𫇭
0x9838B130 U+2B300 𫌀
0x9838BA39 U+2B363 𫍣
0x9838BC31 U+2B36F 𫍯
0x9838BC34 U+2B372 𫍲
0x9838BD35 U+2B37D 𫍽
0x9838CB30 U+2B404 𫐄
0x9838CC32 U+2B410 𫐐
0x9838CC35 U+2B413 𫐓
0x9838D433 U+2B461 𫑡
0x9838E137 U+2B4E7 𫓧
0x9838E235 U+2B4EF 𫓯
0x9838E332 U+2B4F6 𫓶
0x9838E335 U+2B4F9 𫓹
0x9838E535 U+2B50D 𫔍
0x9838E536 U+2B50E 𫔎
0x9838E936 U+2B536 𫔶
0x9838F536 U+2B5AE 𫖮
0x9838F537 U+2B5AF 𫖯
0x9838F631 U+2B5B3 𫖳
0x9838FB33 U+2B5E7 𫗧
0x9838FC36 U+2B5F4 𫗴
0x98398236 U+2B61C 𫘜
0x98398237 U+2B61D 𫘝
0x98398336 U+2B626 𫘦
0x98398337 U+2B627 𫘧
0x98398338 U+2B628 𫘨
0x98398430 U+2B62A 𫘪
0x98398432 U+2B62C 𫘬
0x98398E37 U+2B695 𫚕
0x98398E38 U+2B696 𫚖
0x98399131 U+2B6AD 𫚭
0x98399735 U+2B6ED 𫛭

Extension D — 8 ideographs — Unicode Version 6.0 (2010)

0x9839AA33  U+2B7A9 𫞩
0x9839AD31 U+2B7C5 𫟅
0x9839B034 U+2B7E6 𫟦
0x9839B233 U+2B7F9 𫟹
0x9839B236 U+2B7FC 𫟼
0x9839B336 U+2B806 𫠆
0x9839B430 U+2B80A 𫠊
0x9839B538 U+2B81C 𫠜

Extension E — 108 ideographs — Unicode Version 8.0 (2015)

0x9839C534  U+2B8B8 𫢸
0x9839FA31 U+2BAC7 𫫇
0x99308B33 U+2BB5F 𫭟
0x99308B36 U+2BB62 𫭢
0x99308E32 U+2BB7C 𫭼
0x99308E39 U+2BB83 𫮃
0x99309E31 U+2BC1B 𫰛
0x9930C039 U+2BD77 𫵷
0x9930C235 U+2BD87 𫶇
0x9930CD37 U+2BDF7 𫷷
0x9930D237 U+2BE29 𫸩
0x99318739 U+2C029 𬀩
0x99318830 U+2C02A 𬀪
0x99319437 U+2C0A9 𬂩
0x99319830 U+2C0CA 𬃊
0x9931B237 U+2C1D5 𬇕
0x9931B331 U+2C1D9 𬇙
0x9931B633 U+2C1F9 𬇹
0x9931C334 U+2C27C 𬉼
0x9931C436 U+2C288 𬊈
0x9931C734 U+2C2A4 𬊤
0x9931D239 U+2C317 𬌗
0x9931D937 U+2C35B 𬍛
0x9931DA33 U+2C361 𬍡
0x9931DA36 U+2C364 𬍤
0x9931F738 U+2C488 𬒈
0x9931F930 U+2C494 𬒔
0x9931F933 U+2C497 𬒗
0x99328C34 U+2C542 𬕂
0x9932A133 U+2C613 𬘓
0x9932A138 U+2C618 𬘘
0x9932A237 U+2C621 𬘡
0x9932A335 U+2C629 𬘩
0x9932A337 U+2C62B 𬘫
0x9932A338 U+2C62C 𬘬
0x9932A339 U+2C62D 𬘭
0x9932A431 U+2C62F 𬘯
0x9932A630 U+2C642 𬙂
0x9932A638 U+2C64A 𬙊
0x9932A639 U+2C64B 𬙋
0x9932BD34 U+2C72C 𬜬
0x9932BD37 U+2C72F 𬜯
0x9932C839 U+2C79F 𬞟
0x9932CC33 U+2C7C1 𬟁
0x9932D233 U+2C7FD 𬟽
0x9932E833 U+2C8D9 𬣙
0x9932E838 U+2C8DE 𬣞
0x9932E931 U+2C8E1 𬣡
0x9932EA39 U+2C8F3 𬣳
0x9932EC39 U+2C907 𬤇
0x9932ED32 U+2C90A 𬤊
0x9932EF31 U+2C91D 𬤝
0x99338830 U+2CA02 𬨂
0x99338932 U+2CA0E 𬨎
0x99339433 U+2CA7D 𬩽
0x99339837 U+2CAA9 𬪩
0x9933A535 U+2CB29 𬬩
0x9933A539 U+2CB2D 𬬭
0x9933A630 U+2CB2E 𬬮
0x9933A633 U+2CB31 𬬱
0x9933A730 U+2CB38 𬬸
0x9933A731 U+2CB39 𬬹
0x9933A733 U+2CB3B 𬬻
0x9933A737 U+2CB3F 𬬿
0x9933A739 U+2CB41 𬭁
0x9933A838 U+2CB4A 𬭊
0x9933A932 U+2CB4E 𬭎
0x9933AA34 U+2CB5A 𬭚
0x9933AA35 U+2CB5B 𬭛
0x9933AB34 U+2CB64 𬭤
0x9933AB39 U+2CB69 𬭩
0x9933AC32 U+2CB6C 𬭬
0x9933AC35 U+2CB6F 𬭯
0x9933AC39 U+2CB73 𬭳
0x9933AD32 U+2CB76 𬭶
0x9933AD34 U+2CB78 𬭸
0x9933AD38 U+2CB7C 𬭼
0x9933B331 U+2CBB1 𬮱
0x9933B435 U+2CBBF 𬮿
0x9933B436 U+2CBC0 𬯀
0x9933B630 U+2CBCE 𬯎
0x9933C336 U+2CC56 𬱖
0x9933C435 U+2CC5F 𬱟
0x9933D335 U+2CCF5 𬳵
0x9933D336 U+2CCF6 𬳶
0x9933D433 U+2CCFD 𬳽
0x9933D435 U+2CCFF 𬳿
0x9933D438 U+2CD02 𬴂
0x9933D439 U+2CD03 𬴃
0x9933D536 U+2CD0A 𬴊
0x9933E235 U+2CD8B 𬶋
0x9933E237 U+2CD8D 𬶍
0x9933E239 U+2CD8F 𬶏
0x9933E330 U+2CD90 𬶐
0x9933E435 U+2CD9F 𬶟
0x9933E436 U+2CDA0 𬶠
0x9933E534 U+2CDA8 𬶨
0x9933E539 U+2CDAD 𬶭
0x9933E630 U+2CDAE 𬶮
0x9933E939 U+2CDD5 𬷕
0x9933F036 U+2CE18 𬸘
0x9933F038 U+2CE1A 𬸚
0x9933F137 U+2CE23 𬸣
0x9933F230 U+2CE26 𬸦
0x9933F234 U+2CE2A 𬸪
0x9933FA36 U+2CE7C 𬹼
0x9933FB38 U+2CE88 𬺈
0x9933FC39 U+2CE93 𬺓

According to the last paragraph in this CJK Type Blog article that I published over eight years ago, I predicted that TGH 2013 would become a GB 18030 requirement. At least, it is required for Implementation Level 2.

The official machine-readable tables that map GB 18030 code points to their equivalent code points in the Unicode Standard were posted on 2023-03-28 via a page entitled GB18030–2022与UCS代码映射表. The file named GB18030–2022MappingTableBMP.txt covers 63,488 BMP (Basic Multilingual Plane; aka Plane 0) code points, and as many can guess, the file named GB18030–2022MappingTableSMP.txt covers the 16 Supplementary Planes (Planes 1 through 16) to the tune of 1,048,576 code points.

For those who are interested in the mappings for the 8,105 ideographs in China’s 通用规范汉字表 (aka TGH 2013), the official machine-readable table was posted on 2023-03-27 via a page entitled 《通用规范汉字表》汉字的GB18030–2022与UCS代码映射表.

In terms of font implementations, only the Simplified Chinese fonts of the Noto Sans CJK (Google), Source Han Mono (Adobe), and Source Han Sans (Adobe) typeface families are already compliant with GB 18030-2022 Implementation Level 2 by virtue of forward-thinking on the part of their original developers (aka me 😎). Microsoft YaHei (Microsoft), Noto Serif CJK (Google), PingFang (Apple), and Source Han Serif (Adobe) — at least the current versions as of this writing — require a small number of URO additions that are associated with Implementation Level 1 in order to become compliant with GB 18030-2022 Implementation Level 2. As usual, I recorded the gory details.

Pro Tip: If you are interested in learning how to future-proof GB 18030–based font implementations, first consider the following two key points:

  1. GB 18030 is synchronized with ISO/IEC 10646, not with the Unicode Standard. The contents of GB 18030-2022 are synchronized with ISO/IEC 10646:2017 (aka Fifth Edition), which is equivalent to Unicode Version 11.0. This explains why the range 9FF0..9FFF is not included in Implementation Level 1. Those 16 ideographs were appended to the URO in Unicode Versions 13.0 (13) and 14.0 (3).
  2. New versions of GB 18030 incrementally include ideographs that are appended to the URO, whose precedent has been established by GB 18030-2022. I predicted this, and therefore predict that the same will be true of ideographs appended to Extension A. The key point here is that these are the only two CJK Unified Ideographs blocks that are required in their entirety, and that requirement would logically extend to any ideographs that are appended to them. This would impact Implementation Level 1, meaning critical for font implementations. Both the URO and Extension A blocks are now full, as of Unicode Versions 14.0 and 13.0, respectively.

So, in order to future-proof font implementations, I recommend that glyphs for the additional ideographs that were appended to the URO and Extension A be added. The following lists provide the four-byte GB 18030 code points, their mappings to the Unicode Standard in GB 18030-2000, GB 18030-2005, and GB 18030-2022, and the ideographs themselves:

URO — 13 ideographs — Unicode Version 13.0 (2020)

0x82359637  U+9FF0 鿰
0x82359638 U+9FF1 鿱
0x82359639 U+9FF2 鿲
0x82359730 U+9FF3 鿳
0x82359731 U+9FF4 鿴
0x82359732 U+9FF5 鿵
0x82359733 U+9FF6 鿶
0x82359734 U+9FF7 鿷
0x82359735 U+9FF8 鿸
0x82359736 U+9FF9 鿹
0x82359737 U+9FFA 鿺
0x82359738 U+9FFB 鿻
0x82359739 U+9FFC 鿼

URO — 3 ideographs — Unicode Version 14.0 (2021)

0x82359830  U+9FFD 鿽
0x82359831 U+9FFE 鿾
0x82359832 U+9FFF 鿿

Extension A — 10 ideographs — Unicode Version 13.0 (2020)

0x82358739  U+4DB6 䶶
0x82358830 U+4DB7 䶷
0x82358831 U+4DB8 䶸
0x82358832 U+4DB9 䶹
0x82358833 U+4DBA 䶺
0x82358834 U+4DBB 䶻
0x82358835 U+4DBC 䶼
0x82358836 U+4DBD 䶽
0x82358837 U+4DBE 䶾
0x82358838 U+4DBF 䶿

If you heard about an amendment to GB 18030-2022 and want to know more, look no further than The First Amendment.

About the Author

Dr Ken Lunde has worked for Apple as a Font Developer since 2021-08-02 (and was in the same role as a contractor from 2020-01-16 through 2021-07-30), is the author of CJKV Information Processing Second Edition (O’Reilly Media, 2009), and earned BA (1987), MA (1988), and PhD (1994) degrees in linguistics from The University of Wisconsin-Madison. Prior to working at Apple, he worked at Adobe for over twenty-eight years — from 1991-07-01 to 2019-10-18 — specializing in CJKV Type Development, meaning that he architected and developed fonts for East Asian typefaces, along with the standards and specifications on which they are based. He architected and developed the Adobe-branded “Source Han” (Source Han Sans, Source Han Serif, and Source Han Mono) and Google-branded “Noto CJK” (Noto Sans CJK and Noto Serif CJK) open source Pan-CJK typeface families that were released in 2014, 2017, and 2019, and published over 300 articles on Adobe’s now-static CJK Type Blog. Ken serves as the Unicode Consortium’s IVD (Ideographic Variation Database) Registrar, attends UTC and IRG meetings, participates in the Unicode Editorial Committee, became an individual Unicode Life Member in 2018, received the 2018 Unicode Bulldog Award, was a Unicode Technical Director from 2018 to 2020, became a Vice-Chair of the Emoji Subcommittee in 2019, published UTN #43 (Unihan Database Property “kStrange) in 2020, became the Chair of the CJK & Unihan Group in 2021, and published UTN #45 (Unihan Property History) in 2022. He and his wife, Hitomi, are proud owners of a His & Hers pair of acceleration-boosted 2018 LR Dual Motor AWD Tesla Model 3 EVs.

--

--

Dr Ken Lunde
Dr Ken Lunde

Written by Dr Ken Lunde

ISO/IEC JTC 1/SC 2/WG 2/IRG Convenor—Almaden Valley—San José—CA—USA—NW Hemisphere—Terra—Sol—Orion-Cygnus Arm—Milky Way—Local Group—Laniakea Supercluster