Unihan Property History

Dr Ken Lunde
7 min readJul 10, 2021

By Dr Ken Lunde, Janitor, Spirits of Christmas Past

Dr kStrange holding the 84-stroke CJK Unified Ideograph U+3106C 𱁬 (Unicode Version 13.0) — note that the “k” is silent

Unihan refers to the tens of thousands of Han ideographs that are encoded in the Unicode Standard, most of which are CJK Unified Ideographs that are in nine separate blocks (and yes, for the benefit of the experts out there, one of those nine blocks is CJK Compatibility Ideographs in which there are twelve CJK Unified Ideographs). As of Unicode Version 13.0 (2020), there are 92,856 CJK Unified Ideographs. There are also 1,002 CJK Compatibility Ideographs in two blocks, and that figure has been unchanged since Unicode Version 6.3 (2013) and is unlikely to change in the future.

Unicode Version 13.0—CJK Unified Ideographs

This article is my way to introduce a new resource for Unihan enthusiasts, and will detail what led to its creation, along with its immediate and longer-term benefits.

One Spreadsheet, Three Sheets

Now is always a good time to find ways to improve the world in which we live. For those who specialize in a particular field, that field is their world. UTN (Unicode Technical Note) #45, Unihan Property History, which is effectively the landing page for a downloadable Microsoft Excel spreadsheet that consists of three sheets, is my way to improve the field in which I work. I encourage you to bookmark it.

This exercise began with a recent suggestion to add to UAX #38, Unihan Han Database (Unihan), the number of ideographs in each of the nearly 100 Unihan properties, each of which includes a table in UAX #38. The purpose of this suggestion, of course, was to explicitly indicate the extent of coverage of each Unihan property.

While this particular suggestion was a complete non-starter in its original form, partly because it requires a boatload of work to initially implement, but mainly because it represents a maintenance nightmare of immense proportions. No one who is alive today wants to edit approximately 100 individual HTML tables to update a single figure in each new version of the Unicode Standard.

There must be A Better Way™.

This led me to develop a new Python script, unihan-stats.py, that takes as STDIN one or more Unihan database files, and reports to STDOUT statistics for each Unihan property. For each Unihan property, it reports the number of ideographs that have a property value, along with percent of coverage among all ideographs.

The Unihan database for Unicode Versions 2.0 through 5.1 were distributed as a single file. Starting from Unicode Version 5.2, the properties were distributed among the following eight files:

  • Unihan_DictionaryIndices.txt
  • Unihan_DictionaryLikeData.txt
  • Unihan_IRGSources.txt
  • Unihan_NumericValues.txt
  • Unihan_OtherMappings.txt
  • Unihan_RadicalStrokeCounts.txt
  • Unihan_Readings.txt
  • Unihan_Variants.txt

Please refer to UAX #38 Section 4.3, Listing by Location within Unihan.zip, to determine which properties are include in which Unihan database files, which has changed over time.

By default, the Python script reports statistics only for the Unihan properties that are present in the Unihan database that is supplied via STDIN, and does so for both CJK Unified Ideographs and CJK Compatibility Ideographs. There are command-line options to report statistics only for CJK Unified Ideographs or CJK Compatibility Ideographs, along with one that reports statistics for all known Unihan properties. As of Unicode Version 14.0 (2021), there are 108 known Unihan properties.

These command-line options were invoked to produce the data for the following three sheets:

  • Unified & Compatibility—CJK Unified Ideographs and CJK Compatibility Ideographs for 108 Unihan properties
  • Unified—CJK Unified Ideographs for 108 Unihan properties
  • Compatibility—CJK Compatibility Ideographs for 108 Unihan properties

The file for the Unihan database for Unicode Version 2.0, Unihan-1.txt, is truncated in the middle of the records for U+8BC1 , which means that the data for approximately 25% of the 20,902 CJK Unified Ideographs are missing, and all of the data for the 290 CJK Compatibility Ideographs are missing. While this truncation is mentioned in Section 5, History, of UAX #38, what isn’t mentioned is that the same file that is included on the CD-ROM in The Unicode Standard, Version 2.0 book is truncated at the same position. This spreadsheet includes the statistics for the Unihan database for Unicode Version 2.0 based on a restored version that I prepared. After careful analysis, I concluded that the only substantive difference between the Version 2.0.0 and Version 2.1.2 file, in terms of statistics, was the presence of two kSemanticVariant records and two kSpecializedSemanticVariant records in the former version.

Identifying Incomplete Unihan Properties

Incomplete Unihan properties, in terms of coverage, is the norm, but an incredibly small number of Unihan properties—two, to be exact—are expected to have 100% coverage in the Unihan database: kRSUnicode and kTotalStrokes.

The Unihan Property History spreadsheet quickly diagnosed that the draft Unihan database for Unicode Version 14.0 (2021) was missing kTotalStrokes property values for 21 ideographs in the range U+9FD6 鿖 through U+9FEA 鿪. These were Unicode Version 10.0 (2017) additions, meaning that the same issue was present in the Unihan database for Unicode Versions 10.0 through 13.0. This discovery demonstrated the value of the spreadsheet.

Of course, I prepared and submitted proposed kTotalStrokes property values for these 21 ideographs, which are reflected in the Unihan database for Unicode Version 14.0.

My Python script detected one anomalous record in the Unihan database file for Unicode Version 3.1.1, Unihan-3.1.1.txt, at line 246,442:

U+64AC 297

Identifying Undocumented Unihan Properties

How does one discover undocumented Unihan properties? Good question. Current Unihan properties are fully-documented in UAX #38, and each property includes a table that specifies the following information:

  • Property
  • Status
  • Category
  • Introduced
  • Delimiter
  • Syntax
  • Description

Earlier versions of the Unihan database—prior to the establishment of UAX #38 in 2008 for Unicode Version 5.1—documented each property in its file header.

The first clue is to identify which Unihan properties have been removed, which happens to be the topic of the next section.

The Unihan database for Unicode Version 3.0 includes three properties that are not documented: kAlternateJEF, kJHJ, and kRSMerged:

  • The kAlternateJEF and kRSMerged properties are minimally documented in that the table in Section 4.2 of UAX #38 correctly indicates that both properties were added in Unicode Version 3.0, but were removed in Unicode Version 3.1. However, the header of the Unihan database file for Unicode Version 3.0, Unihan-3.txt, provides no description for these two properties.
  • The kJHJ property is neither present in the table in Section 4.2 of UAX #38 nor has a description in the Unihan database file for Unicode Version 3.0. This property is, however, present in UAX #42, Unicode Character Database in XML.

Speaking of UAX #42, I found that Section 4.4.23, Unihan properties, includes an entry for the kWubi property, but it is neither present in any known version of the Unihan database nor mentioned in UAX #38. The background can be found in the minutes for UTC #107.

With regard to the undocumented kJHJ property, a clue as to its origins can be found as the initials of one of the co-editors of UAX #38, Unihan Han Database (Unihan). Interestingly, someone named Thomas Chan noticed this undocumented property on 2000-10-12.

Removed Unihan Properties

After a period of time, the UTC (Unicode Technical Committee) may determine that a particular Unihan property is no longer necessary, either because its coverage is minimal and is unlikely to improve, or because it is redundant when compared to another Unihan property.

Unicode Version 14.0 (2021) includes 99 Unihan properties. The latest property is kStrange, which is documented in UTN #43, Unihan Database Property “kStrange.” Historically, however, there are 108 Unihan properties. This means that nine Unihan properties have been removed over time, all of which are listed below (their added/dropped Unicode version numbers are in parentheses):

  • kAlternateHanYu (2.0/3.2)
  • kAlternateJEF (3.0/3.1)
  • kAlternateKangXi (2.0/4.1)
  • kAlternateMorohashi (2.0/4.1)
  • kJHJ (3.0/3.1)
  • kRSJapanese (2.0/13.0)
  • kRSKanWa (2.0/13.0)
  • kRSKorean (2.0/13.0)
  • kRSMerged (3.0/3.1)

As stated in the previous section, the kAlternateJEF, kJHJ, and kRSMerged properties are undocumented.

Summary

Naturally, if you work with the Unihan database, I hope that UTN #45 proves to be a useful resource in your work or study. I shall endeavor to keep this resource up-to-date with each release of the Unicode Standard, which is trivial now that I have the tools to do so. For those who would like to access the live version, which is a three-sheet Google Sheets spreadsheet, click here. Anything rows or columns that are considered draft are highlighted in yellow.

Oh, and I have been one of the co-editors of UAX #38, Unicode Han Database (Unihan), since Unicode Version 6.1 (2012).

About the Author

Dr Ken Lunde worked at Adobe for over twenty-eight years — from 1991-07-01 to 2019-10-18 — specializing in CJKV Type Development, meaning that he architected and developed fonts for East Asian typefaces, along with the standards and specifications on which they are based. He architected and developed the Adobe-branded “Source Han” (Source Han Sans, Source Han Serif, and Source Han Mono) and Google-branded “Noto CJK” (Noto Sans CJK and Noto Serif CJK) open source Pan-CJK typeface families that were released in 2014, 2017, and 2019, is the author of CJKV Information Processing Second Edition (O’Reilly Media, 2009), and published over 300 articles on Adobe’s now-static CJK Type Blog. Ken earned BA (1987), MA (1988), and PhD (1994) degrees in linguistics from The University of Wisconsin-Madison, served as Adobe’s representative to the Unicode Consortium since 2006, was Adobe’s primary representative from 2015 until 2019, serves as Unicode’s IVD (Ideographic Variation Database) Registrar, attends UTC and IRG meetings, participates in the Unicode Editorial Committee, became an individual Unicode Life Member in 2018, received the 2018 Unicode Bulldog Award, was a Unicode Technical Director from 2018 to 2020, became a Vice-Chair of the Emoji Subcommittee in 2019, published UTN #43 (Unihan Database Property “kStrange”) in 2020, and became the Chair of the CJK & Unihan Group in 2021. He and his wife, Hitomi, are proud owners of a His & Hers pair of acceleration-boosted 2018 LR AWD Tesla Model 3 EVs.

--

--

Dr Ken Lunde
Dr Ken Lunde

Written by Dr Ken Lunde

ISO/IEC JTC 1/SC 2/WG 2/IRG Convenor—Almaden Valley—San José—CA—USA—NW Hemisphere—Terra—Sol—Orion-Cygnus Arm—Milky Way—Local Group—Laniakea Supercluster

Responses (1)