The Power of “Plain Text”

9 min readMar 12, 2022

Twelve years ago I presented The Power of “Plain Text” & the Importance of Meaningful Content at IUC34 (34th Internationalization & Unicode Conference), which is worth revisiting in article form, because almost all of its content is still relevant today. To summarize the presentation:

“Plain text” represents raw text data; can endure in more environments; survives most transcoding operations; persists throughout a document workflow; can be edited, searched, copied, pasted, imported, and exported; can be repurposed, mined, analyzed, and transliterated; can be stylized, made rich, or marked up; and serves as the foundation for “meaningful content.” Furthermore, Unicode “plain text” is far more usable than that based on legacy encodings due to its broad coverage of scripts (159 as of Unicode Version 14.0) and its extensive use of properties.

Editors, or text editors, are among the most wonderful apps, because they allow one to work with raw—or plain—text. One need not care about font selection, at least at the character level, and styles, such as bold and italic, are ignored. From such apps, text can be repurposed via Copy&Paste or by importing a saved file into another app.

When I use the TextEdit app on macOS, I have its Preferences set to plain text for new documents, and I need not care about font selection (much) as font fallback allows virtually any font to be used if the selected font lacks a glyph for one or more characters. (For my daily text-editing needs, I have been using Sublime Text since late 2020, and for the prior 25+ years I was using command-line Emacs in the Terminal app.)

TextEdit app Preferences settings (macOS)

Likewise, when I compose email using the Mail app on macOS, I also have its Preferences set to plain text for both new emails and when responding to emails.

For all other apps, if I am aware that pasted text data is not treated as plain text, such as the Notes app or virtually all Microsoft apps, I first paste it into a plain text environment, such as Sublime Text (macOS) or Textastic (iOS/iPadOS) to effectively sanitize it, copy it again, then paste it into the desired app. I feel fortunate that Adobe InDesign, which sees almost daily usage by me, treats pasted text data as plain text.

How many times have you copied text from a web page only to see unwanted formatting show up when pasting into another app? It is for this reason that I almost always need to sanitize it.

X certainly marks the spot when the X is genuinely an X, and not a Х. What in the world does this even mean in the context of an article about plain text?

How a character appears, which is its presentation, is not nearly as important as its content, or underlying meaning, which leads us to meaningful content and its pitfalls, particularly in the content of the Unicode Standard.

To demonstrate this, search for “plain text” throughout this article. The string “рlаіn tехt” will not be among the search results. Why? Because five of the characters—p, a, i, e, and our friend x—are from a different script, specifically Cyrillic, and therefore have different code points than their Latin script lookalikes. For typefaces whose fonts support both the Latin and Cyrillic scripts, which is common, the glyphs for these lookalike characters are likely to be identical.

As a further demonstration using another script, Han, search for “八十二” (82) throughout this article. Unlike the previous example, the string “⼋⼗⼆” will—or at least should—be among the search results. In this case, both strings include characters from the Han script. The former string uses code points from the main CJK Unified Ideographs block (aka Unified Repertoire & ordering or URO), but the latter string uses code points from the Kangxi Radicals block. The latter characters are normalized to the former characters when searching.

Now, let’s explore in more detail the seven pitfalls that I identified twelve years ago:

Pitfall #1: Code Point Poaching

Code point poaching is when an inappropriate glyph is mapped from the code point of an existing character, and sacrifices long-term stability for short-term benefits. Besides unexpected visual behavior, which makes it difficult to detect, the application of Unicode properties can result in other forms of unexpected behavior. Code point poaching is common practice for fonts that include glyphs for unsupported scripts or characters.

Due to the persistence of particular legacy standards, such as JIS X 0201 (Japan) and KS X 1003 (South Korea), the character U+005C \ REVERSE SOLIDUS may appear as U+00A5 ¥ YEN SIGN or U+20A9 ₩ WON SIGN in some environments. This is because the equivalent code point, 0x5C, mapped to those characters in those legacy standards.

PUA (Private Use Area) code point usage is considered a slightly better practice, which brings us to our next pitfall…

Pitfall #2: PUA Code Point Usage

The 137,468 PUA code points in the Unicode Standard—6,400 in the BMP (U+E000 through U+F8FF), 65,534 in Plane 15 (U+F0000 through U+FFFFD), and 65,534 in Plane 16 (U+100000 through 10FFFD)—have no useful character properties, and can be reliably exchanged only in closed environments. Unlike code point poaching, it is relatively easy to detect PUA code point usage.

As a point of trivia, all legacy implementations of emoji are PUA-based. Unicode Version 6.0 fixed that.

Pitfall #3: Normalization

Normalization is a process whereby multiple representations of the same character or sequence are transformed into a common representation. The primary benefit, of course, is for searching.

A prototypical example is an accented Latin character, such as o with a macron accent. It can be represented as a single character, U+014D LATIN SMALL LETTER O WITH MACRON, or as the sequence U+006F LATIN SMALL LETTER O plus U+0304 COMBINING MACRON. In terms of normalization forms, the former character is NFC and NFKC, and the latter sequence is NFD and NFKD.

For the Han script, there are 1,002 CJK Compatibility Ideographs in two blocks that are affected by normalization regardless of which of the four normalization forms is applied. Among Japan’s 863 Jinmei-yō Kanji (人名用漢字), which are kanji for use in personal names, 57 are CJK Compatibility Ideographs, and 18 additional kanji in Japan’s JIS X 0213 standard are also CJK Compatibility Ideographs. For example, U+FA47 漢 is normalized into U+6F22 漢. The only way to preserve CJK Compatibility Ideographs in plain text is to use their corresponding Standardized Variation Sequences (SVSes).

IMPORTANT NOTE: The CJK Compatibility Ideographs block includes twelve CJK Unified Ideographs that are not subject to normalization.

The important takeaway is that normalization may be applied at any time by any process that acts on text data, so it is therefore unwise to depend on distinctions that are erased by normalization.

Pitfall #4: Unassigned/Reserved/Noncharacter Code Point Usage

Like PUA code points, unassigned, reserved, and noncharacter code points have no useful character properties, and should therefore not be used. And, as the name suggests, unassigned code points may become assigned in a future version of the Unicode Standard.

Pitfall #5: Characters That “Fall Between The Proverbial Cracks”

Each subsequent version of the Unicode Standard includes more characters, which are either added to existing blocks or as new blocks. The size of existing blocks can also change. Unless one stays up-to-date and familiar with the latest version of the Unicode Standard, it is easy to overlook additional characters, particularly those that are added to existing blocks.

In terms of the eight CJK Unified Ideograph blocks in Unicode Version 14.0, the table below shows that the URO, Extension A, and Extension B blocks are now full as of Unicode Version 14.0. Ideographs were appended to the URO nine times before it became full. Extension H is expected to be included in Unicode Version 15.0 (2022) as the ninth CJK Unified Ideographs block.

CJK Unified Ideographs in Unicode Version 14.0 (2021)

And, as shown above and as mentioned in Pitfall #3, twelve CJK Unified Ideographs—U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, and U+FA27 through U+FA29—are in the CJK Compatibility Ideographs block, and are therefore easily overlooked as CJK Unified Ideographs.

Pitfall #6: Fonts With Glyphs That Map From More Than One Code Point

It is common—and appropriate—to map multiple code points to the same glyph, to ensure that the glyph is identical. Not so when crossing script boundaries, such as the large number of lookalike characters in the Latin and Cyrillic scripts.

In the case of the Han script, it is considered good practice to map the characters in the Kangxi Radicals block and their corresponding CJK Unified Ideographs to the same glyph. For example, U+2F00 ⼀ and U+4E00 一 map to CID+1200 in Adobe-Japan1–based fonts.

The thing to be aware of is that in some environments, such as PDF, the original content may not preserved, depending on the app that produced the PDF. For our example, all instances of U+2F00 ⼀ and U+4E00 一 may be treated as the former or the latter, though the latter would be the preferred code point, because the former normalizes to the latter. Adobe InDesign is unique as a PDF producer in that its exported PDFs preserve both code points.

Pitfall #7: Supporting Only BMP Code Points

The BMP (Basic Multilingual Plane) is merely one of 17 planes in the Unicode Standard. Of course, being the first plane (aka Plane 0), the most frequently-used characters are included in it. As of Unicode Version 14.0, there are only 1,426 available code points in the BMP.

It was actually in this IUC34 presentation that I first delivered Why Support Beyond-BMP Code Points? as a Top Ten List (see pp 16 through 27), so I suggest clicking below to see the latest-and-greatest (2022) version.

2022 Top Ten List: Why Support Beyond-BMP Code Points?

By Dr Ken Lunde

ken-lunde.medium.com

In other words, after twelve years, there is still no excuse for BMP-only implementations. Sadly, they still exist.

This article should serve as an excellent reminder to cherish plain text and the content that it conveys, and to understand the pitfalls that can hamper its use. I was actually surprised to see that virtually all of the content from my IUC34 presentation is still relevant today. I guess that speaks to the robust nature of plain text.

About the Author

Dr Ken Lunde has worked for Apple as a Font Developer since 2021-08-02 (and was in the same role as a contractor from 2020-01-16 through 2021-07-30), is the author of CJKV Information Processing Second Edition (O’Reilly Media, 2009), and earned BA (1987), MA (1988), and PhD (1994) degrees in linguistics from The University of Wisconsin-Madison. Prior to working at Apple, he worked at Adobe for over twenty-eight years — from 1991-07-01 to 2019-10-18 — specializing in CJKV Type Development, meaning that he architected and developed fonts for East Asian typefaces, along with the standards and specifications on which they are based. He architected and developed the Adobe-branded “Source Han” (Source Han Sans, Source Han Serif, and Source Han Mono) and Google-branded “Noto CJK” (Noto Sans CJK and Noto Serif CJK) open source Pan-CJK typeface families that were released in 2014, 2017, and 2019, and published over 300 articles on Adobe’s now-static CJK Type Blog. Ken serves as the Unicode Consortium’s IVD (Ideographic Variation Database) Registrar, attends UTC and IRG meetings, participates in the Unicode Editorial Committee, became an individual Unicode Life Member in 2018, received the 2018 Unicode Bulldog Award, was a Unicode Technical Director from 2018 to 2020, became a Vice-Chair of the Emoji Subcommittee in 2019, published UTN #43 (Unihan Database Property “kStrange”) in 2020, became the Chair of the CJK & Unihan Group in 2021, and published UTN #45 (Unihan Property History) in 2022. He and his wife, Hitomi, are proud owners of a His & Hers pair of acceleration-boosted 2018 LR Dual Motor AWD Tesla Model 3 EVs.