Hubbry Logo
Unicode equivalenceUnicode equivalenceMain
Open search
Unicode equivalence
Community hub
Unicode equivalence
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Unicode equivalence
Unicode equivalence
from Wikipedia

Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with pre-existing standard character sets, which often included similar or identical characters.

Unicode provides two such notions, canonical equivalence and compatibility. Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E n LATIN SMALL LETTER N followed by U+0303 ◌̃ COMBINING TILDE is defined by Unicode to be canonically equivalent to the single code point U+00F1 ñ LATIN SMALL LETTER N WITH TILDE of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other. Similarly, each Hangul syllable block that is encoded as a single character may be equivalently encoded as a combination of a leading conjoining jamo, a vowel conjoining jamo, and, if appropriate, a trailing conjoining jamo.

Sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts. Thus, for example, the code point U+FB00 (the typographic ligature "ff") is defined to be compatible—but not canonically equivalent—to the sequence U+0066 U+0066 (two Latin "f" letters). Compatible sequences may be treated the same way in some applications (such as sorting and indexing), but not in others; and may be substituted for each other in some situations, but not in others. Sequences that are canonically equivalent are also compatible, but the opposite is not necessarily true.

The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text. For each of the two equivalence notions, Unicode defines two normal forms, one fully composed (where multiple code points are replaced by single points whenever possible), and one fully decomposed (where single points are split into multiple ones).

Sources of equivalence

[edit]

Character duplication

[edit]

For compatibility or other reasons, Unicode sometimes assigns two different code points to entities that are essentially the same character. For example, the letter "A with a ring diacritic above" is encoded as U+00C5 Å LATIN CAPITAL LETTER A WITH RING ABOVE (a letter of the alphabet in Swedish and several other languages) or as U+212B ANGSTROM SIGN. Yet the symbol for angstrom is defined to be that Swedish letter, and most other symbols that are letters (such as ⟨V⟩ for volt) do not have a separate code point for each usage. In general, the code points of truly identical characters are defined to be canonically equivalent.

Combining and precomposed characters

[edit]

For consistency with some older standards, Unicode provides single code points for many characters that could be viewed as modified forms of other characters (such as U+00F1 for "ñ" or U+00C5 for "Å") or as combinations of two or more characters (such as U+FB00 for the ligature "ff" or U+0132 for the Dutch letter "ij")

For consistency with other standards, and for greater flexibility, Unicode also provides codes for many elements that are not used on their own, but are meant instead to modify or combine with a preceding base character. Examples of these combining characters are U+0303 ◌̃ COMBINING TILDE and the Japanese diacritic dakuten (U+3099 ◌゙ COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK).

In the context of Unicode, character composition is the process of replacing the code points of a base letter followed by one or more combining characters into a single precomposed character; and character decomposition is the opposite process.

In general, precomposed characters are defined to be canonically equivalent to the sequence of their base letter and subsequent combining diacritic marks, in whatever order these may occur.

Example

[edit]
Amélie with its two canonically equivalent Unicode forms (NFC and NFD)
NFC character A m é l i e
NFC code point 0041 006d 00e9 006c 0069 0065
NFD code point 0041 006d 0065 0301 006c 0069 0065
NFD character A m e ◌́ l i e

Typographical non-interaction

[edit]

Some scripts regularly use multiple combining marks that do not, in general, interact typographically, and do not have precomposed characters for the combinations. Pairs of such non-interacting marks can be stored in either order. These alternative sequences are, in general, canonically equivalent. The rules that define their sequencing in the canonical form also define whether they are considered to interact.

Typographic conventions

[edit]

Unicode provides code points for some characters or groups of characters which are modified only for aesthetic reasons (such as ligatures, the half-width katakana characters, or the full-width Latin letters for use in Japanese texts), or to add new semantics without losing the original one (such as digits in subscript or superscript positions, or the circled digits (such as "①") inherited from some Japanese fonts). Such a sequence is considered compatible with the sequence of original (individual and unmodified) characters, for the benefit of applications where the appearance and added semantics are not relevant. However, the two sequences are not declared canonically equivalent, since the distinction has some semantic value and affects the rendering of the text.

Encoding errors

[edit]

UTF-8 and UTF-16 (and also some other Unicode encodings) do not allow all possible sequences of code units. Different software will convert invalid sequences into Unicode characters using varying rules, some of which are very lossy (e.g., turning all invalid sequences into the same character). This can be considered a form of normalization and can lead to the same difficulties as others.

Normalization

[edit]

A text processing software implementing the Unicode string search and comparison functionality must take into account the presence of equivalent code points. In the absence of this feature, users searching for a particular code point sequence would be unable to find other visually indistinguishable glyphs that have a different, but canonically equivalent, code point representation.

Algorithms

[edit]

Unicode provides standard normalization algorithms that produce a unique (normal) code point sequence for all sequences that are equivalent; the equivalence criteria can be either canonical (NF) or compatibility (NFK). Since one can arbitrarily choose the representative element of an equivalence class, multiple canonical forms are possible for each equivalence criterion. Unicode provides two normal forms that are semantically meaningful for each of the two compatibility criteria: the composed forms NFC and NFKC, and the decomposed forms NFD and NFKD. Both the composed and decomposed forms impose a canonical ordering on the code point sequence, which is necessary for the normal forms to be unique.

In order to compare or search Unicode strings, software can use either composed or decomposed forms; this choice does not matter as long as it is the same for all strings involved in a search, comparison, etc. On the other hand, the choice of equivalence criteria can affect search results. For instance, some typographic ligatures like U+FB03 (), Roman numerals like U+2168 () and even subscripts and superscripts, e.g. U+2075 () have their own Unicode code points. Canonical normalization (NF) does not affect any of these, but compatibility normalization (NFK) will decompose the ffi ligature into the constituent letters, so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03. Likewise when searching for the Latin letter I (U+0049) in the precomposed Roman numeral (U+2168). Similarly, the superscript (U+2075) is transformed to 5 (U+0035) by compatibility mapping.

Transforming superscripts into baseline equivalents may not be appropriate, however, for rich text software, because the superscript information is lost in the process. To allow for this distinction, the Unicode character database contains compatibility formatting tags that provide additional details on the compatibility transformation.[1] In the case of typographic ligatures, this tag is simply <compat>, while for the superscript it is <super>. Rich text standards like HTML take into account the compatibility tags. For instance, HTML uses its own markup to position a U+0035 in a superscript position.[2]

Normal forms

[edit]

The four Unicode normalization forms and the algorithms (transformations) for obtaining them are listed in the table below.

NFD
Normalization Form Canonical Decomposition
Characters are decomposed by canonical equivalence, and multiple combining characters are arranged in a specific order.
NFC
Normalization Form Canonical Composition
Characters are decomposed and then recomposed by canonical equivalence.
NFKD
Normalization Form Compatibility Decomposition
Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.
NFKC
Normalization Form Compatibility Composition
Characters are decomposed by compatibility, then recomposed by canonical equivalence.

All these algorithms are idempotent transformations, meaning that a string that is already in one of these normalized forms will not be modified if processed again by the same algorithm.

The normal forms are not closed under string concatenation.[3] For defective Unicode strings starting with a Hangul vowel or trailing conjoining jamo, concatenation can break Composition.

However, they are not injective (they map different original glyphs and sequences to the same normalized sequence) and thus also not bijective (cannot be restored). For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining ring above "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").

A single character (other than a Hangul syllable block) that will get replaced by another under normalization can be identified in the Unicode tables for having a non-empty compatibility field but lacking a compatibility tag.

Canonical ordering

[edit]

The canonical ordering is mainly concerned with the ordering of a sequence of combining characters. For the examples in this section we assume these characters to be diacritics, even though in general some diacritics are not combining characters, and some combining characters are not diacritics.

Unicode assigns each character a combining class, which is identified by a numerical value. Non-combining characters have class number 0, while combining characters have a positive combining class value. To obtain the canonical ordering, every substring of characters having non-zero combining class value must be sorted by the combining class value using a stable sorting algorithm. Stable sorting is required because combining characters with the same class value are assumed to interact typographically, thus the two possible orders are not considered equivalent.

For example, the character U+1EBF (ế), used in Vietnamese, has both an acute and a circumflex accent. Its canonical decomposition is the three-character sequence U+0065 (e) U+0302 (circumflex accent) U+0301 (acute accent). The combining classes for the two accents are both 230, thus U+1EBF is not equivalent to U+0065 U+0301 U+0302.

Since not all combining sequences have a precomposed equivalent (the last one in the previous example can only be reduced to U+00E9 U+0302), even the normal form NFC is affected by combining characters' behavior.

Errors due to normalization differences

[edit]

When two applications share Unicode data, but normalize them differently, errors and data loss can result. In one specific instance, OS X normalized Unicode filenames sent from the Netatalk and Samba file- and printer-sharing software. Netatalk and Samba did not recognize the altered filenames as equivalent to the original, leading to data loss.[4][5] Resolving such an issue is non-trivial, as normalization is not losslessly invertible.

See also

[edit]

Notes

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Unicode equivalence refers to the relationships defined in the Unicode Standard between distinct sequences of code points that represent the same or semantically similar text content, enabling processes like normalization to standardize these representations for consistent processing and display across systems. The Unicode Standard distinguishes two primary types of equivalence: canonical equivalence, which applies to sequences that represent the identical abstract character with the same visual appearance and behavior—such as the precomposed character being equivalent to the sequence followed by the combining cedilla ◌̧—and compatibility equivalence, a broader category where sequences may have similar but not identical appearances or behaviors, often arising from legacy encodings or stylistic variants like the mathematical symbol ℌ decomposing to the plain letter H. Canonical equivalence ensures that text maintains its intended meaning and rendering regardless of decomposition or composition, while compatibility equivalence supports mappings from non-Unicode sources but does not guarantee full behavioral interchangeability. To handle these equivalences practically, the Unicode Standard specifies four normalization forms: Decomposition into canonical components (NFD), canonical decomposition followed by composition into precomposed characters where possible (NFC), compatibility decomposition (NFKD) which includes mappings for variant forms, and compatibility decomposition followed by canonical composition (NFKC). These forms provide unique binary representations for equivalent strings, preventing issues in searching, sorting, and collation by ensuring that, for example, "é" (U+00E9) and "e" + combining acute accent (U+0065 U+0301) both normalize to the same form in NFC. Normalization is crucial for in applications such as , web rendering, and database storage, as it respects equivalences while optionally addressing compatibility ones, with conformance requiring implementations to produce stable results aligned with the Unicode NormalizationTest.txt file and to maintain consistency across versions since 4.1. Stability policies further guarantee that normalization outcomes for assigned characters do not change in future versions, with composition rules fixed since Unicode 3.1.0, promoting reliable long-term text handling.

Fundamentals

Definition and Scope

Unicode equivalence refers to the situation in which two or more distinct sequences of Unicode code points—often forming clusters—represent the same abstract character or unit of text, either semantically or in terms of visual rendering. In the Unicode Standard, code points are the numeric values (from U+0000 to U+10FFFF) assigned to characters, while abstract characters denote the conceptual semantic or orthographic units of text independent of their encoding, and clusters represent the minimal user-perceived units of text that may span multiple code points. This equivalence ensures that text processing can handle variations in representation without altering meaning or appearance. The scope of Unicode equivalence encompasses two primary categories: canonical equivalence and compatibility equivalence. Canonical equivalence applies to sequences that preserve both the semantics and canonical rendering of the text, such as decomposed and composed forms of accented characters, where the sequences are considered identical for all purposes in the standard. Compatibility equivalence, in contrast, covers cases where sequences represent the same abstract character but allow for differences in visual appearance or behavior, such as typographic variants or ligatures, enabling decomposition for compatibility with legacy systems while requiring careful handling in applications. Equivalence classes, such as those resulting from normalization forms like NFC (composed) and NFD (decomposed), illustrate how these sequences can be interchanged within the standard's framework. Historically, Unicode equivalence emerged from the effort to merge diverse legacy encoding standards, including ISO 8859 for Western scripts and Shift-JIS for Japanese, into a single universal character encoding. This unification process intentionally introduced redundancies to maintain round-trip compatibility with existing systems, allowing multiple sequences to encode the same text unit while supporting across global text processing. Normalization processes provide a means to resolve these equivalences by transforming sequences into standardized forms, though detailed algorithms are specified elsewhere in the standard.

Canonical vs Compatibility Equivalence

Canonical equivalence in Unicode refers to sequences of code points that represent the same abstract character and are intended to have identical semantic meaning, visual appearance, and behavior across all contexts, making them fully interchangeable without loss of information. For instance, the precomposed character "é" (U+00E9) is canonically equivalent to the sequence "e" followed by combining (U+0065 U+0301), as both render the same accented letter with preserved meaning and round-trip stability in normalization processes. This type of equivalence ensures that text processing operations, such as rendering or , treat these forms identically, supporting full fidelity in applications like search and indexing. In contrast, compatibility equivalence applies to character sequences that are similar but not semantically identical, often introduced for historical, stylistic, or legacy compatibility reasons, allowing into simpler forms that may alter typographic or formatting intent. A common example is the ligature "fi" (U+FB01), which decomposes compatibly to the separate characters "f" and "i" (U+0066 U+0069), enabling normalization for legacy systems but potentially losing the intended joined appearance. Unlike canonical equivalence, compatibility mappings do not guarantee round-trip stability or preservation of original semantics, as they prioritize broader over strict identity. The primary differences between canonical and compatibility equivalence lie in their scope and implications for text handling: canonical equivalence maintains complete interchangeability and fidelity, ensuring no loss in meaning or appearance during operations like normalization, whereas compatibility equivalence permits simplification that can introduce variations in rendering or reduce precision in tasks such as full-text search. For example, canonical equivalences like diacritic combinations preserve exact semantics, while compatibility ones, such as superscript digits (e.g., "²" U+00B2 decomposing to "2" U+0032), allow stylistic variants but risk altering document intent if over-applied. These distinctions are formalized in Unicode Standard Annex #15 (UAX #15), with ongoing updates through Unicode 17.0 (2025) emphasizing secure handling of equivalences to mitigate risks like confusables in identifiers.
AspectCanonical Equivalence ExampleCompatibility Equivalence Example
Representation"é" (U+00E9) ↔ "e" + acute (U+0065 U+0301)"fi" (U+FB01) → "f" + "i" (U+0066 U+0069)
PurposeSemantic and visual identityStylistic or legacy normalization
InterchangeabilityFull, with no loss in rendering or meaningPartial, may alter typographic intent
ImplicationsPreserves fidelity in search/indexingRisks reduced precision in processing
Historical duplicates in Unicode contribute to both types of equivalence by providing multiple encodings for the same character, though canonical forms prioritize semantic consistency.

Sources of Equivalence

Character Duplication and Historical Duplicates

Unicode incorporated characters from multiple legacy encodings during its development to ensure compatibility with established standards, resulting in duplicate code points for the same abstract character. This approach prioritized round-trip conversion, where text encoded in older systems like ISO/IEC 8859-1 could be accurately mapped to and from Unicode without loss. For instance, the Latin-1 Supplement block (U+0080–U+00FF) includes characters such as the micro sign (U+00B5), which serves as a compatibility encoding of the Greek small letter mu (U+03BC) to support legacy uses in scientific notation. Similarly, symbols from standards like ISO 8859-7 for Greek were integrated into the Greek and Coptic block (U+0370–U+03FF), but certain variants, such as the ohm sign (U+2126) in the Letterlike Symbols block, were added separately to preserve historical distinctions while establishing compatibility equivalence to the Greek capital letter omega (U+03A9). These duplications arose because early Unicode versions aimed to accommodate diverse national standards without requiring immediate unification of all visually identical glyphs. The mechanisms behind these duplications were primarily intentional aliases designed for , avoiding the need for complex in applications handling legacy data. In the case of CJK ideographs, unification across Chinese, Japanese, and Korean scripts was a core principle, yet separate compatibility blocks were created to mirror source-specific encodings. The CJK Compatibility Ideographs block (U+F900–U+FAFF) contains 863 characters as of Unicode 17.0 (2024), the majority of which are duplicates or near-duplicates of unified ideographs in the main blocks, sourced from standards like KS C 5601-1989. Specifically, the range U+F900–U+FA0B includes 172 exact duplicates from the Korean standard to enable precise round-trip mapping. Other major categories of duplicates in the Basic Multilingual Plane encompass approximately 100 canonical equivalents via singleton decompositions, such as the Kelvin sign (U+212A) mapping to the Latin capital letter (U+004B), and additional compatibility characters in blocks like and Geometric Shapes for typographic variants from legacy fonts. These historical duplicates form equivalence classes where distinct code points represent the same semantic and visual content, allowing substitution in processing without altering rendering or meaning under appropriate equivalence rules. For example, the angstrom sign (U+212B) is compatibility equivalent to the Latin capital letter A with ring above (U+00C5), which canonically decomposes to the base letter A (U+0041) plus combining ring above (U+030A), ensuring consistent behavior in text comparison under compatibility normalization. Post-2010 Unicode versions expanded this pattern with emoji-related additions; starting from Unicode 8.0, skin tone modifiers (U+1F3FB–U+1F3FF) were introduced as single code points that pair with base emoji to create variant forms, effectively duplicating representations for diversity while maintaining compatibility with earlier neutral emoji. Such duplications underscore Unicode's evolution to balance historical preservation with modern interoperability needs.

Precomposed and Combining Characters

Unicode includes precomposed characters as single code points that represent common combinations of a base letter and a mark, such as U+00E9 LATIN SMALL LETTER E WITH ACUTE (é), located in the block. These characters were incorporated to ensure compatibility with legacy character encodings, such as ISO 8859-1, which lacked support for dynamic combining sequences and required fixed single-byte representations for accented letters. In contrast, Unicode supports the construction of equivalent forms through sequences of a base character followed by one or more combining marks, such as U+0065 LATIN SMALL LETTER E (e) combined with U+0301 COMBINING ACUTE ACCENT (◌́). Combining marks are assigned Canonical Combining Class (CCC) values ranging from 1 to 255 to indicate their attachment and ordering relative to the base, while base characters have CCC 0; this system enables proper reordering during normalization to maintain consistent representation. These sequences form clusters—user-perceived character units defined by Unicode Standard Annex #29's boundary rules, such as no breaks before Extend characters (including most combining marks) or SpacingMark, ensuring that the composite appears as a single visual unit across equivalent forms. Under canonical equivalence, a is defined as equivalent to its decomposed sequence of base plus combining marks, meaning they must be rendered, searched, and processed identically in conforming implementations. For instance, NFC (Normalization Form C) decomposes text first and then recomposes where possible, converting the sequence U+0075 LATIN SMALL LETTER U (u) + U+0308 COMBINING DIAERESIS (◌̈) into the precomposed U+00FC LATIN SMALL LETTER U WITH DIAERESIS (ü), while preserving the visual appearance and semantic meaning. Rendering systems normalize these equivalents to display the correctly attached to the base, regardless of the underlying structure. This principle extends to modern extensions like Emoji Zero Width Joiner (ZWJ) sequences starting from Unicode 8.0, with expansions in later versions such as Unicode 9.0 for family sequences, where multi-code-point combinations form composite emojis treated as single grapheme clusters; for example, the family emoji sequence U+1F468 👨‍❤️‍👨 (man + ZWJ + heart + ZWJ + man) uses ZWJ (U+200D, an Extend grapheme breaker) to join base emojis into an equivalent representation of a couple, analogous to combining marks in textual equivalence.

Typographic and Compatibility Variants

Typographic equivalence in Unicode arises from characters and sequences that influence rendering or layout without altering the underlying semantic meaning, ensuring that text processors can treat them as equivalent under compatibility normalization. Characters such as the (U+200B) serve as invisible format controls, providing line break opportunities or word boundaries in while maintaining no visible width or impact on textual equivalence classes. Similarly, variation selectors (U+FE00–U+FE0F) are non-spacing marks that follow a base character to specify variants, such as vertical or italic forms, without shifting the character's core identity or normalization equivalence. Compatibility variants encompass precomposed characters designed for typographic convenience that decompose into simpler forms via mappings defined in the Unicode Character Database (UCD), particularly in the Decomposition_Mapping field of UnicodeData.txt. These mappings, tagged with or specific subtypes like or , allow normalization forms such as NFKC to replace stylistic representations with their semantic equivalents, preserving meaning but standardizing appearance. For instance, ligatures, such as U+FB02 (fl), decompose to f l (U+0066 U+006C), facilitating font-independent text processing. Other typographic conventions include fraction characters, which follow decomposition rules to support legacy typography. The vulgar fraction one quarter (U+00BC, ¼) canonically decomposes to 1⁄4 using the fraction slash (U+0031 U+2044 U+0034), allowing modern rendering engines to reconstruct the fraction dynamically. These mappings are derived from historical encodings and ensure that typographic flourishes do not create semantic barriers in equivalence comparisons. Since their introduction in Unicode 6.1, variation selectors have been used in rendering, where sequences like U+1F600 U+FE0F (grinning face in style) use VS16 (U+FE0F) to request colorful , while VS15 (U+FE0E) opts for text style; both variants remain semantically equivalent. This mechanism, detailed in emoji-variation-sequences.txt, allows platforms to adapt glyphs to contextual needs without altering their equivalence under normalization.

Normalization Techniques

Normalization Algorithms

The Normalization Algorithm, as defined in Unicode Standard Annex #15, provides a standardized method to convert Unicode text into a or compatibility equivalent form while preserving the original meaning and visual appearance. This algorithm ensures that equivalent sequences of code points are transformed into a unique representation, facilitating consistent processing in software applications. It operates on strings by breaking down composite characters, rearranging elements according to defined rules, and optionally reassembling them, with all operations relying on properties derived from the Unicode Character Database. The algorithm proceeds through a structured sequence of operations, often described in four key phases: decomposition, allocation of combining classes, reordering, and composition (the latter applied selectively based on the target form). In the decomposition phase, each code point in the input string is examined and, if it has a decomposition mapping, replaced by its canonical components for forms NFD and NFC or compatibility components for NFKD and NFKC; this process is recursive to handle nested decompositions, with special rules for Hangul syllables that decompose into jamo elements. The Decomposition_Mapping property, sourced from the UnicodeData file, dictates these breakdowns, ensuring that precomposed characters are expanded into base characters followed by combining marks. (UnicodeData.txt in UCD) Following decomposition, combining classes are allocated to each code point using the Canonical_Combining_Class (CCC) property, also from UnicodeData, which assigns numeric values (0 for base characters and spacing marks, higher values for diacritics and other non-spacing marks) to determine their relative positioning. The reordering phase then sorts the sequence of combining marks in ascending order of their CCC values, treating base characters (CCC=0) as anchors while stably rearranging subsequent non-starters to achieve canonical order; this step resolves variations in combining mark placement without altering the text's semantics. For composed forms like NFC and NFKC, the composition phase follows, where the algorithm scans the reordered sequence from left to right, identifying eligible pairs—a base character followed by a combining mark—and replacing them with a precomposed character if a canonical mapping exists and the pair is not excluded by the Composition_Exclusions list; primary composition uses the main decomposition components, while secondary handles cases after the primary combiner. (UnicodeData.txt and CompositionExclusions.txt in UCD) Key properties of the algorithm include uniqueness, where canonically equivalent strings yield identical outputs in NFD or NFC, and stability under concatenation, meaning that if two normalized strings are joined, the result remains normalized provided no inter-string reordering is needed across the boundary. The process is idempotent, so applying normalization multiple times produces no change after the first application, and it is version-stable: since Unicode 4.1, no changes destabilize already normalized text, with the composition frozen to the Unicode 3.1.0 character set to prevent introducing new decompositions that could affect existing strings. A high-level pseudocode outline for generating NFD (decomposed form) and NFC (composed form) illustrates the process without full implementation details:

function decompose(codePoint): if Decomposition_Mapping[codePoint] exists: return concatenate(decompose each component recursively) else: return codePoint function toNFD(inputString): result = empty string for each codePoint in inputString: append decompose(codePoint) to result assign CCC to each codePoint in result using Canonical_Combining_Class reorder result by stable sort on CCC (CCC=0 fixed, others ascending) return result function composeNFD(result): // for NFC from NFD output output = [empty string](/page/Empty_string) i = 0 while i < length(result): if canCompose(result[i], result[i+1]) and not excluded: append composite(result[i], result[i+1]) to output i += 2 else: append result[i] to output i += 1 return output function toNFC(inputString): nfd = toNFD(inputString) return composeNFD(nfd)

function decompose(codePoint): if Decomposition_Mapping[codePoint] exists: return concatenate(decompose each component recursively) else: return codePoint function toNFD(inputString): result = empty string for each codePoint in inputString: append decompose(codePoint) to result assign CCC to each codePoint in result using Canonical_Combining_Class reorder result by stable sort on CCC (CCC=0 fixed, others ascending) return result function composeNFD(result): // for NFC from NFD output output = [empty string](/page/Empty_string) i = 0 while i < length(result): if canCompose(result[i], result[i+1]) and not excluded: append composite(result[i], result[i+1]) to output i += 2 else: append result[i] to output i += 1 return output function toNFC(inputString): nfd = toNFD(inputString) return composeNFD(nfd)

This outline omits optimizations like quick checks for already-normalized segments and Hangul-specific shortcuts. The Unicode Standard also specifies the Stream-Safe Text Format to support efficient streaming normalization for large-scale text processing, allowing incremental application without full buffering by restricting combining sequences to prevent unbounded reordering across chunk boundaries.

Normal Forms

Unicode defines four standard normalization forms to handle equivalence among character sequences: Normalization Form Canonical Composition (NFC), Normalization Form Canonical Decomposition (NFD), Normalization Form Compatibility Composition (NFKC), and Normalization Form Compatibility Decomposition (NFKD). These forms transform text into consistent representations, ensuring that equivalent strings map to the same sequence while preserving essential semantics. NFC and NFD address canonical equivalence, where sequences represent the same abstract character with identical visual appearance and behavior, such as a precomposed character like "é" (U+00E9) versus its decomposed form "e" + combining (U+0065 U+0301). In contrast, NFKC and NFKD incorporate compatibility equivalence, which extends to visually or semantically similar but not canonically identical representations, such as mapping the full-width Latin capital "A" (U+FF21) to the standard "A" (U+0041). The process for achieving these forms involves followed by optional composition. For NFC, text undergoes —breaking characters into their base and combining marks—then composition, which recombines compatible sequences into precomposed characters where defined, resulting in the most compact form. NFD stops at , producing a fully decomposed sequence suitable for processes like sorting. NFKC applies compatibility first, which includes additional mappings for variant forms like ligatures or half-width characters, before composition; NFKD similarly uses only compatibility . These rules ensure that NFC and NFD guarantee equivalence, while NFKC and NFKD provide broader folding for compatibility equivalence, though they may alter or lose certain stylistic distinctions. In practice, NFC is widely recommended for storage and searching in databases and , as it minimizes variability and ensures consistent indexing of canonically equivalent text. NFD proves useful for input methods and text processing where separate handling of base characters and diacritics is beneficial, such as in spell-checking or rendering. NFKC and NFKD are employed in scenarios requiring compatibility folding, like identifier matching or search engines that treat full-width and half-width forms as identical. For instance, normalizing "fi" ligature (U+FB01) to "f" + "i" (U+0066 U+0069) via NFKC allows treating it equivalently to the separate letters in integrations. The normal forms exhibit key properties that support their reliability:
PropertyDescriptionApplies to NFC/NFDApplies to NFKC/NFKD
UniquenessEquivalent strings yield a unique binary representation.YesYes
StabilityThe normalized form of a string remains unchanged across Unicode versions.YesYes
Round-TripOriginal text can be recovered exactly after normalization and inverse.YesNo (may lose formatting)
These properties make the forms predictable for software implementations, though NFKC/NFKD's potential for information loss requires caution in applications preserving .

Canonical Ordering Behavior

Canonical combining classes are numeric values ranging from 0 to 255 assigned to Unicode characters to determine their combining behavior, with most base characters (starters) having a value of 0 and non-zero values applied to diacritics and other combining marks. These classes resolve ambiguities in how combining marks attach to base characters by enforcing a consistent ordering that reflects typical visual positioning, such as placing below marks before above marks. For instance, the combining dot below (U+0323) is assigned class 220 (below right), while the combining dot above (U+0307) has class 230 (above). In Normalization Form D (NFD) and Normalization Form C (NFC), the reordering process sorts combining marks attached to a base character in ascending order of their canonical combining classes, using a stable sort to preserve the relative order of marks with identical classes. This ensures that canonically equivalent sequences normalize to the identical form. For example, the sequence consisting of the base character 'q' followed by combining dot above (U+0307, class 230) and then combining dot below (U+0323, class 220) is reordered to 'q' + U+0323 + U+0307, placing the lower-class (below) mark before the higher-class (above) mark. Similarly, for the Vietnamese character 'ấ' (U+1EA5, Latin small letter a with circumflex and acute), its canonical decomposition is 'a' + combining circumflex accent (U+0302, class 230) + combining acute accent (U+0301, class 230); since both marks share the same class, their order remains stable during reordering, resulting in a unique normalized sequence. Behavior rules for canonical ordering account for non-stacking classes, where multiple marks in the same position (e.g., several above marks with class 230) are logically sequenced but may not visually stack in rendering. Exceptions apply to fixed-position classes 230 through 233 (above-left, above, below-left, and below), which prioritize positional consistency over strict reordering in certain contexts, preventing illogical attachments while maintaining equivalence. This reordering mechanism directly impacts equivalence by guaranteeing that different input sequences with the same semantic content, such as varied orders, converge to a single , facilitating consistent processing across systems. Revisions to Unicode Standard Annex #15 in Unicode 16.0 (released in 2024) and Unicode 17.0 (released in 2025) extended support for canonical ordering to new combining marks in scripts like Kirat Rai, Tulu-Tigalari, and Gurung Khema, as well as scripts added in 17.0 such as Beria Erfe, Sidetic, Tai Yo, and Tolong Siki, including improvements for handling that enhance equivalence preservation in right-to-left (RTL) contexts.

Practical Implications

Errors from Normalization Differences

Normalization differences between Unicode forms can lead to common errors in text processing, particularly in case folding operations. For instance, in Greek script, the uppercase (U+03A3 Σ) folds to either the medial lowercase (U+03C3 σ) or the final (U+03C2 ς) depending on its position in a word, which can cause mismatches during case-insensitive comparisons if the folding algorithm does not account for contextual rules. This issue arises because simple uppercase or lowercase conversions fail to preserve equivalence, resulting in failed searches or incorrect sorting in applications handling Greek text. Another frequent error occurs in filename handling and searches due to variations between Normalization Form C (NFC) and Normalization Form D (NFD). File systems like macOS HFS+ store filenames in NFD, decomposing precomposed characters into base letters plus combining marks, while others like Windows NTFS use NFC, leading to potential duplicates or missed matches when files are shared across platforms. For example, the string "café" in NFC (using the single precomposed é at U+00E9) differs from "café" in NFD (e at U+0065 followed by combining acute accent at U+0301), causing database indexing errors where equivalent entries are treated as distinct, resulting in retrieval failures or unintended data duplication. These normalization differences also introduce significant security implications, particularly through homograph attacks that exploit compatibility decompositions and confusable characters. Attackers can use visually similar characters, such as the Cyrillic small letter a (U+0430 а) and Latin small letter a (U+0061 a), to create deceptive strings that appear identical but differ in code points, bypassing filters that do not normalize before comparison. Compatibility normalization (NFKC or NFKD) can further obscure such attacks by mapping variant forms, allowing malicious inputs to evade detection in security checks. In the context of Internationalized Domain Names (IDNs), normalization variances heighten spoofing risks, where homoglyphs enable by mimicking legitimate domains without triggering equivalence-based safeguards. Such as exploits using polyglot Unicode characters that normalize to restricted symbols like apostrophes, allowing attackers to bypass validation and inject malicious payloads. More recently, vulnerabilities in implementations have demonstrated how mishandling equivalence during credential processing can enable unauthorized access by exploiting normalization discrepancies in flows.

Handling Equivalence in Applications

Applications handling Unicode text must implement normalization to manage equivalence effectively, ensuring consistent representation across storage, processing, and display. The (ICU) library provides robust support for this, offering APIs to convert text into standard normalization forms such as NFC (Normalization Form Canonical Composition) and NFD (Normalization Form Canonical Decomposition). Developers are recommended to normalize input to NFC for persistent storage, as this form uses precomposed characters where possible, minimizing byte length and avoiding interoperability issues between systems that may handle combining sequences differently. For interactive editing operations, such as inserting or rearranging diacritics, NFD is preferred because it decomposes characters into base letters and separate combining marks, facilitating precise manipulation without unintended recomposition. Integration of Unicode equivalence handling appears in various standards and protocols to promote reliability. In XML processing, particularly for digital signatures, exclusive ensures that subdocuments maintain equivalence regardless of surrounding context, incorporating Unicode normalization steps like conversion to NFC during encoding transformations to standardize character representations. HTML5 requires consistent normalization for elements like form fields and CSS selectors; for instance, user input in forms should be normalized to NFC to prevent mismatches where precomposed characters in fail to match decomposed forms in submitted data. Programming languages embed these capabilities natively; Python's unicodedata module, for example, includes a normalize function that applies NFC, NFD, NFKC, or NFKD forms to strings, enabling developers to enforce equivalence during string operations. Best practices emphasize proactive equivalence testing and edge case management to avoid subtle errors. To verify equivalence, applications can normalize strings to a common form (e.g., NFD) and then perform collation using the Collation Algorithm (UCA), which generates identical sort keys for canonically equivalent sequences, allowing reliable comparison beyond simple matching. Edge cases, such as UTF-16 surrogate pairs or invalid sequences, require validation before normalization; surrogates must be paired correctly to represent supplementary characters, while ill-formed bytes should be replaced with the Unicode replacement character (U+FFFD) to prevent processing failures. For security-sensitive applications, such as identifier validation, NFKC_Casefold is recommended to detect and mitigate confusables—visually similar characters that could enable —by combining compatibility normalization with case folding for thorough equivalence resolution. Recent W3C guidelines, updated in 2025, advise specification developers to document equivalence-related security risks in Web APIs and avoid mandating normalization forms except where necessary for operations like matching, with emerging emphasis on consistent handling in AI-driven text to ensure robust multilingual model inputs.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.