Recent from talks
Nothing was collected or created yet.
Unicode equivalence
View on WikipediaThis article needs additional citations for verification. (November 2014) |
Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with pre-existing standard character sets, which often included similar or identical characters.
Unicode provides two such notions, canonical equivalence and compatibility. Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E n LATIN SMALL LETTER N followed by U+0303 ◌̃ COMBINING TILDE is defined by Unicode to be canonically equivalent to the single code point U+00F1 ñ LATIN SMALL LETTER N WITH TILDE of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other. Similarly, each Hangul syllable block that is encoded as a single character may be equivalently encoded as a combination of a leading conjoining jamo, a vowel conjoining jamo, and, if appropriate, a trailing conjoining jamo.
Sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts. Thus, for example, the code point U+FB00 (the typographic ligature "ff") is defined to be compatible—but not canonically equivalent—to the sequence U+0066 U+0066 (two Latin "f" letters). Compatible sequences may be treated the same way in some applications (such as sorting and indexing), but not in others; and may be substituted for each other in some situations, but not in others. Sequences that are canonically equivalent are also compatible, but the opposite is not necessarily true.
The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text. For each of the two equivalence notions, Unicode defines two normal forms, one fully composed (where multiple code points are replaced by single points whenever possible), and one fully decomposed (where single points are split into multiple ones).
Sources of equivalence
[edit]Character duplication
[edit]For compatibility or other reasons, Unicode sometimes assigns two different code points to entities that are essentially the same character. For example, the letter "A with a ring diacritic above" is encoded as U+00C5 Å LATIN CAPITAL LETTER A WITH RING ABOVE (a letter of the alphabet in Swedish and several other languages) or as U+212B Å ANGSTROM SIGN. Yet the symbol for angstrom is defined to be that Swedish letter, and most other symbols that are letters (such as ⟨V⟩ for volt) do not have a separate code point for each usage. In general, the code points of truly identical characters are defined to be canonically equivalent.
Combining and precomposed characters
[edit]For consistency with some older standards, Unicode provides single code points for many characters that could be viewed as modified forms of other characters (such as U+00F1 for "ñ" or U+00C5 for "Å") or as combinations of two or more characters (such as U+FB00 for the ligature "ff" or U+0132 for the Dutch letter "ij")
For consistency with other standards, and for greater flexibility, Unicode also provides codes for many elements that are not used on their own, but are meant instead to modify or combine with a preceding base character. Examples of these combining characters are U+0303 ◌̃ COMBINING TILDE and the Japanese diacritic dakuten (U+3099 ◌゙ COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK).
In the context of Unicode, character composition is the process of replacing the code points of a base letter followed by one or more combining characters into a single precomposed character; and character decomposition is the opposite process.
In general, precomposed characters are defined to be canonically equivalent to the sequence of their base letter and subsequent combining diacritic marks, in whatever order these may occur.
Example
[edit]| NFC character | A | m | é | l | i | e | |
|---|---|---|---|---|---|---|---|
| NFC code point | 0041 | 006d | 00e9 | 006c | 0069 | 0065 | |
| NFD code point | 0041 | 006d | 0065 | 0301 | 006c | 0069 | 0065 |
| NFD character | A | m | e | ◌́ | l | i | e |
Typographical non-interaction
[edit]Some scripts regularly use multiple combining marks that do not, in general, interact typographically, and do not have precomposed characters for the combinations. Pairs of such non-interacting marks can be stored in either order. These alternative sequences are, in general, canonically equivalent. The rules that define their sequencing in the canonical form also define whether they are considered to interact.
Typographic conventions
[edit]Unicode provides code points for some characters or groups of characters which are modified only for aesthetic reasons (such as ligatures, the half-width katakana characters, or the full-width Latin letters for use in Japanese texts), or to add new semantics without losing the original one (such as digits in subscript or superscript positions, or the circled digits (such as "①") inherited from some Japanese fonts). Such a sequence is considered compatible with the sequence of original (individual and unmodified) characters, for the benefit of applications where the appearance and added semantics are not relevant. However, the two sequences are not declared canonically equivalent, since the distinction has some semantic value and affects the rendering of the text.
Encoding errors
[edit]UTF-8 and UTF-16 (and also some other Unicode encodings) do not allow all possible sequences of code units. Different software will convert invalid sequences into Unicode characters using varying rules, some of which are very lossy (e.g., turning all invalid sequences into the same character). This can be considered a form of normalization and can lead to the same difficulties as others.
Normalization
[edit]A text processing software implementing the Unicode string search and comparison functionality must take into account the presence of equivalent code points. In the absence of this feature, users searching for a particular code point sequence would be unable to find other visually indistinguishable glyphs that have a different, but canonically equivalent, code point representation.
Algorithms
[edit]Unicode provides standard normalization algorithms that produce a unique (normal) code point sequence for all sequences that are equivalent; the equivalence criteria can be either canonical (NF) or compatibility (NFK). Since one can arbitrarily choose the representative element of an equivalence class, multiple canonical forms are possible for each equivalence criterion. Unicode provides two normal forms that are semantically meaningful for each of the two compatibility criteria: the composed forms NFC and NFKC, and the decomposed forms NFD and NFKD. Both the composed and decomposed forms impose a canonical ordering on the code point sequence, which is necessary for the normal forms to be unique.
In order to compare or search Unicode strings, software can use either composed or decomposed forms; this choice does not matter as long as it is the same for all strings involved in a search, comparison, etc. On the other hand, the choice of equivalence criteria can affect search results. For instance, some typographic ligatures like U+FB03 (ffi), Roman numerals like U+2168 (Ⅸ) and even subscripts and superscripts, e.g. U+2075 (⁵) have their own Unicode code points. Canonical normalization (NF) does not affect any of these, but compatibility normalization (NFK) will decompose the ffi ligature into the constituent letters, so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03. Likewise when searching for the Latin letter I (U+0049) in the precomposed Roman numeral Ⅸ (U+2168). Similarly, the superscript ⁵ (U+2075) is transformed to 5 (U+0035) by compatibility mapping.
Transforming superscripts into baseline equivalents may not be appropriate, however, for rich text software, because the superscript information is lost in the process. To allow for this distinction, the Unicode character database contains compatibility formatting tags that provide additional details on the compatibility transformation.[1] In the case of typographic ligatures, this tag is simply <compat>, while for the superscript it is <super>. Rich text standards like HTML take into account the compatibility tags. For instance, HTML uses its own markup to position a U+0035 in a superscript position.[2]
Normal forms
[edit]The four Unicode normalization forms and the algorithms (transformations) for obtaining them are listed in the table below.
| NFD Normalization Form Canonical Decomposition |
Characters are decomposed by canonical equivalence, and multiple combining characters are arranged in a specific order. |
| NFC Normalization Form Canonical Composition |
Characters are decomposed and then recomposed by canonical equivalence. |
| NFKD Normalization Form Compatibility Decomposition |
Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order. |
| NFKC Normalization Form Compatibility Composition |
Characters are decomposed by compatibility, then recomposed by canonical equivalence. |
All these algorithms are idempotent transformations, meaning that a string that is already in one of these normalized forms will not be modified if processed again by the same algorithm.
The normal forms are not closed under string concatenation.[3] For defective Unicode strings starting with a Hangul vowel or trailing conjoining jamo, concatenation can break Composition.
However, they are not injective (they map different original glyphs and sequences to the same normalized sequence) and thus also not bijective (cannot be restored). For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining ring above "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").
A single character (other than a Hangul syllable block) that will get replaced by another under normalization can be identified in the Unicode tables for having a non-empty compatibility field but lacking a compatibility tag.
Canonical ordering
[edit]The canonical ordering is mainly concerned with the ordering of a sequence of combining characters. For the examples in this section we assume these characters to be diacritics, even though in general some diacritics are not combining characters, and some combining characters are not diacritics.
Unicode assigns each character a combining class, which is identified by a numerical value. Non-combining characters have class number 0, while combining characters have a positive combining class value. To obtain the canonical ordering, every substring of characters having non-zero combining class value must be sorted by the combining class value using a stable sorting algorithm. Stable sorting is required because combining characters with the same class value are assumed to interact typographically, thus the two possible orders are not considered equivalent.
For example, the character U+1EBF (ế), used in Vietnamese, has both an acute and a circumflex accent. Its canonical decomposition is the three-character sequence U+0065 (e) U+0302 (circumflex accent) U+0301 (acute accent). The combining classes for the two accents are both 230, thus U+1EBF is not equivalent to U+0065 U+0301 U+0302.
Since not all combining sequences have a precomposed equivalent (the last one in the previous example can only be reduced to U+00E9 U+0302), even the normal form NFC is affected by combining characters' behavior.
Errors due to normalization differences
[edit]When two applications share Unicode data, but normalize them differently, errors and data loss can result. In one specific instance, OS X normalized Unicode filenames sent from the Netatalk and Samba file- and printer-sharing software. Netatalk and Samba did not recognize the altered filenames as equivalent to the original, leading to data loss.[4][5] Resolving such an issue is non-trivial, as normalization is not losslessly invertible.
See also
[edit]- Complex text layout
- Diacritic
- IDN homograph attack
- ISO/IEC 14651
- Ligature (typography)
- Precomposed character
- The uconv tool can convert to and from NFC and NFD Unicode normalization forms.
- Unicode
- Unicode compatibility characters
Notes
[edit]- ^ "UAX #44: Unicode Character Database". Unicode.org. Retrieved 20 November 2014.
- ^ "Unicode in XML and other Markup Languages". Unicode.org. Retrieved 20 November 2014.
- ^ Per What should be done about concatenation
- ^ "netatalk / Bugs / #349 volcharset:UTF8 doesn't work from Mac". SourceForge. Retrieved 20 November 2014.
- ^ "rsync, samba, UTF8, international characters, oh my!". 2009. Archived from the original on January 9, 2010.
References
[edit]External links
[edit]Unicode equivalence
View on GrokipediaFundamentals
Definition and Scope
Unicode equivalence refers to the situation in which two or more distinct sequences of Unicode code points—often forming grapheme clusters—represent the same abstract character or unit of text, either semantically or in terms of visual rendering.[1] In the Unicode Standard, code points are the numeric values (from U+0000 to U+10FFFF) assigned to characters, while abstract characters denote the conceptual semantic or orthographic units of text independent of their encoding, and grapheme clusters represent the minimal user-perceived units of text that may span multiple code points.[2] This equivalence ensures that text processing can handle variations in representation without altering meaning or appearance. The scope of Unicode equivalence encompasses two primary categories: canonical equivalence and compatibility equivalence. Canonical equivalence applies to sequences that preserve both the semantics and canonical rendering of the text, such as decomposed and composed forms of accented characters, where the sequences are considered identical for all purposes in the standard.[1] Compatibility equivalence, in contrast, covers cases where sequences represent the same abstract character but allow for differences in visual appearance or behavior, such as typographic variants or ligatures, enabling decomposition for compatibility with legacy systems while requiring careful handling in applications.[1] Equivalence classes, such as those resulting from normalization forms like NFC (composed) and NFD (decomposed), illustrate how these sequences can be interchanged within the standard's framework.[1] Historically, Unicode equivalence emerged from the effort to merge diverse legacy encoding standards, including ISO 8859 for Western scripts and Shift-JIS for Japanese, into a single universal character encoding.[2] This unification process intentionally introduced redundancies to maintain round-trip compatibility with existing systems, allowing multiple code point sequences to encode the same text unit while supporting interoperability across global text processing.[2] Normalization processes provide a means to resolve these equivalences by transforming sequences into standardized forms, though detailed algorithms are specified elsewhere in the standard.[1]Canonical vs Compatibility Equivalence
Canonical equivalence in Unicode refers to sequences of code points that represent the same abstract character and are intended to have identical semantic meaning, visual appearance, and behavior across all contexts, making them fully interchangeable without loss of information.[3] For instance, the precomposed character "é" (U+00E9) is canonically equivalent to the sequence "e" followed by combining acute accent (U+0065 U+0301), as both render the same accented letter with preserved meaning and round-trip stability in normalization processes.[3] This type of equivalence ensures that text processing operations, such as rendering or collation, treat these forms identically, supporting full fidelity in applications like search and indexing.[3] In contrast, compatibility equivalence applies to character sequences that are similar but not semantically identical, often introduced for historical, stylistic, or legacy compatibility reasons, allowing decomposition into simpler forms that may alter typographic or formatting intent.[3] A common example is the ligature "fi" (U+FB01), which decomposes compatibly to the separate characters "f" and "i" (U+0066 U+0069), enabling normalization for legacy systems but potentially losing the intended joined glyph appearance.[3] Unlike canonical equivalence, compatibility mappings do not guarantee round-trip stability or preservation of original semantics, as they prioritize broader interoperability over strict identity.[3] The primary differences between canonical and compatibility equivalence lie in their scope and implications for text handling: canonical equivalence maintains complete interchangeability and fidelity, ensuring no loss in meaning or appearance during operations like normalization, whereas compatibility equivalence permits simplification that can introduce variations in rendering or reduce precision in tasks such as full-text search.[3] For example, canonical equivalences like diacritic combinations preserve exact semantics, while compatibility ones, such as superscript digits (e.g., "²" U+00B2 decomposing to "2" U+0032), allow stylistic variants but risk altering document intent if over-applied.[3] These distinctions are formalized in Unicode Standard Annex #15 (UAX #15), with ongoing updates through Unicode 17.0 (2025) emphasizing secure handling of equivalences to mitigate risks like confusables in identifiers.[3][4]| Aspect | Canonical Equivalence Example | Compatibility Equivalence Example |
|---|---|---|
| Representation | "é" (U+00E9) ↔ "e" + acute (U+0065 U+0301) | "fi" (U+FB01) → "f" + "i" (U+0066 U+0069) |
| Purpose | Semantic and visual identity | Stylistic or legacy normalization |
| Interchangeability | Full, with no loss in rendering or meaning | Partial, may alter typographic intent |
| Implications | Preserves fidelity in search/indexing | Risks reduced precision in processing |
Sources of Equivalence
Character Duplication and Historical Duplicates
Unicode incorporated characters from multiple legacy encodings during its development to ensure compatibility with established standards, resulting in duplicate code points for the same abstract character. This approach prioritized round-trip conversion, where text encoded in older systems like ISO/IEC 8859-1 could be accurately mapped to and from Unicode without loss. For instance, the Latin-1 Supplement block (U+0080–U+00FF) includes characters such as the micro sign (U+00B5), which serves as a compatibility encoding of the Greek small letter mu (U+03BC) to support legacy uses in scientific notation. Similarly, symbols from standards like ISO 8859-7 for Greek were integrated into the Greek and Coptic block (U+0370–U+03FF), but certain variants, such as the ohm sign (U+2126) in the Letterlike Symbols block, were added separately to preserve historical distinctions while establishing compatibility equivalence to the Greek capital letter omega (U+03A9). These duplications arose because early Unicode versions aimed to accommodate diverse national standards without requiring immediate unification of all visually identical glyphs.[3] The mechanisms behind these duplications were primarily intentional aliases designed for backward compatibility, avoiding the need for complex transcoding in applications handling legacy data. In the case of CJK ideographs, unification across Chinese, Japanese, and Korean scripts was a core principle, yet separate compatibility blocks were created to mirror source-specific encodings. The CJK Compatibility Ideographs block (U+F900–U+FAFF) contains 863 characters as of Unicode 17.0 (2024), the majority of which are duplicates or near-duplicates of unified ideographs in the main CJK Unified Ideographs blocks, sourced from standards like KS C 5601-1989.[5] Specifically, the range U+F900–U+FA0B includes 172 exact duplicates from the Korean standard to enable precise round-trip mapping. Other major categories of duplicates in the Basic Multilingual Plane encompass approximately 100 canonical equivalents via singleton decompositions, such as the Kelvin sign (U+212A) mapping to the Latin capital letter K (U+004B), and additional compatibility characters in blocks like Enclosed Alphanumerics and Geometric Shapes for typographic variants from legacy fonts. These historical duplicates form equivalence classes where distinct code points represent the same semantic and visual content, allowing substitution in processing without altering rendering or meaning under appropriate equivalence rules. For example, the angstrom sign (U+212B) is compatibility equivalent to the Latin capital letter A with ring above (U+00C5), which canonically decomposes to the base letter A (U+0041) plus combining ring above (U+030A), ensuring consistent behavior in text comparison under compatibility normalization.[6] Post-2010 Unicode versions expanded this pattern with emoji-related additions; starting from Unicode 8.0, skin tone modifiers (U+1F3FB–U+1F3FF) were introduced as single code points that pair with base emoji to create variant forms, effectively duplicating representations for diversity while maintaining compatibility with earlier neutral emoji.[7] Such duplications underscore Unicode's evolution to balance historical preservation with modern interoperability needs.Precomposed and Combining Characters
Unicode includes precomposed characters as single code points that represent common combinations of a base letter and a diacritic mark, such as U+00E9 LATIN SMALL LETTER E WITH ACUTE (é), located in the Latin-1 Supplement block.[1] These characters were incorporated to ensure compatibility with legacy character encodings, such as ISO 8859-1, which lacked support for dynamic combining sequences and required fixed single-byte representations for accented letters.[1] In contrast, Unicode supports the construction of equivalent forms through sequences of a base character followed by one or more combining marks, such as U+0065 LATIN SMALL LETTER E (e) combined with U+0301 COMBINING ACUTE ACCENT (◌́).[1] Combining marks are assigned Canonical Combining Class (CCC) values ranging from 1 to 255 to indicate their attachment and ordering relative to the base, while base characters have CCC 0; this system enables proper reordering during normalization to maintain consistent representation.[8] These sequences form grapheme clusters—user-perceived character units defined by Unicode Standard Annex #29's boundary rules, such as no breaks before Extend characters (including most combining marks) or SpacingMark, ensuring that the composite appears as a single visual unit across equivalent forms.[8] Under canonical equivalence, a precomposed character is defined as equivalent to its decomposed sequence of base plus combining marks, meaning they must be rendered, searched, and processed identically in conforming implementations.[1] For instance, NFC (Normalization Form C) decomposes text first and then recomposes where possible, converting the sequence U+0075 LATIN SMALL LETTER U (u) + U+0308 COMBINING DIAERESIS (◌̈) into the precomposed U+00FC LATIN SMALL LETTER U WITH DIAERESIS (ü), while preserving the visual appearance and semantic meaning.[1] Rendering systems normalize these equivalents to display the diacritic correctly attached to the base, regardless of the underlying code point structure.[1] This principle extends to modern extensions like Emoji Zero Width Joiner (ZWJ) sequences starting from Unicode 8.0, with expansions in later versions such as Unicode 9.0 for family sequences, where multi-code-point combinations form composite emojis treated as single grapheme clusters; for example, the family emoji sequence U+1F468 👨❤️👨 (man + ZWJ + heart + ZWJ + man) uses ZWJ (U+200D, an Extend grapheme breaker) to join base emojis into an equivalent representation of a couple, analogous to combining marks in textual equivalence.[9][8]Typographic and Compatibility Variants
Typographic equivalence in Unicode arises from characters and sequences that influence rendering or layout without altering the underlying semantic meaning, ensuring that text processors can treat them as equivalent under compatibility normalization. Characters such as the zero-width space (U+200B) serve as invisible format controls, providing line break opportunities or word boundaries in typesetting while maintaining no visible width or impact on textual equivalence classes.[10] Similarly, variation selectors (U+FE00–U+FE0F) are non-spacing marks that follow a base character to specify glyph variants, such as vertical or italic forms, without shifting the character's core identity or normalization equivalence.[11] Compatibility variants encompass precomposed characters designed for typographic convenience that decompose into simpler forms via mappings defined in the Unicode Character Database (UCD), particularly in the Decomposition_Mapping field of UnicodeData.txt. These mappings, tagged withNormalization Techniques
Normalization Algorithms
The Unicode Normalization Algorithm, as defined in Unicode Standard Annex #15, provides a standardized method to convert Unicode text into a canonical or compatibility equivalent form while preserving the original meaning and visual appearance. This algorithm ensures that equivalent sequences of code points are transformed into a unique representation, facilitating consistent processing in software applications. It operates on strings by breaking down composite characters, rearranging elements according to defined rules, and optionally reassembling them, with all operations relying on properties derived from the Unicode Character Database.[3] The algorithm proceeds through a structured sequence of operations, often described in four key phases: decomposition, allocation of combining classes, reordering, and composition (the latter applied selectively based on the target form). In the decomposition phase, each code point in the input string is examined and, if it has a decomposition mapping, replaced by its canonical components for forms NFD and NFC or compatibility components for NFKD and NFKC; this process is recursive to handle nested decompositions, with special rules for Hangul syllables that decompose into jamo elements. The Decomposition_Mapping property, sourced from the UnicodeData file, dictates these breakdowns, ensuring that precomposed characters are expanded into base characters followed by combining marks.[3] (UnicodeData.txt in UCD) Following decomposition, combining classes are allocated to each code point using the Canonical_Combining_Class (CCC) property, also from UnicodeData, which assigns numeric values (0 for base characters and spacing marks, higher values for diacritics and other non-spacing marks) to determine their relative positioning. The reordering phase then sorts the sequence of combining marks in ascending order of their CCC values, treating base characters (CCC=0) as anchors while stably rearranging subsequent non-starters to achieve canonical order; this step resolves variations in combining mark placement without altering the text's semantics. For composed forms like NFC and NFKC, the composition phase follows, where the algorithm scans the reordered sequence from left to right, identifying eligible pairs—a base character followed by a combining mark—and replacing them with a precomposed character if a canonical mapping exists and the pair is not excluded by the Composition_Exclusions list; primary composition uses the main decomposition components, while secondary handles cases after the primary combiner.[3] (UnicodeData.txt and CompositionExclusions.txt in UCD) Key properties of the algorithm include uniqueness, where canonically equivalent strings yield identical outputs in NFD or NFC, and stability under concatenation, meaning that if two normalized strings are joined, the result remains normalized provided no inter-string reordering is needed across the boundary. The process is idempotent, so applying normalization multiple times produces no change after the first application, and it is version-stable: since Unicode 4.1, no changes destabilize already normalized text, with the composition algorithm frozen to the Unicode 3.1.0 character set to prevent introducing new decompositions that could affect existing strings.[3] A high-level pseudocode outline for generating NFD (decomposed form) and NFC (composed form) illustrates the process without full implementation details:function decompose(codePoint):
if Decomposition_Mapping[codePoint] exists:
return concatenate(decompose each component recursively)
else:
return codePoint
function toNFD(inputString):
result = empty string
for each codePoint in inputString:
append decompose(codePoint) to result
assign CCC to each codePoint in result using Canonical_Combining_Class
reorder result by stable sort on CCC (CCC=0 fixed, others ascending)
return result
function composeNFD(result): // for NFC from NFD output
output = [empty string](/page/Empty_string)
i = 0
while i < length(result):
if canCompose(result[i], result[i+1]) and not excluded:
append composite(result[i], result[i+1]) to output
i += 2
else:
append result[i] to output
i += 1
return output
function toNFC(inputString):
nfd = toNFD(inputString)
return composeNFD(nfd)
function decompose(codePoint):
if Decomposition_Mapping[codePoint] exists:
return concatenate(decompose each component recursively)
else:
return codePoint
function toNFD(inputString):
result = empty string
for each codePoint in inputString:
append decompose(codePoint) to result
assign CCC to each codePoint in result using Canonical_Combining_Class
reorder result by stable sort on CCC (CCC=0 fixed, others ascending)
return result
function composeNFD(result): // for NFC from NFD output
output = [empty string](/page/Empty_string)
i = 0
while i < length(result):
if canCompose(result[i], result[i+1]) and not excluded:
append composite(result[i], result[i+1]) to output
i += 2
else:
append result[i] to output
i += 1
return output
function toNFC(inputString):
nfd = toNFD(inputString)
return composeNFD(nfd)
Normal Forms
Unicode defines four standard normalization forms to handle equivalence among character sequences: Normalization Form Canonical Composition (NFC), Normalization Form Canonical Decomposition (NFD), Normalization Form Compatibility Composition (NFKC), and Normalization Form Compatibility Decomposition (NFKD). These forms transform text into consistent representations, ensuring that equivalent strings map to the same sequence while preserving essential semantics. NFC and NFD address canonical equivalence, where sequences represent the same abstract character with identical visual appearance and behavior, such as a precomposed character like "é" (U+00E9) versus its decomposed form "e" + combining acute accent (U+0065 U+0301). In contrast, NFKC and NFKD incorporate compatibility equivalence, which extends to visually or semantically similar but not canonically identical representations, such as mapping the full-width Latin capital "A" (U+FF21) to the standard "A" (U+0041).[1] The process for achieving these forms involves decomposition followed by optional composition. For NFC, text undergoes canonical decomposition—breaking characters into their base and combining marks—then canonical composition, which recombines compatible sequences into precomposed characters where defined, resulting in the most compact form. NFD stops at canonical decomposition, producing a fully decomposed sequence suitable for processes like sorting. NFKC applies compatibility decomposition first, which includes additional mappings for variant forms like ligatures or half-width characters, before canonical composition; NFKD similarly uses only compatibility decomposition. These rules ensure that NFC and NFD guarantee canonical equivalence, while NFKC and NFKD provide broader folding for compatibility equivalence, though they may alter or lose certain stylistic distinctions.[1] In practice, NFC is widely recommended for storage and searching in databases and web content, as it minimizes variability and ensures consistent indexing of canonically equivalent text. NFD proves useful for input methods and text processing where separate handling of base characters and diacritics is beneficial, such as in spell-checking or rendering. NFKC and NFKD are employed in scenarios requiring compatibility folding, like identifier matching or search engines that treat full-width and half-width forms as identical. For instance, normalizing "fi" ligature (U+FB01) to "f" + "i" (U+0066 U+0069) via NFKC allows treating it equivalently to the separate letters in legacy system integrations.[1][17] The normal forms exhibit key properties that support their reliability:| Property | Description | Applies to NFC/NFD | Applies to NFKC/NFKD |
|---|---|---|---|
| Uniqueness | Equivalent strings yield a unique binary representation. | Yes | Yes |
| Stability | The normalized form of a string remains unchanged across Unicode versions. | Yes | Yes |
| Round-Trip | Original text can be recovered exactly after normalization and inverse. | Yes | No (may lose formatting) |
