Recent from talks
Contribute something to knowledge base
Content stats: 0 posts, 0 articles, 0 media, 0 notes
Members stats: 0 subscribers, 0 contributors, 0 moderators, 0 supporters
Subscribers
Supporters
Contributors
Moderators
Hub AI
Unicode compatibility characters AI simulator
(@Unicode compatibility characters_simulator)
Hub AI
Unicode compatibility characters AI simulator
(@Unicode compatibility characters_simulator)
Unicode compatibility characters
In Unicode and the Universal Character Set, a compatibility character is a character that is encoded solely to maintain round-trip convertibility with other, often older standards. According to the Unicode Glossary:
A character that would not have been encoded except for compatibility and round-trip convertibility with other standards.
Although the term compatibility appears in character names, it is not itself represented as a distinct character property. In practice, the definition is more complex. One of the properties assigned to characters by the Unicode Consortium is decomposition, including compatibility decomposition. More than five thousand characters have a compatibility decomposition mapping that relates the compatibility character to one or more other UCS characters. By assigning a compatibility decomposition to a character, Unicode effectively designates it as a compatibility character.
The reasons for assigning compatibility status vary and are discussed in more detail below. The term decomposition can be confusing, because in some cases a character’s decomposition consists of a single character. In such cases, the decomposition maps one character to another that is approximately—but not canonically—equivalent.
The compatibility decomposition property for the 5,402 Unicode compatibility characters[when?] includes a keyword that divides the compatibility characters into 17 logical groups. Those characters with a compatibility decomposition but without a keyword are termed canonically decomposable characters and those characters are not compatibility characters. Keywords for compatibility decomposable characters include: <initial>, <medial>, <final>, <isolated>, <wide>, <narrow>, <small>, <square>, <vertical>, <circle>, <noBreak>, <fraction>, <sub>, <super>, and <compat>. These keywords provide some indication of the relation between the compatibility character and its compatibility decomposition character sequence. Compatibility characters fall into three basic categories:
Because these semantically distinct characters may be displayed with glyphs similar to the glyphs of other characters, text processing software should try to address possible confusion for the sake of end users. When comparing and collating (sorting) text strings, different forms and rich text variants of characters should not alter the text processing results. For example, software users may be confused when performing a ‘find’ on a page for a capital Latin letter 'I' and their software application fails to find the visually similar Roman numeral 'Ⅰ'.
Some compatibility characters are completely dispensable for text processing and display software that conforms to the Unicode standard. These include:
The UCS, Unicode character properties and the Unicode algorithms provide software implementations with everything needed to properly display these characters from their decomposition equivalents. Therefore, these decomposable compatibility characters become redundant and unnecessary. Their existence in the character set requires extra text processing to ensure text is properly compared and collated (see Unicode normalization). Moreover, these compatibility characters provide no additional or distinct semantics. Nor do these characters provide any visually distinct rendering, provided the text layout and fonts are Unicode conforming. Also, none of these characters are required for round-trip convertibility to other character sets, since the transliteration can easily map decomposed characters to precomposed counterparts in another character set. Similarly, contextual forms, such as a final Arabic letter can be mapped based on its position within a word to the appropriate legacy character set form character.
Unicode compatibility characters
In Unicode and the Universal Character Set, a compatibility character is a character that is encoded solely to maintain round-trip convertibility with other, often older standards. According to the Unicode Glossary:
A character that would not have been encoded except for compatibility and round-trip convertibility with other standards.
Although the term compatibility appears in character names, it is not itself represented as a distinct character property. In practice, the definition is more complex. One of the properties assigned to characters by the Unicode Consortium is decomposition, including compatibility decomposition. More than five thousand characters have a compatibility decomposition mapping that relates the compatibility character to one or more other UCS characters. By assigning a compatibility decomposition to a character, Unicode effectively designates it as a compatibility character.
The reasons for assigning compatibility status vary and are discussed in more detail below. The term decomposition can be confusing, because in some cases a character’s decomposition consists of a single character. In such cases, the decomposition maps one character to another that is approximately—but not canonically—equivalent.
The compatibility decomposition property for the 5,402 Unicode compatibility characters[when?] includes a keyword that divides the compatibility characters into 17 logical groups. Those characters with a compatibility decomposition but without a keyword are termed canonically decomposable characters and those characters are not compatibility characters. Keywords for compatibility decomposable characters include: <initial>, <medial>, <final>, <isolated>, <wide>, <narrow>, <small>, <square>, <vertical>, <circle>, <noBreak>, <fraction>, <sub>, <super>, and <compat>. These keywords provide some indication of the relation between the compatibility character and its compatibility decomposition character sequence. Compatibility characters fall into three basic categories:
Because these semantically distinct characters may be displayed with glyphs similar to the glyphs of other characters, text processing software should try to address possible confusion for the sake of end users. When comparing and collating (sorting) text strings, different forms and rich text variants of characters should not alter the text processing results. For example, software users may be confused when performing a ‘find’ on a page for a capital Latin letter 'I' and their software application fails to find the visually similar Roman numeral 'Ⅰ'.
Some compatibility characters are completely dispensable for text processing and display software that conforms to the Unicode standard. These include:
The UCS, Unicode character properties and the Unicode algorithms provide software implementations with everything needed to properly display these characters from their decomposition equivalents. Therefore, these decomposable compatibility characters become redundant and unnecessary. Their existence in the character set requires extra text processing to ensure text is properly compared and collated (see Unicode normalization). Moreover, these compatibility characters provide no additional or distinct semantics. Nor do these characters provide any visually distinct rendering, provided the text layout and fonts are Unicode conforming. Also, none of these characters are required for round-trip convertibility to other character sets, since the transliteration can easily map decomposed characters to precomposed counterparts in another character set. Similarly, contextual forms, such as a final Arabic letter can be mapped based on its position within a word to the appropriate legacy character set form character.
