Recent from talks
Nothing was collected or created yet.
Bidirectional text
View on WikipediaThis article needs additional citations for verification. (July 2015) |
A bidirectional text contains two text directionalities, right-to-left (RTL) and left-to-right (LTR). It generally involves text containing different types of alphabets, but may also refer to boustrophedon, which is changing text direction in each row.
An example is the RTL Hebrew name Sarah: שרה, spelled sin (ש) on the right, resh (ר) in the middle, and heh (ה) on the left. Many computer programs failed to display this correctly, because they were designed to display text in one direction only.
Some so-called right-to-left scripts such as the Persian script and Arabic are mostly, but not exclusively, right-to-left—mathematical expressions, numeric dates and numbers bearing units are embedded from left to right. That also happens if text from a left-to-right language such as English is embedded in them; or vice versa, if Arabic is embedded in a left-to-right script such as English.
Bidirectional script support
[edit]Bidirectional script support is the capability of a computer system to correctly display bidirectional text. The term is often shortened to "BiDi" or "bidi".
Early computer installations were designed only to support a single writing system, typically for left-to-right scripts based on the Latin alphabet only. Adding new character sets and character encodings enabled a number of other left-to-right scripts to be supported, but did not easily support right-to-left scripts such as Arabic or Hebrew, and mixing the two was not practical. Right-to-left scripts were introduced through encodings like ISO/IEC 8859-6 and ISO/IEC 8859-8, storing the letters (usually) in writing and reading order. It is possible to simply flip the left-to-right display order to a right-to-left display order, but doing this sacrifices the ability to correctly display left-to-right scripts. With bidirectional script support, it is possible to mix characters from different scripts on the same page, regardless of writing direction.
In particular, the Unicode standard provides foundations for complete BiDi support, with detailed rules as to how mixtures of left-to-right and right-to-left scripts are to be encoded and displayed.
Unicode bidi support
[edit]The Unicode standard calls for characters to be ordered 'logically', i.e. in the sequence they are intended to be interpreted, as opposed to 'visually', the sequence they appear. This distinction is relevant for bidi support because at any bidi transition, the visual presentation ceases to be the 'logical' one. Thus, in order to offer bidi support, Unicode prescribes an algorithm for how to convert the logical sequence of characters into the correct visual presentation. For this purpose, the Unicode encoding standard divides all its characters into one of four types: 'strong', 'weak', 'neutral', and 'explicit formatting'.[1]
Strong characters
[edit]Strong characters are those with a definite direction. Examples of this type of character include most alphabetic characters, syllabic characters, Han ideographs, non-European or non-Arabic digits, and punctuation characters that are specific to only those scripts.
Weak characters
[edit]Weak characters are those with vague direction. Examples of this type of character include European digits, Eastern Arabic-Indic digits, arithmetic symbols, and currency symbols.
Neutral characters
[edit]Neutral characters have direction indeterminable without context. Examples include paragraph separators, tabs, and most other whitespace characters. Punctuation symbols that are common to many scripts, such as the colon, comma, full-stop, and the no-break-space also fall within this category.
Explicit formatting
[edit]Explicit formatting characters, also referred to as "directional formatting characters", are special Unicode sequences that direct the algorithm to modify its default behavior. These characters are subdivided into "marks", "embeddings", "isolates", and "overrides". Their effects continue until the occurrence of either a paragraph separator, or a "pop" character.
Marks
[edit]If a "weak" character is followed by another "weak" character, the algorithm will look at the first neighbouring "strong" character. Sometimes this leads to unintentional display errors. These errors are corrected or prevented with "pseudo-strong" characters. Such Unicode control characters are called marks. The mark (U+200E LEFT-TO-RIGHT MARK (LRM) or U+200F RIGHT-TO-LEFT MARK (RLM)) is to be inserted into a location to make an enclosed weak character inherit its writing direction.
For example, to correctly display the U+2122 ™ TRADE MARK SIGN for an English name brand (LTR) in an Arabic (RTL) passage, an LRM mark is inserted after the trademark symbol if the symbol is not followed by LTR text (e.g. "قرأ Wikipedia™ طوال اليوم."). If the LRM mark is not added, the weak character ™ will be neighbored by a strong LTR character and a strong RTL character. Hence, in an RTL context, it will be considered to be RTL, and displayed in an incorrect order (e.g. "قرأ Wikipedia™ طوال اليوم.").
Embeddings
[edit]The "embedding" directional formatting characters are the classical Unicode method of explicit formatting, and as of Unicode 6.3, are being discouraged in favor of "isolates". An "embedding" signals that a piece of text is to be treated as directionally distinct. The text within the scope of the embedding formatting characters is not independent of the surrounding text. Also, characters within an embedding can affect the ordering of characters outside. Unicode 6.3 recognized that directional embeddings usually have too strong an effect on their surroundings and are thus unnecessarily difficult to use.
Isolates
[edit]The "isolate" directional formatting characters signal that a piece of text is to be treated as directionally isolated from its surroundings. As of Unicode 6.3, these are the formatting characters that are being encouraged in new documents – once target platforms are known to support them. These formatting characters were introduced after it became apparent that directional embeddings usually have too strong an effect on their surroundings and are thus unnecessarily difficult to use. Unlike the legacy 'embedding' directional formatting characters, 'isolate' characters have no effect on the ordering of the text outside their scope. Isolates can be nested, and may be placed within embeddings and overrides.
Overrides
[edit]The "override" directional formatting characters allow for special cases, such as for part numbers (e.g. to force a part number made of mixed English, digits and Hebrew letters to be written from right to left), and are recommended to be avoided wherever possible. As is true of the other directional formatting characters, "overrides" can be nested one inside another, and in embeddings and isolates.
Using Unicode to override
[edit]Using U+202D LEFT-TO-RIGHT OVERRIDE will switch the text direction from left-to-right to right-to-left. Similarly, using U+202E RIGHT-TO-LEFT OVERRIDE will switch the text direction from right-to-left to left-to-right. Refer to the Unicode Bidirectional Algorithm.
Pops
[edit]The "pop" directional formatting character, encoded at U+202C POP DIRECTIONAL FORMATTING, terminates the scope of the most recent "embedding", "override", or "isolate".
Runs
[edit]In the algorithm, each sequence of concatenated strong characters is called a "run". A "weak" character that is located between two "strong" characters with the same orientation will inherit their orientation. A "weak" character that is located between two "strong" characters with a different writing direction will inherit the main context's writing direction (in an LTR document the character will become LTR, in an RTL document, it will become RTL).
Table of possible BiDi character types
[edit]| Type[2] | Description | Strength | Directionality | General scope | Bidi_Control character[3] |
|---|---|---|---|---|---|
| L | Left-to-Right | Strong | L-to-R | Most alphabetic and syllabic characters, Chinese characters, non-European or non-Arabic digits, LRM character, ... | U+200E LEFT-TO-RIGHT MARK (LRM) |
| R | Right-to-Left | Strong | R-to-L | Adlam, Garay, Hebrew, Mandaic, Mende Kikakui, N'Ko, Samaritan, ancient scripts like Kharoshthi and Nabataean, RLM character, ... | U+200F RIGHT-TO-LEFT MARK (RLM) |
| AL | Arabic Letter | Strong | R-to-L | Arabic, Hanifi Rohingya, Sogdian, Syriac, and Thaana alphabets, and most punctuation specific to those scripts, ALM character, ... | U+061C ARABIC LETTER MARK (ALM) |
| EN | European Number | Weak | European digits, Eastern Arabic-Indic digits, Coptic epact numbers, ... | ||
| ES | European Separator | Weak | plus sign, minus sign, ... | ||
| ET | European Number Terminator | Weak | degree sign, currency symbols, ... | ||
| AN | Arabic Number | Weak | Arabic-Indic digits, Arabic decimal and thousands separators, Rumi digits, Hanifi Rohingya digits, ... | ||
| CS | Common Number Separator | Weak | colon, comma, full stop, no-break space, ... | ||
| NSM | Nonspacing Mark | Weak | Characters in General Categories Mark, nonspacing, and Mark, enclosing (Mn, Me) | ||
| BN | Boundary Neutral | Weak | Default ignorables, non-characters, control characters other than those explicitly given other types | ||
| B | Paragraph Separator | Neutral | paragraph separator, appropriate Newline Functions, higher-level protocol paragraph determination | ||
| S | Segment Separator | Neutral | Tabs | ||
| WS | Whitespace | Neutral | space, figure space, line separator, form feed, General Punctuation block spaces (smaller set than the Unicode whitespace list) | ||
| ON | Other Neutrals | Neutral | All other characters, including object replacement character | ||
| LRE | Left-to-Right Embedding | Explicit | L-to-R | LRE character only | U+202A LEFT-TO-RIGHT EMBEDDING (LRE) |
| LRO | Left-to-Right Override | Explicit | L-to-R | LRO character only | U+202D LEFT-TO-RIGHT OVERRIDE (LRO) |
| RLE | Right-to-Left Embedding | Explicit | R-to-L | RLE character only | U+202B RIGHT-TO-LEFT EMBEDDING (RLE) |
| RLO | Right-to-Left Override | Explicit | R-to-L | RLO character only | U+202E RIGHT-TO-LEFT OVERRIDE (RLO) |
| Pop Directional Format | Explicit | PDF character only | U+202C POP DIRECTIONAL FORMATTING (PDF) | ||
| LRI | Left-to-Right Isolate | Explicit | L-to-R | LRI character only | U+2066 LEFT-TO-RIGHT ISOLATE (LRI) |
| RLI | Right-to-Left Isolate | Explicit | R-to-L | RLI character only | U+2067 RIGHT-TO-LEFT ISOLATE (RLI) |
| FSI | First Strong Isolate | Explicit | FSI character only | U+2068 FIRST STRONG ISOLATE (FSI) | |
| PDI | Pop Directional Isolate | Explicit | PDI character only | U+2069 POP DIRECTIONAL ISOLATE (PDI) | |
Notes
| |||||
Security
[edit]Unicode bidirectional characters are used in the Trojan Source vulnerability.[2]
Visual Studio Code highlights BiDi control characters since version 1.62 released in October 2021.[3]
Visual Studio highlights BiDi control characters since version 17.0.3 released on December 14, 2021.[4]
Scripts using bidirectional text
[edit]Egyptian hieroglyphs
[edit]Egyptian hieroglyphs were written bidirectionally, where the signs that had a distinct "head" or "tail" faced the beginning of the line.
Chinese characters and other CJK scripts
[edit]Chinese characters can be written in either direction as well as vertically (top to bottom then right to left), especially in signs (such as plaques), but the orientation of the individual characters does not change. This can often be seen on tour buses in China, where the company name customarily runs from the front of the vehicle to its rear — that is, from right to left on the right side of the bus, and from left to right on the left side of the bus. English texts on the right side of the vehicle are also quite commonly written in reverse order. (See pictures of tour bus and post vehicle below.)
Likewise, other CJK scripts made up of the same square characters, such as the Japanese writing system and Korean writing system, can also be written in any direction, although horizontally left-to-right, top-to-bottom and vertically top-to-bottom right-to-left are the two most common forms.
-
The right side (text runs from right to left, including the English text)
-
The left side (text runs from left to right)
-
On the right side of this Hainan Airlines aircraft, the text runs from right to left (空航南海).
-
The left side of this Hainan Airlines aircraft, however, shows the text running from left to right (海南航空).
-
A photo that shows text on both sides of a China Post vehicle. On the right door, china post appears as tsop anihc.
Boustrophedon
[edit]Boustrophedon is a writing style found in ancient Greek inscriptions, in Old Sabaic (an Old South Arabian language) and in Hungarian runes. This method of writing alternates direction, and usually reverses the individual characters, on each successive line.
Moon type
[edit]Moon type is an embossed adaptation of the Latin alphabet invented as a tactile alphabet for the blind. Initially the text changed direction (but not character orientation) at the end of the lines. Special embossed lines connected the end of a line and the beginning of the next.[5] Around 1990, it changed to a left-to-right orientation.
See also
[edit]References
[edit]- ^ "UAX #9: Unicode Bi-directional Algorithm". Unicode.org. 2018-05-09. Retrieved 2018-06-26.
- ^ "Trojan Source Attacks". trojansource.codes. Retrieved 17 January 2022.
- ^ "Visual Studio Code October 2021". code.visualstudio.com. Retrieved 11 November 2021.
- ^ "Visual Studio 2022 version 17.0 Release Notes". docs.microsoft.com. Retrieved 17 January 2022.
- ^ Moon Type for the Blind, Ramseyer Bible Collection, Kathryn A. Martin Library, University of Minnesota Duluth.
External links
[edit]- Unicode Standards Annex #9 The Bidirectional Algorithm
- W3C guidelines on authoring techniques for bi-directional text - includes examples and good explanations
- ICU International Components for Unicode contains an implementation of the bi-directional algorithm — along with other internationalization services
Bidirectional text
View on Grokipediaunicode-bidi and direction.[3][2] For web content, the base direction can be set using HTML's dir attribute (e.g., dir="rtl"), ensuring consistent rendering across diverse linguistic environments, though developers must account for issues like caret positioning during editing or neutral character resolution at direction boundaries.[2] Overall, the UBA enables seamless global text interchange, supporting users of RTL scripts while maintaining logical order for storage and processing efficiency.[1][3]
Fundamentals
Definition and Overview
Bidirectional text refers to sequences of characters that mix left-to-right (LTR) and right-to-left (RTL) directional scripts, requiring algorithmic reordering to achieve correct visual presentation.[1] This phenomenon arises from the fundamental differences in writing systems: LTR for Latin-based languages like English, and RTL for Semitic languages such as Arabic and Hebrew, which often appear together in multilingual documents or interfaces.[2][4] A simple example is the string "Hello שלום", where the Hebrew word "שלום" (meaning "peace") is rendered in RTL order—reversing its characters visually—while embedded within the LTR "Hello", resulting in the RTL portion appearing to the right but reading from right to left.[2] In computing, accurate bidirectional text rendering is crucial for maintaining readability in digital applications, web pages, and software interfaces supporting global users, preventing confusion in mixed-language content.[1][4] The Unicode standard addresses this through its Bidirectional Algorithm, ensuring consistent display across diverse platforms.[1]Historical Context
Bidirectional writing practices trace their roots to ancient civilizations, where script directions varied to accommodate inscription surfaces or aesthetic needs. One of the earliest forms was boustrophedon, a method alternating line directions from left-to-right and right-to-left, akin to an ox plowing a field. This style appeared in ancient Greek inscriptions as early as the 8th century BCE, with examples like the Dipylon Oinochoe vase from Athens demonstrating careful reversal of characters to maintain legibility across lines.[5] Etruscan inscriptions from the 7th century BCE, influenced by Greek contact, also frequently employed boustrophedon, as seen in early monumental texts on stone and metal, reflecting a transitional phase before standardization to consistent directions.[6] Semitic scripts established a more uniform right-to-left orientation that profoundly shaped later writing systems. The Phoenician alphabet, developed around 1200 BCE in the Levant, was consistently inscribed from right to left, omitting vowels for efficiency in trade and administration. This directionality directly influenced the Hebrew script by the 10th century BCE, where shared consonantal forms and phonetic principles preserved the RTL flow in biblical and epigraphic texts. Similarly, through intermediate Aramaic adaptations around the 9th century BCE, Phoenician RTL conventions evolved into the Arabic script, standardizing right-to-left writing across the Islamic world by the 7th century CE.[7] Other historical systems exhibited bidirectional flexibility tied to visual or contextual cues rather than strict linearity. In ancient Egyptian hieroglyphs, dating from circa 3200 BCE, the reading direction was determined by the orientation of figures—human and animal glyphs faced toward the start of the text, allowing seamless shifts between left-to-right and right-to-left flows within the same composition, as evidenced in tomb inscriptions and stelae.[8] The advent of printing in the 15th–16th centuries amplified challenges for bidirectional and mixed scripts, predating digital solutions. Early European presses struggled with RTL languages like Hebrew and Arabic, requiring custom type molds for cursive joins and multiple letter variants—up to four forms per Arabic character—resulting in laborious, error-prone composition by undertrained compositors. Mixed-language texts, such as early Arabic works printed in Italy from 1514, demanded reversed setting and frequent plate realignments, limiting production and fidelity to manuscript traditions until specialized foundries emerged in the 19th century.[9][10]Technical Standards
Unicode Bidirectional Support
The Unicode Standard establishes a foundational framework for handling bidirectional text via its Bidirectional Algorithm, initially specified in version 2.0, released in July 1996.[1][11] This algorithm outlines rules for resolving the visual ordering of text that mixes left-to-right (LTR) and right-to-left (RTL) scripts, enabling consistent rendering across diverse writing systems without requiring script-specific adjustments in applications.[1] At its core, the algorithm operates on a paragraph-by-paragraph basis, embedding each unit of text independently to isolate directional behavior.[1] It determines the implicit base direction from the first strong directional character in the paragraph, such as an LTR letter or RTL character, and supports explicit overrides through dedicated formatting controls to force specific directions when needed.[1] In resolving implicit direction, the algorithm relies on character classifications, such as left-to-right or right-to-left types, which are detailed elsewhere.[1] The Bidirectional Algorithm has evolved through successive Unicode versions, with a key enhancement in Unicode 6.3 (released September 2013) introducing bidirectional isolates to manage nested directional runs more precisely, limiting their scope and reducing interference with adjacent text.[1] The Unicode Consortium maintains ongoing refinements to the algorithm, ensuring compatibility with new scripts and internationalization requirements.[1] This support extends to broader standards, where the algorithm informs implementations in HTML and CSS—such as thedir="rtl" attribute for declaring base direction—and is realized in open-source libraries like the International Components for Unicode (ICU), which provides robust bidirectional text processing for software applications.[1][12]
Character Classification
In bidirectional text processing, characters are classified into categories based on their inherent directional behavior, which determines how they influence the ordering of text in mixed-direction scripts. These classifications are defined by the Unicode Bidirectional Algorithm and form the foundation for resolving text directionality. The primary property governing this is the Bidi_Class, a normative Unicode character property that assigns one of 23 possible bidirectional types to each code point, including unassigned and private-use characters.[1] Characters are grouped into three main categories: strong, weak, and neutral, with additional explicit formatting types for directional control. Strong characters have a fixed direction that strongly influences surrounding text: L for left-to-right (e.g., Latin letters like A), R for right-to-left (e.g., Hebrew letters like א), and AL for right-to-left Arabic letters (e.g., Arabic ا). Weak characters adopt direction based on context, such as EN for European numbers (e.g., digits 0-9), AN for Arabic numbers (e.g., Eastern Arabic-Indic digits ٠-٩), ES for European number separators (e.g., + or -), ET for European number terminators (e.g., $ or °), CS for common separators (e.g., , or ;), and NSM for nonspacing marks (e.g., diacritics like acute accent ´). Neutral characters have no inherent direction and resolve based on adjacent strong types, including B for paragraph separators (e.g., ¶), S for segment separators (e.g., tab), WS for whitespace (e.g., space), and ON for other neutrals (e.g., most punctuation like !).[1][13] Explicit embedding and isolate types provide mechanisms for overriding or isolating directional runs: LRE, RLE, LRO, and RLO for legacy embedding and overrides; LRI, RLI, FSI, and PDI for isolates; and PDF and BN for format and boundary neutrals. These types are assigned via the Unicode Character Database, where unassigned code points default to strong types (L or R based on script) and private-use characters may vary by implementation. Additional properties refine classification: Bidi_Mirrored indicates characters that mirror their glyphs in right-to-left contexts (e.g., parentheses () become )( ), while Bidi_Paired_Bracket identifies paired brackets (with types Open or Close) to ensure proper matching within directional runs, typically treating them as ON unless specified otherwise.[1][13] The following table summarizes the bidirectional character types, their abbreviations, descriptions, and representative examples (based on Unicode 15.1 data, with approximate character counts for scale):| Category | Type | Description | Examples (Unicode Code Points) | Approx. Count |
|---|---|---|---|---|
| Strong | L | Left-to-Right | A (U+0041), α (U+03B1) | 112,000 |
| R | Right-to-Left | א (U+05D0), ܐ (U+0710) | 3,700 | |
| AL | Right-to-Left Arabic | ا (U+0627), ދ (U+078B) | 1,100 | |
| Weak | EN | European Number | 1 (U+0031), १ (U+0967) | 10 |
| ES | European Number Separator | + (U+002B), − (U+2212) | 2 | |
| ET | European Number Terminator | ¢ (U+00A2), ₹ (U+20B9) | 5 | |
| AN | Arabic Number | ٠ (U+0660), ১ (U+09E7) | 30 | |
| CS | Common Number Separator | : (U+003A), ، (U+060C) | 3 | |
| NSM | Nonspacing Mark | ̀ (U+0300), ◌̥ (U+0325) | 1,900 | |
| BN | Boundary Neutral | (U+00AD), (U+200C) | 27 | |
| Neutral | B | Paragraph Separator | ¶ (U+00B6), ‡ (U+2029) | 5 |
| S | Segment Separator | (U+000C), � (U+001D) | 2 | |
| WS | Whitespace | (U+0020), (U+2009) | 17 | |
| ON | Other Neutral | ! (U+0021), © (U+00A9) | 460 | |
| Explicit | LRE | Left-to-Right Embedding | (U+202A) | 1 |
| LRO | Left-to-Right Override | (U+202D) | 1 | |
| RLE | Right-to-Left Embedding | (U+202B) | 1 | |
| RLO | Right-to-Left Override | (U+202E) | 1 | |
| Pop Directional Format | (U+202C) | 1 | ||
| LRI | Left-to-Right Isolate | (U+2066) | 1 | |
| RLI | Right-to-Left Isolate | (U+2067) | 1 | |
| FSI | First Strong Isolate | (U+2068) | 1 | |
| PDI | Pop Directional Isolate | (U+2069) | 1 |
Formatting Controls
Unicode provides a set of explicit directional formatting characters to control the rendering of bidirectional text without changing its underlying semantics. These controls allow authors to override or adjust the automatic bidirectional algorithm, ensuring proper visual ordering in mixed-direction content. They are defined in the Unicode Standard and detailed in Unicode Standard Annex #9 (UAX #9).[1] The formatting controls fall into several categories: implicit marks, embeddings, overrides, and isolates. Implicit marks, such as the Left-to-Right Mark (LRM, U+200E) and Right-to-Left Mark (RLM, U+200F), function as zero-width characters that influence the directionality of adjacent neutral or weak characters without visible effect. The Arabic Letter Mark (ALM, U+061C) serves a similar role specifically for Arabic script contexts. These marks are particularly useful for resolving ambiguities in short sequences, such as ensuring correct punctuation attachment in mixed text.[14][15] Embeddings and overrides provide stronger directional control by altering the embedding levels for subsequent text. The Left-to-Right Embedding (LRE, U+202A) and Right-to-Left Embedding (RLE, U+202B) initiate an embedded sequence in the specified direction, which can nest up to eight levels deep, while the Pop Directional Formatting (PDF, U+202C) terminates the most recent embedding or override, restoring the prior level. Overrides, including the Left-to-Right Override (LRO, U+202D) and Right-to-Left Override (RLO, U+202E), force all following characters—regardless of their inherent directionality—to adopt the specified strong direction until terminated by PDF. For example, in an RTL-dominant context like Hebrew text, inserting LRE before an English phrase such as "Hello World" followed by PDF ensures the phrase renders left-to-right: Hebrew LRE Hello World PDF. These older embedding and override mechanisms can propagate effects to surrounding text, potentially causing unintended reordering.[16][17] To address limitations of embeddings, Unicode 6.3 introduced directional isolates, which limit directional changes to a scoped segment without influencing adjacent content. The Left-to-Right Isolate (LRI, U+2066) and Right-to-Left Isolate (RLI, U+2067) embed text in the respective directions, while the First Strong Isolate (FSI, U+2068) determines the direction based on the first strong directional character within the isolate. All isolates are terminated by the Pop Directional Isolate (PDI, U+2069), which also closes any nested embeddings or overrides inside the isolate. Isolates are preferred in modern implementations for their cleaner scoping, as they prevent "stacking" issues where mismatched controls disrupt the entire paragraph. For instance, in RTL Arabic text containing an embedded LTR URL, using FSI before the URL and PDI after it allows the URL to render correctly without affecting the surrounding Arabic: Arabic FSI https://example.com PDI.[18] While embeddings and overrides remain supported for backward compatibility, isolates are recommended for new content to minimize compatibility risks and improve robustness in applications like web browsers and text editors. These controls interact with inherent character classifications, such as strong RTL types, to fine-tune rendering outcomes.[19][20]Bidirectional Algorithm
The Unicode Bidirectional Algorithm (UBA), defined in Unicode Standard Annex #9, is a standardized process for determining the correct visual ordering of bidirectional text, ensuring that left-to-right (LTR) and right-to-left (RTL) scripts display properly when mixed.[1] It operates on a sequence of characters, each classified by bidirectional type from the Unicode Character Database, such as strong (L for LTR, R or AL for RTL), weak (e.g., numbers like EN), neutral (e.g., punctuation like ON), or explicit formatting controls.[1] The algorithm proceeds in sequential rules to resolve embedding levels and reorder the text for rendering, without altering the logical storage order.[1] The process begins with identifying the base direction of each paragraph (rules P1–P3). The text is first split into paragraphs at paragraph separators. The base embedding level is then set to 0 (LTR) if the first strong directional character is of type L, or to 1 (RTL) if it is R or AL; if no strong character is found, the base direction defaults to LTR or follows higher-level protocols.[1] Next, explicit embeddings and overrides are resolved (rules X1–X10), processing directional formatting characters like LRE (start LTR embedding), RLE (start RTL embedding), LRO/RLO (overrides), and their terminators PDF, PDI, along with isolates (LRI, RLI, FSI) that limit the scope of embeddings to prevent deep nesting issues. These are managed via a directional status stack with a maximum depth of 125 to avoid overflow.[1] Weak and neutral types are then resolved relative to their neighbors and the embedding direction (rules W1–W7 and N1–N2). Weak characters, such as European numbers (EN) or Arabic numbers (AN), adopt the direction of adjacent strong types or the embedding level, with rules adjusting for contexts like numbers following RTL text (e.g., EN becomes AN). Neutral characters, including most punctuation and whitespace, take the direction of the nearest strong type or the paragraph embedding level, with special handling for paired brackets.[1] Following this, implicit levels are assigned (rules I1–I2): characters of type L, EN, or AN receive even levels matching the embedding parity, while R types receive odd levels.[1] The resolved text is segmented into bidirectional runs, defined as contiguous sequences of characters with the same resolved embedding level and direction.[1] These runs are then reordered for visual display (rules L1–L4). Separators and whitespace are reset to the paragraph level (L1), and runs are reversed within each higher embedding level, starting from the highest odd level down to the base, to achieve the correct visual order.[1] Finally, mirroring is applied (L4): characters with resolved RTL direction and the Bidi_Mirrored property (e.g., < becomes >) are replaced with their mirrored glyphs.[1] A high-level pseudocode summary of the resolution phases is as follows:For each paragraph in the text:
Split at paragraph separators (P1)
Determine base embedding level from first strong character (P2–P3)
Process explicit directional formatting to set embedding levels (X1–X10)
For each isolating run sequence:
Resolve weak types based on neighbors (W1–W7)
Resolve neutral types to strong or embedding direction (N1–N2)
Assign implicit levels (I1–I2)
For each rendering line:
Identify and form level runs (L1)
Reorder runs by embedding level (L2)
Adjust marks and numerics (L3)
Apply mirroring for RTL mirrored characters (L4)
For each paragraph in the text:
Split at paragraph separators (P1)
Determine base embedding level from first strong character (P2–P3)
Process explicit directional formatting to set embedding levels (X1–X10)
For each isolating run sequence:
Resolve weak types based on neighbors (W1–W7)
Resolve neutral types to strong or embedding direction (N1–N2)
Assign implicit levels (I1–I2)
For each rendering line:
Identify and form level runs (L1)
Reorder runs by embedding level (L2)
Adjust marks and numerics (L3)
Apply mirroring for RTL mirrored characters (L4)
Script Applications
Right-to-Left Scripts
Right-to-left (RTL) scripts are writing systems where text is primarily arranged from right to left, a directionality that necessitates specific bidirectional handling when mixed with left-to-right (LTR) elements. Among the primary RTL scripts are Hebrew, Arabic, Syriac, Persian (Farsi), Urdu, Pashto, and Kurdish (Sorani dialect), with Arabic-script variants like Persian, Urdu, and others adapting the cursive forms for their phonologies while maintaining RTL flow. Each of Hebrew, Arabic, and Syriac belongs to the Semitic language family and functions as abjads where consonants form the core of the writing system, with vowels often indicated optionally via diacritics.[21][22] Hebrew employs a square script, characterized by block-like letter forms that do not connect cursively; it primarily represents consonants, with niqqud diacritics for vowels used mainly in educational or religious contexts. Arabic, in contrast, is a cursive script where letters adopt contextual joining forms—initial, medial, final, or isolated—depending on their position within a word, and it frequently incorporates harakat diacritics for short vowels and pronunciation nuances. Persian uses a modified Arabic script with additional letters for sounds like /p/, /ch/, /zh/, and /g/, retaining cursive RTL directionality. Urdu similarly adapts Arabic script with extra characters for Indic sounds, often including more diacritics (zer, zabar) for vowels. Syriac and related Semitic variants, such as those used for Neo-Aramaic languages, also feature cursive connections similar to Arabic, with combining diacritics like qushshaya and rukkakha for vocalization, though they include unique elements like the ligature for the letter Waw with a vertical stroke.[21] In bidirectional contexts, these scripts establish an RTL base direction, embedding LTR segments for elements like European numerals and dates, which maintain their natural left-to-right order within the flow; for instance, a date such as "2025-11-10" appears with the year reading LTR amid surrounding RTL text. Punctuation in these scripts often undergoes mirroring as per the Unicode Bidirectional Algorithm—for example, opening parentheses visually flip to closing forms in RTL runs to preserve logical pairing. The Unicode Bidirectional Algorithm ensures proper reordering and mirroring for these scripts when integrated with LTR content.[1] Modern adaptations for RTL scripts include specialized keyboard layouts and font technologies to facilitate digital input and display. The Hebrew QWERTY layout, a phonetic mapping on standard QWERTY keyboards, assigns Hebrew letters to keys based on English sound approximations (e.g., 'k' for Kaf), enabling bilingual users to switch seamlessly between Hebrew and Latin input. For Arabic, Persian, Urdu, and Syriac, fonts must support OpenType shaping tables to render correct joining forms and ligatures, ensuring cursive connectivity across digital platforms.[23][21] These scripts are prevalent in the Middle East, North Africa, and South Asia, where they serve over 500 million speakers as of 2025, predominantly Arabic (around 400 million), Persian (around 110 million), and Urdu (around 70 million) users, underscoring their cultural and communicative significance in regions spanning from Morocco to Pakistan.[24][25]Mixed-Direction Examples
Bidirectional text frequently arises in multilingual settings where left-to-right (LTR) elements, such as email addresses, are embedded within right-to-left (RTL) scripts like Arabic. For instance, an Arabic sentence describing a contact might include an LTR email like "[email protected]", which the bidirectional algorithm isolates to prevent reversal while the surrounding Arabic flows from right to left.[26] Similarly, Hebrew documents often incorporate LTR URLs, such as "https://www.example.com", maintaining their sequential order amid RTL text to ensure hyperlinks remain functional and readable.[1] Product labels in bilingual markets, like those combining English brand names with Arabic descriptions, rely on bidi handling to display prices or instructions without visual disruption, as seen in consumer goods sold across the Middle East.[27] The visual rendering of mixed-direction text involves reordering characters according to the Unicode Bidirectional Algorithm, grouping LTR segments appropriately within RTL contexts. A representative example is the English phrase "Price: $100" inserted into an Arabic paragraph, which displays as "$100 :Price" on screen, with the numeric value and colon mirroring to align with RTL flow while preserving the internal LTR logic of the price.[1] This reordering extends to other neutrals like punctuation, ensuring commas or parentheses pair correctly with adjacent strong directional characters, as demonstrated in mixed sentences like Arabic text quoting English prices or dates. Software applications handle these scenarios through built-in bidi support. Microsoft Word introduced comprehensive RTL and bidirectional features in Office 2000, enabling users to toggle paragraph directions, embed LTR isolates for emails or URLs, and process mixed-script documents like bilingual reports.[28] Web browsers such as Firefox and Chrome implement the algorithm natively via CSS properties likeunicode-bidi and direction, rendering inline mixed content—such as Hebrew pages with embedded English URLs—correctly across platforms. [29]
Culturally, bidirectional text appears in public signage across diverse regions. In Israel, road signs use trilingual layouts with Hebrew (RTL) on top, followed by Arabic (RTL) and English (LTR) on separate lines, avoiding inline mixing to simplify reading for tourists and locals alike.[30] In the UAE, directional signs and product labels pair Arabic (RTL) with Latin (LTR) scripts, often isolating English terms like brand names or prices to maintain clarity in high-traffic areas like Dubai.[31]
Non-Alphabetic Systems
In ancient Egyptian hieroglyphic writing, the direction of reading is determined by the orientation of human and animal figures, which typically face toward the beginning of the text; a rightward-facing preference is dominant, allowing text to be arranged left-to-right or right-to-left accordingly.[32][33] This flexibility introduces bidirectional elements, as inscriptions on monuments or papyri could reverse direction within a single composition to suit artistic layouts. In the modern Gardiner's sign list, a standardized catalog of over 700 hieroglyphs compiled by Egyptologist Alan H. Gardiner, signs are conventionally oriented to face right, facilitating consistent scholarly transcription while preserving the script's inherent directional variability.[34] CJK (Chinese, Japanese, and Korean) scripts, which employ logographic Han characters, are primarily rendered left-to-right in horizontal lines or top-to-bottom in vertical columns in contemporary usage, though ancient inscriptions and seals often exhibit bidirectional or rotational arrangements. For instance, Chinese seal scripts on chops or imprints frequently arrange characters in anti-clockwise or clockwise sequences to fit circular or square forms, requiring readers to interpret direction based on context rather than linear flow.[35] In the Unicode Standard, CJK characters are classified with the bidirectional class "L" (left-to-right), treating them as strong directional elements that do not inherently support right-to-left overrides, though neutral punctuation may interact in mixed-text scenarios.[36] Boustrophedon writing, meaning "as the ox turns" in Greek, features alternating line directions—left-to-right followed by right-to-left—creating a bidirectional pattern akin to plowing a field; this method appears in various ancient non-alphabetic scripts but lacks native support in Unicode, relying on manual formatting or specialized tools for reproduction. The Mayan script of Mesoamerica, used from approximately 300 BCE to 900 CE, often employed boustrophedon in double-column blocks, where glyphs reversed direction at line ends to fill codex pages efficiently.[37] Similarly, the undeciphered Rongorongo script of Easter Island, dating to the 19th century or earlier, follows a reverse boustrophedon style, with lines read right-to-left then flipped 180 degrees for the next, as evidenced in surviving wooden tablets.[38] Moon type, a tactile writing system developed in 1845 by British inventor William Moon for blind readers, adapts simplified Latin-derived symbols embossed on paper and employs a boustrophedon layout, alternating left-to-right and right-to-left across lines to optimize page space and finger navigation. This mirroring approach made it accessible for illiterate or elderly blind individuals familiar with print shapes, contrasting with Braille's fixed left-to-right progression. Historically promoted in 19th-century Britain by the British and Foreign Blind Association, Moon type saw widespread use until the early 20th century, with approximately 300 books and other works produced.[39][40][41]Challenges and Considerations
Rendering Issues
Rendering bidirectional text presents several challenges across different platforms and software implementations, often deviating from the ideal standards outlined in the Unicode Bidirectional Algorithm. One common issue arises in legacy software, where incorrect reordering of characters occurs due to incomplete support for complex scripts. For instance, prior to the introduction of Uniscribe in Windows 2000, earlier versions like Windows 95 and 98 lacked robust bidirectional handling, leading to garbled display of mixed-direction text such as Arabic embedded in English.[42] Nested embeddings exacerbate these problems, as improper nesting of directional controls can cause punctuation and neutral characters to associate with the wrong embedding level, resulting in visually incorrect layouts. The Unicode standard addresses this through explicit directional isolates (e.g., RLI, LRI, PDI) to prevent interference between embedded segments, but many implementations fail to handle deep nesting correctly, leading to reversed or misaligned text blocks.[43][44][45] Platform variations further complicate rendering, with notable differences between mobile operating systems. iOS introduced stronger RTL support with Auto Layout in iOS 6 (2012), enabling automatic mirroring of layouts for languages like Arabic and Hebrew, though full UI overhauls came in iOS 9 (2015). In contrast, Android's bidirectional support evolved unevenly; versions prior to 4.2 (2012) offered minimal RTL handling, while later releases like Android 5.0+ integrated better bidi via HarfBuzz, but inconsistencies persist in text shaping for mixed scripts across devices.[46][47][48] On the web, inconsistencies arise without explicit CSS controls likeunicode-bidi: bidi-override, as browsers may apply the bidirectional algorithm differently, leading to erratic reordering of inline elements. For example, Firefox and Chrome have historically diverged in handling SVG RTL text with overrides, requiring developers to use the bdo element or directional formatting codes for consistent isolation.[49][50][51]
Accessibility challenges include screen readers mishandling bidirectional directions and multilingual content, which can disrupt the logical reading order for users. Developers are recommended to test with tools such as the Unicode text-rendering test suite, which includes bidi conformance tests, to ensure proper isolation and directionality in applications.[52][53]
