Complex text layout
View on WikipediaThis article includes a list of general references, but it lacks sufficient corresponding inline citations. (July 2013) |


Complex text layout (CTL) or complex text rendering is the typesetting of writing systems in which the shape or positioning of a grapheme depends on its relation to other graphemes. The term is used in the field of software internationalization, where each grapheme is a character.
Scripts which require CTL for proper display may be known as complex scripts. Examples include the Arabic alphabet and scripts of the Brahmic family, such as Devanagari, Khmer script or the Thai alphabet. Many scripts do not require CTL. For instance, the Latin alphabet or Chinese characters can be typeset by simply displaying each character one after another in straight rows or columns. However, even these scripts have alternate forms or optional features (such as cursive writing) which require CTL to produce on computers.
Characteristics requiring CTL
[edit]The main characteristics of CTL complexity are:
- Bi-directional text, where characters may be written from either right-to-left or left-to-right direction.
- Context-sensitive shaping and ligatures, where a character may change its shape, dependent on its location and/or the surrounding characters. For example, a character in Arabic script can have as many as four different shape-forms, depending on context.
- Ordering, where the displayed order of the characters is not the same as the logical order. For example, in Devanagari, which is written from left to right, the grapheme for "short i" appears to the left of ("before") the consonant that it follows: in कि ki, the ि -i should render on the left, its bow reaching until above the क k- to the right.
Not all occurrences of these characteristics require CTL. For example, the Greek alphabet has context-sensitive shaping of the letter sigma, which appears as ς at the end of a word and σ elsewhere. However, these two forms are normally stored as different characters; for instance, Unicode has both U+03C2 ς GREEK SMALL LETTER FINAL SIGMA and U+03C3 σ GREEK SMALL LETTER SIGMA, and does not treat them as equivalent. For collation and comparison purposes, software should consider the string "δῖος Ἀχιλλεύς" equivalent to "δῖοσ Ἀχιλλεύσ",[1] but for typesetting purposes they are distinct and CTL is not required to choose the correct form.
Implementations
[edit]Most text-rendering software that is capable of CTL will include information about specific scripts, and so will be able to render them correctly without font files needing to supply instructions on how to lay out characters. Such software is usually provided in a library; examples include:
- Core Text for macOS
- Uniscribe (with Universal Shaping Engine) and DirectWrite for Microsoft Windows
- HarfBuzz, a cross-platform library
- Pango, a cross-platform library which nowadays incorporates HarfBuzz
However, such software is unable to properly render any script for which it lacks instructions, which can include many minority scripts. The alternative approach is to include the rendering instructions in the font file itself. Rendering software still needs to be capable of reading and following the instructions, but this is relatively simple.
Examples of this latter approach include Apple Advanced Typography (AAT) and Graphite. Both of these names encompass both the instruction format and the software supporting it; AAT is included on Apple operating systems, while Graphite is available for Microsoft Windows and Linux-based systems.
The OpenType format is primarily intended for systems using the first approach (layout knowledge in the renderer, not the font), but it has a few features that assist with CTL, such as contextual ligatures. AAT and Graphite instructions can be embedded in OpenType font files.
See also
[edit]- Typography
- Unicode
- Writing systems which require complex text layout:
- Arabic alphabet
- Most of the Brahmic family of scripts
- N'Ko script
- Tengwar (diacritics and numbers)
References
[edit]- ^ "FAQ - Greek Language & Script". Unicode Consortium. 2012-12-03. Retrieved 2013-09-13.
It is easier to simply equate the two sigma codes for operations which are concerned with word content, for example.
External links
[edit]- Examples of complex rendering — SIL international's examples of complex writing systems around the world
- Complex Text Layout — The Open Group's Desktop Technologies
- Supporting Indic Scripts in Mozilla — also other CTL scripts
- Project SILA — Graphite and Mozilla integration project
- CTL Architecture in Solaris — Solaris Globalization Whitepapers
- Complex Scripts — Microsoft Global Development and Computing Portal
- Theppitak's Homepage — information about Thai language processing
- HarfBuzz's page at Freedesktop.org
- D-Type Unicode Text Module — Portable software library for complex text
- BidiRenderer — An application that illustrates the shaping and layout of complex text in bidirectional paragraphs using FriBidi, FreeType, and HarfBuzz
- Tehreer-Android — A library that gives full control over text related technologies such as bidirectional algorithm, open type shaping, text typesetting and text rendering
- Tehreer-Cocoa — Standalone font/text engine for iOS
- MediaWiki test cases for complex script rendering
Complex text layout
View on GrokipediaIntroduction
Definition and Scope
Complex text layout (CTL) refers to the typesetting and rendering of writing systems in which the shape, position, or order of a grapheme depends on its context, such as adjacent characters or the surrounding text direction. This process involves transformations between the logical storage of text in Unicode and its visual display, distinguishing it from simple linear rendering where characters are presented without modification.[1][2] The scope of CTL includes bidirectional (BiDi) text that mixes right-to-left and left-to-right directions, cursive joining behaviors, ligature formation for combined glyphs, and vertical or multidirectional layouts, but generally excludes straightforward left-to-right scripts like basic Latin unless they require contextual features such as combining marks. These elements ensure that text is legible and culturally appropriate across diverse scripts, with brief handling of BiDi reordering to maintain logical flow in mixed-language documents.[1][2] For example, the simple Latin string "abc" displays as isolated characters in fixed positions, while the Arabic phrase "العربية" demands contextual shaping: letters connect cursively and alter forms (initial, medial, final, or isolated) based on their neighbors, resulting in a fluid, joined appearance. CTL's importance lies in its role for software internationalization (i18n), allowing applications to support global languages accurately and reducing localization costs for vendors entering international markets.[5]Historical Development
In the 1980s and early 1990s, digital typesetting technologies like Adobe's PostScript, introduced in 1982, were optimized for Latin-based scripts, creating substantial hurdles for non-Latin writing systems that demanded bidirectional rendering, variable glyph widths, or contextual shaping.[6] These systems often relied on fixed-width encodings or ad hoc extensions, complicating the handling of scripts such as Arabic, Hebrew, or CJK ideographs, where mixed byte lengths in standards like Shift-JIS further exacerbated access and unification issues.[7] Early Unicode releases, including version 1.0 in 1991 and 1.1 in 1993, provided a universal encoding foundation but omitted full bidirectional support, restricting effective digital representation of right-to-left and mixed-direction texts.[7] Key advancements in the mid-1990s addressed these deficiencies through standardized algorithms and font formats. Unicode 2.0, released in 1996, incorporated the Bidirectional Algorithm, enabling logical-to-visual text reordering for scripts with opposing directionalities.[8] Complementing this, OpenType 1.0, jointly developed by Microsoft and Adobe and published in April 1997, introduced glyph substitution and positioning tables via GSUB and GPOS, facilitating complex shaping for cursive and conjunct-dependent scripts.[9] As proprietary solutions proved insufficient for diverse linguistic needs, open-source initiatives gained traction: SIL International launched Graphite in 2004 as a programmable system for TrueType fonts targeting lesser-known languages, while HarfBuzz emerged in 2006 from collaborations between Pango and Qt developers to provide a robust, unified OpenType shaping engine.[10] The post-2000 era marked a transition to open, web-centric standards, driven by the internet's expansion into non-Western markets and the demand for global content accessibility. This evolution culminated in specifications like the CSS Writing Modes Module Level 3, issued as a W3C Working Draft in February 2011, which defined properties for horizontal, vertical, and bidirectional layouts to support international scripts in browsers.[11] Despite these strides, pre-2020 implementations revealed persistent gaps in minority script support, where many endangered or low-resource writing systems lacked encoding, shaping rules, or font resources for complex layouts. Unicode expansions, including version 3.0 in 2001 and subsequent releases up to 13.0 in 2020, systematically incorporated new characters, bidirectional properties, and script-specific behaviors to bridge these deficiencies and preserve linguistic diversity, continuing in later versions up to 16.0 in September 2024.[12][13]Writing Systems Requiring CTL
Bidirectional Scripts
Bidirectional scripts are writing systems that incorporate text flowing primarily from right to left (RTL), often intermixed with left-to-right (LTR) elements such as numbers, punctuation, or embedded phrases in other languages, necessitating algorithmic reordering to achieve correct visual presentation.[14] These scripts arise in languages where the base direction is RTL, but neutral or weak directional characters require resolution based on surrounding context to prevent visual distortion.[15] Primary examples include Arabic, Hebrew, and Syriac, which are Semitic languages using abjads where letters connect and change form contextually, but whose layout demands bidirectional handling for coherent display.[16] Numbers, typically classified as European numbers (EN) or Arabic numbers (AN), and punctuation marks like parentheses or quotes are treated as neutral (ON) or weak elements, adopting the direction of adjacent strong directional text or the paragraph's embedding level.[17] For instance, in an Arabic sentence containing a European numeral, the number flows LTR within the RTL context, ensuring readability without manual adjustment.[18] The Unicode Bidirectional Algorithm, specified in Unicode Standard Annex #9 (UAX #9), governs this reordering through a multi-pass process that assigns directional levels to characters.[14] Embedding levels allow nesting of opposite-direction text using control characters like left-to-right embedding (LRE, U+202A) or right-to-left embedding (RLE, U+202B), with levels ranging from even (LTR) to odd (RTL) up to a maximum depth of 125 to avoid overflow.[19] Overrides, via left-to-right override (LRO, U+202D) or right-to-left override (RLO, U+202E), force uniform direction but are discouraged due to accessibility and security concerns.[20] Resolution occurs in phases: first, splitting into paragraphs (P1) and applying explicit embeddings (X1–X9); then resolving weak types like numbers (W1–W7); followed by neutral resolution (N1–N2), where neutrals inherit direction from neighbors; and finally implicit levels (I1–I2) for unresolved cases, culminating in visual reordering by level parity (L1–L4).[15] These scripts affect hundreds of millions of users worldwide, with Arabic alone spoken by over 450 million people across 25 countries, underscoring the global scale of bidirectional layout needs.[21] Historical precedents trace to ancient systems like the Phoenician script, an RTL abjad from the 11th century BCE that influenced modern Semitic writing directions.[22] Challenges emerge prominently in mixed-content scenarios, such as RTL documents embedding LTR quotes, URLs, or code snippets, where unhandled neutrals can lead to reversed or mirrored appearances— for example, a URL in Arabic text might display with slashes and dots in inverted order, confusing readers.[23] Modern solutions recommend directional isolates (LRI, RLI, PDI; U+2066–U+2069) to encapsulate segments without affecting surroundings, mitigating these issues in digital interfaces.[24]Complex Shaping Scripts
Complex shaping scripts involve writing systems where individual characters or glyphs change form, combine into ligatures, or reposition relative to one another within a word or syllable to achieve proper rendering. These scripts require sophisticated layout engines to handle intra-word transformations, such as vowel signs attaching to consonants or letters adopting contextual shapes based on their position. Unlike simple scripts, shaping here ensures legibility and aesthetic harmony by applying rules for clustering and substitution.[25] The Indic or Brahmic family of scripts, including Devanagari, Bengali, and Tamil, exemplifies complex shaping through abugida structures where consonants carry an inherent vowel that can be modified or suppressed. In Devanagari, dependent vowel signs known as matras attach above, below, to the left, or right of a base consonant; for instance, the matra U+093F ◌ि repositions to the left of the consonant क (U+0915) to form the syllable कि (ki). Bengali follows similar rules, allowing up to three left-side vowel signs per syllable, while Tamil uses the puḷḷi (U+0BCA) to suppress inherent vowels and positions vowel signs accordingly. These scripts rely on glyph substitution (GSUB) and positioning (GPOS) tables in OpenType fonts to handle reordering and attachment of matras and consonant conjuncts.[25][26] Southeast Asian scripts like Thai, Khmer, and Lao also demand intricate shaping due to their stacked diacritics and lack of inter-word spacing. In Thai, tone marks (e.g., U+0E48 ◌่ mai ek) and vowel signs (e.g., U+0E31 ◌ู) appear above or below the base consonant, with left-side vowels rendered in logical order but visually preceding the base. Khmer employs a coeng (U+17D2 ◌្) for subjoined consonants and vowel signs that trap around the base, such as composites like U+17B6 U+17C6 for certain vowels, while avoiding spaces between words. Lao mirrors Thai in tone mark and vowel placement, using diacritics that stack outward from the consonant. These features necessitate precise vertical positioning to prevent overlaps in rendering.[27] Cursive scripts such as Arabic and Mongolian further complicate shaping by requiring glyphs to adopt position-dependent forms for fluid connection. Arabic letters typically have up to four contextual forms: isolated (standalone), initial (word-start), medial (mid-word, joining both sides), and final (word-end), applied to dual-joining characters like م (U+0645); right-joining letters like ر (U+0631) use only isolated and final forms. This cursive joining is managed through OpenType features like init, medi, and fina. Mongolian, written vertically, exhibits similar cursive behavior where letters join on both sides within words, with context-sensitive forms ensuring continuous flow from top to bottom.[28][29][30][31]Vertical and Multidirectional Layouts
Vertical text layout involves arranging characters in lines that flow from top to bottom, often with columns progressing from right to left, a convention prevalent in certain writing systems to accommodate their visual and cultural traditions.[32] This approach contrasts with the predominant horizontal left-to-right flow in many scripts and requires specific handling for character orientation, such as keeping ideographs upright while rotating punctuation or Latin letters.[33] In East Asian languages, vertical presentation has historical roots in scroll-based writing, where text advances downward along the spine, enhancing readability for dense ideographic content.[34] East Asian scripts exemplify vertical layout through their handling of Hanzi (Chinese characters), Kanji (Japanese characters borrowed from Chinese), Hiragana and Katakana (Japanese syllabaries), and Hangul (Korean syllables). Hanzi and Kanji remain upright in vertical text, with lines flowing top to bottom and succeeding columns from right to left, preserving the square aspect of each glyph for optimal legibility.[32] Hiragana and Katakana characters also stay upright, integrating seamlessly with ideographs in mixed-script documents common in Japanese publications.[33] For Korean, Hangul syllables are composed of stacked jamo (consonants and vowels) that appear upright in vertical flow, though the overall syllable block does not rotate; this allows natural progression down the line without disrupting phonetic clustering.[32] The Mongolian script represents a distinct vertical system where text is written in columns from top to bottom, with columns advancing from right to left across the page. Individual letters rotate 90 degrees counterclockwise to align with the vertical baseline and connect fluidly within each column, forming a cursive-like chain that reflects the script's traditional calligraphic style.[35] This rotation and connection ensure that vowels and consonants interlock properly, maintaining the script's aesthetic continuity in vertical presentation.[36] Multidirectional layouts extend vertical flow by incorporating non-linear progressions, as seen in Tibetan script, which primarily runs horizontally left to right but can adopt vertical arrangements top to bottom with successive columns progressing from left to right in certain manuscript traditions.[37] This leftward column advance, combined with the script's inherent stacking of subjoined consonants below main glyphs, creates a dynamic flow suited to religious texts or artistic layouts.[38] Ancient scripts like Linear B, used for Mycenaean Greek around 1450–1200 BCE, occasionally employed boustrophedon writing—alternating direction per line (left to right, then right to left)—on clay tablets.[39] Unicode Technical Annex #50 (UAX #50) addresses these needs by defining the Vertical_Orientation property, which specifies default behaviors such as upright positioning or 90-degree rotation for over 100 characters across scripts, enabling consistent rendering in vertical contexts without relying solely on font-specific adjustments.[32] This property supports bidirectional interactions briefly noted in text directionality handling, ensuring mixed vertical-horizontal flows remain coherent.[32]Key Characteristics
Text Directionality
Text directionality in complex text layout (CTL) refers to the foundational rules governing how text flows, either from left to right (LTR) or right to left (RTL), particularly in mixed-direction content. For languages like English, the default base direction is LTR, while scripts such as Hebrew and Arabic use RTL as the base direction.[40] The base direction of a paragraph is typically determined by the first strong directional character encountered, which could be L (left-to-right, e.g., Latin letters), R (right-to-left, e.g., Hebrew letters), or AL (Arabic letters with right-to-left direction).[40] If no strong character is present, higher-level protocols may set the direction explicitly.[40] The Unicode Bidirectional Algorithm (UBA), defined in Unicode Standard Annex #9 (UAX #9), provides a standardized method to resolve directionality through an 18-rule process divided into phases: separating paragraphs, resolving embedding levels, handling weak and neutral characters, and final reordering.[40] For instance, Rule P2 identifies paragraph separators, and Rule P3 sets the paragraph level to 0 (LTR) or 1 (RTL) based on the first strong character.[40] Explicit directional overrides are managed by formatting codes, such as Rule X2 for RLE (Right-to-Left Embedding), which raises the embedding level to the next odd number to force RTL direction within a segment, later terminated by PDF (Pop Directional Format).[40] Rule L1 then resets the levels of paragraph separators, trailing whitespace, and isolate terminators to match the paragraph's base level.[40] Directionality operates at both paragraph and inline levels within CTL. Paragraphs are processed independently, split by B-type (paragraph separator) characters, with each establishing its own base direction before line-by-line reordering.[40] Inline elements, such as embedded text or objects, inherit or adapt to the surrounding context, treating inline objects as the neutral U+FFFC character for direction resolution.[40] Weak directional characters, including numbers, are resolved in the algorithm's third phase using Rules W1 through W7; for example, European numbers (EN) adapt by changing to Arabic numbers (AN) if preceded by right-to-left characters like AL (Rule W2), or to left-to-right if preceded by L (Rule W7), ensuring numbers align appropriately in RTL contexts without disrupting the overall flow.[40] In web and document technologies, directionality can be overridden using standards like CSS. The CSSdirection property specifies the base inline direction as ltr or rtl for an element, influencing the UBA's paragraph level and affecting text ordering, table layouts, and overflow behavior.[41] Complementing this, the unicode-bidi property controls bidirectional embedding and isolation, with values like embed (inserting LRE or RLE codes), isolate (using directional isolates for scoped direction), or bidi-override (forcing direction regardless of character types), allowing precise control over mixed-direction rendering while integrating with the UBA.[41] These properties enable authors to handle CTL in bidirectional scripts, such as embedding LTR quotes in RTL text.[41]