Hubbry Logo
Zero-width spaceZero-width spaceMain
Open search
Zero-width space
Community hub
Zero-width space
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Zero-width space
Zero-width space
from Wikipedia
Zero-width space
In UnicodeU+200B ZERO WIDTH SPACE (​, ​, ​, ​, ​)

The zero-width space (rendered: ; HTML entity: ​ or ​), abbreviated ZWSP, is a non-printing character used in computerized typesetting to indicate where the word boundaries are, without actually displaying a visible space in the rendered text. This enables text-processing systems for scripts that do not use explicit spacing to recognize where word boundaries are for the purpose of handling line breaks appropriately.

The zero-width space is Unicode character U+200B, and is located in the Unicode General Punctuation block. In HTML, it can be represented by the character entity reference ​.

Purpose

[edit]

The zero-width space marks a potential line break without hyphenation. Its semantics and HTML implementation are similar to the soft hyphen, but soft hyphens display a hyphen character at the point where the line is broken.

The zero-width space can be used to mark word breaks in languages without visible space between words, such as Thai, Myanmar, Khmer, and Japanese.[1]

In justified text, the rendering engine may add inter-character spacing, also known as letter spacing, between letters separated by a zero-width space, unlike around fixed-width spaces.[1]

Example

[edit]

To show the effect of the zero-width space in text, the following words have been separated with zero-width spaces:

Lorem​Ipsum​Dolor​Sit​Amet​Consectetur​Adipiscing​Elit​Sed​Do​Eiusmod​Tempor​Incididunt​Ut​Labore​Et​Dolore​Magna​Aliqua​Ut​Enim​Ad​Minim​Veniam​Quis​Nostrud​Exercitation​Ullamco​Laboris​Nisi​Ut​Aliquip​Ex​Ea​Commodo​Consequat​Duis​Aute​Irure​Dolor​In​Reprehenderit​In​Voluptate​Velit​Esse​Cillum​Dolore​Eu​Fugiat​Nulla​Pariatur​Excepteur​Sint​Occaecat​Cupidatat​Non​Proident​Sunt​In​Culpa​Qui​Officia​Deserunt​Mollit​Anim​Id​Est​Laborum

By contrast, the following words have not been separated:

LoremIpsumDolorSitAmetConsecteturAdipiscingElitSedDoEiusmodTemporIncididuntUtLaboreEtDoloreMagnaAliquaUtEnimAdMinimVeniamQuisNostrudExercitationUllamcoLaborisNisiUtAliquipExEaCommodoConsequatDuisAuteIrureDolorInReprehenderitInVoluptateVelitEsseCillumDoloreEuFugiatNullaPariaturExcepteurSintOccaecatCupidatatNonProidentSuntInCulpaQuiOfficiaDeseruntMollitAnimIdEstLaborum

The first text is broken into lines but only at word boundaries, and resizing the browser window will re-break the text accordingly, while the second text is not broken at all.

Usage

[edit]

HTML

[edit]

In HTML pages, the HTML element <wbr> functions as a zero-width space. In Internet Explorer 6, the zero-width space was not supported in some fonts.[2]

Prohibition in domain names

[edit]

ICANN rules prohibit domain names from containing non-displayed characters, including the zero-width space, and most browsers prohibit their use within domain names because they can be used to create a homograph attack, where a malicious URL is visually indistinguishable from a legitimate one.[3][4]

Encoding

[edit]
Character information
Preview
Unicode name ZERO WIDTH SPACE
Encodings decimal hex
Unicode 8203 U+200B
UTF-8 226 128 139 E2 80 8B
Numeric character reference &#8203; &#x200B;
Named character reference &NegativeMediumSpace;, &NegativeThickSpace;, &NegativeThinSpace;, &NegativeVeryThinSpace;, &ZeroWidthSpace;

The zero-width space character is encoded in Unicode as U+200B ZERO WIDTH SPACE.[5]

In HTML, it can be referenced as &ZeroWidthSpace;, &#8203; or &#x200B;. Additionally, the character entities &NegativeThickSpace;, &NegativeMediumSpace;, &NegativeThinSpace;, and &NegativeVeryThinSpace; all also refer to the zero-width space, contrary to what their names suggest.[6]

The TeX representation is \hskip0pt; the LaTeX representation is \hspace{0pt};[7] and the groff representation is \:.[8]

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The zero-width space (ZWSP), designated as Unicode character U+200B, is a non-printing format character that occupies no visual width and serves to indicate invisible word boundaries and line break opportunities in text processing. Introduced in 1.0 in 1991 as part of the General Punctuation block, the ZWSP was initially classified under the space category (Zs) but was later reclassified as a format character (Cf) to emphasize its role in layout control rather than spacing. This character enables proper text rendering in languages and scripts that lack explicit visible spaces between words, such as Thai, Lao, Khmer, , and certain uses in Japanese. In the Unicode Line Breaking Algorithm (UAX #14), the ZWSP functions as a break opportunity: line breaks are prohibited before it (along with regular spaces), but explicitly allowed immediately after it, allowing it to delimit words without altering visual appearance. During text justification, it permits the addition of inter-letter spacing, distinguishing it from fixed-width spaces like the en space (U+2002). The ZWSP differs from related zero-width characters, such as the (U+200C) and (U+200D), which control joining and ligature formation in scripts rather than providing break points. While primarily a tool for and , its invisible nature requires careful handling in editing software to avoid unintended insertions or rendering issues.

Overview

Definition

The zero-width space (ZWSP) is a non-printing character designated as U+200B, which occupies no horizontal space in rendered text but serves to indicate a potential word boundary or line break opportunity. This invisible character is detectable by text processing systems, allowing them to apply breaking rules without altering the visual layout. Introduced in 1.0 in October 1991 as part of the General block (U+2000 to U+206F), the ZWSP was developed to support international text handling in digital environments. It addresses the needs of scripts that lack explicit visible spaces between words, such as Thai, Lao, Khmer, , and Japanese, by providing an invisible separator for word breaks in and layout algorithms. In rendering, the ZWSP remains completely invisible to users, with no or advance width, yet it influences line-breaking behavior in applications like word processors and web browsers. This property makes it essential for maintaining readability in complex scripts while preserving the intended structure of the text. The zero-width space (U+200B), commonly abbreviated as ZWSP, is primarily intended for invisible word separation and line break control, allowing a line break opportunity without visible width, though it may expand slightly in justified text. In contrast, the (U+200C, ZWNJ) serves to separate characters that would otherwise form ligatures or join in scripts, such as in or Indic languages, but it does not create a line break opportunity or affect word boundaries. Thus, while both are invisible, the ZWSP facilitates potential breaks for formatting, whereas the ZWNJ prioritizes preventing unwanted joining without altering layout flow. The zero-width joiner (U+200D, ZWJ) functions oppositely to the ZWNJ by forcing the joining of adjacent characters that would not normally connect, such as combining base emojis into sequences (e.g., family or flag emojis) or linking elements in scripts like Devanagari. Unlike the ZWSP, which permits separation and breaks, the ZWJ enforces visual or semantic unity without influencing line breaking, making it unsuitable for word boundary marking. This distinction is critical in complex text rendering, where misuse could disrupt intended glyph formation or emoji display. Another related character is the zero-width no-break space (U+FEFF, ZWNBSP), which prohibits line breaks at its position to maintain text integrity, such as preventing unwanted separation in phrases, and is widely used as a (BOM) to indicate encoding in files like or UTF-16. In opposition to the ZWSP's breakable nature, the ZWNBSP ensures non-breaking behavior, and explicitly deprecates its use for invisible separation in favor of other characters like U+2060 for similar non-breaking needs. Fundamentally, these differences highlight the ZWSP's role as a neutral, breakable whitespace for general text processing, while the ZWNJ, ZWJ, and ZWNBSP focus on controlling joining or prohibiting breaks in contexts involving complex scripts or encoding signatures.

Encoding and Standards

Unicode Specification

The zero-width space is encoded in the Unicode Standard at U+200B, named ZERO WIDTH SPACE, and resides in the General Punctuation block spanning U+2000 through U+206F. It was first assigned in Version 1.0.0, released in October 1991, and has remained stable in its core encoding since that initial version, with no subsequent reallocation or deprecation. In the Unicode Character Database, it is classified with the General_Category property value of Cf (Other, Format), reflecting its role as a formatting rather than a visible ; this category was updated from the original Zs (Separator, ) value in Unicode Version 4.0.1 to better align with its invisible, non-spacing behavior. Additional key properties include Bidi_Class=BN (Boundary Neutral), which ensures it does not affect embedding levels, and Line_Break=ZWSP, designating it specifically for enabling invisible line break opportunities without contributing to text width. The character is treated as invisible in rendering, with no default glyph or advance width, and it is designated as a Default_Ignorable_Code_Point, meaning it should be ignored in rendering unless explicitly supported for line breaking and other formatting effects. No major changes to its encoding or primary properties have occurred since Unicode 1.0.0 beyond the General_Category adjustment, underscoring its foundational status in the standard. It is referenced in Unicode Standard Annex #9 (Unicode Bidirectional Algorithm) as a boundary neutral format character that preserves text directionality without visual impact, and in Unicode Standard Annex #14 (Unicode Line Breaking Properties) as a dedicated class for zero-width break opportunities, distinct from other spaces or joiners.

Representations in Markup and Protocols

In , the zero-width space (U+200B) can be represented using the numeric character entities ​ () or ​ (). These entities allow insertion of the character without direct support in older parsers. In XML and related markup languages such as those used in and Atom feeds, the zero-width space is typically inserted directly as (bytes E2 80 8B) or UTF-16 encoding, or via numeric character references like ​ or ​ to ensure compatibility across parsers. This approach is common for invisible separators in structured feeds, where the character maintains document integrity without visual impact. Programming languages provide standard methods to generate the zero-width space in string literals, often for testing or text manipulation. In , it is created using String.fromCharCode(8203). In Python, the chr(0x200B) or chr(8203) function returns the character. In C#, the escape sequence \u200B embeds it directly in strings. In network protocols, the zero-width space requires specific encoding for transmission. In URLs, it is percent-encoded in UTF-8 as %E2%80%8B to handle the multi-byte sequence safely. For email via (RFC 2045), it appears in quoted-printable or base64-encoded bodies, or as \u200B in structured parts, preserving invisibility across transports. In (RFC 8259), non-ASCII characters like U+200B are escaped as \u200b to ensure valid parsing.

Core Purposes

Word Boundary Marking

The zero-width space (ZWSP, U+200B) functions as an invisible to mark word boundaries in languages without visible inter-word spacing, such as Thai, Lao, and Khmer. By inserting ZWSP between words, text processors can accurately segment continuous scripts for tasks like dictionary lookups and , preserving the original visual appearance while enabling precise linguistic analysis. In , ZWSP aids parsers in identifying or word boundaries without disrupting layout, particularly useful in non-spaced scripts or for annotating compounds. For instance, in Thai text like "สวัสดี" (hello), placing a ZWSP after "สวัส" distinguishes it as a compound for processing, such as in segmentation algorithms. This approach enhances applications like tools that rely on explicit boundaries for tokenization. The use of ZWSP improves text search accuracy by providing reliable word-level granularity, especially when combined with text segmentation rules, and supports hyphenation in systems like by allowing breaks at designated points in compounds without visible gaps. For example, in LaTeX, inserting ZWSP after a slash in terms like "input/output" enables proper hyphenation while maintaining compound integrity.

Line Break Facilitation

The zero-width space (ZWSP, U+200B) serves as an invisible in text layout systems, permitting line wrapping at designated points without altering the visual appearance of the content. According to the Line Breaking Algorithm outlined in UAX #14, ZWSP is assigned the line breaking property class ZW, which enforces specific rules for break opportunities: breaks are prohibited before ZWSP (rule LB7: × ZW), but allowed after it (rule LB8: ZW ÷). This mechanism positions ZWSP as a non-hyphenating alternative to the (U+00AD), enabling controlled fragmentation of otherwise unbreakable sequences while avoiding the insertion of a mark. In practical scenarios, ZWSP facilitates line breaks in extended constructs such as URLs, where inserting it— for instance, within "://example.com/very/long/path"—prevents horizontal overflow in constrained viewports without compromising readability. Similarly, it supports wrapping in chemical formulas, like long molecular notations (e.g., C₆₀H₁₂₂), and ideographic scripts such as Chinese, Japanese, or Korean (CJK), where traditional spaces are absent and natural break points are scarce; UAX #14 explicitly notes its utility for indicating potential breaks in non-Latin scripts. These applications ensure text flows adaptively across devices and formats, maintaining semantic integrity. Within web styling contexts, ZWSP integrates with CSS line-breaking behaviors to allow discretionary wraps where standard spaces would introduce undesired width; for example, browsers treat it equivalently to the for suggesting breaks in inline elements. However, its effectiveness is not absolute, as break realization remains optional and context-dependent—UAX #14 specifies that ZWSP may be suppressed after punctuation or in tightly justified layouts, prioritizing overall typographic balance over individual insertions. This contextual sensitivity underscores ZWSP's role as a suggestive rather than mandatory cue in line wrapping algorithms.

Practical Applications

In Multilingual Text Processing

The zero-width space (U+200B) plays a key role in East Asian typography, particularly for Chinese, Japanese, and Korean (CJK) languages, where it provides subtle control over spacing and line breaks without introducing visible gaps. In CJK text processing, algorithms may automatically insert proportional spacing between characters for justification, but inserting a zero-width space can override this behavior to prevent unwanted auto-spacing, ensuring precise layout in horizontal or vertical arrangements. For instance, in Japanese typesetting, the zero-width space facilitates break opportunities, as in CSS features for phrase breaking, maintaining aesthetic balance across lines while adhering to monospaced character grids. This application is especially useful in digital software, where CJK justification relies on distributed spacing rather than word gaps, and the zero-width space acts as an invisible delimiter to fine-tune character distribution. In bidirectional text processing, the zero-width space supports layouts involving right-to-left scripts like Hebrew and by serving as a neutral formatting character that does not alter the overall directional flow. According to the Unicode Bidirectional Algorithm (UAX #9), neutral characters such as whitespace (WS class) and boundary neutrals like the zero-width space (U+200B, BN class) are treated as neutral, allowing them to embed boundaries between directional runs—such as isolating left-to-right insertions like numbers or English terms—without forcing reordering or embedding levels that could disrupt the primary right-to-left progression. This neutrality ensures that the zero-width space can mark logical separations in mixed-direction text, such as in Arabic sentences containing Hebrew quotes, while preserving the visual integrity of the right-to-left rendering as defined in the algorithm's resolution phases. For search and indexing in natural language processing (NLP), the zero-width space improves tokenization accuracy in script-mixed multilingual text by explicitly indicating word boundaries where visible spaces are absent or ambiguous. Unicode Standard Annex #29 specifies that U+200B functions as a deliberate word separator, enabling tools to distinguish tokens in languages without inter-word spacing, such as when English words are embedded in Arabic or Thai sentences; for example, inserting it between "hello" and an adjacent Arabic term prevents erroneous merging during indexing. Libraries like the International Components for Unicode (ICU) incorporate these rules in their BreakIterator implementation, supporting precise segmentation for search engines and NLP pipelines handling diverse scripts, thus enhancing retrieval relevance in global corpora. In multilingual input methods, the zero-width space facilitates the entry of invisible across language locales, particularly in environments where keyboard configurations allow insertion via modifier keys for non-printing characters essential to script-specific formatting. For example, certain international layouts enable users to produce U+200B through compose sequences or AltGr combinations, aiding typists in adding subtle boundaries during real-time composition of mixed-script documents without visible artifacts.

In Web Development and HTML

In , the zero-width space (U+200B) is inserted into HTML using the numeric entity ​, enabling line breaks within inline elements without introducing visible spacing. This technique is particularly useful for maintaining layout integrity in scenarios where standard spaces would disrupt , such as in navigation menus where items need to wrap responsively on smaller screens without awkward gaps. For instance, placing ​ between menu link text allows the browser to break the line at that point if the narrows, preserving readability without adding width. Similarly, in snippets displayed inline, ​ facilitates natural line wrapping for long identifiers or URLs, ensuring they do not overflow containers while mimicking the original formatting. When integrated with CSS, the zero-width space enhances text wrapping behaviors, especially in responsive designs. It pairs effectively with the white-space: pre property, which preserves whitespace and line breaks, allowing developers to embed ZWSP strategically to control where breaks occur without altering the visual flow. Combined with word-break: break-word, ZWSP provides subtle opportunities for hyphenless breaks in long, space-less strings like URLs or compound words, preventing overflow in fluid layouts across devices. This approach is common in mobile-first designs, where precise control over text reflow is essential to avoid horizontal scrolling. In , handling zero-width spaces is crucial for input sanitization to mitigate risks like injection attacks or hidden payloads in user-submitted data. Developers often detect and remove ZWSP using regular expressions, such as string.replace(/\u200B/g, ''), which targets the Unicode and strips all instances globally. This method ensures clean form data processing, particularly in web applications where malicious actors might embed invisible characters to evade validation filters. The gained prominence in following the adoption of around 2010, as mobile browsers improved support and responsive techniques became standard, enabling better handling of invisible formatting in cross-device layouts.

In Typography and Document Formatting

In applications like and , the zero-width space (ZWSP, U+200B) serves as an invisible delimiter to facilitate adjustments in non-Latin scripts, such as Thai or , where visible spaces are absent between words. By marking word boundaries without adding width, it enables software to apply appropriate inter-character spacing and optical metrics tailored to the font's design, preventing awkward gaps or overlaps in complex layouts. This is particularly useful for maintaining readability in documents mixing scripts, as the ZWSP informs the engine of logical breaks for justification without altering visual appearance. As an alternative to soft hyphens, the ZWSP allows line breaks in justified text without introducing visible hyphenation marks, promoting cleaner in professional outputs. For instance, in legal documents requiring precise and unobtrusive formatting, inserting a ZWSP at potential break points ensures even line endings across paragraphs while avoiding the aesthetic disruption of hyphens, which can imply fragmentation in formal prose. This approach supports full justification by permitting controlled letter-spacing expansion instead of erratic word gaps, aligning with typographic best practices for high-legibility print media. During PDF and generation, the ZWSP aids consistent rendering across devices by explicitly signaling allowable line breaks within embedded fonts, which may vary in glyph metrics or justification . This prevents overflow or reflow issues in digital publications, especially for long compounds or non-spaced scripts, ensuring the document's layout integrity regardless of the reader's font substitution or screen size. In environments, packages such as polyglossia incorporate the ZWSP to manage script-specific spacing in multilingual PDFs, inserting it dynamically at language transitions or word boundaries to enforce proper hyphenation and rules per script. This enhances output quality for documents blending Latin and non-Latin content, like academic texts, by leveraging XeLaTeX's fontspec integration for precise, invisible adjustments.

Restrictions and Challenges

Prohibitions in Identifiers

The zero-width space (U+200B) is prohibited in internationalized domain names (IDNs) under policies to prevent attacks and invisible spoofing that could enable or visual deception. The briefing on IDN permissible code points discusses U+200B as a non-displayed character in the context of potential user confusion. Similarly, RFC 5892, which outlines the code points eligible for IDNA labels, classifies U+200B as disallowed, excluding it from the protocol-valid (PVALID) category and thereby barring its use in registered domain labels. In programming languages, the zero-width space is handled specially in identifiers for ; the ignores U+200B to avoid hidden code and security risks, as it does not qualify under the Java Language Specification's rules for valid identifier parts based on categories. In contrast, the C++ ISO standard permits U+200B in identifiers under its Unicode support rules, but this allowance is criticized for enabling invisible variations in code that can lead to subtle bugs or malicious insertions. Security implications arise from the zero-width space's potential for malicious use in , where it is inserted into URLs to create homographic domains like "examp​le.com" that evade detection while appearing identical to legitimate ones. This technique, dubbed Z-WASP (zero-width space ), has been employed to bypass protections in systems such as Microsoft Office 365 by obfuscating malicious links without altering their functionality. As of January 2025, variants like "shy z-wasp" continue to exploit zero-width characters in campaigns. Policy evolution regarding the zero-width space in identifiers reflects growing awareness of Unicode security risks, with the Unicode Technical Report #36 (updated in 2010) discouraging the use of invisible characters like U+200B in user-facing identifiers to mitigate confusability and spoofing threats. This guidance influences standards bodies and implementers to prioritize visible, unambiguous characters in contexts like and code naming.

Compatibility and Rendering Issues

The zero-width space (U+200B) presents several compatibility challenges in web browsers, particularly in older versions. For instance, early implementations in , such as version 6, did not fully support the character in certain fonts, resulting in it being ignored or rendered incorrectly, which disrupted intended line break opportunities. In , inserting zero-width spaces into for URL wrapping could trigger crashes during PDF generation in versions from around 2006. Modern browsers like Chrome have also been observed to inadvertently insert U+200B into copied snippets from developer tools, complicating debugging. Rendering variations occur across fonts and operating systems. Fonts such as Arial Unicode MS provide support for U+200B, but without appropriate fallback mechanisms, WebKit-based browsers may display it as a square or a tiny visible gap if the primary font lacks the , leading to inconsistent visual output. Input methods on macOS and Windows can accidentally insert the character; for example, selecting text (e.g., via Cmd+A) in web applications like Outlook Web App on macOS has been reported to add extraneous U+200B instances. In legacy ASCII-only environments, the Unicode character cannot be represented and is typically substituted with a replacement like "?" or stripped entirely, nullifying its formatting role. To mitigate these problems, developers commonly employ regular expressions such as /\u200B/g in to detect and remove zero-width spaces, a practice increasingly routine in code audits since the early to eliminate artifacts from web copy-pasting.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.