Recent from talks
Nothing was collected or created yet.
Double-byte character set
View on WikipediaThis article includes a list of references, related reading, or external links, but its sources remain unclear because it lacks inline citations. (September 2021) |
A double-byte character set (DBCS) is a character encoding in which either all characters (including control characters) are encoded in two bytes, or merely every graphic character not representable by an accompanying single-byte character set (SBCS) is encoded in two bytes (Han characters would generally comprise most of these two-byte characters). A DBCS supports national languages that contain many unique characters or symbols (the maximum number of characters that can be represented with one byte is 256 characters, while two bytes can represent up to 65,536 characters). Examples of such languages include Korean, Japanese, and Chinese. Korean Hangul does not contain as many characters, but KS X 1001 supports both Hangul and Hanja, and uses two bytes per character.
In CJK computing
[edit]The term DBCS traditionally refers to a character encoding where each graphic character is encoded in two bytes.
In an 8-bit code, such as Big-5 or Shift JIS, a character from the DBCS is represented with a lead (first) byte with the most significant bit set (i.e., being greater than seven bits), and paired up with a single-byte character-set (SBCS). For the practical reason of maintaining compatibility with unmodified, off-the-shelf software, the SBCS is associated with half-width characters and the DBCS with full-width characters. In a 7-bit code such as ISO-2022-JP, escape sequences or shift codes are used to switch between the SBCS and DBCS.
Sometimes, the use of the term "DBCS" can imply an underlying structure that does not comply with ISO 2022. For example, "DBCS" can sometimes mean a double-byte encoding that is specifically not Extended Unix Code (EUC).
This original meaning of DBCS is different from what some consider correct usage today. Some insist that these character encodings be properly called multi-byte character sets (MBCS) or variable-width encodings, because character encodings such as EUC-JP, EUC-KR, EUC-TW, GB 18030, and UTF-8 use more than two bytes for some characters, and they support one byte for other characters.
Ambiguity
[edit]Some people use DBCS to mean the UTF-16 and UTF-8 encodings, while other people use the term DBCS to mean older (pre-Unicode) character encodings that use more than one byte per character. Shift JIS, GB 2312 and Big5 are a few character encodings that can contain more than one byte per character, but even using the term DBCS for these character encodings is incorrect terminology because these character encodings are really variable-width encodings (as are both UTF-16 and UTF-8). Some IBM mainframes do have true DBCS code pages, which contain only the double byte portion of a multi-byte code page.
If a person uses the term "DBCS enablement" for software internationalization, they are using ambiguous terminology. They either mean they want to write software for East Asian markets using older technology with code pages, or they are planning on using Unicode. Sometimes this term also implies translation into an East Asian language. Usually "Unicode enablement" means internationalizing software by using Unicode, and "DBCS enablement" means using incompatible character encodings that exist between the various countries in East Asia for internationalizing software. Since Unicode, unlike many other character encodings, supports all the major languages in East Asia, it is generally easier to enable and maintain software that uses Unicode. DBCS (non-Unicode) enablement is usually only desired when much older operating systems or applications do not support Unicode.
TBCS
[edit]A triple-byte character set (TBCS) is a character encoding in which characters (including control characters) are encoded in three bytes.
See also
[edit]- Variable-width encoding (also known as MBCS – multi-byte character set)
- DOS/V
External links
[edit]- Microsoft's definition of "double-byte character set"
- IBM's definition of "double-byte character set" at the Wayback Machine (archived October 18, 2018)
Double-byte character set
View on GrokipediaFundamentals
Definition
A double-byte character set (DBCS) is a character encoding scheme that uses either one or two bytes to represent characters, extending single-byte character sets (SBCS) to support up to 65,536 unique characters.[3] This capacity significantly exceeds the 256-character limit of SBCS, which use only 8 bits per character and are sufficient for alphabetic scripts like Latin-based languages.[3] In DBCS, characters are mapped to code points, with single-byte characters typically in the ASCII range and double-byte sequences for ideographic characters, providing a variable-width structure.[2] The primary purpose of DBCS is to accommodate logographic writing systems, such as those used in Chinese, Japanese, and Korean (collectively known as CJK languages), which require thousands of distinct glyphs to represent characters and ideographs.[2] Unlike alphabetic scripts that rely on a limited set of letters combined into words, CJK scripts demand extensive character inventories—often exceeding 20,000 commonly used symbols—for full expressiveness in text processing, display, and storage.[3] This encoding approach emerged as a solution for handling the complexity of these scripts in computing environments, ensuring efficient representation without the constraints of single-byte limitations.[2] In terms of basic mechanics, a DBCS assigns characters to either single-byte or double-byte values from a predefined table. In variable-width DBCS, which is the most common form, single-byte characters are identified by values in a specific range (e.g., 0x00 to 0x7F for ASCII compatibility), while double-byte characters begin with a lead byte (typically in the high range, such as 0x81 to 0x9F) followed by a trail byte (in a complementary range).[1] Fixed-width variants, where all characters use two bytes, exist but are less common for mixed text. For instance, implementations based on the JIS X 0208 standard often use this variable approach to encode Japanese characters including kanji, hiragana, and katakana.[2]Comparison to Other Encodings
Double-byte character sets (DBCS) differ fundamentally from single-byte character sets (SBCS) in their capacity to represent characters. SBCS encodings, such as ASCII, allocate one byte per character, limiting the repertoire to 256 distinct code points, which is sufficient for Latin-based scripts but inadequate for languages with large character inventories like Chinese, Japanese, and Korean (CJK).[17] In contrast, DBCS expands this capability by using up to two bytes per character, enabling comprehensive coverage of CJK ideographs through variable-width encoding.[18][17] DBCS is a specific type of multibyte character set (MBCS), where characters are encoded in one or two bytes, distinguishing it from variable-length schemes like UTF-8 (which uses 1 to 4 bytes per character).[19] UTF-8, as a Unicode transformation format, supports over a million code points and maintains backward compatibility with ASCII by encoding Latin characters in a single byte, whereas DBCS uses lead bytes to signal double-byte sequences for non-ASCII characters, avoiding the more complex parsing of UTF-8 but limiting universality.[18][19] DBCS encompasses primarily variable-width variants, with some fixed-width implementations. Variable-width DBCS, such as Shift JIS, optimizes by using one byte for ASCII-compatible characters and two for others, enhancing efficiency for mixed-language text but introducing parsing complexity through lead-byte identification (e.g., bytes above 0x7F signaling a trail byte).[18][19] Fixed-width DBCS employs uniform two-byte sequences for all characters, simplifying parsing and indexing since boundaries are predictable, though this leads to redundancy for ASCII subsets.[19] Regarding space efficiency, variable DBCS strikes a balance for mixed ASCII-CJK content, similar to UTF-8's variable approach, but its regional specificity limits universal applicability.[18] For predominantly CJK text, DBCS uses two bytes per character, which can be more compact than UTF-8's typical three bytes for such characters.[20]Historical Development
Origins in the 1980s
In the mid-1980s, the rapid rise of personal computing in Japan and China highlighted the inadequacies of existing character encoding systems for handling East Asian languages. ASCII, originally a 7-bit standard limited to 128 characters and later extended to 8-bit ISO-8859 variants supporting up to 256 characters, proved insufficient for representing the vast repertoires of kanji in Japanese and hanzi in Chinese, which required encoding thousands of unique glyphs for practical use.[1][21] In Japan, the proliferation of personal computers like the NEC PC-9800 series and IBM's Multistation 5550 drove demand for native language support, while in China, early imports and domestic developments in the mid-1980s necessitated solutions for simplified hanzi processing.[22] The linguistic motivation for double-byte character sets (DBCS) stemmed from the ideographic nature of CJK (Chinese, Japanese, Korean) scripts, where characters are logographic symbols rather than phonetic units, demanding encoding for 2,000 to 7,000 or more commonly used characters—far exceeding single-byte capacities. Unlike alphabetic scripts, CJK languages rely on these ideographs for semantic expression, with Japanese kanji alone drawing from historical sets like the 1,850 tōyō kanji and additional extensions, making one-byte encodings impractical for comprehensive text representation.[23][21] This complexity fueled the shift toward 16-bit (double-byte) approaches, allowing up to 65,536 possible characters while preserving compatibility with ASCII for Roman letters and symbols.[1] Early developments of DBCS were pioneered by IBM in Japan around 1984, particularly through implementations on mainframe-derived systems like the Multistation 5550 workstation, which integrated double-byte support for Japanese text processing. These efforts were heavily influenced by emerging Japanese Industrial Standards (JIS), building on the foundational JIS C 6226-1978 (later JIS X 0208) that defined a 94×94 grid for over 6,000 kanji and kana.[24] In China, parallel advancements culminated in the national standard GB 2312-1980 (published 1981), a double-byte encoding for 6,763 simplified hanzi, addressing similar needs for domestic computing.[25] Initial adoption of DBCS occurred in specialized hardware for East Asian markets, including early word processors and terminals that enabled kana-to-kanji conversion and full-script display. In Japan, systems like the Multistation 5550, released in 1983 with full DBCS rollout by 1984, supported business applications and text editing via Shift JIS encoding.[24][26] Similarly, in China, GB 2312 facilitated hanzi input on imported and modified PCs, powering the first wave of localized software for administrative and publishing tasks by the mid-1980s.[27]Key Milestones and Standards
In 1987, the Japanese Industrial Standards Committee formalized JIS X 0208 as the nation's primary double-byte character set (DBCS), renaming and updating the earlier JIS C 6226 standard to encode 6,879 graphic characters, including kanji, hiragana, katakana, and symbols, arranged in a 94x94 grid for efficient representation of Japanese text.[28][15] This milestone addressed the limitations of single-byte encodings for East Asian scripts, enabling broader digital adoption in computing and printing industries. The 1990s saw significant expansions of DBCS standards across East Asia, building on JIS X 0208's model. Although initially published in 1980 as a national standard for simplified Chinese, GB 2312 gained widespread DBCS implementation during this decade through encodings like EUC-CN and GBK, supporting over 6,700 characters for Mandarin text in personal computing and software localization.[29] Similarly, South Korea's KS C 5601, established in 1987, encoded 2,350 modern Hangul syllables, 4,888 Hanja characters, and additional symbols in a DBCS format, with revisions in 1992 adding precomposed syllables to enhance compatibility with international systems.[30][31] These standards facilitated regional text processing but highlighted the need for interoperability amid growing global data exchange.[32] International efforts in the early 1990s influenced DBCS evolution through ISO/IEC 10646, whose initial drafts proposed a fixed-width 16-bit universal character set to accommodate East Asian requirements, drawing from DBCS structures like JIS X 0208 while aiming for broader coverage.[33] Concurrently, Microsoft extended DBCS via code pages such as CP932, an enhanced Shift JIS variant incorporating JIS X 0208 with proprietary additions for Windows environments, improving font rendering and input methods.[18] The decade's peak came in 1992 with Windows 3.1's integration of DBCS for East Asian locales, providing standardized lead-byte detection and text handling that propelled adoption in Japan, China, and Korea.[34]Encoding Types
Fixed-Width Encodings
Fixed-width double-byte character sets, also known as fixed-width encodings, represent each character—including control characters—using exactly two bytes, resulting in uniform 16-bit code points for all symbols. This approach eliminates the need for state machines or shift sequences during decoding, as every code unit directly corresponds to a single character without ambiguity in byte boundaries.[35] A prominent example of such an encoding is UCS-2, the original 16-bit form of the Universal Character Set defined in early versions of ISO/IEC 10646, which served as a predecessor to the more flexible UTF-16. UCS-2 maps characters solely within the Basic Multilingual Plane (BMP), supporting up to 65,536 code points, and was widely adopted in early Unicode implementations for its simplicity in processing CJK and other scripts. In CJK contexts, fixed-width DBCS often appear as 16-bit process codes for Asian character sets, such as internal representations of JIS X 0208 in UNIX systems.[35][36][37] The primary advantages of fixed-width encodings lie in their predictable memory layout and support for straightforward random access in strings, where the nth character can be located by simply multiplying the index by two bytes without parsing variable lengths. This facilitates efficient indexing and substring operations in software, particularly in environments predating widespread Unicode adoption. However, these encodings are inefficient for text dominated by ASCII characters, as they double the storage size compared to single-byte representations, and they cannot handle characters beyond the BMP without surrogate mechanisms, limiting their scalability for full Unicode coverage.[38]Variable-Width Encodings
In variable-width double-byte character sets (DBCS), ASCII characters are encoded using a single byte in the range 0x00 to 0x7F, while characters from CJK languages are encoded using two bytes to accommodate larger character repertoires.[1] The first byte of a two-byte sequence, known as the lead byte, typically has its high bit set (e.g., in the range 0x81 to 0x9F), signaling that the subsequent byte—the trail byte—forms part of the same character.[39] This structure allows seamless integration of single-byte and double-byte codes within the same text stream, enabling efficient representation of mixed-language content.[1] Decoding these encodings is state-dependent, requiring software to maintain a parsing state that tracks whether a lead byte has been encountered and a trail byte is expected next.[1] Applications must scan the byte stream sequentially from the beginning, as random access or substring operations can misinterpret boundaries between single-byte and double-byte characters without proper state management.[1] Some byte values may serve as either lead or trail bytes depending on context, which can introduce ambiguity if the state is not correctly tracked—for instance, a lone lead byte might be erroneously treated as a single-byte character.[40] Common issues arise from invalid or incomplete sequences, such as a lead byte without a valid trail byte or mismatched pairs, which can result in data corruption during processing or conversion between encodings.[41] Without robust error handling, these malformed sequences may cause characters to be skipped, replaced with placeholders, or misinterpreted, leading to garbled output in displays or files.[1] The variable-width design provides space efficiency for documents mixing English text with CJK characters, as prevalent in East Asian computing environments, by avoiding the overhead of fixed two-byte encoding for all characters.[1] This approach reduces storage and transmission costs compared to uniform fixed-width schemes, particularly when ASCII content predominates.[1]Specific Implementations
Japanese Encodings
Japanese encodings for double-byte character sets (DBCS) primarily revolve around the Japanese Industrial Standards (JIS) developed by the Japanese Standards Association to handle the complexities of kanji, hiragana, katakana, and other symbols in the Japanese writing system. The foundational standard, JIS X 0208 established in 1978 (with revisions in 1983, 1990 and 1997), defines a 94x94 grid encoding for 6,879 graphic characters, including 6,355 kanji (2,965 in Level 1 and 3,390 in Level 2) and 524 non-kanji characters such as hiragana, katakana, punctuation, and symbols.[42] This set serves as the core for many Japanese DBCS implementations, enabling representation of over 5,000 essential kanji used in everyday text, place names, and technical documentation.[42] A prominent variable-width encoding based on JIS X 0208 is Shift-JIS, developed in the 1980s by Microsoft for MS-DOS systems and later adopted by IBM and Apple. Shift-JIS extends the single-byte JIS X 0201 encoding by incorporating double-byte sequences for JIS X 0208 characters, using lead bytes in the ranges 0x81–0x9F or 0xE0–0xEF followed by trail bytes in 0x40–0x7E or 0x80–0xFC, resulting in a total range from 0x8140 to 0xEFFC.[42][43] This design ensures backward compatibility with ASCII while supporting the full JIS X 0208 repertoire, making it a de facto standard for Japanese text in Windows environments prior to Unicode adoption. An alternative variable-width encoding for Unix systems is EUC-JP, which encodes JIS X 0208 using two bytes with lead bytes 0xA1–0xFE and trail bytes 0xA1–0xFE, maintaining compatibility with ASCII in the 0x00–0x7F range.[42] To address limitations in character coverage, supplementary standards emerged. JIS X 0212, introduced in 1990, adds 5,801 kanji (primarily Level 3 and 4) and 266 non-kanji characters in another 94x94 grid, often integrated into EUC-JP via three-byte sequences starting with 0x8F.[42] For Windows-specific needs, Microsoft extended Shift-JIS into CP932 (also known as Windows-31J), incorporating approximately 1,000 additional characters including NEC special symbols (Row 13), IBM extensions (Rows 89–92 and 115–119), and user-defined areas, expanding the total repertoire to around 13,000 characters when combined with JIS X 0208 and JIS X 0212.[42][44] These extensions enhance support for proprietary fonts and legacy applications while preserving the core JIS structure.Chinese and Korean Encodings
Double-byte character sets (DBCS) for Chinese languages emerged in the 1980s to address the limitations of single-byte encodings like ASCII for representing the vast repertoire of Hanzi characters in Traditional and Simplified Chinese scripts. In Taiwan, the Big5 encoding was developed around 1984 by a consortium of vendors to support Traditional Chinese characters, employing a variable-width scheme that uses one byte for ASCII compatibility and two bytes for Hanzi.[27] Big5 encompasses 13,053 characters, including 13,053 Hanzi and associated symbols, focusing on a subset commonly used in Taiwanese contexts.[45] On the mainland, the GB 2312 standard, established in 1980 by the Chinese government, targets Simplified Chinese with a variable-width DBCS format, encoding 6,763 Hanzi characters alongside 682 non-Hanzi symbols for a total of 7,445 glyphs.[46][47] Korean DBCS implementations, such as KS C 5601 standardized in 1987 by the Korea Industrial Standards Association, adopt a similar variable-width approach to encode both Hangul syllables and Hanja (Chinese-derived characters).[30] This standard includes 2,350 precomposed Hangul syllables—formed by combining consonants and vowels phonetically—and 4,888 Hanja, totaling 7,238 characters including additional symbols and graphics.[48] A common variant, EUC-KR, maps KS C 5601 into an Extended Unix Code format, using one byte for ASCII and two bytes for Korean-specific content, ensuring compatibility in Unix-like environments.[49] Key differences between Chinese and Korean DBCS lie in their script priorities: Chinese encodings like Big5 and GB 2312 emphasize subsets of logographic Hanzi tailored to regional orthographic norms (Traditional for Taiwan, Simplified for mainland China), whereas Korean standards such as KS C 5601 prioritize phonetic Hangul combinations alongside a smaller Hanja component for Sino-Korean vocabulary.[48] This reflects the alphabetic nature of Hangul, which allows for syllable composition, contrasting with the ideographic focus of Hanzi.[50] Extensions to these standards addressed gaps in coverage. GBK, introduced in 1995 as a national extension of GB 2312, adds compatibility with emerging characters, expanding to 21,886 total glyphs while maintaining backward compatibility for Simplified Chinese applications.[51] Similarly, Big5-HKSCS, developed by the Hong Kong government in 1995 and revised in 1999, extends Big5 with supplementary characters specific to Cantonese usage in Hong Kong, incorporating over 4,700 additional glyphs in its initial release.[52][53]Technical Challenges
Ambiguity Issues
In variable-width double-byte character sets (DBCS), such as Shift-JIS, parsing ambiguity occurs because the byte ranges designated for lead bytes, trail bytes, and single-byte characters overlap, making it impossible to unambiguously classify individual bytes without examining surrounding context or maintaining a parsing state. For instance, in Shift-JIS, lead bytes occupy the ranges 0x81–0x9F and 0xE0–0xFC, trail bytes span 0x40–0x7E and 0x80–0xFC, and single-byte half-width katakana characters use 0xA1–0xDF—a subset of the trail byte range. This overlap allows certain byte sequences to be valid interpretations in multiple ways, such as a sequence starting with an even number of bytes in the half-width katakana range potentially being parsed as single bytes or as part of multi-byte characters depending on subsequent bytes.[54][55][56] A concrete example arises in Shift-JIS with the byte 0x5C, which represents the reverse solidus (ASCII backslash) as a standalone single-byte character but can also function as a valid trail byte (within 0x40–0x7E) when immediately following a lead byte, forming part of a two-byte character. Determining the correct interpretation requires sequential parsing rules that track whether the previous byte was a lead byte, as isolated examination of 0x5C alone provides no definitive clue.[58] These ambiguities pose significant risks during string operations, such as substring extraction, length calculation, or pattern searching, where failure to maintain parsing state can split a two-byte character across boundaries, resulting in data corruption, incomplete characters (e.g., isolated trail bytes rendered as invalid glyphs), or infinite loops in naive decoders. In Java, for example, using byte-based methods likegetBytes() on Shift-JIS strings without character-aware alternatives can lead to buffer overruns or misaligned data if overlaps are not handled.[54][56]
Historically, such issues manifested in early email systems that assumed all text adhered to 7-bit ASCII, leading to mangled DBCS content when Japanese or other CJK messages traversed networks without proper encoding indicators, often producing mojibake where lead bytes were treated as control characters or garbled symbols. For instance, pre-MIME email protocols (before RFC 1341 in 1992) lacked support for multi-byte encodings, causing Shift-JIS text to be corrupted in transit across ASCII-only gateways, a problem exacerbated in software like early Windows ActiveX controls handling international email.[59]
