Hubbry Logo
logo
Extended Unix Code
Community hub

Extended Unix Code

logo
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something to knowledge base
Hub AI

Extended Unix Code AI simulator

(@Extended Unix Code_simulator)

Extended Unix Code

Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese (characters).

The most commonly used EUC codes are variable-length encodings with a character belonging to an ISO/IEC 646 compliant coded character set (such as ASCII) taking one byte, and a character belonging to a 94×94 coded character set (such as GB 2312) represented in two bytes. The EUC-CN form of GB 2312 and EUC-KR are examples of such two-byte EUC codes. EUC-JP includes characters represented by up to three bytes, including an initial shift code, whereas a single character in EUC-TW can take up to four bytes.

Modern applications are more likely to use UTF-8, which supports all of the glyphs of the EUC codes, and more, and is generally more portable with fewer vendor deviations and errors. EUC is however still very popular, especially EUC-KR for South Korea.

The structure of EUC is based on the ISO/IEC 2022 standard, which specifies a system of graphical character sets that can be represented with a sequence of the 94 7-bit bytes 0x21–7E, or alternatively 0xA1–FE if an eighth bit is available. This allows for sets of 94 graphical characters, or 8836 (942) characters, or 830584 (943) characters. Although initially 0x20 and 0x7F were always the space and delete character and 0xA0 and 0xFF were unused, later editions of ISO/IEC 2022 allowed the use of the bytes 0xA0 and 0xFF (or 0x20 and 0x7F) within sets under certain circumstances, allowing the inclusion of 96-character sets. The ranges 0x00–1F and 0x80–9F are used for C0 and C1 control codes.

EUC is a family of 8-bit profiles of ISO/IEC 2022, as opposed to 7-bit profiles such as ISO-2022-JP. As such, only ISO 2022 compliant character sets can have EUC forms. Up to four coded character sets (referred to as G0, G1, G2, and G3 or as code sets 0, 1, 2, and 3) can be represented with the EUC scheme. The G0 set is set to an ISO/IEC 646 compliant coded character set such as ASCII, ISO 646:KR (KS X 1003) or ISO 646:JP (the lower half of JIS X 0201) and invoked over GL (i.e. 0x21–0x7E, with the most significant bit cleared). If ASCII is used, this makes the code an extended ASCII encoding; the most common deviation from ASCII is that 0x5C (backslash in ASCII) is often used to represent a yen sign in EUC-JP (see below) and a won sign in EUC-KR.

The other code sets are invoked over GR (i.e. with the most significant bit set). Hence, to get the EUC form of a character, the most significant bit of each coding byte is set (equivalent to adding 128 to each 7-bit coding byte, or adding 160 to each number in the kuten code); this allows the software to easily distinguish whether a particular byte in a character string belongs to the ISO 646 code or the extended code. Characters in code sets 2 and 3 are prefixed with the control codes SS2 (0x8E) and SS3 (0x8F) respectively, and invoked over GR. Besides the initial shift code, any byte outside of the range 0xA0–0xFF appearing in a character from code sets 1 through 3 is not a valid EUC code.

The EUC code itself does not make use of the announcement and designation sequences from ISO 2022. However, the code specification is equivalent to the following sequence of four ISO 2022 announcement sequences, with meanings breaking down as follows.

The ISO-2022-based variable-length encoding described above is sometimes referred to as the EUC packed format, which is the encoding format usually labeled as EUC. However, internal processing of EUC data may make use of a fixed-length transformation format called the EUC complete two-byte format. This represents:

See all
User Avatar
No comments yet.