Hubbry Logo
JIS X 0208JIS X 0208Main
Open search
JIS X 0208
Community hub
JIS X 0208
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
JIS X 0208
JIS X 0208
from Wikipedia

JIS X 0208
Alias(es)JIS C 6226
Languages
Partial support:
StandardJIS X 0208:1978 through 1997
Classification
Extensions
Encoding formats
Preceded byJIS X 0201
Succeeded byJIS X 0213
Other related encodingsAssociated supplements: JIS X 0212
Other ISO 2022 CJK DBCSes:

JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standard, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language. The official title of the current standard is 7-bit and 8-bit double byte coded KANJI sets for information interchange (7ビット及び8ビットの2バイト情報交換用符号化漢字集合, Nana-Bitto Oyobi Hachi-Bitto no Ni-Baito Jōhō Kōkan'yō Fugōka Kanji Shūgō). It was originally established as JIS C 6226 in 1978, and has been revised in 1983, 1990, and 1997. It is also called Code page 952 by IBM. The 1978 version is also called Code page 955 by IBM.

Scope of use and compatibility

[edit]

The character set JIS X 0208 establishes is primarily for the purpose of information interchange (情報交換, jōhō kōkan) between data processing systems and the devices connected to them, or mutually between data communication systems. This character set can be used for data processing and text processing.

Partial implementations of the character set are not considered compatible. Because there are places where such things have happened as the original drafting committee of the first standard taking care to separate characters between level 1 and level 2 and the second standard then shuffling some variant characters (異体字, itaiji) between the levels, at least in the first and second standards, it is conjectured that non-kanji and level 1-only implementation Japanese computer systems were at one time considered for development. However, such implementations have never been specified as compatible, though examples such as the early NEC PC-9801 did exist.[1]

Even though there are provisions in the JIS X 0208:1997 standard concerning compatibility, at the present time, it is generally considered that this standard neither certifies compatibility nor is it an official manufacturing standard that amounts to a declaration of self-compatibility.[2] Consequently, de facto, JIS X 0208-"compatible" products are not considered to exist. Terminology such as "conformant" (準拠, junkyo) and "support" (対応, taiō) is included in JIS X 0208, but the semantics of these terms vary from person to person.

Code charts

[edit]

Lead byte

[edit]

The first encoding byte corresponds to the row or cell number plus 0x20, or 32 in decimal (see below). Hence, the code set starting with 0x21 has a row number of 1, and its cell 1 has a continuation byte of 0x21 (or 33), and so forth.

For lead bytes used for characters other than kanji, links are provided to charts on this page listing the characters encoded under that lead byte. For lead bytes used for kanji, links are provided to the appropriate section of Wiktionary's kanji index.

JIS X 0208 (lead bytes)
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x  SP  1-_ 2-_ 3-_ 4-_ 5-_ 6-_ 7-_ 8-_ 9-_ 10-_ 11-_ 12-_ 13-_ 14-_ 15-_
3x 16-_ 17-_ 18-_ 19-_ 20-_ 21-_ 22-_ 23-_ 24-_ 25-_ 26-_ 27-_ 28-_ 29-_ 30-_ 31-_
4x 32-_ 33-_ 34-_ 35-_ 36-_ 37-_ 38-_ 39-_ 40-_ 41-_ 42-_ 43-_ 44-_ 45-_ 46-_ 47-_
5x 48-_ 49-_ 50-_ 51-_ 52-_ 53-_ 54-_ 55-_ 56-_ 57-_ 58-_ 59-_ 60-_ 61-_ 62-_ 63-_
6x 64-_ 65-_ 66-_ 67-_ 68-_ 69-_ 70-_ 71-_ 72-_ 73-_ 74-_ 75-_ 76-_ 77-_ 78-_ 79-_
7x 80-_ 81-_ 82-_ 83-_ 84-_ 85-_ 86-_ 87-_ 88-_ 89-_ 90-_ 91-_ 92-_ 93-_ 94-_ DEL

Non-Kanji rows

[edit]

Character set 0x21 (row number 1, special characters)

[edit]

Some vendors use slightly different Unicode mapping for this set than the one below. For example, Microsoft maps kuten 1-29 (JIS 0x213D) to U+2015 (Horizontal Bar),[3] whereas Apple maps it to U+2014 (Em Dash).[4] Similarly, Microsoft maps kuten 1-61 (JIS 0x215D) to U+FF0D[3] (the fullwidth form of U+002D Hyphen-Minus), and Apple maps it to U+2212 (Minus Sign).[4] Unicode mapping of the wave dash also differs between vendors. See the cells with footnotes below.

ASCII and JISCII punctuation (shown here with a yellow background) may use alternative mappings to the Halfwidth and Fullwidth Forms block if used in an encoding which combines JIS X 0208 with ASCII or with JIS X 0201, such as Shift JIS, EUC-JP or ISO 2022-JP.

JIS X 0208 (prefixed with 0x21)
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x IDSP , . : ; ? ! ´ ` ¨
3x ^ _ [c] /
4x \ [d] [e] | ( ) [ ]
5x { } + [f] ± ×
6x ÷ = < > ° ¥
7x $ ¢ £ % # & * @ §

Character set 0x22 (row number 2, special characters)

[edit]

Most of the characters in this set were added in 1983, except for characters 0x2221–0x222E (kuten 2-1 through 2-14, or the first line of the chart below), which were included in the original 1978 version of the standard.

JIS X 0208 (prefixed with 0x22)
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x
3x
4x ¬
5x
6x
7x

Character set 0x23 (row number 3, digits and Roman)

[edit]

This set includes a subset of the ISO 646 invariant set (and therefore also a subset of both ASCII and the JIS X 0201 Roman set), minus punctuation and symbols, comprising western Arabic numerals and both cases of the Basic Latin alphabet. Characters in this set may use alternative Unicode mappings to the Halfwidth and Fullwidth Forms block if used in an encoding which combines JIS X 0208 with ASCII or with JIS X 0201, such as EUC-JP, Shift JIS or ISO 2022-JP.

Compare row 3 of KPS 9566, which this row exactly matches. Compare and contrast row 3 of KS X 1001 and of GB 2312, which include their entire national variants of ISO 646 in this row, rather than only the alphanumeric subset.

JIS X 0208 (prefixed with 0x23)
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x
3x 0 1 2 3 4 5 6 7 8 9
4x A B C D E F G H I J K L M N O
5x P Q R S T U V W X Y Z
6x a b c d e f g h i j k l m n o
7x p q r s t u v w x y z

Character set 0x24 (row number 4, Hiragana)

[edit]

This row contains Japanese Hiragana.

Compare row 4 of GB 2312, which matches this row. Compare and contrast row 10 of KPS 9566 and of KS X 1001, which use the same layout, but in a different row.

JIS X 0208 (prefixed with 0x24)
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x
3x
4x
5x
6x
7x

Character set 0x25 (row number 5, Katakana)

[edit]

This row contains Japanese Katakana.

Compare row 5 of GB 2312, which matches this row. Compare and contrast row 11 of KPS 9566 and of KS X 1001, which use the same layout, but in a different row. Contrast the considerably different Katakana layout used by JIS X 0201.

JIS X 0208 (prefixed with 0x25)
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x
3x
4x
5x
6x
7x

Character set 0x26 (row number 6, Greek)

[edit]

This row contains basic support for the modern Greek alphabet, without diacritics or the final sigma.

Compare row 6 of GB 2312 and GB 12345 and row 6 of KPS 9566, which include the same Greek letters in the same layout, although GB 12345 adds vertical presentation forms and KPS 9566 adds Roman numerals. Compare and contrast row 5 of KS X 1001, which offsets the Greek letters to include the Roman numerals first.

JIS X 0208 (prefixed with 0x26)
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο
3x Π Ρ Σ Τ Υ Φ Χ Ψ Ω
4x α β γ δ ε ζ η θ ι κ λ μ ν ξ ο
5x π ρ σ τ υ φ χ ψ ω
6x
7x

Character set 0x27 (row number 7, Cyrillic)

[edit]

This row contains the modern Russian alphabet and is not necessarily sufficient for representing other forms of the Cyrillic script.

Compare row 7 of GB 2312, which matches this row. Compare and contrast row 12 of KS X 1001 and row 5 of KPS 9566, which use the same layout (but in a different row).

JIS X 0208 (prefixed with 0x27)
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x А Б В Г Д Е Ё Ж З И Й К Л М Н
3x О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э
4x Ю Я
5x а б в г д е ё ж з и й к л м н
6x о п р с т у ф х ц ч ш щ ъ ы ь э
7x ю я

Character set 0x28 (row number 8, box drawing)

[edit]

All characters in this set were added in 1983, and were not present in the original 1978 revision of the standard.

JIS X 0208 (prefixed with 0x28)
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x
3x
4x
5x
6x
7x

Extension character set 0x2D (row number 13, NEC special characters)

[edit]

Rows 9 through 15 of the JIS X 0208 standard are left empty.

However, the following layout for row 13, first introduced by NEC, is a common extension. It is used (with minor variations, noted in footnotes) by Windows-932[3] (which is matched by the WHATWG Encoding Standard used by HTML5), by the PostScript variant (but, since KanjiTalk version 7, not the regular variant)[5] of MacJapanese, and by JIS X 0213 (the successor to JIS X 0208).[5][6] Unlike the other extensions made by Windows-932/WHATWG and JIS X 0213, the two match rather than colliding, so decoding of most of this row is better supported than the other extensions made by JIS X 0213.

NEC Special Characters for JIS X 0208 (prefixed by 0x2D)
0 1 2 3 4 5 6 7 8 9 A B C D E F
2x
3x [g]
4x
5x [g] [h]
6x
7x [i] [i] [i] [i] [i] [i] [i] [i] [i] [g] [g]

Kanji rows

[edit]

Code structure

[edit]

In order to represent code points, column/line numbers are used for one-byte codes and kuten numbers are used for two-byte codes. For a way to identify a character without depending on a code, character names are used.

Single byte codes

[edit]

Almost all JIS X 0208 graphic character codes are represented with two bytes of at least seven bits each. However, every control character, as well as the plain space – although not the ideographic space – is represented with a one-byte code. In order to represent the bit combination (ビット組合せ, bitto kumiawase) of a one-byte code, two decimal numbers – a column number and a line number – are used. Three high-order bits out of seven or four high-order bits out of eight, counting from zero to seven or from zero to fifteen respectively, form the column number. Four low-order bits counting from zero to fifteen form the line number. Each decimal number corresponds to one hexadecimal digit. For example, the bit combination corresponding to the graphic character "space" is 010 0000 as a 7-bit number, and 0010 0000 as an 8-bit number. In column/line notation, this is represented as 2/0. Other representations of the same single-byte code include 0x20 as hexadecimal, or 32 as a single decimal number.

Code points and code numbers

[edit]

The double-byte codes are laid out in 94 numbered groups, each called a row (, ku; lit. "section"). Every row contains 94 numbered codes, each called a cell (, ten; lit. "point").[j] This makes a total of 8836 (94 × 94) possible code points (although not all are assigned, see below); these are laid out in the standard in a 94-line, 94-column code table.

A row number and a cell number (each numbered from 1 to 94, for a standard JIS X 0208 code) form a kuten (区点) point, which is used to represent double-byte code points. A code number or kuten number (区点番号, kuten bangō) is expressed in the form "row-cell", the row and cell numbers being separated by a hyphen. For example, the character "" has a code point at row 16, cell 1, so its code number is represented as "16-01".

In 7-bit JIS X 0208 (as might be switched to in JIS X 0202 / ISO-2022-JP), both bytes must be from the 94-byte range of 0x21 (used for row or cell number 1) through 0x7E (used for row or cell number 94) – exactly corresponding to the range used for 7-bit ASCII printing characters, not counting the space. Accordingly, the encoded bytes are obtained by adding 0x20 (32) to each number.[7] For instance, the above example of 16-01 ("亜") would be represented by the bytes 0x30 0x21. The 8-bit EUC-JP instead uses the range 0xA1 through 0xFE (setting the high bit to 1), whereas other encodings such as Shift JIS use more complicated transforms. Shift JIS includes more encoding space than is needed for JIS X 0208 itself; some Shift JIS specific extensions to JIS X 0208 make use of row numbers above 94.[8]

This structure is also used in the Mainland Chinese GB 2312, where it is natively known as 区位; qūwèi, and the South Korean KS C 5601 (currently KS X 1001), where the ku and ten are respectively known as hang[9] (; ; haeng) and yol[9] (; ; yeol). The later JIS X 0213 extends this structure by having more than one plane (, men; lit. "face") of rows, which is also the structure used by CNS 11643, and related to the structure used by CCCII.

Unassigned code points

[edit]

Among the 2-byte codes, rows 9 to 15 and 85 to 94 are unassigned code points (空き領域, aki ryōiki); that is, they are code points with no characters assigned to them. Also, some cells in other rows are also essentially unassigned code points.

These empty areas contain code points that should basically not be used. Except when there is prior agreement among the relevant parties, characters (gaiji) for information interchange should not be assigned to the unassigned code points.

Even when assigning characters to unassigned code points, graphic characters defined in the standard should not be assigned to them, and the same character should not be assigned to multiple unassigned code points; characters should not be duplicated in the set.

Furthermore, when assigning characters to unassigned code points, it is necessary to be cautious of unification in regards to kanji glyphs. For example, row 25 cell 66 corresponds to the kanji meaning "high" or "expensive"; both the form with a component resembling the "mouth" character () in the middle () and the less common form with a ladder-like construction in the same location () are subsumed into the same code point. Consequently, limiting point 25-66 to the "mouth" form and assigning the latter "ladder" form to an unassigned code point would technically be in violation of the standard.

In practice, however, several vendor-specific Shift JIS variants, including Windows-932 and MacJapanese, encode vendor extensions in unallocated rows of the encoding space for JIS X 0208. Also, most of the codes unassigned in JIS X 0208 are assigned by the newer JIS X 0213 standard.

Character names

[edit]

Each JIS X 0208 character is given a name. By using a character's name, it is possible to identify characters without relying on their codes. The names of characters are coordinated with other character set standards, notably the Universal Coded Character Set (UCS/Unicode), so this is one possible source of character mappings to character sets such as Unicode. For example, both the character at ISO/IEC 646 International Reference Version (US-ASCII) column 4 line 1 and the one at JIS X 0208 row 3 cell 33 have the name "LATIN CAPITAL LETTER A". Therefore, the character at 4/1 in ASCII and the character at 3-33 in JIS X 0208 can be regarded as the same character (although, in practice, alternative mapping is used for the JIS X 0208 character due to encodings providing ASCII separately). Conversely, ASCII characters 2/2 (quotation mark), 2/7 (apostrophe), 2/13 (hyphen-minus), and 7/14 (tilde) can be determined to be characters that do not exist in this standard.

Character names of non-kanji characters use uppercase Roman letters, spaces, and hyphens. Non-kanji characters are given a Japanese-language common name (日本語通用名称, Nihongo tsūyō meishō), but some provisions for these names do not exist.[k] The names of kanji, on the other hand, are mechanically set according to the corresponding hexadecimal representation of their code in UCS/Unicode. The name of a kanji can be arrived at by prepending the Unicode codepoint with "CJK UNIFIED IDEOGRAPH-". For example, row 16 cell 1 () corresponds to U+4E9C in UCS, so the name of it would be "CJK UNIFIED IDEOGRAPH-4E9C". Kanji are not given Japanese common names.

Kanji set

[edit]

Overview

[edit]

JIS X 0208 prescribes a set of 6879 graphical characters that correspond to two-byte codes with either seven or eight bits to the byte; in JIS X 0208, this is called the kanji set (漢字集合, kanji shūgō), which includes 6355 kanji as well as 524 non-kanji (非漢字, hikanji), including characters such as Latin letters, kana, and so forth.

Special characters
Occupies rows 1 and 2. There are 18 descriptor symbols (記述記号, kijutsu kigō) such as the "ideographic space" ( ), and the Japanese comma and period; eight diacritical marks such as dakuten and handakuten; 10 characters for things that follow kana or kanji (仮名又は漢字に準じるもの, kana mata wa kanji ni junjiru mono) such as the Iteration mark; 22 bracket symbols (括弧記号, kakko kigō); 45 mathematical symbols (学術記号, gakujutsu kigō); and 32 unit symbols, which includes the currency sign and the postal mark, for a total of 147 characters.
Numerals
Occupies part of row 3. The ten digits from "0" to "9".
Latin letters
Occupies part of row 3. The 26 letters of the English alphabet in uppercase and lowercase form for a total of 52.
Hiragana
Occupies row 4. Contains 48 unvoiced kana (including the obsolete wi and we), 20 voiced kana (dakuten), 5 semi-voiced kana (handakuten), 10 small kana for palatalized and assimilated sounds, for a total of 83 characters.
Katakana
Occupies row 5. There are 86 characters; in addition to the katakana equivalents of the hiragana characters, the small ka/ke kana (/) and the vu kana ().
Greek letters
Occupies row 6. The 24 letters of the Greek alphabet in uppercase and lowercase form (minus the final sigma) for a total of 48.
Cyrillic letters
Occupies row 7. The 33 letters of the Russian alphabet in uppercase and lowercase form for a total of 66.
Box-drawing characters
Occupies row 8. Thin segments, thick segments, and mixed thin and thick segments, 32 total.
Kanji
The 2965 characters of level 1 (第1水準, dai ichi suijun) from row 16 to row 47, and the 3390 characters of level 2 (第2水準, dai ni suijun) from row 48 to row 84 for a total of 6355.

Special characters, numerals, and Latin characters

[edit]

As for the special characters in the kanji set, some characters from the graphic character set of the International Reference Version (IRV) of ISO/IEC 646:1991 (equivalent to ASCII) are absent from JIS X 0208. There are the aforementioned four characters "QUOTATION MARK", "APOSTROPHE", "HYPHEN-MINUS", and "TILDE". The former three are split into different code points in the kanji set (Nishimura, 1978; JIS X 0221-1:2001 standard, Section 3.8.7). The "TILDE" of IRV has no corresponding character in the kanji set.

In the following table, the ISO/IEC 646:1991 IRV characters in question are compared with their multiple equivalents in JIS X 0208, except for the IRV character "TILDE", which is compared with the "WAVE DASH" of JIS X 0208. The entries under the "Symbol" columns utilize UCS/Unicode code points, so the specifics of display may differ.

The ASCII/IRV characters without exact JIS X 0208 equivalents were later assigned code points by JIS X 0213, these are also listed below, as are Microsoft's mapping of the four characters.

Non-strict correspondence between ISO/IEC 646:1991 IRV (ASCII) and JIS X 0208
ISO/IEC 646:1991 IRV JIS X 0208
Column/Line x0213[6] Microsoft Symbol Name Kuten Symbol Name
2/2 1-2-16 92-94[A]
115-24[B]
" QUOTATION MARK 1-15 ¨ DIAERESIS
1-40 LEFT DOUBLE QUOTATION MARK
1-41 RIGHT DOUBLE QUOTATION MARK
1-77 DOUBLE PRIME
2/7 1-2-15 92-93[A]
115-23[B]
' APOSTROPHE 1-13 ´ ACUTE ACCENT
1-38 LEFT SINGLE QUOTATION MARK
1-39 RIGHT SINGLE QUOTATION MARK
1-76 PRIME
2/13 1-2-17 1-61[C] - HYPHEN-MINUS 1-30 HYPHEN
1-61 MINUS SIGN
7/14 1-2-18 1-33[D] ~ TILDE (no corresponding character)
(no corresponding character) 1-33 WAVE DASH[D]
  1. ^ a b From "NEC selection of IBM extensions". Occupies a code point unallocated in JIS X 0208.
  2. ^ a b From "IBM extensions". Outside range of JIS X 0208, but encodable in Shift_JIS.
  3. ^ Microsoft treat the JIS minus sign as a fullwidth form of the hyphen-minus.
  4. ^ a b Wave Dash is sometimes treated as a fullwidth form of the tilde, e.g. by Microsoft (see Tilde § Unicode and Shift JIS encoding of wave dash). The ASCII / IRV tilde is an ambiguous code point which may appear either as a tilde accent mark (˜) or as a dash with the same curvature (∼), although the dash is more common due to the spacing accent having a separate code point in Windows-1252; there is no JIS X 0208 character for a tilde accent. Character 1-2-18 in JIS X 0213 is shown as a tilde accent in the code chart.[6]

This means that the kanji set is the most widespread non-upward-compatible character set in the world; it is counted as one of the weak points of this standard.

Even with the 90 special characters, numerals, and Latin letters the kanji set and the IRV set have in common, this standard does not follow the arrangement of ISO/IEC 646. These 90 characters are split between rows 1 (punctuation) and 3 (letters and numbers), although row 3 does follow ISO 646 arrangement for the 62 letters and numbers alone (e.g. 4/1 ("A") in ISO 646 becomes 2/3 4/1 (i.e. 3-33) in JIS X 0208).

As to the cause of how these numerals, Latin letters, and so forth in the kanji set are the "full-width alphanumeric characters" (全角英数字, zenkaku eisūji) and how the original implementation came forth with a differing interpretation compared to the IRV, it is thought that it is due to these incompatibilities.

Ever since the first standard, it has been possible to represent composites (合成, gōsei) such as encircled numbers, ligatures for measurement unit names, and Roman numerals;[10] they were not given independent kuten code points. Although individual companies that manufacture information systems can make an effort to represent these characters as customers may require by the composition of the characters, none has requested to have them added to the standard, instead choosing to proprietarily offer them as gaiji.

In the fourth standard (1997), all these characters were explicitly defined as characters that accompany an advancement of the current position; that is to say, they are spacing characters. Furthermore, it was ruled that they should not be made by the composition of characters. For this reason, it became disallowed to represent Latin characters with diacritics at all, with possibly the sole exception of the ångström symbol (Å) at row 2 cell 82.

Hiragana and katakana

[edit]

The hiragana and katakana in JIS X 0208, unlike JIS X 0201, include dakuten and handakuten markings as part of a character. The katakana wi () and we () (both obsolete in modern Japanese) as well as the small wa (), not in JIS X 0201, are also included.

The arrangement of kana in JIS X 0208 is different from the arrangement of katakana in JIS X 0201. In JIS X 0201, the syllabary starts with wo (), followed by the small kana sorted by gojūon order, followed by the full-size kana, also in gojūon order (ヲァィゥェォャュョッーアイウエオ......ラリルレロワン). On the other hand, in JIS X 0208, the kana are sorted first by gojūon order, then in the order of "small kana, full-size kana, kana with dakuten, and kana with handakuten" such that the same fundamental kana is grouped with its derivatives (ぁあぃいぅうぇえぉお......っつづ......はばぱひびぴふぶぷへべぺほぼぽ......ゎわゐゑをん). This ordering was chosen in order to more simply facilitate the sorting of kana-based dictionary look-ups (Yasuoka, 2006).[l]

As mentioned above, in this standard, the previously defined katakana order in JIS X 0201 was not followed in JIS X 0208. It is thought that the JIS X 0201 katakana being "half-width kana" arose due to the incompatibility with the katakana of this standard. This point is also one of the weaknesses of this standard.

Kanji

[edit]

How the kanji in this standard were chosen from what sources, why they are split into level 1 and level 2, and how they are arranged are all explained in detail in the fourth standard (1997). Per that explanation, the kanji included in the following four kanji listings were reflected in the 6349 characters of the first standard (1978).

  • Kanji Listing for Standard Code (Tentative) (標準コード用漢字表 (試案), Hyōjun Kōdo-yō Kanjihyō (Shian))
    The Information Processing Society of Japan kanji code committee compiled this list in 1971. In the below "Correspondence Analysis Results", this appears to be 6086 characters.
  • Basic Kanji for Administrative Data Processing Use (行政情報処理用基本漢字, Gyōsei Jōhō Shoriyō Kihon Kanji)
    Selected by the Administrative Management Agency of Japan in 1975, it consists of 2817 characters. For data for the purpose of selection, the Agency made a report which, starting with the "Kanji Listing for Standard Code (Tentative)", contrasted several kanji listings, the "Correspondence Analysis Results and Frequency of Use of Kanji for Administrative Data Processing Use Normal Kanji Selection" (行政情報処理用標準漢字選定のための漢字の使用頻度および対応分析結果, Gyōsei Jōhō Shoriyō Kihon Kanji Sentei no Tame no Kanji no Shiyō Hindo Oyobi Taiō Bunseki Kekka), or "Correspondence Analysis Results" (対応分析結果, Taiō Bunseki Kekka) for short.
  • Japanese Personality Registration Name Kanji (日本生命収容人名漢字, Nihon Seimei Shūyō Jinmei Kanji)
    One of the kanji listings that compose the "Correspondence Analysis Results", consisting of 3044 characters. It no longer exists. The original list was nonexistent for the original drafting committee; this kanji list was reflected in the standard to follow the "Correspondence Analysis Results".
  • Kanji for National Administrative District Listing (国土行政区画総覧使用漢字, Kokudo Gyōsei Kukaku Sōran Shiyō Kanji)
    One of the kanji listings that compose the "Correspondence Analysis Results", consisting of 3251 characters. They are the kanji used in the list of all administrative place names compiled by the Japan Geographic Data Center, the "National Administrative District Listing" (国土行政区画総覧, Kokudo Gyōsei Kukaku Sōran). The original drafting committee did not investigate the listing itself; the kanji used from this list followed the "Correspondence Analysis Results".

In the second and third standards, they added four and two characters to level 2, respectively, bringing the total kanji to 6355. Also, in the second standard, character forms were changed as well as transposition among the levels; in the third standard as well, character forms were changed. These are described further below.

Level partitioning

[edit]

The 2,965 Level 1 kanji occupy rows 16 to 47. The 3,390 Level 2 kanji occupy rows 48 to 84.

For level 1, characters common to multiple kanji glyph listings were chosen, using the tōyō kanji, the tōyō kanji correction draft, and the jinmeiyō kanji as a basis. Also, JIS C 6260 ("To-Do-Fu-Ken (Prefecture) Identification Code"; currently JIS X 0401) and JIS C 6261 ("Identification code for cities, towns and villages"; currently JIS X 0402) were consulted; kanji for nearly all Japanese prefectures, cities, districts, wards, towns, villages, and so forth were intentionally placed in level 1.[m] Furthermore, amendments by experts were added.

Level 2 was dedicated to kanji that made an appearance in the aforementioned four major listings but were not selected for level 1. As noted below, the kanji of level 1 were ordered by their pronunciation, so among the kanji whose pronunciation were difficult to determine, there were those that were transferred from level 1 to level 2 on that basis (Nishimura, 1978).

Due to these decisions, for the most part, level 1 contains more frequently used kanji, and level 2 contains more infrequently used kanji, but of course, those were judged by the standards of the day; over the passage of time, some level 2 kanji have become more frequently used, such as one meaning "to soar" () and one meaning "to glitter" (); and inversely, some level 1 kanji have become infrequent, notably the ones meaning "centimeter" () and "millimeter" (). Of the current jōyō kanji, 30 fall into level 2,[n] while three are missing altogether (塡󠄀, 剝󠄀 and 頰󠄀).[o] Of the current jinmeiyō kanji, 192 are in level 2,[p] while 105 are not part of the standard.[q]

Arrangement

[edit]

The kanji in level 1 are sorted in order of each one's "representative reading" (i.e. a canonical reading chosen for the purposes of this standard only); the reading of a kanji for this may be an on or a kun reading; readings are sorted in gojūon order.[r] As a general rule, the on (Chinese-sound) reading is considered the representative reading; where a kanji has multiple on readings, the reading judged to be predominant in use frequency is used for the representative reading (JIS C 6226-1978 standard, Section 3.4). For the small percentage of kanji that either do not have an on reading or have an on reading which is little known and not in common use, the kun reading was employed as the representative reading. Where a verb kun reading must be used as the representative reading, the ren'yōkei (rather than the shūshikei) form is used.

For example, cells 1 to 41 on row 16 are 41 characters sorted as starting with a reading of a. Within these, 22 characters, including 16-10 (: on reading "ki"; kun reading "aoi") and 16-32 (: on readings "zoku" and "shoku"; kun reading "awa") are there on the basis of their kun readings. 16-09 (: on reading "", kun reading "a(i)") and 16-23 (: on readings "" and "kyū", kun reading "atsuka(i)") are just two examples of ren'yōkei-form verbs used for the representative reading.

Where the representative reading is the same between different kanji, a kanji that uses an on reading is placed ahead of one that uses a kun reading. Where the on or kun readings are the same between more than one kanji, they are then ordered by their primary radical and stroke count.

Whether on level 1 or level 2, itaiji are arranged to directly follow their exemplar form. For example, in level 2, right after row 49 cell 88 (), the immediately following characters deviate from the general rule (stroke count in this case) to include three variants of 49-88 (, , and ).[s]

The kanji in level 2 are arranged in order of primary radical and stroke count. Where these two properties are the same for different kanji, they are then sorted by reading.

Kanji from unknown sources

[edit]
Kanji for which sources are unclear, unknown, or otherwise un­iden­ti­fiable in JIS X 0208:1997 Appendix 7
Kuten Symbol Classi­fi­ca­tion
52-55 Unknown
52-63 Unknown
54-12 Source unclear
55-27 Un­iden­ti­fiable
57-43 Source unclear
58-83 Source unclear
59-91 Source unclear
60-57 Source unclear
74-12 Source unclear
74-57 Source unclear
79-64 Source unclear
81-50 Source unclear

It has been pointed out that there are kanji in the kanji set that are not found in comprehensive, unabridged kanji dictionaries, and that the sources thereof are unknown. For example, only one year after the first standard was established, Tajima (1979) reported that he had confirmed 63 kanji that were not to be found in Shinjigen (a large kanji dictionary published by Kadokawa Shoten), nor in Dai Kan-Wa jiten, and they did not make sense as ryakuji of any sort; he noted that it would be preferable for kanji not available in kanji dictionaries to be selected from definite sources. These kanji came to be known as "ghost" characters (幽霊文字, yūrei moji) or "ghost kanji" (幽霊漢字, yūrei kanji), among other names.

The drafting committee for the fourth version of the standard also saw the existence of kanji with sources unknown as a problem, and so made an inquiry into just what kind of sources the drafting committee of the first version referenced. As a result, it was discovered that the original drafting committee had heavily relied on the "Correspondence Analysis Results" to collect kanji. When the drafting committee investigated the "Correspondence Analysis Results", it became clear that many of the kanji included in the kanji set but not found in exhaustive kanji dictionaries supposedly came from the "Japanese Personality Registration Name Kanji" and "Kanji for National Administrative District Listing" lists mentioned in the "Correspondence Analysis Results".

It was confirmed that no original text for the "Japanese Personality Registration Name Kanji" referenced in the "Correspondence Analysis Results" exists. For the "National Administrative District Listing", Sasahara Hiroyuki of the fourth version's drafting committee examined the kanji that appeared on the in-progress development pages for the first standard. The committee also consulted many ancient writings, as well as many examples of personal names in a database of NTT phone books.

Due to this thorough investigation, the committee was able to pare down the number of kanji for which the source cannot be confidently explained to twelve, shown on the adjacent table. Of these, it is conjectured that several glyphs came about due to copying errors. In particular, 妛 was probably created when printers tried to create 𡚴 by cutting and pasting 山 and 女 together. A shadow from that process was misinterpreted as a line, resulting in 妛 (a picture of this can be found in the Jōyō kanji jiten).

Unification of kanji variants

[edit]

According to the specifications in the fourth standard (1997), unification (包摂, hōsetsu; not the same term used for Unicode's "unification" although it is nearly the same concept) is the action of giving the same code point to a character without regard to its different character forms. In the fourth standard, the glyphs allowed are limited; the extent to which particular allographic glyphs are unified into a graphemic code point is clearly defined.

Furthermore, according to the specifications in the standard, a glyph (字体, jitai; lit. "character body";) is an abstract notion as to the graphical representation of a graphic character; a character form (字形, jikei; lit. "character shape"; also a "glyph" in a sense, but differentiated on a different level for standardization purposes) is the representation as a graphical shape that a glyph takes in actuality (e.g. due to a glyph being handwritten, printed, displayed on a screen, etc.). For a single glyph, there exists an endless range of possible concretely and/or visibly different character forms. A variation between a character form of one glyph is termed a "design difference" (デザインの差, dezain no sa).

The extent to which a glyph is unified to one code point is determined according to that code point's "example glyph" (例示字体, reiji jitai) and the "unification criteria" (包摂規準, hōsetsu kijun) that can be applied to that example glyph; that is, the example glyph for a code point applies to that code point, and any glyphs for which the parts that compose the example glyph are replaced in accordance with the unification criteria also apply to that code point.

For example, the example glyph at 33-46 () is composed of radical 9 () and the kanji that eventually spawned the so kana (). Also, in unification criterion 101, there are three kanji displayed: the first takes the form most often seen in Japanese (); the second contains a more traditional form () in which the first two strokes form radical 12 (the kanji numeral for the number 8: ); and the third is like the second, except that radical 12 is inverted (). Consequently, all three permutations (, , ) all apply to the code point at line 33 cell 46.

In the fourth standard, including one of the errata for the first printing, there are 186 unification criteria.

When a code point's example glyph is composed of more than one part glyph, unification criteria can be applied to each part. After a unification criterion is applied to one part glyph, that part cannot have any more unification criteria applied to it. Also, a unification criterion is not allowed to apply if the resulting glyph would coincide with that of another code point entirely.

An example glyph is no more than an example for that code point; it is not a glyph "endorsed" by the standard. Also, the unification criteria need only be used for generally used kanji and for the purpose of assigning things to the code points of this standard. The standard requests that generally unused kanji not be created based on the example glyphs and unification criteria.

The kanji of the kanji set are not chosen completely consistently according to the unification criteria. For example, although 41-7 corresponds to the form where the third and fourth strokes cross () as well as the form where they don't () according to unification criterion 72, 20-73 only corresponds to the form where they do not cross (), and 80-90 only corresponds to the form where they do ().

The terms "unification", "unification criteria", and "example glyph" were adopted in the fourth standard. From the first to the third version, kanji and relations between kanji were grouped into three types: "independent" (独立, dokuritsu), "compatible" (対応, taiō), and "equivalent" (同値, dōchi); it was explained that the characters recognized as equivalent "consolidate to just one point". "Equivalence" included, other than kanji with exactly the same shape, kanji with differences due to style, and kanji where the difference in character form is small.

In the first standard, it was stipulated that "this standard ... does not establish the particulars of character forms" (Section 3.1); it also states that "the aim of this standard is to establish the general idea of characters and their codes; the design of their character forms and such lie outside its scope." In the second and third standards as well, notes to the effect that specific designs of character forms lie outside its scope (the note on item 1). The fourth standard also stipulates that "This standard regulates graphic characters as well as their bit patterns, and the use, specific designs of individual characters, and so forth are not within the scope of this standard" (JIS X 0208:1997, item 1).

Unification criteria for compatibility

[edit]

In the fourth standard, "unification criteria for maintaining compatibility with previous standards" (過去の規格との互換性を維持するための包摂規準, kako no kikaku to no gokansei wo iji suru tame no hōsetsu kijun) is defined. Their application is limited to 29 code points whose glyphs vary greatly between the standards JIS C 6226-1983 on and after and JIS C 6226-1978. For those 29 code points, the glyphs from JIS C 6226-1983 on and after are displayed as "A", and the glyphs from JIS C 6226-1978 as "B". On each of them, both "A" and "B" glyphs may be applied. However, in order to claim compatibility with the standard, whether the "A" or "B" form has been used for each code point must be explicitly noted.

Character encodings

[edit]

Encoding schemes stipulated by JIS X 0208

[edit]

In JIS X 0208:1997, article 7 combined with appendices 1 and 2 define a total of eight encoding schemes.

In the descriptions below, the "CL" (control left), "GL" (graphic left), "CR" (control right), and "GR" (graphic right) regions are respectively, in column/line notation, from 0/0 to 1/15, from 2/1 to 7/14, from 8/0 to 9/15, and from 10/1 to 15/14. For each code, 2/0 is assigned the graphic character "SPACE" and 7/15 the control character "DELETE". The C0 control characters (defined in JIS X 0211 and matching ISO/IEC 6429) are assigned to the CL region.

7-bit encoding for kanji
Stipulated in the standard itself. The JIS X 0208 double-byte set is assigned to the GL region.
8-bit encoding for kanji
Stipulated in the standard itself. Same as the 7-bit encoding, but defined in terms of 8-bit bytes. The CR region may be unused, or encode the C1 control characters from JIS X 0211. The GR region is unused.
International Reference Version + 7-bit encoding for kanji
Stipulated in the standard itself. The shift in control character designates the ISO/IEC 646:1991 IRV (International Reference Version, equivalent to US-ASCII) to the GL region. Shift out designates the JIS X 0208 double-byte set to the same region.
Latin characters + 7-bit encoding for kanji
Stipulated in the standard itself. As with IRV+7-bit, but with ISO/IEC 646:IRV replaced with ISO/IEC 646:JP (the Roman set of JIS X 0201).
International Reference Version + 8-bit encoding for kanji
Stipulated in the standard itself. ISO/IEC 646:IRV is assigned to the GL region, JIS X 0208 to the GR region. This is effectively a subset of EUC-JP, excluding the half-width katakana from JIS X 0201 and the supplemental kanji from JIS X 0212.
Latin characters + 8-bit encoding for kanji
Stipulated in the standard itself. As with IRV+8-bit, but with ISO/IEC 646:IRV replaced with ISO/IEC 646:JP.
Shift-coded character set
Stipulated in Appendix 1: "Shift-Coded Representation" (シフト符号化表現, Shifuto Fugōka Hyōgen). The authoritative definition of Shift JIS.
RFC 1468-coded character set
Stipulated in Appendix 2: "RFC 1468-Coded Representation" (RFC 1468符号化表現, RFC 1468 Fugōka Hyōgen). Resembles ISO-2022-JP (which is authoritatively defined in RFC 1468) but is defined in terms of eight-bit bytes, whereas ISO-2022-JP is defined in terms of seven-bit bytes.

Among the encodings stipulated in the fourth standard, only the "Shift" coded character set is registered by the IANA.[11] However, certain others are closely related to IANA-registered encodings defined elsewhere (EUC-JP and ISO-2022-JP).

Escape sequences for JIS X 0202 / ISO 2022

[edit]

JIS X 0208 may be used within ISO 2022/JIS X 0202 (of which ISO-2022-JP is a subset). The escape sequences to designate JIS X 0208 to each of the four ISO 2022 code sets are listed below. Here, "ESC" refers to the control character "Escape" (0x1B, or 1/11).

ISO 2022 escape sequences to select JIS C 6226 and JIS X 0208
Standard G0 G1 G2 G3
78 ESC 2/4 4/0 ESC 2/4 2/9 4/0 ESC 2/4 2/10 4/0 ESC 2/4 2/11 4/0
83 ESC 2/4 4/2 ESC 2/4 2/9 4/2 ESC 2/4 2/10 4/2 ESC 2/4 2/11 4/2
90 onward ESC 2/6 4/0 ESC 2/4 4/2 ESC 2/6 4/0 ESC 2/4 2/9 4/2 ESC 2/6 4/0 ESC 2/4 2/10 4/2 ESC 2/6 4/0 ESC 2/4 2/11 4/2

The escape sequence starting ESC 2/4 selects a multi-byte character set. The escape sequence starting ESC 2/6 specifies a revision of the upcoming character set selection. JIS C 6226:1978 is identified by the multibyte-94-set identifier byte 4/0 (corresponding to ASCII @). JIS C 6226:1983 / JIS X 0208:1983 is identified by the multibyte-94-set identifier byte 4/2 (B). JIS X 0208:1990 is also identified by the 94-set identifier byte 4/2, but can be distinguished with the revision identifier 4/0 (@).

Duplicate encodings of ASCII and JIS X 0201

[edit]

When using the kanji set of this standard with either the ISO/IEC 646:1991 IRV graphic character set (ASCII) or JIS X 0201's graphic character set for Latin characters (JIS-Roman), the treatment of the characters common to both sets becomes problematic. Unless one takes special measures, the characters included in both sets do not all map to each other one-to-one, and a single character may be given more than one code point; that is, it may cause a duplicate encoding.

JIS X 0208:1997, in regards to when a character is common to both sets, basically forbids the use of the code point in the kanji set (which is one of two code points), eliminating duplicate encodings. It is judged that characters that have the same name are the same character.

For example, both the name of the character corresponding to the bit pattern 4/1 in ASCII and the name of the character corresponding to row 3 cell 33 of the kanji set are "LATIN CAPITAL LETTER A". In International Reference Version + 8-bit code for kanji, whether by the bit pattern 4/1 or by the bit pattern corresponding to the kanji set's row 3 cell 33 (10/3 12/1), the letter "A" (i.e. "LATIN CAPITAL LETTER A") is represented. The standard forbids the use of the "10/3 12/1" bit pattern, in an attempt to eliminate the duplicate encoding.

In consideration to implementations that treat the characters of the code points in the kanji set as "full-width characters" and those of ASCII or JIS-Roman as different characters, the use of the kanji set code points is permitted only for the sake of backwards compatibility. For example, for the purpose of backwards compatibility, it is permitted to consider 10/3 12/1 in International Reference Version + 8-bit code for kanji to correspond to a full-width "A".

If the kanji set is used along with ASCII or JIS-Roman, then even if the standard is abided by strictly, the unique encoding of a character is not guaranteed. For example, in the International Reference Version + 8-bit code for kanji, it is valid to represent a hyphen with the bit pattern 2/13 for the character "HYPHEN-MINUS", as well as with the kanji set's row 1 cell 30 (bit pattern 10/1 11/14) for the character "HYPHEN". In addition, the standard does not define which of the two to use for what, and so the hyphen is not given one unique encoding. The same problem affects the minus sign, the quotation marks, and so forth.

Moreover, even if the kanji set is used as a separate code, there is no guarantee that the unique encoding of characters is implemented. In many cases, however, the full-width "IDEOGRAPHIC SPACE" at row 1 cell 1 and the half-width space (2/0) coexist. How the two should be different is not self-explanatory, and is not specified in the standard.

Comparison of encoding schemes used in practice

[edit]
Encoding Alternate name 7-bit?[A] ISO 2022? State­less?[B] Accepts ASCII? 0x00–7F always ASCII? Superset of 8-bit JIS X 0201? Supports JIS X 0212? Bytewise self-synchron­izing? Bitwise self-synchron­izing?
ISO-2022-JP "JIS" (JIS X 0202) Yes Yes No[C] Yes Sequences can be non-ASCII[C] No (encoding possible)[D] Possible[E] No No
Shift_JIS "SJIS" No No Yes Almost[F] Isolated bytes can be non-ASCII[G] Yes No No No
EUC-JP "UJIS" (Unixized JIS) No Yes[H] Yes[H] Usually[I] Yes No (encoded)[J] Usually available[K] No No
Unicode formats for comparison[L]
UTF-8   No No Yes Yes Yes No (encoded) Available Yes Usually[M]
UTF-16 "Unicode"[N] No No Yes No No No (encoded) Available Over 16-bit words only. No
GB 18030   No No[O] Yes Yes Isolated bytes can be non-ASCII No (encoded) Available No No
UTF-32   No No Yes No No No (encoded) Available Usually, in practice[P] No
  1. ^ i.e. does not require 8-bit clean transmission.
  2. ^ i.e. the sequence used to encode a given character is always the same, no matter what the previous character(s) were. See state (computer science).
  3. ^ a b ISO-2022-JP is a stateful encoding: all charsets are encoded over 0x21–7E and are switched between using ANSI escapes. Hence, while it is ASCII in its initial state, entire sequences of non-ASCII characters can be encoded with ASCII bytes.
  4. ^ JIS X 0201 katakana are available in JIS X 0202 and ISO 2022, but not included in the basic ISO-2022-JP profile, although they are a common extension.
  5. ^ JIS X 0212 is available in JIS X 0202 and ISO 2022, and included in the ISO-2022-JP-1 and ISO-2022-JP-2 profiles, but not in the basic ISO-2022-JP profile.
  6. ^ Single byte characters 0x21–7E in Shift_JIS are properly ISO-646-JP, in order to be a superset of 8-bit JIS X 0201, but are often decoded (not necessarily displayed) as ASCII, which differs only in two places.
  7. ^ Some (not all) ASCII bytes can appear as second bytes, but not first bytes, of double-byte characters in Shift_JIS. Hence in a sequence of two or more ASCII bytes, the second byte onward are necessarily ASCII (or ISO-646-JP) characters.
  8. ^ a b Packed-format EUC is based on ISO 2022 mechanisms, with charset designations pre-arranged. Charset designation escapes and locking shifts are avoided, whereas use of single shifts can be implemented in a non-stateful manner. The constraints of ISO 2022 are nonetheless followed.
  9. ^ Single byte characters 0x21–7E in EUC-JP are generally considered ASCII, but sometimes treated as ISO-646-JP.
  10. ^ Unlike Shift_JIS, EUC-JP will not handle plain 8-bit JIS X 0201 input without prior conversion, due to the different representation of the JIS X 0201 katakana (with single-shifts).
  11. ^ JIS X 0212 in EUC-JP is not always implemented.
  12. ^ Besides the properties of the encodings themselves, Unicode formats have further advantages stemming from the underlying character set: they are not limited to JIS coded characters but can represent the entirety of UCS (including the full repertoire of JIS coded characters), and are hence suited to international use. They are also less badly affected by colliding proprietary extensions, due to their greater base repertoire and designated private use areas.
  13. ^ Most bitwise frameshifts of UTF-8-encoded text will produce invalid UTF-8, but it is possible to construct sequences of characters that remain valid UTF-8 even when frameshifted by one or more bits.
  14. ^ By Microsoft only.
  15. ^ While GB 18030 and GBK are extensions of the EUC-CN form of GB/T 2312, they do not follow the constraints of EUC or ISO 2022, unlike EUC-JP (or the original EUC-CN).
  16. ^ Although, in theory, UTF-32 is self-synchronizing over 32-bit dwords only, the use of a 32-bit value to represent a 21-bit value means that, in practice, UTF-32 contains a continuous run of at least 11 zero bits at the high end of each character, which can usually be used to align to character boundaries, depending on the codepoint(s) involved.

History

[edit]

Until five years have passed after a Japanese Industrial Standard has been established, reaffirmed, or revised, the prior standard undergoes a process of reaffirmation, revision, or withdrawal. Since establishment, the standard has been subject to revision three times, and at present, the fourth standard is valid.

First standard

[edit]

The first standard is JIS C 6226-1978 "Code of Japanese Graphic Character Set for Information Interchange" (情報交換用漢字符号系, Jōhō Kōkan'yō Kanji Fugōkei), established by the Japanese Minister of International Trade and Industry on 1 January 1978. It is also called 78JIS for short. Entrusted by the Agency of Industrial Science and Technology, a JIPDEC kanji code standardization research and study committee produced the draft. The committee chairman was Moriguchi Shigeichi.

The code included 453 non-Kanji (including Hiragana, Katakana, the Roman, Greek and Cyrillic alphabets and punctuation) and 6349 Kanji (2965 level 1 Kanji and 3384 level 2 Kanji) for a total of 6802 characters.[12] It did not yet include box-drawing characters. The standard itself was set in Shaken Co., Ltd's Ishii Mincho typeface.

Second standard

[edit]

The second standard JIS C 6226-1983 "Code of Japanese Graphic Character Set for Information Interchange" (情報交換用漢字符号系, Jōhō Kōkan'yō Kanji Fugōkei) revised the first standard on 1 September 1983. It is also called 83JIS. Entrusted by the AIST, a JIPDEC kanji code-related JIS committee produced the draft. The committee chairman was Motooka Tōru.

The draft of the second standard was based on the consideration of factors such as the promulgation of the jōyō kanji, the enforcement of the jinmeiyō kanji, and the standardization of Japanese-language Teletex by the Ministry of Posts and Telecommunications; also, the next modification was performed to keep pace with JIS C 6234-1983 (24-pixel matrix printer character forms; presently JIS X 9052).

Addition of special characters
39 characters were added to the special characters. Among these 39, per JICST recommendations, and from such standards as JIS Z 8201-1981 (mathematical symbols) and JIS Z 8202-1982 (quantity, unit, and chemical symbols), things that could not be represented by composition were chosen.
Newly added box-drawing characters
32 box-drawing characters were added.
Swapping of itaiji code points
Code points for 22 variant pairs of Kanji were swapped, such that the variant in level 2 was moved to level 1 and vice versa.[12][13] For example, (level 1's) row 36 cell 59 in the first standard () was moved to (level 2's) row 52 cell 68; the point originally at row 52 cell 68 () was in turn moved to row 36 cell 59.
Additions to the level 2 kanji
Three characters from level 1 and one character from level 2 were given new code points at previously unassigned code points in row 84 as level 2 kanji. Itaiji for each of those code points were newly assigned to their original locations.[14] For example, row 84 cell 1 in the second standard () was moved there to accommodate a different form not included in the first standard at row 22 cell 38 as a level 1 kanji ().
Modification of character forms
The character forms of approximately 300 kanji were amended.[15]

Among the changes in those 300 or so kanji character forms, many level 1 glyphs that were in the style of the Kangxi Dictionary were changed into variants, and especially more simplified forms (e.g. ryakuji and extended shinjitai). For example, a couple of code points that are often the subject of criticism due to being greatly changed are row 18 cell 10 (78JIS: , 83JIS: ) and row 38 cell 34 (78JIS: , 83JIS: ).

There were many smaller changes away from the Kangxi-style variants; for example, row 25 cell 84 () lost part of a stroke. Also, where some glyphs for level 1 kanji were not Kangxi-style forms, there were some changed into their Kangxi-style forms; for example, row 80 cell 49 () gained part of a stroke (i.e., the same part of the stroke that 25-84 lost).

In order to elucidate the original intent of the first standard, these ended up falling into parameters for unification criteria in the fourth standard. The difference in form for the examples noted above ("" and "") falls under the parameters for unification criterion 42 (concerning the component "").[t]

The bulk of the changes to character forms are differences between level 1 and level 2 kanji. Specifically, simplification was done more often for level 1 kanji than for level 2 kanji; simplifications applied to level 1 kanji (e.g. "" to "" and "" to "") were not generally applied to kanji in level 2 ("" stayed as-is). The aforementioned 25-84 () and 80-49 () were given different treatment likewise, as the former is in level 1 and the latter is in level 2. Even so, there were some changes regardless of the level; for instance characters containing the "door" () and "winter" () components were changed with no different treatment between level 1 and level 2 kanji.

However, for 29 code points (such as the problematic 18-10 and 38-34 mentioned above), the forms inherited by the fourth standard contradicts the original intent of the first. For these, there are special unification criteria to maintain compatibility with the previous standards at these code points.

When the new "X" category for Japanese Industrial Standards (for information-related fields) was introduced, the second standard was re-termed JIS X 0208-1983[12] on 1 March 1987.

Third standard

[edit]

The third standard JIS X 0208-1990 "Code of Japanese Graphic Character Set for Information Interchange" (情報交換用漢字符号, Jōhō Kōkan'yō Kanji Fugō) revised the second standard on 1 September 1990. It is also called 90JIS for short. Entrusted by the AIST, a committee at the Japanese Standards Association for the revision of JIS X 0208 created the draft. The committee chairman was Tajima Kazuo.

225 kanji glyphs were changed, and two characters were added to level 2 (84-05 "" and 84-06 ""). This was a disunification of itaiji for two characters already included (49-59 "" and 63-70 ""). Some of the changes and the two additions corresponded to the 118 jinmeiyō kanji added in March 1990.[12] The standard itself was set in Heisei Mincho.

Fourth standard

[edit]

The fourth standard JIS X 0208:1997 "7-bit and 8-bit double byte coded KANJI sets for information interchange" (7ビット及び8ビットの2バイト情報交換用符号化漢字集合, Nana-Bitto Oyobi Hachi-Bitto no Ni-Baito Jōhō Kōkan'yō Fugōka Kanji Shūgō) revised the third standard on 20 January 1997. It is also called 97JIS for short. Entrusted by the AIST, a JSA committee for research and study of coded character sets produced the draft. The committee chairman was Shibano Kōji.

The basic policies of this revision were to perform no changes the character set, to clarify ambiguous provisions, and to make the standard relatively easier to use. Addition, removal, and code point rearrangement were not done, and without exception, the example glyphs were also left unchanged. However, the stipulations of the standard were completely re-written and/or supplemented. Whereas the third standard was 65 pages long without the explanations, the fourth standard was 374 pages without the explanations.

The main points of the revision are:

Definition of encoding methods
Until the third standard, only the encoding method based on JIS X 0202 code extension was defined. This is something unusual as far as coded character sets go. In the fourth standard, encoding methods that do not use escape sequences for the purpose of code extension were defined.
Definition of the general prohibition of the use of unassigned code points and methods of usage for unassigned code points
The third standard, in an explanation that was not part of the standard, described things as if there were places where for some unassigned code points, it was acceptable to assign gaiji. In the fourth standard, it was clarified that use of unassigned code points is generally prohibited. Also, the conditions for the usage of unassigned code points were specified.
General elimination of duplicate encodings
Each character was given a "character name" that maps to those of other standards. Also, encoding methods to use them together with the ISO/IEC 646's International Reference Version or JIS X 0201 were specified. When JIS X 0208 is used together with either, among two assigned code points for characters with the same name, only one is permitted; thus, duplicate encodings were generally eliminated.
Investigation into sources of kanji
Characters included in the standard so far that are found in neither the Kangxi Dictionary nor the Dai Kanwa Jiten were identified. Accordingly, exactly with what purpose for inclusion and from which sources these kanji came during compilation of the first standard was investigated.
Definition of kanji unification criteria
Based on things such as the materials for the drafting of the first standard, an attempt was made to restore the intent of the first standard for the scope of the glyphs each code point represents. Moreover, the criteria for unifying kanji glyphs were clearly defined.
Inclusion of de facto standards
By the time of the fourth standard, the encoding methods Shift JIS and ISO-2022-JP had become de facto standards for personal computing and e-mail, respectively. These encoding methods were included as "Shift-Coded Representation" and "RFC 1468-Coded Representation" (described above).

Successors

[edit]

JIS X 0213 (extended kanji) was designed "with the goal being to offer a sufficient character set for the purposes of encoding the modern Japanese language that JIS X 0208 intended to be from the start";[16] it defines a character set that expands upon the kanji set of JIS X 0208. The drafters of JIS X 0213 recommend migration from JIS X 0208 to JIS X 0213, among the advantages being JIS X 0213's compatibility with the Hyōgai Kanji Glyph List and with newer jinmeiyō kanji.

Contrary to the expectations of the drafters, adoption of JIS X 0213 has been anything but fast since its enactment in the year 2000. The drafting committee of JIS X 0213:2004 wrote (in the year 2004), "The status where 'what the majority of information systems can use in common is JIS X 0208 only' still continues." (JIS X 0213:2000, Appendix 1:2004, section 2.9.7)

For Microsoft Windows, the predominant operating system (and hence supplying the predominant desktop environment) in the personal computing sector, the JIS X 0213 repertoire has been included since Windows Vista, released in November 2006. Mac OS X has been compatible with JIS X 0213 since version 10.1 (released in 2001). Many Unix-likes such as Linux can (optionally) support JIS X 0213 if desired. Therefore, it is thought that with time, JIS X 0213 support on personal computers will not be an impediment to its eventual adoption.

Among the drafters of JIS X 0213, there are those who expect to see a mix of JIS X 0208 and JIS X 0213 before any adoption of JIS X 0213 (Satō, 2004). However, JIS X 0208 continues to be used for the present, and many predict it to endure as a standard. There are barriers that need to be overcome if JIS X 0213 is to supplant JIS X 0208 in common usage:

  • The character repertoires utilized in Japanese mobile phones at the present time[when?] are based on JIS X 0208. There are no officially announced plans whatsoever to migrate these to JIS X 0213 compatibility. As mobile phones are now a pervasive aspect of Japanese textual communication (see Japanese mobile phone culture), being a widespread, commonly accessed medium for sending e-mail and accessing the World Wide Web, a lack of adoption for mobile phones deters usage elsewhere.
  • JIS X 0213 is not strictly upward-compatible with JIS X 0208 in terms of unification criteria (see below). For large-scale archives (e.g. bibliographic databases and Aozora Bunko) that use JIS X 0208 and follow its unification criteria strictly, it is thought that it would be extremely difficult work to both convert all the data to JIS X 0213 and preserve the same standard of textual integrity.
  • In practice, many systems define and use unassigned code points in JIS X 0208. For example, Windows assigns IBM and NEC extended characters and user-defined character areas (see Windows-932), and mobile phones assign emoji in some such places. The code points of these gaiji conflict with the code points that JIS X 0213 codes use, so there would be some difficulty in migrating these systems from JIS X 0208 to JIS X 0213. There are also plans to migrate to UCS/Unicode and use the JIS X 0213 repertoire from there, but until a system administrator is able to judge that the implementations of UCS/Unicode surrogate pairs and character compositions are sufficiently stable, he or she is likely to hesitate to use the repertoire of JIS X 0213 that requires those implementations.
  • The improvements provided by JIS X 0213 are mostly in the realm of characters that are not used as often as the ones already present in JIS X 0208. Because there are nearly twice as many glyphs that need to be implemented for less usage of those extra glyphs, it can be a low return on investment in many cases, especially where resources are constrained.

Implementations

[edit]

Because JIS X 0208 / JIS C 6226 is primarily a character set and not a strictly defined character encoding, several companies have implemented their own encodings of the character set.

Several of these incorporate vendor-specific character assignments in place of unallocated regions of the standard. These include Windows-932 and MacJapanese, as well as NEC's PC98 character encoding. While IBM-932 and IBM-942 also include vendor assignments, they include them outside of the region used for JIS X 0208.

Relation to other standards

[edit]

ISO/IEC 646 IRV and ASCII

[edit]

As noted above, the kanji set is not upwardly compatible with the ISO/IEC 646:1991 IRV (ASCII) graphic character set. The kanji set and the IRV graphic character set can be used together as specified in JIS X 0208 (IRV + 7-bit code for kanji and IRV + 8-bit code for kanji). They can be used together in EUC-JP as well.

JIS X 0201

[edit]

The kanji set lacks three characters included in JIS X 0201's graphic character set for Latin characters: 2/2 (QUOTATION MARK), 2/7 (APOSTROPHE), and 2/13 (HYPHEN-MINUS). The kanji set contains all character included in JIS X 0201's graphic character set for katakana.

The kanji set and the graphic character set for Latin characters can be used together as specified in JIS X 0208 (Latin characters + 7-bit code for kanji and the Latin characters + 8-bit code for kanji). The kanji set, graphic character set for Latin characters, and JIS X 0201's graphic character set for katakana can be used together as specified in JIS X 0208 (the shift-coded character set; i.e. Shift JIS). The kanji set and graphic character set for katakana can be used together in EUC-JP.

JIS X 0212

[edit]

JIS X 0212 (supplementary kanji) defines additional characters with code points for the purposes of information processing that requires characters not found in JIS X 0208. Rather than allocating characters within the main JIS X 0208 kanji set, it defines a second 94-by-94 kanji set containing supplementary characters.

JIS X 0212 can be used with JIS X 0208 in EUC-JP. Also, JIS X 0208 and JIS X 0212 are both source standards for UCS/Unicode's Han unification, meaning that kanji from both sets can be included in one Unicode-format document.

Among the code points that the second version of JIS X 0208 changed, 28 code points in JIS X 0212 reflect the character forms from before the changes.[17] Also, JIS X 0212 reassigns the "closure mark" that JIS X 0208 had assigned as a non-kanji (, at row 1 cell 26) as a kanji (, at row 16 cell 17). JIS X 0212 has no characters in common with JIS X 0208 other than these. Hence, it is not suited for general use on its own.

However, in the fourth version of JIS X 0208, the connection to JIS X 0212 was not defined at all. It is believed that this is because the drafting committee of the fourth JIS X 0208 standard had a critical opinion of the selection and identification methods of JIS X 0212.[18] The character meanings and selection rationales were not properly documented, making it difficult to identify whether desired kanji corresponded to those in its repertoire.[19] The text of the fourth standard, as well as pointing out the problematic points of the character selection of JIS X 0212, states that "it is thought that not only is character selection impossible, it is also impossible to use together; the connection to JIS X 0212 is not defined at all." (section 3.3.1)

JIS X 0213

[edit]
Euler diagram comparing repertoires of JIS X 0208, JIS X 0212, JIS X 0213, Windows-31J, the Microsoft standard repertoire and Unicode.

JIS X 0213 (extension kanji) defines a kanji set that expands upon the kanji set of JIS X 0208. According to this standard, it is "designed with the goal being to offer a sufficient character set for the purposes of encoding the modern Japanese language that JIS X 0208 intended to be from the start."[16]

The kanji set of JIS X 0213 incorporates all characters that can be represented in the kanji set of JIS X 0208, with many additions. In total, JIS X 0213 defines 1183 non-kanji and 10,050 kanji (for a total of 11,233 characters), within two 94-by-94 planes (, men). The first plane (non-kanji and level 1–3 kanji) is based on JIS X 0208, whereas the second plane (level 4 kanji) is designed to fit within the unallocated rows of JIS X 0212, allowing use in EUC-JP.[20] JIS X 0213 also defines Shift_JISx0213, a variant of Shift_JIS capable of encoding the entirety of JIS X 0213.

For most intents and purposes, JIS X 0213 plane 1 is a superset of JIS X 0208. However, different unification criteria are applied to some code points in JIS X 0213 compared to JIS X 0208. Consequently, some pairs of kanji glyphs that were represented by one JIS X 0208 code point, due to being unified, are given separate code points in JIS X 0213. For example, the glyph at row 33 cell 46 of JIS X 0208 ("", described above) unifies a few variants due to its right-hand component. In JIS X 0213, two forms (the ones containing the component "") are unified on plane 1 row 33 cell 46, and the other (containing the component "") is located at plane 1 row 14 cell 41. Therefore, whether JIS X 0208 row 33 cell 46 should be mapped to JIS X 0213 plane 1 row 33 cell 46 or plane 1 row 14 cell 41 cannot be determined automatically.[u] This limits the extent to which JIS X 0213 can be considered upwardly compatible with JIS X 0208, as admitted by the JIS X 0213 drafting committee.[21]

However, for the most part, row m cell n in JIS X 0208 corresponds to plane 1 row m cell n in JIS X 0213; therefore, not much confusion arises in practice. This is because most typefaces have come to use the glyphs exemplified in JIS X 0208, and most users are not consciously aware of the unification criteria.

ISO/IEC 10646 and Unicode

[edit]

The kanji set of JIS X 0208 is among the original source standards for the Han unification in ISO/IEC 10646 (UCS) and Unicode. Every kanji in JIS X 0208 corresponds to its own code point in UCS/Unicode's Basic Multilingual Plane (BMP).

The non-kanji in JIS X 0208 also correspond to their own code points in the BMP. However, for some special characters, some systems implement a different correspondences from those of UCS/Unicode's (which are based on the character names given JIS X 0208:1997).

Footnotes

[edit]

See also

[edit]
  • JIS coded character sets
    • JIS X 0201 "7-bit and 8-bit coded character sets for information interchange"
    • JIS X 0202 "Information technology – Character code structure and extension techniques" (ISO/IEC 2022)
    • JIS X 0208 "7-bit and 8-bit double byte coded KANJI sets for information interchange"
    • JIS X 0211 "Control functions for coded character sets" (ISO/IEC 6429)
    • JIS X 0212 "Code of the supplementary Japanese graphic character set for information interchange"
    • JIS X 0213 "7-bit and 8-bit double byte coded extended KANJI sets for information interchange"
    • JIS X 0221 "Universal Multiple-Octet Coded Character Set (UCS)" (ISO/IEC 10646)
  • Extended shinjitai
  • Help:Japanese

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
JIS X 0208 is a Japanese Industrial Standard specifying a two-byte character encoding for the graphic character set used in information interchange, encompassing 6,879 characters suitable for Japanese text, including 6,355 kanji (divided into 2,965 first-level and 3,390 second-level characters), full-width Latin letters, hiragana, katakana, Greek and Cyrillic letters, and various symbols and punctuation. Originally published in 1978 as JIS C 6226, it was revised in 1983 and renamed JIS X 0208 effective March 1, 1987, before a 1990 revision that added two characters to the existing set without altering the overall structure. The standard employs a 94-by-94 grid layout for its double-byte characters, excluding control codes and ASCII space, and serves as the foundational character set for several Japanese encodings, including those compliant with ISO 2022 for switching between single-byte ASCII and multibyte Japanese modes in applications like email and legacy systems. As a core component of Japanese computing history, JIS X 0208 enabled the representation of complex scripts in early digital environments, influencing subsequent standards like JIS X 0212 for supplementary characters while remaining integral to formats such as Shift JIS and EUC-JP.

Overview and Scope

Definition and Purpose

JIS X 0208:1997 is a Japanese Industrial Standard defining a 94×94 double-byte character set comprising 6,879 graphic characters intended for information interchange. This standard establishes a structured encoding for representing text in Japanese computing environments, utilizing a grid-based arrangement where each character is identified by a pair of bytes corresponding to row and column positions within the 94-by-94 matrix. The primary purpose of JIS X 0208:1997 is to standardize the digital representation of Japanese writing systems, including ideographs, hiragana and syllabaries, as well as supplementary characters such as Latin letters, Greek and Cyrillic scripts, numerals, , and symbols. It supports 6,355 characters divided into two levels—Level 1 with 2,965 commonly used for everyday text and Level 2 with 3,390 additional for specialized or less frequent usage—alongside 83 hiragana characters and 86 characters, enabling comprehensive handling of Japanese linguistic elements in applications like and . The set also allocates space for extensions and compatibility with other standards, ensuring versatility in mixed-script environments. Prior to the widespread adoption of in the late 1990s and early , JIS X 0208 served as the for Japanese text encoding in systems across and internationally, particularly in conjunction with ISO 2022 escape sequences for character set switching. This role underscored its importance in facilitating reliable data exchange and display of Japanese content in pre-Unicode computing infrastructures.

Scope of Use and Compatibility

JIS X 0208 serves as the foundational Japanese Industrial Standard for encoding graphic characters in information interchange, supporting essential applications in Japanese computing such as text processing in word processors, data storage in databases, printing systems, and early web content creation. Established to facilitate reliable exchange of Japanese text across systems, it enables the representation of kanji, hiragana, katakana, and symbols in environments requiring precise handling of double-byte characters. Its design prioritizes compatibility in professional and technical workflows, where consistent character rendering is critical for documents, reports, and digital communications. The standard is engineered for seamless integration in both 7-bit and 8-bit data transmission environments through ISO 2022 code extension techniques, allowing dynamic switching between character sets. This includes provisions for mixing with , where sequences begin and end in ASCII to ensure with international systems, using escape sequences to invoke the JIS X 0208 plane as needed. Such features make it suitable for legacy networks and protocols that operate under byte constraints, promoting efficient data flow without requiring full 8-bit channels. However, JIS X 0208's fixed structure as a 94×94 grid limits its coverage to 6,879 defined graphic characters, excluding certain modern or specialized Japanese glyphs introduced in subsequent standards. This constraint means it cannot directly handle variable-width encodings like those in contemporary systems without additional mapping layers, potentially complicating integration in diverse text-handling scenarios. Despite these limitations, its with prior iterations, such as the 1978 JIS C 6226 and 1983 revisions, preserves continuity for existing Japanese data archives and software. Revisions, including the update, maintained alignment with these earlier versions to avoid disrupting established implementations.

Character Set Composition

Non-Kanji Characters

The non-kanji characters in JIS X 0208 occupy the first 13 rows of the 94×94 code space, corresponding to lead bytes ranging from 0x21 to 0x2D in the standard's double-byte encoding scheme. Each of these rows allocates 94 positions for graphic characters to accommodate a variety of phonetic, alphabetic, and symbolic elements necessary for processing and display. This organization ensures compatibility with single-byte standards like while extending support for multibyte sequences in information interchange. In total, JIS X 0208 defines 524 non-kanji characters across these rows, emphasizing phonetic scripts such as hiragana and katakana, alongside Latin, Greek, Cyrillic, and various symbols to meet the needs of text rendering in computing environments. The arrangement prioritizes phonetic elements in the earlier rows for streamlined access during encoding and decoding operations, with symbolic and specialized characters positioned in subsequent rows to optimize efficiency in applications like word processing and terminal displays. Key groupings within these rows include special punctuation and symbols in row 1 (lead byte 0x21), additional special characters and symbols (including arrows) in row 2 (lead byte 0x22), and Roman alphabet characters in row 3 (lead byte 0x23), full-width hiragana in row 4 (lead byte 0x24), full-width in row 5 (lead byte 0x25), Greek letters in row 6 (lead byte 0x26) and Cyrillic letters in row 7 (lead byte 0x27), box-drawing line elements in row 8 (lead byte 0x28), and vendor-specific special symbols from in row 13 (lead byte 0x2D). Rows 9 through 12 remain largely unassigned in the standard, reserving space for potential future extensions without disrupting existing implementations. This layout contrasts with the allocations starting from lead byte 0x30 in rows 16 onward, allowing distinct handling of ideographic and non-ideographic content.

Kanji Characters

The kanji characters in JIS X 0208 are allocated to rows 16 through 84 of the standard's 94×94 grid, corresponding to lead bytes ranging from 0x30 to 0x74 in the double-byte encoding scheme. This allocation spans 69 rows, with 6,355 assigned characters (2,965 in Level 1 across rows 16–47 and 3,390 in Level 2 across rows 48–84). Not all 94 positions in these rows are assigned to . These positions contrast with the earlier rows (1–15), which are reserved for non- elements such as hiragana, , and symbols. The selection of kanji for JIS X 0208 emphasizes commonly used characters drawn from educational standards, including the taught in schools, as well as those prevalent in general usage for , documents, and names. This includes (approximately 2,136 characters for everyday and official purposes) and jimmeiyō kanji (over 860 for personal names), while deliberately excluding rare variants and obscure forms to maintain a practical set for information interchange. The chosen kanji are divided into Level 1 (rows 16–47, focusing on frequently used educational characters) and Level 2 (rows 48–84, encompassing supplementary general-use kanji). Within the grid, kanji are encoded using a double-byte system where each character is identified by its ku-ten (区点) coordinates, denoting the row (ku) and column (ten) position, such as 16-1 for the first kanji in row 16. These coordinates are converted to byte values by adding 0x20 (32 decimal) to both ku and ten, forming the 16-bit code (e.g., ku-ten 27-68 becomes bytes 0x3B and 0x64). Not all 94 columns in these rows are fully utilized for kanji, with some positions left unassigned to accommodate future expansions or compatibility. Kanji serve as the core component of JIS X 0208 for semantic representation in Japanese text, enabling the encoding of ideographic content that conveys meaning beyond phonetic scripts like hiragana and . This ideographic focus supports applications in printing, computing, and digital communication, where provide compact expression of complex ideas essential to the .

Code Structure

Code Points and Lead Bytes

JIS X 0208 defines its character set using a double-byte code structure, where each graphic character is represented by a pair of bytes known as the lead byte and the trail byte. Both the lead byte and the trail byte range from 0x21 to 0x7E, corresponding to 94 possible values each, forming a 94 by 94 grid of potential code positions. The lead bytes are mapped to specific character groups, with the initial range of 0x21 to 0x2D allocated primarily to non-kanji characters such as symbols, punctuation, Latin letters, and phonetic scripts like hiragana and . Kanji characters occupy the subsequent lead byte range of 0x30 to 0x74, encompassing rows 16 through 84 in the grid. Gaps exist within the lead byte assignments, notably at 0x2E to 0x2F, which are reserved or unassigned to prevent overlap with control codes or future extensions in compatible encodings. Code points in JIS X 0208 are calculated to provide a unique linear identifier for each position in the 94x94 grid, ranging from 1 to 8836. The derives the row number as (lead byte - 0x20) and the column number as (trail byte - 0x20), then computes the code point as ((row - 1) × 94) + column. This numbering system facilitates precise referencing of characters within the standard, such as mapping to Unicode equivalents. Certain s serve as delimiters or markers within the , for example, the combination 0x21-0x21 indicating the start of the first row, aiding in and display implementations. The trail byte follows the same 0x21 to 0x7E range without additional exclusions beyond the standard grid boundaries.

Single-Byte and Double-Byte Codes

JIS X 0208 is fundamentally a double-byte character set, where each character is represented by two 8-bit bytes, with no native single-byte encoding for its or other graphic characters. However, to facilitate integration with international standards and enable transmission over 7-bit channels, it incorporates single-byte support through the ISO/IEC 2022 framework. In this mode, single-byte codes handle ASCII characters (ISO IR 6, designated via the ESC ( B) and elements from JIS X 0201, including the Roman set (ESC ( J) and half-width (ESC ( I). These single-byte sets use 7-bit codes in the range 0x20–0x7E for 94 graphic characters, allowing seamless mixing of English text and basic Japanese phonetic symbols without shifting to double-byte mode. The priority in JIS X 0208 implementations remains on double-byte encoding, where all 6,879 characters—including , hiragana, full-width , and symbols—are encoded as pairs of bytes, with the first byte (lead byte) typically in the range 0x21–0x7E excluding certain reserved values, as detailed in the code points . There is no provision for single-byte within the native JIS X 0208 encoding, ensuring that complex ideographs always require the full two-byte sequence to maintain uniqueness and avoid conflicts with single-byte sets. Switching between single-byte and double-byte modes occurs via ISO 2022 high-level designators, such as ESC BtoinvoketheJISX02081983doublebyteset(preferredoverESCB to invoke the JIS X 0208-1983 double-byte set (preferred over ESC @ for the 1978 version) or ESC ( B to return to ASCII. This mechanism uses the (0x1B) followed by specific control sequences to designate character sets to the G0 or 94-character double-byte slots, allowing dynamic toggling within a text . Practically, this hybrid approach enables JIS X 0208 to operate over 7-bit transmission channels, such as early protocols (RFC 822) and network news (RFC 1036), by confining all bytes to the 0x00–0x7F range through state-based shifting. For mixed Japanese and English content, sequences begin and end in ASCII mode, with shifts to double-byte only for JIS X 0208 characters, minimizing overhead and ensuring compatibility with systems limited to 7-bit clean transport without requiring additional encodings like Base64. This design was crucial for early adoption in , supporting text interchange while preserving the integrity of the double-byte repertoire.

Unassigned Code Points

JIS X 0208 organizes its double-byte character codes into a 94×94 grid, yielding 8,836 possible code points, of which 6,879 are assigned to graphic characters, leaving 1,957 unassigned. These unassigned positions create gaps in the encoding space, such as entire rows corresponding to lead bytes 0x2E and 0x2F (rows 14 and 15 in the grid), which contain no assigned characters. Additional gaps appear in ranges like lead bytes 0x75 to 0x7E, where certain positions remain unallocated, and trail byte exclusions, such as avoiding 0x7F in compatible encodings, further define the structure. The unassigned code points serve to reserve space for potential future expansions or to maintain compatibility with related standards like JIS X 0212. This reservation approach promotes extensibility in the standard but constrains the immediate usable character set to 6,879, influencing implementations in encodings such as EUC-JP and Shift JIS.

Character Names and Identification

JIS X 0208 characters are primarily identified using the ku-ten notation, a coordinate system that specifies the row (ku) and position within the row (ten) in the standard's 94×94 grid structure. This notation ranges from 1-1 to 94-94, excluding certain unassigned positions, and provides a unique identifier for each defined character; for example, the kanji 日 (meaning "sun" or "day") is located at 38-92, while the hiragana あ is at 3-1. The ku-ten system originates from the arrangement in the official JIS X 0208 code tables and is converted to binary code points by adding 0x20 (decimal 32) to both the ku and ten values before combining them into a 16-bit sequence, such as 0x4A5C for 日. In addition to ku-ten codes, the standard includes descriptive names for characters, particularly non-kanji glyphs, defined in its annexes to ensure unambiguous referencing. These names follow a formal, Unicode-inspired convention tailored to JIS specifications, such as "IDEOGRAPHIC COMMA" for the mark 、 at ku-ten 1-3 and "HIRAGANA LETTER A" for あ at 3-1. Kanji characters typically lack individual descriptive names in the standard, relying instead on ku-ten or decimal equivalents (e.g., 13168 for a specific ) for identification. These naming and identification mechanisms are standardized in JIS X 0208 annexes to support consistent across software systems, enabling reliable mapping in fonts, text processing, and character databases. For instance, ku-ten notations are essential for aligning glyphs in font files, while descriptive names aid in property assignment for rendering and searching applications.

Detailed Character Groups

Special Characters and Symbols

Row 1 of JIS X 0208, corresponding to the lead byte 0x21, is dedicated to special characters and symbols, encompassing 94 graphic characters primarily focused on , diacritical marks, quotation symbols, brackets, mathematical operators, and miscellaneous ideographic and geometric symbols. This row forms part of the standard's non-kanji allocation, providing essential elements for Japanese text composition that complement the alphanumeric and phonetic scripts in subsequent rows. The characters are arranged logically by category within the 94-position matrix (second byte ranging from 0x21 to 0x7E), beginning with spacing and basic , progressing to diacritics and , then to dashes, , various types, mathematical and relational symbols, and concluding with signs, section markers, and simple geometric shapes. This facilitates efficient access in double-byte encoding schemes, with the first two positions (0x2121 and 0x2122) serving as the ideographic and , which act as foundational delimiters in Japanese . Notable unique features include JIS-specific variants tailored for Japanese usage, such as the katakana middle dot (0x2126, U+30FB) used for separating words or enumerations in katakana text, and the katakana-hiragana prolonged sound mark (0x213C, U+30FC), which extends vowel sounds in phonetic notation. Other distinctive elements are the voiced and semi-voiced sound marks (0x212B: U+309B; 0x212C: U+309C), applied as combining diacritics in hiragana and katakana, and the wave dash (0x2141, U+301C), commonly employed in Japanese computing for directory paths or approximations. Representative examples across categories illustrate the row's diversity:
CategoryJIS CodeUnicodeDescription
Punctuation0x2121U+3000Ideographic space (full-width space for East Asian typography)
0x2122U+3001Ideographic comma
0x2123U+3002Ideographic full stop
Diacritics & Marks0x212BU+309BKatakana-hiragana voiced sound mark (dakuten)
0x213CU+30FCKatakana-hiragana prolonged sound mark
Brackets0x214AU+FF08Full-width left parenthesis
0x214CU+3014Left tortoise shell bracket (common in Japanese quotes)
Mathematical Symbols0x215CU+FF0BFull-width plus sign
0x215DU+2212Minus sign
0x215FU+00D7Multiplication sign
Currency & Misc.0x216FU+FFE5Full-width yen sign
0x2179U+2606White star
0x217BU+25CBWhite circle
These selections highlight the row's role in supporting precise punctuation and symbolic expression in Japanese documents, with full-width variants ensuring compatibility in mixed-script layouts.

Numerals, Latin, Greek, and Cyrillic

Row 2 of JIS X 0208 allocates positions for additional special characters and symbols, extending the punctuation and marks introduced in row 1. These include geometric shapes, arrows, and other diacritical or enclosing forms suitable for technical and mathematical notation in Japanese contexts. For example, position 2-1 maps to the black diamond (U+25C6), 2-3 to the black square (U+25A0), and 1-93 to the white circle (U+25CB), providing compatibility with legacy typesetting needs. Such symbols are rendered in full-width forms to align with the proportional spacing of East Asian typography. Row 3 dedicates its 94 positions primarily to full-width representations of Western numerals and Latin letters, facilitating mixed-script text in Japanese documents. The cells (3-16 through 3-25) encode the digits 0 through 9 as full-width forms (U+FF10 to U+FF19), followed by uppercase Latin letters A through Z (3-33 through 3-58, U+FF21 to U+FF3A) and lowercase a through z (3-65 through 3-90, U+FF41 to U+FF5A). These full-width variants ensure uniform character width in fonts designed for CJK integration, preventing alignment issues in proportional layouts. Common symbols like the full-width (1-10, U+FF01) and (1-29, U+FF1F) appear in row 1, supporting basic in bilingual text. Rows 6 and 7 provide dedicated spaces for Greek and Cyrillic scripts, respectively, each with uppercase and lowercase variants to support scientific, mathematical, and international terminology within Japanese publications. Row 6 (lead byte 0x26) encodes the 24 uppercase Greek letters starting at 6-1 with alpha (U+0391) through omega at 6-24 (U+03A9), followed by lowercase from 6-33 (alpha, U+03B1) to 6-56 (omega, U+03C9). These are standard Greek mappings but rendered full-width in East Asian fonts for typographic harmony. JIS-specific glyph variants may differ slightly from ISO standards to align with Japanese printing conventions, ensuring compatibility in legacy systems. Row 7 (lead byte 0x27) similarly accommodates the basic 33 letters of the Cyrillic alphabet, with uppercase forms from 7-1 (A, U+0410) to 7-33 (Я, U+042F) and lowercase from 7-34 (a, U+0430) to 7-66 (я, U+044F), excluding obsolete or supplementary characters. Like the Greek set, these are full-width for proportional font integration and may feature JIS-adapted shapes, such as rounded forms for certain letters, to match East Asian aesthetic preferences. This arrangement reflects JIS X 0208's emphasis on separate encoding planes for non-Latin scripts, distinct from ASCII Latin in row 3.
RowScript/GroupKey Examples (Position: Unicode Name)
2Special Symbols2-1: BLACK DIAMOND (U+25C6)
2-3: BLACK SQUARE (U+25A0)
2-10: RIGHTWARDS ARROW (U+2192)
3Numerals & Latin3-16: FULLWIDTH DIGIT ZERO (U+FF10)
3-33: FULLWIDTH LATIN CAPITAL LETTER A (U+FF21)
3-65: FULLWIDTH LATIN SMALL LETTER A (U+FF41)
6Greek6-1: GREEK CAPITAL LETTER ALPHA (U+0391)
6-33: GREEK SMALL LETTER ALPHA (U+03B1)
6-56: GREEK SMALL LETTER OMEGA (U+03C9)
7Cyrillic7-1: CYRILLIC CAPITAL LETTER A (U+0410)
7-34: CYRILLIC SMALL LETTER A (U+0430)
7-66: CYRILLIC SMALL LETTER YA (U+044F)

Hiragana and Katakana

Row 4 of JIS X 0208 is dedicated to the hiragana phonetic script, encompassing 83 characters that represent the basic syllables, voiced variants (dakuten), semi-voiced variants (handakuten), small characters for compounding, and historical forms. These include the standard set from あ (a) to ん (n), along with modifications such as が (ga) and ぱ (pa), enabling the full expression of Japanese phonemes in a cursive, fluid style typically used for native words, grammatical particles, and inflections. The inclusion of obsolete characters like ゐ (wi) and ゑ (we) reflects historical orthography, preserved for compatibility with legacy texts despite their obsolescence in modern usage following post-war script reforms. Row 5 mirrors this structure with 83 characters, providing angular counterparts to the hiragana glyphs for phonetic representation. , often employed for emphasis, , scientific terms, and foreign loanwords, includes equivalent voiced and semi-voiced forms such as ガ () and パ (pa), as well as small variants like ャ (small ya). Like hiragana, it incorporates historical katakana for wi (ヰ) and we (ヱ), maintaining consistency in the standard's coverage of Japanese syllabaries. All hiragana and katakana in JIS X 0208 are encoded as double-byte sequences to align with the standard's overall structure for non-kanji and kanji characters, ensuring uniform processing in text streams. Dakuten (゛) and handakuten (゜) are integrated directly into the character glyphs as precomposed forms rather than separate combining marks, with dedicated code points for each variant (e.g., か for ka and が for ga), facilitating straightforward rendering without additional diacritic application. This approach supports the phonetic completeness of the scripts while adhering to the 94x94 grid layout of the standard.

Box Drawing and Graphic Symbols

Row 8 of JIS X 0208 contains 94 box-drawing characters dedicated to and semigraphic elements, enabling the construction of borders, tables, and basic diagrams within fixed-width text environments. These characters were introduced in the 1983 revision of the standard to support graphical representations in Japanese systems, particularly for terminal-based applications and early text software. The set draws inspiration from IBM's , incorporating similar conventions for compatibility with international hardware and software influences prevalent in the era. The characters encompass a range of line styles, including single (light), double (heavy), and mixed variants, along with specialized forms for connections and intersections. Representative examples include horizontal lines such as the light horizontal ( equivalent U+2500 ─) and double horizontal (U+2550 ═), vertical lines like the light vertical (U+2502 │) and double vertical (U+2551 ║), corner pieces such as the light lower-left corner (U+2514 └) and double lower-left corner (U+255A ╚), and tee junctions like the light left tee (U+251C ├) and double left tee (U+2560 ╠). These elements allow users to assemble complex structures by combining segments, promoting consistent rendering across monospaced displays without requiring bitmap graphics. Beyond row 8, additional graphic symbols appear in nearby rows, such as (e.g., right-pointing arrow in row 1) and geometric shapes, which complement box-drawing for broader illustrative purposes in text interfaces. Overall, this collection serves terminal displays and tabular layouts in text-based systems, facilitating accessible visual organization in resource-constrained environments typical of Japanese .

NEC Extension Characters

The NEC extension characters occupy row 13 (lead byte 0x2D in the JIS encoding scheme) of the JIS X 0208 grid, a space left unassigned in the official standard to allow for vendor-specific additions. Developed by , this extension comprises 83 proprietary characters designed to augment the standard set with specialized symbols for applications in Japanese environments. These additions were particularly prominent in 's hardware and software, such as early personal computers and text processing systems, where they filled gaps in symbol support for technical and cultural notations. The characters in this row encompass a range of symbolic forms, including circled and enclosed numerals (e.g., ① for circled 1 and ⑳ for circled 20), parenthesized (e.g., Ⅰ for I and Ⅹ for X), and square-form unit symbols derived from or Roman letters (e.g., ㌔ for kilometer and ㍍ for meter). Additional entries feature mathematical operators like the (∞) and enclosures such as circled ideographs, alongside Japanese-specific marks for emphasis or annotation. Notably, the row includes ligatured forms of historical era names, such as ㍾ for Meiji (明治) at position 13-77, ㍽ for Taishō (大正) at 13-78, ㍼ for (昭和) at 13-79, and ㍻ for Heisei (平成) at 13-63, which combine two into compact square representations for use in dates and documents. Although not incorporated into the core JIS X 0208 specification, these characters gained acceptance in certain implementations, including Microsoft's Code Page 932 (also known as Windows-31J), where they are encoded with Shift_JIS lead byte 0x87 followed by specific second bytes. In Unicode mappings, the majority are assigned to compatibility code points rather than the Private Use Area, facilitating round-trip conversions; for instance, the era name ligatures reside in the CJK Compatibility block (U+3300–U+33FF), while circled numerals fall in Enclosed CJK Letters and Months (U+3200–U+32FF). This compatibility preserved their utility in legacy systems but introduced challenges like duplicate representations when converting to standardized encodings. Originally vital for early Japanese PC ecosystems, the NEC extensions have become largely obsolete with the advent of JIS X 0213 in 2000, which reallocated some symbols to official positions while deprecating others, and the widespread shift to , which prioritizes unified character representations over vendor-specific variants.

Kanji-Specific Features

Overview of Kanji Coverage

JIS X 0208 defines a repertoire of 6,355 characters, partitioned into two levels based on frequency of occurrence in contemporary Japanese writing: level 1 encompasses 2,965 commonly used suitable for general text processing, while level 2 includes 3,390 less frequent but still relevant for specialized contexts. This division prioritizes efficient encoding for practical applications, with level 1 covering the majority of typical documents according to usage surveys conducted during the standard's development. The selection draws from established lists, including all 1,945 (from the 1981 list) designated for everyday educational and literary use by the Japanese Ministry of Education, the 166 (pre-1990) approved for personal and place names by the government, and supplementary characters chosen from empirical data on frequency in newspapers, technical literature, and administrative records. These sources ensure comprehensive support for standard while incorporating characters vital for proper nouns and professional terminology, reflecting a balance between tradition and modern utility. While JIS X 0208 includes all from the 1981 list, three from the 2010 expansion (塡, 剝, 頰) are not encoded, requiring later standards like JIS X 0213 for full current coverage. While providing robust coverage for routine and formal Japanese expression, JIS X 0208 omits uncommon and archaic , which are instead accommodated in the auxiliary standard JIS X 0212 containing 5,801 additional for expanded needs such as historical texts or specialized fields. This focused scope aligns with the standard's goal of meeting immediate informational interchange requirements without overburdening early computing resources. The occupy a subset of the overall 94×94 double-byte code grid, specifically utilizing 69 rows (16 through 84) primarily for ideographs, with certain positions left unassigned or allocated for compatibility with international standards like ISO/IEC 646 to facilitate interoperability.

Level Partitioning and Arrangement

JIS X 0208 partitions its into two levels to prioritize commonly used characters for efficient encoding and display in computing environments. Level 1 comprises 2,965 high-frequency , primarily consisting of everyday terms such as common verbs, nouns, and basic vocabulary essential for general text . These occupy the initial positions in the kanji subarea of the 94×94 code grid, specifically rows 16 through 47, providing approximately 3,008 slots with some reserved or undefined. Level 2 includes 3,390 supplementary , focused on less frequent usages like personal and place names, technical terminology, and specialized expressions, positioned in rows 48 through 84. This partitioning facilitates single-byte representation for Level 1 in certain display modes while deferring Level 2 to multi-byte sequences. The partition criteria were established based on surveys and official lists from Japan's Ministry of Education in 1981, incorporating the 1,945 (regular-use characters taught in schools), 166 (for names), and additional selections from frequency analyses in newspapers, literature, and administrative documents to ensure coverage of practical needs. For Level 1, emphasis was placed on characters appearing in the majority of typical Japanese text, while Level 2 extended to rarer but necessary glyphs, avoiding overlap through unification rules. Subsequent revisions refined these criteria: the 1983 update adjusted forms and minor inclusions, the 1987 version incorporated feedback from implementation, and the 1990 revision added 2 kanji to Level 2 and adjusted positions of 2 characters to align with updated ministry lists and usage data. Within the levels, kanji are arranged according to the ku-ten (区点) system, denoting row (ku, 1-94) and column (ten, 1-94) coordinates in the grid, which does not follow an alphabetical or phonetic sequence but optimizes for systematic lookup. Level 1 kanji are ordered primarily by frequency of occurrence in contemporary Japanese corpora, with secondary sorting by on'yomi (Chinese-derived pronunciation) to resolve ties, enabling intuitive access for common words like 日 (nichi, "day") appearing early due to high usage. In contrast, Level 2 kanji follow the traditional radical-stroke count order, grouping by the 214 Kangxi radicals (e.g., row 48 begins with radical 1, 一, and progresses by increasing stroke numbers), then by residual stroke count within each radical, and finally by frequency or phonetic order for identical cases, as seen in characters like 薔 (radical 140, 艸, 18 strokes). This dual arrangement balances usability for frequent characters with dictionary-like organization for supplementary ones, supporting applications from text input to printing.

Sources and Unknown Kanji

The characters in JIS X 0208 were primarily sourced from the official list, originally established as the 1,850 in 1946 by Japan's Ministry of Education based on usage surveys in education, government, and media, and revised to 1,945 characters in 1981 to reflect contemporary needs while maintaining compatibility with earlier standards. Supplementary beyond the list were drawn from comprehensive dictionaries such as the Daikanwa Jiten and usage surveys conducted by organizations like the Research Institute (NLRI), which analyzed frequency in printed materials to ensure coverage of less common but relevant forms for administrative and technical applications. The inclusion process for kanji in JIS X 0208 spanned the 1970s to 1990s, involving decisions by the Japanese Industrial Standards Committee, which compiled data from multiple contributors including the Information Processing Society of Japan (listing 6,086 kanji in 1971), the Administrative Management Agency (identifying 2,817 bureaucratic kanji in 1975), and Insurance for practical usage examples. These efforts relied on frequency data derived from newspaper corpora, such as the NLRI's 1970 survey of compound words and the corpus analyzed by NTT, prioritizing characters that appeared in modern texts while accommodating historical and specialized needs to total 6,355 kanji by the 1990 revision. Among the included kanji, approximately 60 have questionable origins or represent non-standard forms, often incorporated for compatibility with legacy systems or to cover edge cases in data interchange, though many stem from transcription errors during the digitization process. A subset of 12 of these are known as "ghost characters" (yūrei moji), erroneous kanji with no verifiable historical origins, resulting from misreadings of handwritten sources, ink blots, or degraded photocopies during the 1970s compilation; examples include 彁 (intended as a variant but untraceable) and 妛 (a fabrication without attestation in classical texts). For legitimately obscure but attested kanji, such as 龠 (denoting an ancient bamboo flute), inclusion was justified by references in historical texts like the Shijing, ensuring support for scholarly and cultural applications despite low modern frequency. The 1997 revision scrutinized these characters, confirming sources where possible or noting discrepancies to refine the standard's integrity.

Variant Unification and Compatibility Criteria

JIS X 0208 adopts a unification policy that merges and kyujitai forms of , along with regional variants, into single representative glyphs when the characters are semantically equivalent and visually similar enough to represent the same abstract character. This approach limits the total number of encoded glyphs by treating minor variations—such as those arising from handwriting styles or historical reforms—as non-distinct, prioritizing a standardized form for common usage in Japanese text. The criteria for unification emphasize glyph shape similarity (typically above 90% based on scanned image comparisons), frequency of usage in modern Japanese, and the absence of any semantic or contextual differences that would warrant separate encoding. Approximately 186 such unifications were applied during the standard's development and revisions, drawing from sources like the Joyo kanji list and historical dictionaries to resolve variants. These decisions were informed by the Ideographic Research Group (IRG) guidelines, which JIS X 0208 aligns with for consistency in international standards. Despite unification, JIS X 0208 includes compatibility ideographs with distinct codes to facilitate round-trip conversions with legacy systems and other national standards like or , where variants may not be merged. This ensures that data encoded in JIS can be accurately mapped back without loss, even if the glyphs are unified in the core set. For instance, the kanji 国 (country) is unified under its standard shinjitai form across variants, while certain cases like variants of 学 (learn) retain separate encodings in supplementary extensions to preserve compatibility with pre-reform texts.

Encoding Schemes

Standard Encoding Methods in JIS X 0208

JIS X 0208 defines a double-byte encoding scheme for its character repertoire, utilizing a fixed-width 16-bit representation suitable for environments. This native encoding assigns each character a unique pair of bytes drawn from a 94×94 matrix, where the lead byte and trail byte each occupy one of 94 defined values. In the 8-bit variant, both the lead and trail bytes range from 0xA1 to 0xFE, effectively shifting the base 7-bit values (0x21 to 0x7E) by adding 0x80 to ensure compatibility with 8-bit byte streams and to avoid overlap with control characters. The standard explicitly excludes control bytes (0x00–0x1F and 0x7F) from valid trail byte positions, confining trail bytes to the printable range 0x21–0x7E in the 7-bit form or 0xA1–0xFE in the 8-bit form to maintain and prevent misinterpretation as control sequences. This exclusion applies uniformly across the matrix, ensuring that no double-byte sequence incorporates low-value control codes in the second byte. The 7-bit variant packs the same 14-bit effective code space (94×94 = 8,836 positions) into consecutive 7-bit bytes without high bits set, forming the base for interchange in 7-bit channels, though it requires mode designation for full use. For Unix-like systems, JIS X 0208:1997 specifies an EUC-JP-like packing method as a standard 8-bit encoding variant, where the lead byte signals the JIS plane (typically 0xA1–0xFE) and the trail byte follows in the same range, enabling efficient storage and transmission of the full set without escape mechanisms. This approach maps directly to the (EUC) format, with JIS X 0208 occupying EUC plane 1, and supports the standard's total of 6,879 graphic characters.

ISO 2022 Escape Sequences

JIS X 0208 is integrated into the ISO/IEC 2022 framework through specific escape sequences that designate its character sets for use in 7-bit or 8-bit environments, enabling dynamic switching between ASCII and Japanese graphic characters. The standard employs the Escape (ESC, 0x1B) followed by intermediate and final bytes to invoke the 94×94 matrix containing , hiragana, and . These sequences support both the original 1978 version and subsequent revisions, with the final byte serving as the designator for the particular revision of the JIS X 0208 set. For the 1978 version of JIS X 0208 (originally JIS C 6226), the designation sequence is ESC @(0x1B0x240x40),whichassignsthefull94×94charactersetincludingapproximately6,068[kanji](/page/Kanji),83hiragana,and86[katakana](/page/Katakana)totheG0orG1graphicsetinISO2022.ThissequencewasregisteredasISOIR42intheInternationalRegisterofCodedCharacterSets.The1983revision,whichexpandedthesetbyadding287characterstoreach6,355[kanji](/page/Kanji),usesESC@ (0x1B 0x24 0x40), which assigns the full 94×94 character set—including approximately 6,068 [kanji](/page/Kanji), 83 hiragana, and 86 [katakana](/page/Katakana)—to the G0 or G1 graphic set in ISO 2022. This sequence was registered as ISO-IR 42 in the International Register of Coded Character Sets. The 1983 revision, which expanded the set by adding 287 characters to reach 6,355 [kanji](/page/Kanji), uses ESC B (0x1B 0x24 0x42) and is registered as ISO-IR 87; this remains the most commonly invoked sequence for JIS X 0208 in ISO-2022-JP encodings. Both sequences invoke the entire plane, where hiragana occupy row 30 (codes 0x21–0x7E for full-width forms) and row 32, treated as subsets within the double-byte mode. The 1990 revision added two characters and is registered under ISO-IR 168, using the same sequence ESC B(0x1B0x240x42)whilemaintainingcompatibility.The1997revisionupdatedcharacterreferencesandglyphformsforbetterunificationbutpreservedthecodepoints,escapesequences,andregistrationunderISOIR168,supportingthesamehiraganaand[katakana](/page/Katakana)subsetsalongsidethe[kanji](/page/Kanji).Inmultibyteoperation,theseinvocationstypicallyassignthesettoG0(for7bitchannels),withlockingshiftslikeShiftOut(SO,0x0E)toenterdoublebytemodeandShiftIn(SI,0x0F)toreturntosinglebyteASCII;alternatively,nonlockingSingleShiftmechanismscanbeusedforG1.ThestructurefollowsISO2022sformat:ESCfollowedbyoneormoreintermediatebytes(e.g.,B (0x1B 0x24 0x42) while maintaining compatibility. The 1997 revision updated character references and glyph forms for better unification but preserved the code points, escape sequences, and registration under ISO-IR 168, supporting the same hiragana and [katakana](/page/Katakana) subsets alongside the [kanji](/page/Kanji). In multi-byte operation, these invocations typically assign the set to G0 (for 7-bit channels), with locking shifts like Shift Out (SO, 0x0E) to enter double-byte mode and Shift In (SI, 0x0F) to return to single-byte ASCII; alternatively, non-locking Single Shift mechanisms can be used for G1. The structure follows ISO 2022's format: ESC followed by one or more intermediate bytes (e.g.,) and a final byte (e.g., B) to specify the 94×94 grid, ensuring seamless transitions without altering the underlying byte layout of JIS X 0208. In practice, after invocation (e.g., ESC $ B), subsequent bytes are interpreted as double-byte JIS X 0208 codes until a revert sequence like ESC ( B reassigns G0 to ASCII (ISO-IR 6). For katakana subsets, full-width forms use the main JIS X 0208 invocation, while half-width katakana from may be designated separately via ESC ( I, though this is outside the core 0208 sequences. These mechanisms ensure JIS X 0208's compatibility in protocols like , where lines must end in ASCII mode to avoid rendering issues.

Integration with ASCII and JIS X 0201

JIS X 0208 includes duplicate encodings of characters from ASCII and JIS X 0201 to support consistent rendering in mixed Japanese and Latin text environments, where full-width forms are preferred for typographic alignment. In particular, the third row (ku-ten notation 03) of the JIS X 0208 code table contains full-width equivalents of the ASCII Latin uppercase and lowercase letters (A–Z, a–z) along with digits (0–9), allowing these common symbols to be represented in the double-byte JIS X 0208 space without requiring a mode switch, though single-byte alternatives exist for efficiency. Similarly, the thirteenth row (ku-ten 13) incorporates specialized symbols, some of which align with extensions compatible with JIS X 0201's half-width katakana representations, ensuring broader coverage for legacy systems. The primary integration occurs through the ISO 2022 framework, which enables dynamic switching between character sets in a single data stream. To invoke ASCII (equivalent to the Roman set of ), the ESC ( B is used; for the JIS X 0201 half-width set, ESC ( I is employed. These single-byte modes allow for efficient encoding of ASCII-compatible text and katakana without entering the double-byte JIS X 0208 mode, which is designated by ESC $ B. This approach ensures that approximately 95% of 7-bit ASCII's printable characters can be handled in the lightweight ASCII or JIS X 0201 Roman mode, minimizing overhead in transmissions like . Key differences arise in form and byte usage: JIS X 0208's duplicates are full-width (double-byte, occupying two em-widths for visual balance with ), contrasting with the half-width (single-byte) variants in , which prioritize compactness for early computing constraints. further extends ASCII by adding a dedicated 94-character half-width set, absent in standard 7-bit ASCII, to support phonetic Japanese without . The purpose of these overlaps is to reduce the need for frequent insertions during text processing, promoting interoperability in protocols like ISO-2022-JP while maintaining with 7-bit networks.

Practical Encoding Variations and Comparisons

Shift-JIS, a variant developed by and , encodes JIS X 0208 characters using a variable-width scheme where single-byte ASCII (0x00–0x7F) is directly supported, and double-byte sequences use lead bytes in the ranges 0x81–0x9F or 0xE0–0xFC followed by trailing bytes 0x40–0x7E or 0x80–0xFC. This mapping shifts the JIS X 0208 row and cell values to fit these byte ranges, but the encoding is non-invertible for certain points due to extensions like CP932 that add vendor-specific characters outside the standard JIS set, potentially mapping multiple sources to the same byte sequence or leaving some JIS characters ambiguous in round-trip conversions. In contrast, EUC-JP, the standard encoding for Unix systems, directly maps JIS X 0208 characters to double-byte sequences with both lead and trailing bytes in the range 0xA1–0xFE, corresponding one-to-one with the JIS rows (adding 0xA0 to the JIS row and cell numbers). This results in a more uniform structure, with ASCII handled as single bytes (0x00–0x7F) and optional support for JIS X 0212 via three-byte sequences prefixed by 0x8F, though JIS X 0208 coverage remains fully invertible without extensions. Both encodings provide complete coverage of JIS X 0208's 6,355 and associated characters, but differ in byte efficiency and system integration. Shift-JIS offers better efficiency for text mixing ASCII and Japanese, as its lead bytes overlap minimally with high-ASCII ranges, allowing denser storage in mixed-language documents. EUC-JP, while straightforward, reserves higher byte ranges (0xA1–0xFE) for multibyte characters, leading to slightly larger sizes for ASCII-heavy content but simpler parsing. Neither encoding specifies , operating as big-endian byte streams by default in practice.
AspectShift-JISEUC-JP
CoverageFull + extensionsFull + optional JIS X 0212
Byte EfficiencyHigher for ASCII/Japanese mixesLower for ASCII mixes, uniform multibyte
Lead Bytes0x81–0x9F, 0xE0–0xFC0xA1–0xFE ()
InvertibilityPartial (due to extensions)Full for standard characters
Shift-JIS became the for Windows and early in the , while EUC-JP dominated Unix and environments for server-side applications. Following the 1997 revision of JIS X 0208 and the rise of , both have been treated as legacy encodings, though they persist in older software and files for compatibility.

Historical Development

Initial Standard (1978)

The initial standard for Japanese , designated as JIS C 6226-1978 and titled "Code of the Japanese Graphic Character Set for Information Interchange," was published by the Japanese Industrial Standards Committee on January 1, 1978. This standard addressed the growing demand for computerized processing of Japanese text during the 1970s computing boom in Japan, where earlier single-byte codes like ISO IR-6 (JIS X 0201) proved inadequate for handling the thousands of characters essential to the . Development began in 1969 under the Information Processing Society of Japan's Standards Committee, involving collaboration with government agencies such as the Administrative Management Agency and linguists, building on a provisional 1971 table of 6,086 characters to create a unified set suitable for information interchange in government, business, and education. The standard introduced a 94×94 double-byte grid , allowing for up to 8,836 positions to encode graphic characters, with each position defined by a row (ku) and column (ten) notation known as the Kuten code. It encompassed a total of 6,802 characters, including 6,349 divided into Level 1 (2,965 frequently used , ordered by phonetic readings) and Level 2 (3,384 less common , ordered by radical and count), plus 453 non- symbols such as hiragana, , Roman letters, and . Among the , it incorporated all 1,850 from the 1946 official list, along with additional characters for names, places, and technical terms to support practical text processing. Despite its innovations, the 1978 standard had notable limitations, as it relied on pre-1981 kanji inventories like the 1946 jōyō list and earlier provisional tables, omitting some characters that later became standard for modern usage, such as certain added in subsequent revisions. Early inclusion errors, including "ghost characters" without verified historical sources, highlighted challenges in verifying the vast repertoire, and the fixed grid size constrained expansion without re-encoding. These issues were incrementally addressed in later revisions, such as those in 1983, 1987, and 1990, which refined the set for better compatibility and coverage.

Revisions (1983, 1987, 1990)

The second revision of JIS X 0208, published in 1983 as JIS C 6226-1983, added 75 characters (primarily non-kanji symbols and adjustments to align with the 1981 list, including form changes for approximately 200 characters), increasing the total number of graphic characters to 6,877. This update also incorporated minor corrections to glyph shapes for improved consistency and readability, addressing issues identified in early implementations. The 1987 update renamed the standard from JIS C 6226 to JIS X 0208 effective March 1, 1987, with no changes to the character repertoire, maintaining the total of 6,877 graphic characters and emphasizing refinements to support commonly used characters in Japanese education and administration without major structural overhauls. The 1990 revision of JIS X 0208 added 2 characters (disunified variants), resulting in a total of 6,879 graphic characters. This version enhanced variant character handling and coordinated with emerging international standards, including early drafts of , to facilitate better cross-platform compatibility and reduce disunified forms. It also added 39 special characters and 32 box-drawing characters. These mid-1980s revisions collectively responded to evolving needs in Japanese character standardization, balancing updates to educational kanji lists with efforts toward global while maintaining with prior editions.

Final Revision (1997) and Successors

The 1997 edition of JIS X 0208 marked the fifth and final major revision of the standard, serving as its culminating active update. This version focused on re-unifying variant character forms that had been disunified in prior editions, such as those split during the 1983 revision, and appending as an official encoding method for compatibility. The character repertoire totaled 6,879 graphic characters, comprising 6,355 (divided into Level 1 with 2,965 characters and Level 2 with 3,390 characters) and 524 non-kanji elements like hiragana, , symbols, and ; no new characters were added. No substantive changes to the character set have occurred since 1997, solidifying its role as the definitive iteration amid shifting priorities toward expanded standards. Successor standards addressed limitations in JIS X 0208 by providing supplementary coverage. JIS X 0212, established in 1990, introduced a separate 94×94 plane dedicated to rare and supplementary , encompassing 5,801 kanji and 245 non-kanji characters absent from the primary set, primarily for specialized or historical texts. JIS X 0213, released in 2000 and amended in 2004, functions as the successor, extending JIS X 0208 into a multi-plane structure (a core 94×94 plane plus supplementary rows) while ensuring . It incorporates the full JIS X 0208 repertoire, merges 2,743 from JIS X 0212, adds 952 new across Levels 3 and 4, and includes numerous non- additions such as accented Roman letters, Ainu orthography variants, and obsolete symbols, yielding a total of 10,040 characters in the 2000 edition (expanded further in 2004 with glyph refinements for 168 and 10 new additions). This transition to JIS X 0213 reflects evolving demands for comprehensive Japanese representation in digital environments, positioning it as the preferred standard for new implementations while JIS X 0208 persists in legacy contexts.

Implementations and Relations

Software and Hardware Implementations

JIS X 0208 has been implemented in various software environments to handle Japanese text processing. In , the charset japanese-jisx0208 receives high priority in environments, enabling font selection and display for JIS X 0208 characters. Web browsers support JIS X 0208 through encodings like Shift_JIS, allowing rendering of Japanese content in legacy web pages via APIs that handle multi-byte sequences. In databases, MySQL's sjis and cp932 character sets incorporate JIS X 0208 alongside , facilitating storage and querying of Japanese data in legacy applications. Microsoft Windows-932, an extension of Shift_JIS, relies on JIS X 0208 mappings for core Japanese characters, though it includes vendor-specific extensions that may diverge from the standard. Hardware implementations of JIS X 0208 emerged in early Japanese computing systems. The PC-9800 series featured built-in Shift_JIS character ROMs to support JIS X 0208 and display on screens and peripherals. terminals adopted EUC-JP as an encoding for JIS X 0208, enabling multi-byte character output in environments like DIGITAL UNIX where each JIS X 0208 code is represented by two bytes with set most-significant bits. Printers, such as POS models, directly support JIS X 0208 code pages for printing Japanese text, including from the 94x94 grid. Implementing JIS X 0208 presents challenges, particularly in font rendering due to variants across revisions (e.g., 1978 vs. 1990), where unified codepoints merge old and new assignments, potentially causing display inconsistencies in systems expecting specific mappings. Conversion tools like iconv address these by supporting transformations between JIS X 0208 and modern encodings such as , with Solaris and implementations handling JIS X 0208 alongside extensions like JIS X 0212. As of 2025, JIS X 0208 usage is declining in favor of , but it remains essential for migrating legacy Japanese data in , files, and archives, where it serves as a component in encodings like EUC-JP and Shift_JIS.

Relations to Other Japanese Standards

serves as a single-byte complement to JIS X 0208, providing 7-bit and 8-bit encodings for basic Latin characters (equivalent to ASCII) and half-width , which are also available in full-width double-byte forms within JIS X 0208. This allows for efficient mixing in multi-byte encodings, where handles Roman and fallback using single bytes, while JIS X 0208 addresses the double-byte requirements for and full-width characters. In practice, standards like ISO-2022-JP and EUC-JP invoke for the G0 code set (single-byte) alongside JIS X 0208 in G1 (double-byte), ensuring seamless integration for text containing both simple and complex Japanese elements. JIS X 0212 functions as an orthogonal supplement to JIS X 0208, defining a separate 94×94 grid with 6,067 characters, including 5,801 rare not covered in the primary set. Unlike JIS X 0208, which focuses on commonly used characters, JIS X 0212 targets supplementary ideographs for specialized applications, with minimal overlap—only one character duplicates directly from JIS X 0208. It employs distinct ISO/IEC 2022 escape sequences to designate its plane, allowing independent invocation without conflicting with JIS X 0208's structure, as seen in extended encodings like EUC-JP where three-byte sequences access JIS X 0212. JIS X 0213 expands upon JIS X 0208 by incorporating all 6,879 characters from the latter into its first plane while adding 4,344 new characters across two 94×94 planes, resulting in a total of 11,223 characters, including 10,040 . This extension adds rows and positions to the original grid for modern and historical characters, such as those needed for legal names, and integrates 2,743 characters from JIS X 0212 to enhance coverage. Designed for , JIS X 0213 retains the full repertoire of JIS X 0208, enabling systems to process legacy data without loss when upgraded. These standards share ISO/IEC 2022 escape sequences for invocation, facilitating their combined use in protocols like email and web content, where shifts between single-byte (JIS X 0201), core double-byte (JIS X 0208), supplementary (JIS X 0212), and extended (JIS X 0213) sets occur dynamically. JIS X 0213's compatibility ensures that content encoded in JIS X 0208 remains fully representable, while the orthogonal nature of JIS X 0212 prevents encoding conflicts in multi-standard environments.

Mapping to International Standards like Unicode

JIS X 0208 incorporates the 94 graphic characters from the International Reference Version (IRV) of ISO 646 in its row 3 (codes 0x2121 to 0x217E in the JIS encoding form), providing direct overlap with ASCII for basic Latin text while extending the 7-bit framework into a double-byte structure to accommodate Japanese characters across additional rows. This design ensures compatibility with ISO 646-based systems, though minor differences exist, such as mapping the backslash to the yen sign (¥) at position 0x215F to align with Japanese conventions. The character repertoire of JIS X 0208 is fully integrated into ISO 10646 (the basis for ), with code points assigned predominantly in the range U+3000 to U+9FFF, encompassing blocks like CJK Symbols and Punctuation (U+3000–U+303F), Hiragana (U+3040–U+309F), (U+30A0–U+30FF), and (U+4E00–U+9FFF). The standard defines 6,879 graphic characters in total, including 6,355 , of which the vast majority are unified with ideographs from other East Asian standards (such as and ) under the process to minimize duplication in . Mapping challenges arise from glyph variants in JIS X 0208 that were not unified due to semantic or typographic distinctions; 62 such ununified variants are encoded separately in the CJK Compatibility Ideographs block (U+F900–U+FAFF) to enable lossless round-trip conversion between JIS X 0208 and without altering the original form. These compatibility characters preserve legacy implementations, such as in Shift_JIS, where exact glyph matching is required for display fidelity. The alignment of JIS X 0208 with international standards evolved through collaborative efforts in the , culminating in its incorporation into ISO 10646 and to support global text interchange. version 1.0 (1991) included a core subset of JIS X 0208 characters—approximately 7,000 code points covering essential , , and symbols—as part of its foundational repertoire for East Asian scripts. Subsequent revisions refined these mappings for completeness and stability, ensuring JIS X 0208 serves as a reliable bridge between Japanese legacy systems and modern Unicode-based applications.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.