Hubbry Logo
Code pageCode pageMain
Open search
Code page
Community hub
Code page
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Code page
Code page
from Wikipedia

In computing, a code page is a character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers. Typically each number represents the binary value in a single byte. (In some contexts these terms are used more precisely; see Character encoding § Terminology.)

The term "code page" originated from IBM's EBCDIC-based mainframe systems,[1] but Microsoft, SAP,[2] and Oracle Corporation[3] are among the vendors that use this term. The majority of vendors identify their own character sets by a name. In the case when there is a plethora of character sets (like in IBM), identifying character sets through a number is a convenient way to distinguish them. Originally, the code page numbers referred to the page numbers in the IBM standard character set manual,[4][5][6] a condition which has not held for a long time. Vendors that use a code page system allocate their own code page number to a character encoding, even if it is better known by another name; for example, UTF-8 has been assigned page numbers 1208 at IBM, 65001 at Microsoft, and 4110 at SAP.

Hewlett-Packard uses a similar concept in its HP-UX operating system and its Printer Command Language[7] (PCL) protocol for printers (either for HP printers or not). The terminology, however, is different: What others call a character set, HP calls a symbol set, and what IBM or Microsoft call a code page, HP calls a symbol set code. HP developed a series of symbol sets,[8][9] each with an associated symbol set code, to encode both its own character sets and other vendors’ character sets.

The multitude of character sets leads many vendors to recommend Unicode.

The code page numbering system

[edit]

IBM introduced the concept of systematically assigning a small, but globally unique, 16 bit number to each character encoding that a computer system or collection of computer systems might encounter. The IBM origin of the numbering scheme is reflected in the fact that the smallest (first) numbers are assigned to variations of IBM's EBCDIC encoding and slightly larger numbers refer to variations of IBM's extended ASCII encoding as used in its PC hardware.

With the release of PC DOS version 3.3 (and the near identical MS-DOS 3.3) IBM introduced the code page numbering system to regular PC users, as the code page numbers (and the phrase "code page") were used in new commands to allow the character encoding used by all parts of the OS to be set in a systematic way.[10]

IBM code page numbers (CPGIDs and CCSIDs) used for CJK encodings. Microsoft's use of code page numbers for CJK encodings differs, and is noted in brackets where applicable.

After IBM and Microsoft ceased to cooperate in the 1990s, the two companies have maintained the list of assigned code page numbers independently from each other, resulting in some conflicting assignments. At least one third-party vendor (Oracle) also has its own different list of numeric assignments.[3] IBM's current assignments are listed in their CCSID repository, while Microsoft's assignments are documented within the MSDN.[11] Additionally, a list of the names and approximate IANA (Internet Assigned Numbers Authority) abbreviations for the installed code pages on any given Windows machine can be found in the Registry on that machine (this information is used by Microsoft programs such as Internet Explorer).

Most well-known code pages, excluding those for the CJK languages and Vietnamese, fit all their code-points into eight bits and do not involve anything more than mapping each code-point to a single character; furthermore, techniques such as combining characters, complex scripts, etc., are not involved.

The text mode of standard (VGA-compatible) PC graphics hardware is built around using an 8-bit code page, though it is possible to use two at once with some color depth sacrifice, and up to eight may be stored in the display adapter for easy switching.[12] There was a selection of third-party code page fonts that could be loaded into such hardware. However, it is now commonplace for operating system vendors to provide their own character encoding and rendering systems that run in a graphics mode and bypass this hardware limitation entirely. However the system of referring to character encodings by a code page number remains applicable, as an efficient alternative to string identifiers such as those specified by the IETF and IANA for use in various protocols such as e-mail and web pages.

Relationship to ASCII

[edit]

The majority of code pages in current use are supersets of ASCII, a 7-bit code representing 128 control codes and printable characters. In the distant past, 8-bit implementations of the ASCII code set the top bit to zero or used it as a parity bit in network data transmissions. When the top bit was made available for representing character data, a total of 256 characters and control codes could be represented. Most vendors (including IBM) used this extended range to encode characters used by various languages and graphical elements that allowed the imitation of primitive graphics on text-only output devices. No formal standard existed for these "extended ASCII character sets" and vendors referred to the variants as code pages, as IBM had always done for variants of EBCDIC encodings.

Relationship to Unicode

[edit]

Unicode is an effort to include all characters from all currently and historically used human languages into single character enumeration (effectively one large single code page), removing the need to distinguish between different code pages when handling digitally stored text. Unicode tries to retain backwards compatibility with many legacy code pages, copying some code pages 1:1 in the design process. An explicit design goal of Unicode was to allow round-trip conversion between all common legacy code pages, although this goal has not always been achieved. Some vendors, namely IBM and Microsoft, have anachronistically assigned code page numbers to Unicode encodings. This convention allows code page numbers to be used as metadata to identify the correct decoding algorithm when encountering binary stored data.

IBM code pages

[edit]

EBCDIC-based code pages

[edit]

These code pages are used by IBM in its EBCDIC character sets for mainframe computers.[13]

  • 1 – USA WP, Original
  • 2 – USA
  • 3 – USA Accounting, Version A
  • 4 – USA
  • 5 – USA
  • 6 – Latin America
  • 7 – Germany F.R. / Austria
  • 8 – Germany F.R.
  • 9 – France, Belgium
  • 10 – Canada (English)
  • 11 – Canada (French)
  • 12 – Italy
  • 13 – Netherlands
  • 14 – Spain
  • 15 – Switzerland (French)
  • 16 – Switzerland (French / German)
  • 17 – Switzerland (German)
  • 18 – Sweden / Finland
  • 19 – Sweden / Finland WP, version 2
  • 20 – Denmark/Norway
  • 21 – Brazil
  • 22 – Portugal
  • 23 – United Kingdom
  • 24 – United Kingdom
  • 25 – Japan (Latin)
  • 26 – Japan (Latin)
  • 27 – Greece (Latin)
  • 29 – Iceland
  • 30 – Turkey
  • 31 – South Africa
  • 32 – Czechoslovakia (Czech / Slovak)
  • 33 – Czechoslovakia
  • 34 – Czechoslovakia
  • 35 – Romania
  • 36 – Romania
  • 37 – USA/Canada - CECP (same with euro: 1140)
  • 37-2 – The real 3279 APL codepage, as used by C/370. This is very close to 1047, except for caret and not-sign inverted. It is not officially recognized by IBM, even though SHARE has pointed out its existence.[14]
  • 38 – USA ASCII
  • 39 – United Kingdom / Israel
  • 40 – United Kingdom
  • 251 – China
  • 252 – Poland
  • 254 – Hungary
  • 256 – International #1 (superseded by 500)
  • 257 – International #2
  • 258 – International #3
  • 259 – Symbols, Set 7
  • 260 – Canadian French - 116
  • 264 – Print Train & Text processing extended
  • 273 – Germany F.R./Austria - CECP (same with euro: 1141)
  • 274 – Old Belgium Code Page
  • 275 – Brazil - CECP
  • 276 – Canada (French) - 94
  • 277 – Denmark, Norway - CECP (same with euro: 1142)
  • 278 – Finland, Sweden - CECP (same with euro: 1143)
  • 279 – French - 94[14]
  • 280 – Italy - CECP (same with euro: 1144)
  • 281 – Japan (Latin) - CECP
  • 282 – Portugal - CECP
  • 283 – Spain - 190[14]
  • 284 – Spain/Latin America - CECP (same with euro: 1145)
  • 285 – United Kingdom - CECP (same with euro: 1146)
  • 286 – Austria / Germany F.R. Alternate
  • 287 – Denmark / Norway Alternate
  • 288 – Finland / Sweden Alternate
  • 289 – Spain Alternate
  • 290 – Japanese (Katakana) Extended
  • 293 – APL
  • 297 – France (same with euro: 1147)[14]
  • 298 – Japan (Katakana)
  • 300 – Japan (Kanji) DBCS (For JIS X 0213)
  • 310 – Graphic Escape APL/TN
  • 320 – Hungary
  • 321 – Yugoslavia
  • 322 – Turkey
  • 330 – International #4
  • 340 – EBCDIC, OCR (same as 893, superseded by 892 and 893)
  • 351 – GDDM default
  • 352 – Printing and publishing option
  • 353 – BCDIC-A
  • 354 – BCDIC-B
  • 355 – PTTC/BCD standard option
  • 357 – PTTC/BCD H option
  • 358 – PTTC/BCD Correspondence option
  • 359 – PTTC/BCD Monocase option
  • 360 – PTTC/BCD Duocase option
  • 361 – EBCDIC Publishing International
  • 363 – Symbols, set 8
  • 382 – EBCDIC Publishing Austria, Germany F.R. Alternate
  • 383 – EBCDIC Publishing Belgium
  • 384 – EBCDIC Publishing Brazil
  • 385 – EBCDIC Publishing Canada (French)
  • 386 – EBCDIC Publishing Denmark, Norway
  • 387 – EBCDIC Publishing Finland, Sweden
  • 388 – EBCDIC Publishing France
  • 389 – EBCDIC Publishing Italy
  • 390 – EBCDIC Publishing Japan (Latin)
  • 391 – EBCDIC Publishing Portugal
  • 392 – EBCDIC Publishing Spain, Philippines
  • 393 – EBCDIC Publishing Latin America (Spanish Speaking)
  • 394 – EBCDIC Publishing China (Hong Kong), UK, Ireland
  • 395 – EBCDIC Publishing Australia, New Zealand, USA, Canada (English)
  • 396 – BookMaster Specials
  • 410 – Cyrillic (revisions: 880, 1025, 1154)
  • 420 – Arabic
  • 421 – Maghreb/French
  • 423 – Greek (superseded by 875)
  • 424 – Hebrew (Bulletin Code)
  • 425 – Arabic / Latin for OS/390 Open Edition
  • 435 – Teletext Isomorphic
  • 500 – International #5 (ECECP; supersedes 256) (same with euro: 1148)
  • 803 – Hebrew Character Set A (Old Code)
  • 829 – Host Math Symbols- Publishing
  • 830 – Math Format
  • 831 – Portugal (Alternate) (same as 37)
  • 833 – Korean Extended (SBCS)
  • 834 – Korean Hangul (KSC5601; DBCS with UDCs)
  • 835 – Traditional Chinese DBCS
  • 836 – Simplified Chinese Extended
  • 837 – Simplified Chinese DBCS
  • 838 – Thai with Low Marks & Accented Characters (same with euro: 1160)
  • 839 – Thai DBCS
  • 870 – Latin 2 (same with euro: 1153) (revision: 1110)
  • 871 – Iceland (same with euro: 1149)[14]
  • 875 – Greek (supersedes 423)
  • 880 – Cyrillic (revision of 410) (revisions: 1025, 1154)
  • 881 – United States - 5080 Graphics System
  • 882 – United Kingdom - 5080 Graphics System
  • 883 – Sweden - 5080 Graphics System
  • 884 – Germany - 5080 Graphics System
  • 885 – France - 5080 Graphics System
  • 886 – Italy - 5080 Graphics System
  • 887 – Japan - 5080 Graphics System
  • 888 – France AZERTY - 5080 Graphics System
  • 889 – Thailand
  • 890 – Yugoslavia
  • 892 – EBCDIC, OCR A
  • 893 – EBCDIC, OCR B
  • 905 – Latin 3
  • 918 – Urdu Bilingual
  • 924 – Latin 9
  • 930 – Japan MIX (290 + 300) (same with euro: 1390)
  • 931 – Japan MIX (37 + 300)
  • 933 – Korea MIX (833 + 834) (same with euro: 1364)
  • 935 – Simplified Chinese MIX (836 + 837) (same with euro: 1388)
  • 937 – Traditional Chinese MIX (37 + 835) (same with euro: 1371)
  • 939 – Japan MIX (1027 + 300) (same with euro: 1399)
  • 1001 – MICR
  • 1002 – EBCDIC DCF Release 2 Compatibility
  • 1003 – EBCDIC DCF, US Text subset
  • 1005 – EBCDIC Isomorphic Text Communication
  • 1007 – EBCDIC Arabic (XCOM2)
  • 1024 – EBCDIC T.61
  • 1025 – Cyrillic, Multilingual (same with euro: 1154) (Revision of 880)
  • 1026 – EBCDIC Turkey (Latin 5) (same with euro: 1155) (supersedes 905 in that country)
  • 1027 – Japanese (Latin) Extended (JIS X 0201 Extended)
  • 1028 – EBCDIC Publishing Hebrew
  • 1030 – Japanese (Katakana) Extended
  • 1031 – Japanese (Latin) Extended
  • 1032 – MICR, E13-B Combined
  • 1033 – MICR, CMC-7 Combined
  • 1037 – Korea - 5080/6090 Graphics System
  • 1039 – GML Compatibility
  • 1047 – Latin 1/Open Systems[14]
  • 1068 – DCF Compatibility
  • 1069 – Latin 4
  • 1070 – USA / Canada Version 0 (Code page 37 Version 0)
  • 1071 – Germany F.R. / Austria (Code page 273 Version 0)
  • 1072 – Belgium (Code page 274 Version 0)
  • 1073 – Brazil (Code page 275 Version 0)
  • 1074 – Denmark, Norway (Code page 277 Version 0)
  • 1075 – Finland, Sweden (Code page 278 Version 0)
  • 1076 – Italy (Code page 280 Version 0)
  • 1077 – Japan (Latin) (Code page 281 Version 0)
  • 1078 – Portugal (Code page 282 Version 0)
  • 1079 – Spain / Latin America Version 0 (Code page 284 Version 0)
  • 1080 – United Kingdom (Code page 285 Version 0)
  • 1081 – France Version 0 (Code page 297 Version 0)
  • 1082 – Israel (Hebrew)
  • 1083 – Israel (Hebrew)
  • 1084 – International#5 Version 0 (Code page 500 Version 0)
  • 1085 – Iceland (Code page 871 Version 0)
  • 1087 – Symbol Set
  • 1091 – Modified Symbols, Set 7
  • 1093 – IBM Logo[15]
  • 1097 – Farsi Bilingual
  • 1110 – Latin 2 (Revision of 870)
  • 1112 – Baltic Multilingual (same with euro: 1156)
  • 1113 – Latin 6
  • 1122 – Estonia (same with euro: 1157)
  • 1123 – Cyrillic, Ukraine (same with euro: 1158)
  • 1130 – Vietnamese (same with euro: 1164)
  • 1132 – Lao EBCDIC
  • 1136 – Hitachi Katakana
  • 1137 – Devanagari EBCDIC
  • 1140 – USA, Canada, etc. ECECP (same without euro: 37) (Traditional Chinese version: 1159)
  • 1141 – Austria, Germany ECECP (same without euro: 273)
  • 1142 – Denmark, Norway ECECP (same without euro: 277)
  • 1143 – Finland, Sweden ECECP (same without euro: 278)
  • 1144 – Italy ECECP (same without euro: 280)
  • 1145 – Spain, Latin America (Spanish) ECECP (same without euro: 284)
  • 1146 – UK ECECP (same without euro: 285)
  • 1147 – France ECECP with euro (same without euro: 297)
  • 1148 – International ECECP with euro (same without euro: 500)
  • 1149 – Icelandic ECECP with euro (same without euro: 871)
  • 1150 – Korean Extended with box characters
  • 1151 – Simplified Chinese Extended with box characters
  • 1152 – Traditional Chinese Extended with box characters
  • 1153 – Latin 2 Multilingual with euro (same without euro: 870)
  • 1154 – Cyrillic, Multilingual with euro (same without euro: 1025; an older version is * 1166)
  • 1155 – Turkey with euro (same without euro: 1026) (same with lira: 1175)
  • 1156 – Baltic Multi with euro (same without euro: 1112)
  • 1157 – Estonia with euro (same without euro: 1122)
  • 1158 – Cyrillic, Ukraine with euro (same without euro: 1123)
  • 1159 – T-Chinese EBCDIC (Traditional Chinese euro update of * 1140)
  • 1160 – Thai with Low Marks & Accented Characters with euro (same without euro: 838)
  • 1164 – Vietnamese with euro (same without euro: 1130)
  • 1165 – Latin 2/Open Systems
  • 1166 – Cyrillic Kazakh
  • 1175 – Turkey with euro and lira (same without lira: 1155)
  • 1278 – EBCDIC Adobe (PostScript) Standard Encoding
  • 1279 – Hitachi Japanese Katakana Host[6]
  • 1300 – Generic Bar Code/OCR-B
  • 1301 – Zip + 4 POSTNET Bar Code
  • 1302 – Facing Identification Marks
  • 1303 – EBCDIC Bar Code
  • 1364 – Korea MIX (833 + 834 + euro) (same without euro: 933)
  • 1371 – Traditional Chinese MIX (1159 + 835) (same without euro: 937)
  • 1376 – Traditional Chinese DBCS Host extension for HKSCS
  • 1377 – Mixed Host HKSCS Growing (37 + 1376)
  • 1378 – Traditional Chinese DBCS Host extension for HKSCS and Simplified Chinese (superset of 1376)
  • 1379 – Mixed Host HKSCS and Simplified Chinese Growing (37 + 1378) (superset of 1377)
  • 1388 – Simplified Chinese MIX (same without euro: 935) (836 + 837 + euro)
  • 1390 – Simplified Chinese MIX Japan MIX (same without euro: 930) (290 + 300 + euro)
  • 1399 – Japan MIX (1027 + 300 + euro) (same without euro: 939)

DOS code pages

[edit]

These code pages are used by IBM in its PC DOS operating system. These code pages were originally embedded directly in the text mode hardware of the graphic adapters used with the IBM PC and its clones, including the original MDA and CGA adapters whose character sets could only be changed by physically replacing a ROM chip that contained the font. The interface of those adapters (emulated by all later adapters such as VGA) was typically limited to single byte character sets with only 256 characters in each font/encoding (although VGA added partial support for slightly larger character sets).

  • 301 – IBM-PC Japan (Kanji) DBCS
  • 437 – Original IBM PC hardware code page
  • 720 – Arabic (Transparent ASMO)
  • 737Greek
  • 775 – Latin-7
  • 808 – Russian with euro (same without euro: 866)
  • 848 – Ukrainian with euro (same without euro: 1125)
  • 849 – Belarusian with euro (same without euro: 1131)
  • 850 – Latin-1
  • 851 – Greek
  • 852 – Latin-2
  • 853 – Latin-3
  • 855 – Cyrillic (same with euro: 872)
  • 856 – Hebrew
  • 857 – Latin-5
  • 858 – Latin-1 with euro symbol
  • 859 – Latin-9
  • 860 – Portuguese
  • 861Icelandic
  • 862Hebrew
  • 863Canadian French
  • 864Arabic
  • 865Danish/Norwegian
  • 866 – Belarusian, Russian, Ukrainian (same with euro: 808)
  • 867Hebrew + euro (based on CP862) (conflictive ID: NEC Czech (Kamenický), which was created before this codepage)
  • 868Urdu
  • 869Greek
  • 872 – Cyrillic with euro (same without euro: 855)
  • 874 – Thai with Low Tone Marks & Ancient Chars (conflictive ID with Windows 874; version with euro: 1161 Windows version: is IBM 1162)
  • 876 – OCR A
  • 877 – OCR B
  • 878KOI8-R
  • 891 – Korean PC SBCS
  • 898 – IBM-PC WP Multilingual
  • 899 – IBM-PC Symbol
  • 903 – Simplified Chinese PC SBCS
  • 904 – Traditional Chinese PC SBCS
  • 906 – International Set #5 3812/3820
  • 907 – ASCII APL (3812)
  • 909 – IBM-PC APL2 Extended
  • 910 – IBM-PC APL2
  • 911 – IBM-PC Japan #1
  • 926 – Korean PC DBCS
  • 927 – Traditional Chinese PC DBCS
  • 928 – Simplified Chinese PC DBCS
  • 929 – Thai PC DBCS
  • 932 – IBM-PC Japan MIX (DOS/V) (DBCS) (897 + 301) (conflictive ID with Windows 932; Windows version is IBM 943)
  • 934 – IBM-PC Korea MIX (DOS/V) (DBCS) (891 + 926)
  • 936 – IBM-PC Simplified Chinese MIX (gb2312) (DOS/V) (DBCS) (903 + 928) (conflictive ID with Windows 936; Windows version is IBM 1386)
  • 938 – IBM-PC Traditional Chinese MIX (DOS/V, OS/2) (904 + 927)
  • 942 – IBM-PC Japan MIX (Japanese SAA (OS/2)) (1041 + 301)
  • 943 – IBM-PC Japan OPEN (897 + 941) (Windows CP 932)
  • 944 – IBM-PC Korea MIX (Korean SAA (OS/2)) (1040 + 926)
  • 946 – IBM-PC Simplified Chinese (Simplified Chinese SAA (OS/2)) (1042 + 928)
  • 948 – IBM-PC Traditional Chinese (Traditional Chinese SAA (OS/2)) (1043 + 927)
  • 949 – Korean (Extended Wansung (ks_c_5601-1987)) (1088 + 951) (conflictive ID with Windows 949 (Unified Hangul Code); Windows version is IBM 1363)
  • 951 – Korean DBCS (IBM KS Code) (conflictive ID with Windows 951, a hack of Windows 950 with Unicode mappings for some PUA Unicode characters found in HKSCS, based on the file name)
  • 1034 – Printer Application - Shipping Label, Set #2
  • 1040 – Korean Extended
  • 1041 – Japanese Extended (JIS X 0201 Extended)
  • 1042 – Simplified Chinese Extended
  • 1043 – Traditional Chinese Extended
  • 1044 – Printer Application - Shipping Label, Set #1
  • 1086 – IBM-PC Japan #1
  • 1088 – Revised Korean (SBCS)
  • 1092 – IBM-PC Modified Symbols
  • 1098Farsi
  • 1108 – DITROFF Base Compatibility
  • 1109 – DITROFF Specials Compatibility
  • 1115 – IBM-PC People's Republic of China
  • 1116 – Estonian
  • 1117 – Latvian
  • 1118 – Lithuanian (IBM's implementation of Lika's code page 774)
  • 1119 – Lithuanian and Russian (IBM's implementation of Lika's code page 772)
  • 1125 – Cyrillic, Ukrainian (same with euro: 848) (IBM modification of RUSCII)
  • 1127 – IBM-PC Arabic / French
  • 1131 – IBM-PC Data, Cyrillic, Belarusian (same with euro: 849)
  • 1139 – Japan Alphanumeric Katakana
  • 1161 – Thai with Low Tone Marks & Ancient Chars with euro (same without euro: 874)
  • 1167KOI8-RU
  • 1168KOI8-U
  • 1370 – Traditional Chinese MIX (Big5 encoding) (1114 + 947 + euro) (same without euro: 950)
  • 1380 – IBM-PC Simplified Chinese GB PC-DATA (DBCS PC IBM GB 2312-80)
  • 1381 – IBM-PC Simplified Chinese (1115 + 1380)
  • 1393 – Japanese JIS X 0213 DBCS
  • 1394 – IBM-PC Japan (JIS X 0213) (897 + 1393)

When dealing with older hardware, protocols and file formats, it is often necessary to support these code pages, but newer encoding systems, in particular Unicode, are encouraged for new designs.

DOS code pages are typically stored in .CPI files.[16][17][18][19][20]

IBM AIX code pages

[edit]

These code pages are used by IBM in its AIX operating system. They emulate several character sets, namely those ones designed to be used accordingly to ISO, such as UNIX-like operating systems.

Code page 819 is identical to Latin-1, ISO/IEC 8859-1, and with slightly-modified commands, permits MS-DOS machines to use that encoding. It was used with IBM AS/400 minicomputers.

IBM OS/2 code pages

[edit]

These code pages are used by IBM in its OS/2 operating system.

  • 1004 – Latin-1 Extended, Desk Top Publishing/Windows[21]

Windows emulation code pages

[edit]

These code pages are used by IBM when emulating the Microsoft Windows character sets. Most of these code pages have the same number as Microsoft code pages, although they are not exactly identical. Some code pages, though, are new from IBM, not devised by Microsoft.

Macintosh emulation code pages

[edit]

These code pages are used by IBM when emulating the Apple Macintosh character sets.

  • 1275 – Apple Roman
  • 1280 – Apple Greek
  • 1281 – Apple Turkish
  • 1282 – Apple Central European
  • 1283 – Apple Cyrillic
  • 1284 – Apple Croatian
  • 1285 – Apple Romanian
  • 1286 – Apple Icelandic

Adobe emulation code pages

[edit]

These code pages are used by IBM when emulating the Adobe character sets.

  • 1038 – Adobe Symbol Encoding
  • 1276 – Adobe (PostScript) Standard Encoding
  • 1277 – Adobe (PostScript) Latin 1

HP emulation code pages

[edit]

These code pages are used by IBM when emulating the HP character sets.

DEC emulation code pages

[edit]

These code pages are used by IBM when emulating the DEC character sets.

  • 1020 – 7-bit Canadian (French) NRC Set
  • 1021 – 7-bit Switzerland NRC Set
  • 1023 – 7-bit Spanish NRC Set
  • 1090 – Special Characters and Line Drawing Set
  • 1100 – DEC Multinational
  • 1101 – 7-bit British NRC Set
  • 1102 – 7-bit Dutch NRC Set
  • 1103 – 7-bit Finnish NRC Set
  • 1104 – 7-bit French NRC Set
  • 1105 – 7-bit Norwegian/Danish NRC Set
  • 1106 – 7-bit Swedish NRC Set
  • 1107 – 7-bit Norwegian/Danish NRC Alternate
  • 1287 – DEC Greek
  • 1288 – DEC Turkish

IBM Unicode code pages

[edit]

Microsoft code pages

[edit]

Windows code pages

[edit]

These code pages are used by Microsoft in its own Windows operating system. Microsoft defined a number of code pages known as the ANSI code pages (as the first one, 1252 was based on an apocryphal ANSI draft of what became ISO 8859-1). Code page 1252 is built on ISO 8859-1 but uses the range 0x80-0x9F for extra printable characters rather than the C1 control codes from ISO 6429 mentioned by ISO 8859-1.[24] Some of the others are based in part on other parts of ISO 8859 but often rearranged to make them closer to 1252.

Microsoft recommends new applications use UTF-8 or UCS-2/UTF-16 instead of these code pages.[25]

DBCS code pages

[edit]

These code pages represent DBCS character encodings for various CJK languages. In Microsoft operating systems, these are used as both the "OEM" and "Windows" code page for the applicable locale.

MS-DOS code pages

[edit]

These code pages are used by Microsoft in its MS-DOS operating system. Microsoft refers to these as the OEM code pages because they were defined by the original equipment manufacturers who licensed MS-DOS for distribution with their hardware, not by Microsoft or a standards organization. Most of these code pages have the same number as the equivalent IBM code pages, although some are not exactly identical.[26]

Macintosh emulation code pages

[edit]

These code pages are used by Microsoft when emulating the Apple Macintosh character sets.

Various other Microsoft code pages

[edit]

The following code page numbers are specific to Microsoft Windows. IBM may use different numbers for these code pages. They emulate several character sets, namely those ones designed to be used accordingly to ISO,[clarification needed] such as UNIX-like operating systems.

Microsoft Unicode code pages

[edit]

HP Symbol Sets

[edit]

HP developed a series of Symbol Sets (each with its associated Symbol Set Code) to encode either its own character sets or other vendors’ character sets. They are normally 7-bit character sets which, when moved to the higher part and associated with the ASCII character set, make up 8-bit character sets.

HP own Symbol Sets

[edit]
  • Symbol Set 0E — HP Roman Extension — 7-bit character set with accented letters (coded by IBM as code page 1050)
  • Symbol Set 0G — HP 7-bit German
  • Symbol Set 0L — HP 7-bit PC Line (coded by IBM as code page 1055)
  • Symbol Set 0M — HP Math-7
  • Symbol Set 0T — HP Thai-8
  • Symbol Set 1S — HP 7-bit Spanish
  • Symbol Set 1U — HP 7-bit Gothic Legal (coded by IBM as code page 1052)
  • Symbol Set 4Q — HP Line Draw (coded by IBM as code page 1056)
  • Symbol Set 4U — HP Roman-9 — Roman-8 + €
  • Symbol Set 7J — HP Desktop
  • Symbol Set 7S — HP 7-bit European Spanish
  • Symbol Set 8E — HP East-8
  • Symbol Set 8G — HP Greek-8 (based on IR 088; not on ELOT 927)
  • Symbol Set 8H — HP Hebrew-8
  • Symbol Set 8I — MS LineDraw (ASCII + HP PC Line)
  • Symbol Set 8K — HP Kana-8 (ASCII + Japanese Katakana)
  • Symbol Set 8L — HP LineDraw (ASCII + HP Line Draw)
  • Symbol Set 8M — HP Math-8 (ASCII + HP Math-8)
  • Symbol Set 8R — HP Cyrillic-8
  • Symbol Set 8S — HP 7-bit Latin American Spanish
  • Symbol Set 8T — HP Turkish-8
  • Symbol Set 8U — HP Roman-8 (ASCII + HP Roman Extension; coded by IBM as code page 1051)
  • Symbol Set 8V — HP Arabic-8
  • Symbol Set 9K — HP Korean-8
  • Symbol Set 9T — PC 8T (also known as Code Page 437-T; this is not code page 857)
  • Symbol Set 9V — Latin / Arabic for Windows (this is not code page 1256)
  • Symbol Set 11U — PC 8D/N (also known as Code Page 437-N; coded by IBM as code page 1058; this is not code page 865)
  • Symbol set 14G — PC-8 Greek Alternate (also known as Code Page 437-G; almost the same as code page 737)
  • Symbol Set 18K —
  • Symbol Set 18T —
  • Symbol Set 19C —
  • Symbol Set 19K —

Symbol Sets from other vendors

[edit]
  • Symbol Set 0D — ISO 60: 7-bit Norwegian
  • Symbol Set 0F — ISO 25: 7-bit French
  • Symbol Set 0H — HP 7-bit Hebrew — Practically the same as Israeli Standard SI 960
  • Symbol Set 0I — ISO 15: 7-bit Italian
  • Symbol Set 0K — ISO 14: 7-bit Japanese Katakana
  • Symbol Set 0N — ISO 8859-1 Latin 1 (Initially called "Gothic-1"; coded by IBM as code page 1053)
  • Symbol Set 0R — ISO 8859-5 Latin/Cyrillic (1986 version — IR 111)
  • Symbol Set 0S — ISO 11: 7-bit Swedish
  • Symbol Set 0U — ISO 6: 7-bit U.S.
  • Symbol Set 0V — Arabic
  • Symbol Set 1D — ISO 61: 7-bit Norwegian
  • Symbol Set 1E — ISO 4: 7-bit U. K.
  • Symbol Set 1F — ISO 69: 7-bit French
  • Symbol Set 1G — ISO 21: 7-bit German
  • Symbol Set 1K — ISO 13: 7-bit Japanese Latin
  • Symbol Set 1T — Windows Thai (Practically the same as 874)
  • Symbol Set 2K — ISO 57: 7-bit Simplified Chinese Latin
  • Symbol Set 2N — ISO 8859-2 Latin 2
  • Symbol Set 2S — ISO 17: 7-bit Spanish
  • Symbol Set 2U — ISO 2: 7-bit International Reference Version
  • Symbol Set 3N — ISO 8859-3 Latin 3
  • Symbol Set 3R — PC-866 Russia (Practically the same as code page 866)
  • Symbol Set 3S — ISO 10: 7-bit Swedish
  • Symbol Set 4N — ISO 8859-4 Latin 4
  • Symbol Set 4S — ISO 16: 7-bit Portuguese
  • Symbol Set 5M — PS Math Symbol (Practically the same as Adobe Symbols)
  • Symbol Set 5N — ISO 8859-9 Latin 5
  • Symbol Set 5S — ISO 84: 7-bit Portuguese
  • Symbol Set 5T — Windows 3.1 Latin-5 (Practically the same as code page 1254)
  • Symbol Set 6J — Microsoft Publishing
  • Symbol Set 6M — Ventura Math
  • Symbol Set 6N — ISO 8859-10 Latin 6
  • Symbol Set 6S — ISO 85: 7-bit Spanish
  • Symbol Set 7H — ISO 8859-8 Latin/Hebrew
  • Symbol Set 9E — Windows 3.1 Latin 2 (Practically the same as code page 1250)
  • Symbol Set 9G — Windows 98 Greek (Practically the same as code page 1253)
  • Symbol Set 9J — PC 1004
  • Symbol Set 9L — Ventura ITC Zapf Dingbats
  • Symbol Set 9N — ISO 8859-15 Latin 9
  • Symbol Set 9R — Windows 98 Cyrillic (Practically the same as code page 1251)
  • Symbol Set 9U — Windows 3.0
  • Symbol Set 10G — PC-851 Latin/Greek (Practically the same as code page 851)
  • Symbol Set 10J — PS Text (Practically the same as Adobe Standard)
  • Symbol Set 10L — PS ITC Zapf Dingbats (Practically the same as Adobe Dingbats)
  • Symbol Set 10N — ISO 8859-5 Latin/Cyrillic (1988 version — IR 144)
  • Symbol Set 10R — PC-855 Cyrillic (Practically the same as code page 855)
  • Symbol Set 10T — Teletex
  • Symbol Set 10U — PC-8 (Practically the same as code page 437; coded by IBM as code page 1057)
  • Symbol Set 10V — CP-864 (Practically the same as code page 864)
  • Symbol Set 11G — CP-869 (Practically the same as code page 869)
  • Symbol Set 11J — PS ISO Latin-1 (Practically the same as Adobe Latin-1)
  • Symbol Set 11N — ISO 8859-6 Latin/Arabic
  • Symbol Set 12G — PC Latin/Greek (Practically the same as code page 737)
  • Symbol Set 12J — MC Text (Practically the same as Macintosh Roman)
  • Symbol Set 12N — ISO 8859-7 Latin/Greek
  • Symbol Set 12R — PC Gost (Practically the same as PC GOST Main)
  • Symbol Set 12U — PC-850 Latin 1 (Practically the same as code page 850)
  • Symbol Set 13J — Ventura International
  • Symbol Set 13R — PC Bulgarian (Practically the same as MIK)
  • Symbol Set 13U — PC-858 Latin 1 + € (Practically the same as code page 858)
  • Symbol Set 14J — Ventura U. S.
  • Symbol Set 14L — Windows Dingbats
  • Symbol Set 14P — ABICOMP International (Practically the same as ABICOMP)
  • Symbol Set 14R — PC Ukrainian (Practically the same as RUSCII)
  • Symbol Set 15H — PC-862 Israel (Practically the same as code page 862)
  • Symbol Set 16U — PC-857 Latin 5 (Practically the same as code page 857)
  • Symbol Set 17U — PC-852 Latin 2 (Practically the same as code page 852)
  • Symbol Set 18N — UTF-8
  • Symbol Set 18U — PC-853 Latin 3 (Practically the same as code page 853)
  • Symbol Set 19L — Windows 98 Baltic (Practically the same as code page 1257)
  • Symbol Set 19M — Windows Symbol
  • Symbol Set 19U — Windows 3.1 Latin 1 (Practically the same as code page 1252)
  • Symbol Set 20U — PC-860 Portugal (Practically the same as code page 860)
  • Symbol Set 21U — PC-861 Iceland (Practically the same as code page 861)
  • Symbol Set 23U — PC-863 Canada - French (Practically the same as code page 863)
  • Symbol Set 24Q — PC-Polish Mazowia (Practically the same as Mazovia encoding)
  • Symbol Set 25U — PC-865 Denmark/Norway (Practically the same as code page 865)
  • Symbol Set 26U — PC-775 Latin 7 (Practically the same as code page 775)
  • Symbol Set 27Q — PC-8 PC Nova (Practically the same as [PC Nova)
  • Symbol Set 27U — PC Latvian Russian (also known as 866-Latvian)
  • Symbol Set 28U — PC Lithuanian/Russian (Practically the same as code page 774)
  • Symbol Set 29U — PC-772 Lithuanian/Russian (Practically the same as code page 772)

Code pages from other vendors

[edit]

These code pages are independent assignments by third party vendors. Since the original IBM PC code page (number 437) was not really designed for international use, several partially compatible country or region specific variants emerged.

These code pages number assignments are not official neither by IBM, neither by Microsoft and almost none of them is referred as a usable character set by IANA. The numbers assigned to these code pages are arbitrary and may clash to registered numbers in use by IBM or Microsoft. Some of them may predate codepage switching being added in DOS 3.3.

  • 100 – DOS Hebrew hardware fontpage (Not from IBM; HDOS)[34]
  • 111 – DOS Greek (Not from IBM; AST Premium Exec DOS 5.0[35][36][37])
  • 112 – DOS Turkish (Not from IBM; AST Premium Exec DOS 5.0[35][36][37])
  • 113 – DOS Yugoslavian (Not from IBM; AST Premium Exec DOS 5.0[35][36][37])
  • 151 – DOS Nafitha Arabic (Not from IBM; ADOS)
  • 152 – DOS Nafitha Arabic (Not from IBM; ADOS)
  • 161 – DOS Arabic (Not from IBM; ADOS)[34]
  • 162 – DOS Arabic with vowel diacritics (Not from IBM; ADOS)
  • 163 – DOS Arabic and French (Not from IBM; ADOS)[34]
  • 164 – DOS Arabic and French with vowel diacritics (Not from IBM; ADOS)
  • 165 – DOS Arabic (864 Extended) (Not from IBM; ADOS)[34]
  • 166 – IBM Arabic PC (ADOS)[34]
  • 190 – DEC DOS German (appears to be identical to Code page 437)
  • 210 – DEC DOS Greek (NEC Jetmate printers)
  • 220 – DEC DOS Spanish (Not from IBM)
  • 489 – Czechoslovakian [OCR software 1993]
  • 620 – DOS Polish (Mazovia) (Not from IBM)
  • 667 – DOS Polish (Mazovia) (Not from IBM)
  • 668 – DOS Polish (Not from IBM)
  • 706 – MS-DOS Server Arabic Sakhr (Not from IBM; Sakhr Software from MSX Computers)
  • 707 – MS-DOS Arabic Sakhr (Not from IBM; Sakhr Software from MSX Computers)
  • 709 – MS-DOS Arabic (ASMO 449+/BCON V4)
  • 710 – MS-DOS Arabic (Transparent Arabic)
  • 711 – MS-DOS Arabic Nafitha Enhanced (Not from IBM)
  • 714 – MS-DOS Arabic Sakr (Not from IBM)
  • 715 – MS-DOS Arabic APTEC (Not from IBM)
  • 721 – MS-DOS Arabic Nafitha International (Not from IBM)
  • 768 – Arabic Al-Arabi (Not from IBM)
  • 770 – DOS Estonian, Latvian, Lithuanian[38] (From Lithuanian Lika Software;[39] Lithuanian RST 1095-89 National Standard)
  • 771 – DOS Lithuanian/Cyrillic — KBL[40] (From Lithuanian Lika Software[39])
  • 772 – DOS Lithuanian/Cyrillic[41] (From Lithuanian Lika Software;[39] Lithuanian LST 1284:1993 National Standard; adopted by IBM as code page 1119)
  • 773 – DOS Latin-7 — KBL (From Lithuanian Lika Software)
  • 774 – DOS Lithuanian[42] (From Lithuanian Lika Software;[39] Lithuanian LST 1283:1993 National Standard; adopted by IBM as code page 1118)
  • 775 – DOS Latin-7 Baltic Rim (From Lithuanian Lika Software;[39] Lithuanian LST 1590-1 National Standard; adopted by IBM and Microsoft as code page 775)
  • 776 – DOS Lithuanian (extended CP770)[43] (From Lithuanian Lika Software[39])
  • 777 – DOS Accented Lithuanian (old) (extended CP773) — KBL[43] (From Lithuanian Lika Software[39])
  • 778 – DOS Accented Lithuanian (extended CP775)[43] (From Lithuanian Lika Software[39])
  • 790 – DOS Polish (Mazovia) with curly quotation marks
  • 854 – Spanish[44][6]
  • 881 – Latin 1 (Not from IBM; AST Premium Exec DOS 5.0[35][36][37]) (conflictive ID with IBM EBCDIC 881)
  • 882 – Latin 2 (ISO 8859-2) (Not from IBM; same as Code page 912; AST Premium Exec DOS 5.0[35][36][37]) (conflictive ID with IBM EBCDIC 882)
  • 883 – Latin 3 (Not from IBM; AST Premium Exec DOS 5.0[35][36][37]) (conflictive ID with IBM EBCDIC 883)
  • 884 – Latin 4 (Not from IBM; AST Premium Exec DOS 5.0[35][36][37]) (conflictive ID with IBM EBCDIC 884)
  • 885 – Latin 5 (Not from IBM; AST Premium Exec DOS 5.0[35][36][37]) (conflictive ID with IBM EBCDIC 885)
  • 895Czech (Kamenický), (Not from IBM; conflictive ID with IBM CP895 — 7-bit EUC Japanese Roman)
  • 896 – DOS Polish (Mazovia) (Not from IBM; conflictive ID with IBM CP896 — 7-bit EUC Japanese Katakana)
  • 900 – DOS Russian (Russian MS-DOS 5.0 LCD.CPI)
  • 928 – Greek (on Star[45] printers); same as Greek National Standard ELOT 928 (Not from IBM; conflictive ID with IBM CP928 — Simplified Chinese PC DBCS)
  • 966 – Saudi Arabian (Not from IBM)
  • 972 – Hebrew (VT100) (Not from IBM)
  • 991 – DOS Polish (Mazovia) (Not from IBM)
  • 999 – DOS Serbo-Croatian I (Not from IBM); also known as PC Nova and CroSCII; lower part is JUSI.B1.002, upper part is code page 437; supports Slovenian and Serbo-Croatian (Latin script)
  • 1001 – Arabic (on Star[45] printers) (Not from IBM; conflictive ID with IBM CP1001 — MICR)
  • 1261 – Windows Korean IBM-1261 LMBCS-17, similar to 1363
  • 1270 – Windows Sámi
  • 1300 – ANSI [PTS-DOS 6.70, not 6.51] (Not from IBM; conflictive ID with IBM EBCDIC 1300 — Generic Bar Code/OCR-B)
  • 2001 – Lithuanian KBL (on Star[45] printers); same as code page 771
  • 3001 – Estonian 1 (on Star[45] printers); same as code page 1116
  • 3002 – Estonian 2 (on Star[45] printers); same as code page 922
  • 3011 – Latvian 1 (on Star[45] printers); same as code page 437-Latvian
  • 3012 – Latvian-2 (on Star[45] printers); same as code page 866-Latvian (Latvian RST 1040-90 National Standard)
  • 3021 – Bulgarian (on Star[45] printers); same as MIK
  • 3031 – Hebrew (on Star[45] printers); same as code page 862
  • 3041 – Maltese (on Star[45] printers); same as ISO 646 Maltese
  • 3840 – IBM-Russian (on Star[45] printers); nearly the same as CP 866
  • 3841 – Gost-Russian (on Star[45] printers); GOST 13052 plus characters for Central Asian languages
  • 3843 – Polish (on Star[45] printers); same as Mazovia
  • 3844 – CS2 (on Star[45] printers); same as Kamenický
  • 3845 – Hungarian (on Star[45] printers); same as CWI
  • 3846 – Turkish (on Star[45] printers); same as PC-8 Turkish + old Turkish Lira sign (Tʟ) at code point A8
  • 3847 – Brazil-ABNT (on Star[45] printers); same as the Brazilian National Standard NBR-9614:1986
  • 3848 – Brazil-ABICOMP (on Star[45] printers); same as ABICOMP
  • 3850 – Standard KU (on Star[45] printers); variation of the Kasetsart University encoding for Thai
  • 3860 – Rajvitee KU (on Star[45] printers); variation of the Kasetsart University encoding for Thai
  • 3861 – Microwiz KU (on Star[45] printers); variation of the Kasetsart University encoding for Thai
  • 3863 – STD988 TIS (on Star[45] printers); variation of the TIS 620 encoding for Thai
  • 3864 – Popular TIS (on Star[45] printers); variation of the TIS 620 encoding for Thai
  • 3865 – Newsic TIS (on Star[45] printers); variation of the TIS 620 encoding for Thai
  • 28799FOCAL (on Star[45] printers); same as FOCAL character set
  • 28800HP RPL (on Star[45] printers); same as RPL
  • (number missing) – CWI-2 (for DOS) supports Hungarian
  • (number missing) – MIK (for DOS) supports Bulgarian
  • (number missing) – DOS Serbo-Croatian II; supports Slovenian and Serbo-Croatian (Latin script)
  • (number missing) — Russian Alternative code page (for DOS); this is the origin for IBM CP 866

List of code page assignments

[edit]

List of known code page assignments (incomplete):

ID Names Description Origin Platform DOS OS/2 Windows Mac Else Encoding Comment
0 Reserved IBM, Microsoft 3.3+ 1.0+ ? ? ? Internal OS use[34]
437 CP437, IBM437 PC US IBM[46] IBM PC 3.3+ 1.0+ Yes ? Yes 8-bit SBCS
57344 - 61439 Private use derivations IBM various Private use code page derivations (E000h-EFFFh)
65280 - 65533 Private use definitions IBM various Private use code page definitions (FF00h-FFFDh)
65534 Reserved IBM, Microsoft ? ? ? ? ? various Internal OS use (FFFEh)
65535 Reserved IBM, Microsoft 3.3+ 1.0+ ? ? ? various Internal OS use (FFFFh)[34]

Criticism

[edit]

Many older character encodings (unlike Unicode) suffer from several problems. Some vendors insufficiently document the meaning of all code point values in their code pages, which decreases the reliability of handling textual data consistently through various computer systems. Some vendors add proprietary extensions to established code pages, to add or change certain code point values: for example, byte 0x5C in Shift JIS can represent either a back slash or a yen sign depending on the platform. Finally, in order to support several languages in a program that does not use Unicode, the code page used for each string/document needs to be stored.

Applications may also mislabel text in Windows-1252 as ISO-8859-1. The only difference between these code pages is that the code point values in the range 0x80–0x9F, used by ISO-8859-1 for control characters, are instead used as additional printable characters in Windows-1252 – notably for quotation marks, the euro sign and the trademark symbol among others. Browsers on non-Windows platforms would tend to show empty boxes or question marks for these characters, making the text hard to read. Most browsers fixed this by ignoring the character set and interpreting as Windows-1252 to look acceptable. In HTML5, treating ISO-8859-1 as Windows-1252 is even codified as a W3C standard.[47] Although browsers were typically programmed to deal with this behaviour, this was not always true of other software. Consequently, when receiving a file transfer from a Windows system, non-Windows platforms would either ignore these characters or treat them as a standard control characters and attempt to take the specified control action accordingly.

Due to Unicode's extensive documentation, vast repertoire of characters and stability policy of characters, the problems listed above are rarely a concern for Unicode. UTF-8 (which can encode over one million codepoints) has replaced the code-page method in terms of popularity on the Internet.[48][49]

Private code pages

[edit]

When, early in the history of personal computers, users did not find their character encoding requirements met, private or local code pages were created using terminate-and-stay-resident utilities or by re-programming BIOS EPROMs. In some cases, unofficial code page numbers were invented (e.g. CP895).

When more diverse character set support became available most of those code pages fell into disuse, with some exceptions such as the Kamenický or KEYBCS2 encoding for the Czech and Slovak alphabets. Another character set is Iran System encoding standard that was created by Iran System corporation for Persian language support. This standard was in use in Iran in DOS-based programs and after introduction of Microsoft code page 1256 this standard became obsolete. However some Windows and DOS programs using this encoding are still in use and some Windows fonts with this encoding exist.

In order to overcome such problems, the IBM Character Data Representation Architecture level 2 specifically reserves ranges of code page IDs for user-definable and private-use assignments. Whenever such code page IDs are used, the user must not assume that the same functionality and appearance can be reproduced in another system configuration or on another device or system unless the user takes care of this specifically. The code page range 57344-61439 (E000h-EFFFh) is officially reserved for user-definable code pages (or actually CCSIDs in the context of IBM CDRA), whereas the range 65280-65533 (FF00h-FFFDh) is reserved for any user-definable "private use" assignments. For example, a non-registered custom variant of code page 437 (1B5h) or 28591 (6FAF) could become 57781 (E1B5h) or 61359 (EFAFh), respectively, in order to avoid potential conflicts with other assignments and maintain the sometimes existing internal numerical logic in the assignments of the original code pages. An unregistered private code page not based on an existing code page, a device specific code page like a printer font, which just needs a logical handle to become addressable for the system, a frequently changing download font, or a code page number with a symbolic meaning in the local environment could have an assignment in the private range like 65280 (FF00h).

The code page IDs 0, 65534 (FFFEh) and 65535 (FFFFh) are reserved for internal use by operating systems such as DOS and must not be assigned to any specific code pages.

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A code page is a coded character set that defines the mapping of code points—nonnegative values—to abstract characters, enabling the representation of text in computing systems through specific byte sequences. Primarily associated with IBM's character data representation (CDRA), the term denotes a particular encoding scheme, often single-byte, tailored to support a given , , or application. The concept originated in the mid-20th century as part of 's evolution of standards, building on early codes like Hollerith's 12-position system from the late and progressing through 6-bit BCDIC in the to 8-bit introduced with the System/360 in 1964. Code page numbers initially referred to literal page numbers in IBM's standard character set manual, which documented various encodings for mainframe and terminal systems, ensuring compatibility across hardware like the IBM 1050 data communication terminal using PTTC (Paper Tape Transmission Code). This numbering system facilitated identification and interchange of character data, with CDRA formalizing it in the to maintain character identity across diverse code pages via coded character set identifiers (CCSIDs). Microsoft adopted and expanded the term in the 1980s for its operating systems, starting with DOS code pages like 437 (the original IBM PC encoding, supporting English and extended graphics) and evolving into Windows code pages such as 1252 for ANSI Latin-1 Western European text. These encodings addressed limitations of 7-bit ASCII by providing 8-bit extensions for accented characters, symbols, and regional scripts, though they often conflicted with international standards like ISO 8859. While code pages enabled early globalization efforts—such as supporting European diacritics or East Asian double-byte sets—they suffered from fragmentation, with hundreds of variants leading to data interchange issues. The rise of Unicode in the 1990s, with its universal repertoire and encodings like UTF-8, has largely supplanted code pages in modern applications, though they persist in legacy IBM iSeries/AS400 systems, Windows APIs, and certain file formats for backward compatibility.

Fundamentals

Definition and Purpose

A code page is a mapping that associates specific byte values, typically in notation, with individual characters, symbols, or control codes within a defined character set. This structured table enables computers to interpret and display text by translating binary data into readable or functional elements. The underlying encoding schemes for what became known as code pages emerged in the with the development of mainframe systems, such as 's System/360 released in 1964, to accommodate the need for character representations beyond the limitations of 7-bit standards like ASCII, particularly for international languages and specialized symbols. The term "code page" itself was adopted by in the , originally referring to page numbers in their standard character set manuals that documented these encodings. Their primary purpose was to facilitate text processing and data interchange in early computing environments, including mainframes and peripherals, by providing a consistent encoding scheme tailored to specific linguistic or operational requirements. Key characteristics of single-byte code pages include their fixed-width nature, where each character is represented by a single byte (8 bits); however, multi-byte code pages exist for languages requiring larger character repertoires, such as East Asian scripts. Code pages feature vendor-specific implementations (such as those from or ), and ongoing use in legacy systems for compatibility during data exchange. These encodings prioritize simplicity in single-language contexts but require careful handling for multilingual applications. In basic structure, a single-byte code page consists of a 256-entry table, with each index from 0 to 255 corresponding to a unique byte value that maps to a , control code, or character ; for instance, the first 128 entries often align with standard ASCII for basic Latin characters, while the remaining 128 support extended symbols.

Numbering System

IBM's code page numbering system originated in the 1960s alongside the development of for its System/360 mainframes, with CP037 introduced as the standard code page for English-speaking regions such as the and . This scheme employs three-digit numeric identifiers, typically assigning lower numbers (such as 037 and 500) to -based variants for mainframe environments, while higher numbers (starting from around 300, like 437 for the original IBM PC OEM code page) denote ASCII-based variants for personal computers and other systems. As code pages proliferated across vendors, the system evolved to accommodate extensions, notably by , which adopted four-digit identifiers for Windows-specific encodings, such as 1252 for Windows Latin-1 (Western European). This led to overlaps and aliases, where the same encoding might be referenced by multiple numbers or names across platforms—for instance, IBM's CCSID 037 aligns with Microsoft's code page 037. Specific numbering rules emerged for certain categories, including the range 850–859 reserved for multilingual DOS code pages, with 850 serving as the standard Latin-1 variant supporting multiple Western European languages. To resolve conflicts and standardize references, the (IANA) maintains a registry of character set names and aliases, such as "ibm-1047" for IBM's Open Systems Latin-1 code page (CCSID 1047). In application programming interfaces (APIs), these numbers map directly to encoding names; for example, Windows APIs use numeric identifiers like 1252 to load the corresponding code page, while IBM systems often reference them via CCSIDs in functions for character conversion.

Relationship to ASCII

ASCII-based code pages function as 8-bit supersets of the 7-bit ASCII standard, preserving the original ASCII character repertoire in the range of code values 0 through 127 while utilizing the additional 128 values (128 through 255) to encode extended characters. This design ensures , allowing systems and applications to interpret ASCII data without alteration when the high bit (bit 7) is set to zero. EBCDIC-based code pages, however, use a distinct encoding scheme incompatible with ASCII positions. The American Standard Code for Information Interchange (ASCII), standardized in , defines 128 characters primarily for English-language text, controls, and basic symbols, forming the foundational layer for most subsequent 8-bit encodings. A key aspect of this compatibility is evident in code pages like IBM's Code Page 437 (CP437), the original OEM code page for English-language IBM PCs and MS-DOS systems, which retains all ASCII characters in their standard positions while incorporating extended glyphs such as box-drawing elements (e.g., ─, ┌, └) and accented Latin letters for limited international support. These additions enabled early personal computers to display graphical user interfaces and simple multilingual text without disrupting ASCII-based data interchange. Similarly, the structure supports "high-ASCII" usage, where the extended range facilitates region-specific adaptations while maintaining interoperability with pure ASCII environments. The ISO/IEC 8859 series represents standardized international extensions of ASCII, with each variant defining an 8-bit code set that includes the full 7-bit ASCII subset and adds characters for specific linguistic needs, such as Western European languages in ISO/IEC 8859-1 (Latin-1). This approach influenced proprietary code page designs, including IBM's (CP850), a multilingual extension supporting Western European languages like Danish, Dutch, French, German, and Spanish by mapping accented characters and symbols into the upper code range. Such variants promoted broader adoption of ASCII-compatible encodings in diverse computing ecosystems. In practice, this relationship introduces challenges for data portability, particularly in mixed environments where applications assume a pure 7-bit ASCII subset but encounter 8-bit code page variations across systems. Differences in code page assignments—such as varying ANSI code pages on different computers—can lead to , where extended characters are misinterpreted or replaced with incorrect glyphs during transfer or display. For instance, assuming universal ASCII compatibility in file exchanges between systems using distinct code pages may result in garbled text, underscoring the need for explicit encoding declarations to mitigate interoperability issues.

Relationship to Unicode

Code pages serve as predefined mappings of byte values to characters, many of which align with subsets of the standard, where each code page corresponds to specific ranges of code points. For instance, Windows Code Page 1252 (CP1252) maps its characters to the Latin-1 Supplement block, covering code points from U+0080 to U+00FF, enabling direct representation of Western European scripts within Unicode's universal character set. Similarly, other code pages, such as those based on ISO 8859 series, fit into Unicode's Basic Latin and extension blocks, facilitating interoperability by treating code page characters as aliases for scalars. Conversion between code pages and relies on standardized mapping tables and utilities to transform legacy encoded data into Unicode formats like or UTF-16. IBM's Coded Character Set Identifier (CCSID) system provides official mappings from EBCDIC-based code pages to , allowing dynamic or predefined conversions for multilingual data processing on and z/OS platforms. Tools such as the libiconv library implement the iconv function for batch conversions, supporting a wide array of code pages to by referencing internal tables that handle character-by-character substitution. These processes ensure that text in single-byte code pages can be migrated to Unicode while preserving semantic meaning where possible. Despite these mappings, conversions from certain code pages to Unicode can be lossy due to characters without direct equivalents in the Unicode repertoire. For example, (CP437), used in early DOS systems, includes unique symbols like box-drawing elements and legacy graphics that map to Unicode approximations, such as U+2500 for horizontal lines, but may result in visual discrepancies or substitution errors during round-trip conversions. Such limitations arise because code pages were designed for specific hardware and locales, often prioritizing display glyphs over universal semantics, leading to potential data fidelity issues in modern Unicode-based applications. Legacy support for code pages persists through platform-specific APIs, even as has become the dominant standard since the early 2000s. On Windows, the MultiByteToWideChar function converts strings from a specified code page to UTF-16 , handling flags for error detection and substitution of unmappable characters. systems maintain CCSID-to- conversions in their Unicode Services for in enterprise environments. However, trends have accelerated post-2000s, with operating systems and software favoring native implementations to reduce complexity, though code pages remain available for legacy file handling and international data exchange.

IBM Code Pages

EBCDIC-Based Code Pages

EBCDIC, or Extended Binary Coded Decimal Interchange Code, originated as an 8-bit character encoding standard developed by IBM in 1963, primarily to support data processing on mainframe systems and complement the punched card technology prevalent at the time. This encoding was introduced with the IBM System/360 in the mid-1960s, establishing it as the default for IBM's mainframe environments, where it replaced earlier BCDIC formats used in punch card systems. Unlike ASCII, EBCDIC features non-contiguous ordering of characters, with gaps in the alphanumeric sequence—for instance, the uppercase letters A through Z are not sequentially encoded, resulting in a collating sequence where lowercase letters precede uppercase and digits follow letters. IBM assigns numbers to its EBCDIC-based code pages typically in the range 000-199, reflecting their foundational role in mainframe character handling. Key variants within the EBCDIC family address specific linguistic and regional needs while maintaining core compatibility. Code page 037 (CCSID 37), for example, serves as the standard for U.S. English and related locales like and , encoding the full Latin-1 character set in an framework. Code page 500 (CCSID 500) provides multilingual support, particularly common in , incorporating a broad Latin-1 charset for international data interchange on mainframes. For enhanced compatibility with open systems, code page 1047 (CCSID 1047) extends Latin-1 coverage to include the Euro symbol and aligns more closely with ISO 8859-1 standards. National variants, such as code page 870 (CCSID 870), target Latin-2 multilingual needs for Central and Eastern European languages, supporting characters essential for Romanian, Czech, and similar scripts. Structurally, diverges from ASCII in ways that reflect its mainframe heritage, including support for zoned decimal formats where numeric data uses a high-order of 0xF for digits, enabling efficient arithmetic operations in legacy and applications. Additionally, incorporates graphics escape sequences, such as the Graphic Escape (0x0B), to invoke special symbols and control mainframe display and printing devices like 3270 terminals. These features prioritize hardware compatibility over the contiguous, byte-optimized design of ASCII, leading to invariant characters (like digits and basic punctuation) that remain consistent across variants, while variant characters adapt to regional scripts. In modern contexts, EBCDIC-based code pages remain integral to IBM z/OS operating systems and CICS transaction processing environments, where they encode vast legacy datasets in banking, insurance, and government applications. Conversion to ASCII or Unicode poses significant challenges due to the incompatible ordering and encoding schemes; for instance, direct byte-for-byte mapping can produce invalid ASCII characters exceeding the 7-bit range or misinterpret zoned decimals as text, necessitating specialized tools like IBM's iconv utility or CCSID-aware translators to preserve data integrity during migrations. Such conversions are routine in hybrid environments but require careful handling of variant characters to avoid corruption, especially when interfacing with Unicode-based systems that support far more code points than EBCDIC's 256.

DOS and Early PC Code Pages

In 1981, with the release of PC-DOS 1.0 alongside the original IBM Personal Computer, (CP437) was established as the default (OEM) code page for the , extending the ASCII standard with an additional 128 characters primarily dedicated to block graphics characters for creating borders, tables, and simple diagrams in text-based interfaces. This set, also known as OEM-US or PC-8, maintained full compatibility with ASCII in the 0-127 range while allocating positions 128-255 for line-drawing elements, mathematical symbols, and a limited set of accented Latin characters to support basic graphical applications on early PC displays like the and . To accommodate international markets, IBM developed multilingual variants of the OEM code pages starting in the late 1980s. Code Page 850 (CP850), introduced with MS-DOS/PC-DOS 3.3 in 1987, targeted Western European languages and Latin American Spanish by replacing many block graphics with additional accented characters such as ñ, ç, and ä, while retaining some line-drawing capabilities for compatibility. Similarly, Code Page 852 (CP852) emerged in 1991 with MS-DOS/PC-DOS 5.0 to support Central and Eastern European Slavic languages, incorporating characters for Polish, Czech, Hungarian, and others like ł, ś, and ň. Code Page 855 (CP855), added in 1994 with MS-DOS/PC-DOS 6.22, focused on Cyrillic scripts for Russian and related languages, prioritizing letters such as я, щ, and ё over extensive graphics. These code pages evolved to support advanced displays like the (EGA) and (VGA), which allowed loading multiple font sets into ROM for selectable character rendering. Users could switch code pages at boot time using the COUNTRY command in , specifying a and corresponding OEM page number from the range 437 to 449, enabling runtime adaptation for different locales without hardware changes. This flexibility was crucial for VGA systems, where up to 16 font blocks could be stored, with CP437 typically as block 0 and variants like CP850 in subsequent blocks. The legacy of these DOS and early PC code pages persists in file systems such as , where filenames and directory entries were encoded using the active OEM code page, leading to compatibility challenges in international deployments—such as garbled characters when files created under one locale (e.g., CP850) were accessed under another (e.g., CP437) without proper . This encoding mismatch often required manual code page switching or conversion tools, highlighting the limitations of single-byte encodings in global software distribution.)

Emulation and Platform-Specific Code Pages

code pages in the range 800 to 999 were primarily designed for emulation purposes and platform-specific adaptations, enabling compatibility with various character encodings across different systems such as environments and legacy operating systems. These code pages support cross-platform data exchange by mapping characters from one encoding scheme to another, often integrating with standards like for Unix compatibility or providing double-byte character set (DBCS) support for Asian languages. For instance, code pages like CP864 emulate PC encoding for Latin-based systems, facilitating the representation of in environments traditionally using Latin alphabets. In the AIX operating system, a Unix-like platform developed by , specific code pages address regional needs while adhering to standards. CP921, for example, provides support for Latvian and Lithuanian languages, allowing seamless integration of Baltic characters in AIX applications and ensuring compliance with Unix localization requirements. Similarly, CP964 is tailored for Chinese (Taiwan) on AIX, extending support for in Unix-based workflows. For the operating system, created code pages that emphasize multilingual capabilities, including DBCS for Asian languages to handle complex scripts. CP942 serves as a superset of the Microsoft CP932 for Japanese on OS/2, incorporating and characters essential for Japanese text processing. CP943 further enhances this by supporting both CP932 and Shift-JIS encodings, enabling robust DBCS operations in OS/2 environments for East Asian data interchange. These adaptations were crucial for OS/2's role in enterprise computing, where multilingual support was key to global deployments. Emulation code pages within this numeric range focus on bridging legacy -based systems with ASCII-derived encodings for cross-platform compatibility. CP423 emulates Greek characters in an context, mapping them to facilitate data exchange from mainframe environments to PC-like Latin setups such as IBM-850. This was particularly useful in terminal emulations and printer drivers requiring Greek script support in mixed-system architectures. Likewise, CP864 supports emulation by providing PC-compatible mappings to Latin structures, aiding in the transition of Arabic data across diverse platforms. These emulation sets, assigned numbers 800-999, were employed in systems like —a microkernel-based successor to —and early web servers to handle international content and ensure reliable data portability. IBM's approach to integrating Unicode with its legacy code page system relies on the Coded Character Set Identifier (CCSID), a numbering scheme that extends traditional code pages to encompass modern encodings like and . This system allows platforms, such as and , to map characters from EBCDIC-based code pages to , facilitating conversions between legacy data and contemporary applications without full system overhauls. By assigning specific CCSIDs to Unicode variants, ensures compatibility across its ecosystem, where CCSID 1208 designates as a growing character set that incorporates new additions over time. Among these, code page (CP1200) represents UTF-16 in little-endian byte order, serving as a bridge for applications requiring wide-character support in environments. Similarly, CCSID 1390 (associated with IBM-1390) provides an EBCDIC-based encoding for Japanese text, featuring an alternative Unicode conversion table that maps double-byte characters to their Unicode equivalents, particularly useful for mixed-script data in East Asian contexts. These mappings prioritize round-trip fidelity, ensuring that characters from legacy Japanese code pages, such as those in CCSID 5026, can be accurately transformed to and from without loss. The development of Unicode-related CCSIDs gained momentum in the post-1990s era, particularly with the iSeries (formerly AS/400) platforms, where OS/400 version V5R2 introduced explicit data support to handle globalized applications. This evolution addressed the limitations of earlier EBCDIC-centric systems by incorporating as a core encoding option, enabling features like GB18030 for Chinese and broader internationalization. leveraged the (ICU) library to implement robust conversion tools, which map between CCSIDs and Unicode scalars, supporting operations in products like Db2 and Integration Bus. ICU's converters handle the nuances of stateful encodings, such as those in CP1390, by using predefined tables for efficient bidirectional transformations. As of 2025, while promotes (CCSID 1208) as the preferred encoding for new developments due to its simplicity and universal compatibility, -related CCSIDs like and 1390 remain integral to for maintaining legacy applications in banking, , and mainframe environments. This partial shift reflects a strategic balance, with ongoing support in Db2 for ensuring seamless data migration, though full reliance on proprietary CCSIDs is discouraged in favor of standard to reduce conversion overhead. Retention of these code pages underscores their role in hybrid systems where data persists alongside workflows.

Microsoft Code Pages

Windows Code Pages

Windows code pages, often referred to as ANSI code pages in the Windows environment, are single-byte character encodings designed to support text display and input in graphical user interfaces and applications, extending the capabilities of earlier MS-DOS code pages for broader international use. These code pages map byte values from 128 to 255 to characters specific to various scripts and languages, while preserving the ASCII range (0-127) for compatibility. Microsoft developed them to handle regional linguistic needs in Windows operating systems, with the active code page determined by system locale settings. The primary Windows code page for Western European languages is CP1252, also known as or ANSI Latin 1, which has served as the default for English and most Western applications since in 1992. This code page supports 256 characters, including Latin letters with diacritics, commonly referred to as an ANSI code page (though this is a ), and based on an early draft of ISO/IEC 8859-1 to ensure compatibility across Windows platforms. Unlike the related ISO-8859-1 (Latin-1), CP1252 populates the 0x80-0x9F range with printable characters such as curly quotes (“ ” ‘ ’), em-dash (—), and , filling gaps left undefined in ISO-8859-1 for better typographic support in applications like word processors. In 1999, Microsoft updated CP1252 and several related code pages to include the Euro symbol (€) at code point 0x80, aligning with the introduction of the Euro currency on January 1, 1999, as specified in OpenType font standards. This update ensured seamless support for financial and business applications in Eurozone countries without requiring a full encoding overhaul. The code page also incorporates other symbols like the en-dash (–) and figure dash (‒), enhancing document formatting in Windows GUI environments. Microsoft provides regional variants of these code pages in the 1250-1258 range to accommodate non-Latin scripts and languages, allowing users to select appropriate encodings via regional settings in the Control Panel. These variants maintain the ASCII base but extend the upper byte range for script-specific characters. For example:
Code PageNamePrimary Language(s)
1250Central European (e.g., Polish, Czech)
1251Cyrillic (e.g., Russian, Bulgarian)
1255Hebrew
1256
Each includes the symbol where relevant and supports right-to-left rendering for bidirectional scripts like Hebrew and . These code pages evolved from ancestry but are optimized for Windows' graphical interfaces. Integration with Windows APIs facilitates programmatic handling of these code pages; for instance, the Win32 function GetACP() retrieves the system's current ANSI code page identifier, enabling applications to convert text dynamically based on locale. Developers can query or set code pages using functions like GetCPInfoExA for detailed character information, ensuring compatibility in multilingual software. Regional settings, configurable through the operating system's features, control which code page is active, promoting consistent text rendering across user environments.

MS-DOS and DBCS Code Pages

Microsoft's code pages for extended the original (CP437), which was based on ASCII with added block graphics and Latin-1 characters, to support various international languages through variants introduced in the 1980s. These variants included single-byte character sets (SBCS) tailored for non-Latin scripts, such as CP720 for (Transparent ASMO), which retained box-drawing characters while accommodating right-to-left text and diacritics, and was added in MS-DOS 6.22 in 1994. For Asian languages requiring more than 256 characters, implemented double-byte character sets (DBCS) starting with version 4.0 in 1988, using code pages in the 700-999 range to distinguish them from SBCS. Notable examples include CP932 for Japanese, an extension of Shift-JIS that maps over 16,000 characters via lead bytes (0x81-0x9F, 0xE0-0xEF) followed by trail bytes, and CP936 for Simplified Chinese, based on GBK with similar lead/trail byte mechanisms for encoding hanzi and other symbols. In DBCS mode, the system interprets certain byte ranges as lead bytes signaling a following trail byte, enabling dense representation of large character sets while maintaining ASCII compatibility for the first 128 code points. Code page switching in was managed through the MODE command, introduced in version 3.3 and enhanced in later releases, allowing users to prepare, select, and refresh code pages for devices like displays and printers. For instance, "MODE CP PREPARE=((850) CON)" loaded for the console, followed by "MODE CP SELECT=850 CON" to activate it, with NLSFUNC.EXE providing necessary support files in . DBCS code pages required special loading via DISPLAY.SYS or similar drivers in international versions. Support for Far East markets, including robust DBCS handling, was significantly improved in MS-DOS 5.0 released in 1991, enabling better integration of Japanese, Chinese, and Korean locales through dedicated language editions. However, filename handling on the FAT file system imposed limitations: the 8.3 format (8 characters for the name, 3 for the extension) was enforced, and while DBCS characters were permitted starting in DBCS-enabled versions, each double-byte character consumed two bytes in the fixed 11-byte directory entry field, effectively reducing the maximum number of characters in a name. This byte-level constraint often led to truncated or incompatible filenames when mixing SBCS and DBCS elements across systems.

Emulation Code Pages

Emulation code pages in Windows provide mappings for character sets developed by other vendors, enabling and data portability across platforms without native support. These code pages are particularly valuable for handling legacy files and applications from systems like Apple Macintosh or Indian standards, where direct compatibility might otherwise lead to garbled text. By emulating external encodings, Windows allows developers and users to import and process foreign data through standard APIs, such as those in the Win32 functions. Code Page 10000 (CP10000) specifically emulates Apple's Mac Roman encoding, an 8-bit character set designed for Western European languages on Macintosh systems. Mac Roman extends ASCII with 128 additional characters, including accented letters, currency symbols, and typographic marks unique to Apple's early font libraries. This emulation maps these Apple-specific glyphs to equivalent Windows representations, facilitating the exchange of text files, such as documents created in Mac applications like Microsoft Word for Mac or Adobe tools on older systems. For instance, symbols like the Apple logo or fraction characters are preserved during conversion, preventing display issues in Windows environments. Similarly, CP57002 emulates the Indian Script Code for Information Interchange (ISCII) standard for script, supporting languages such as , Marathi, and . ISCII, established in the as an 8-bit encoding for Indian scripts, unifies multiple regional writing systems under a single framework. In Windows, this code page enables the processing of ISCII-encoded data from non-Microsoft Indian software, ensuring accurate rendering of conjunct consonants and vowel signs in cross-vendor scenarios, like importing legacy government or educational documents. For systems, emulation occurs through alignments with character sets, notably the Adobe Standard Encoding, emulated in Windows through alignments with code page 1276 or -compatible mappings. This encoding supports Latin-1 text with additional printing symbols for compatibility, allowing Windows applications to interpret Adobe-generated files without loss of glyphs like mathematical operators or diacritics. It is crucial for legacy workflows in , where output from or similar tools must integrate with Windows print drivers. Since the early 2000s, has reserved the 10xxx numbering range for such emulations, primarily targeting Macintosh variants to broaden cross-platform support. Examples include CP10001 for x-mac-japanese and CP10002 for x-mac-chinesetrad, used in Windows APIs for legacy data import. This systematic assignment aids in maintaining compatibility as adoption grew, prioritizing portability over exhaustive native implementations. Microsoft introduced Unicode-related code pages to facilitate direct integration with the standard within Windows environments, enabling applications to handle international text without relying solely on legacy single-byte encodings. Code page 1200 (CP1200) represents UTF-16 in little-endian byte order, providing a 16-bit encoding for the Basic Multilingual Plane of Unicode characters and serving as the primary internal representation for Unicode strings in Windows APIs. Similarly, code page 65001 (CP65001) implements , an 8-bit variable-length encoding that supports the full Unicode repertoire while maintaining compatibility with ASCII for English text. These UTF-based code pages have been standard since in 1996, allowing developers to use Unicode transformations alongside traditional code pages for . For double-byte character set (DBCS) environments, particularly in East Asian locales, extended support through code page 54936 (CP54936), which corresponds to the GB18030 standard for Simplified Chinese. This code page builds on legacy DBCS encodings like CP936 (GBK) by incorporating four-byte sequences to achieve complete coverage of the standard, including rare and historical characters not representable in earlier GB standards. Introduced in and later versions, CP54936 ensures that applications handling Chinese text can convert seamlessly to and from without data loss, addressing limitations in prior DBCS implementations. Windows provides APIs such as WideCharToMultiByte for converting between (UTF-16) strings and multi-byte representations in specified code pages, including the UTF variants like CP1200 and CP65001. This function maps wide-character strings to byte sequences, supporting flags for error handling and default character substitution to maintain data integrity during transformations. Complementing these are system functions like GetACP for retrieving the active code page identifier, which helps applications dynamically identify and adapt to the current encoding context, though recommends direct usage over code page dependencies. In and subsequent versions, has deprecated reliance on non-Unicode (ANSI) code pages in favor of and UTF-16, with version 1903 introducing beta support for setting as the system locale via administrative settings. fallback can be enabled for new applications through manifest properties like activeCodePage, promoting consistent rendering of international text in GDI and console output as of 2025. However, code pages remain available for legacy support, particularly in SQL Server environments where older collations and data imports may require them for compatibility with pre-Unicode databases.
Code PageEncodingIntroductionKey Use
1200UTF-16LE (1996)Internal Unicode string handling in APIs
65001 (1996)Cross-platform text interchange and web content
54936GB18030 (2001)Full Unicode coverage for Simplified Chinese DBCS

Code Pages from Other Vendors

HP Symbol Sets

HP Symbol Sets refer to a family of proprietary 8-bit character encodings developed by (HP) in the 1980s for use in their printer control languages, particularly (), and operating systems like . These sets function similarly to standard code pages by mapping byte values to glyphs, enabling the printing and display of extended characters beyond ASCII, including Western European accents, currency symbols, and mathematical notations. Originating with the introduction of the printer in 1984, they were designed to support in printing environments while maintaining compatibility with early hardware. The foundational set, HP Roman-8 (PCL identifier 8U), extends the 7-bit US ASCII standard into an 8-bit encoding, with the lower 128 code points (0x00–0x7F) matching ASCII and the upper 128 (0x80–0xFF) providing additional symbols such as accented letters (e.g., à, é, ñ), line-drawing characters, and mathematical symbols like ± (plus-minus), ° (degree), and µ (micro). This set, equivalent to IBM code page 1051, was specifically tailored for HP's early LaserJet series and PCL 5 implementations, supporting up to 218 printable glyphs in bound fonts. A variant, HP Turkish-8 (PCL identifier 8T), modifies Roman-8 to include Turkish-specific characters like ğ, ı, and ş, facilitating localization for that language while retaining core ASCII compatibility. These sets prioritize printer output, with structures modeled after ISO 8859 standards but customized for HP hardware constraints. In terms of structure, HP Symbol Sets divide the 256 possible code points into areas: areas 0 and 2 for control or non-printing functions, and areas 1 and 3 for printable glyphs, allowing flexible binding to scalable fonts like Intellifonts. They are closely tied to HP's font cartridges, such as the Medium cartridge (92286Z), which preloads glyphs mapped to Roman-8 for consistent rendering in early LaserJet models without requiring full font downloads. Mathematical symbol support in Roman-8 includes essential operators (e.g., × for at 0xD7, ÷ for division at 0xF7) and relational symbols, making it suitable for technical printing but limited compared to dedicated math encodings. Integration occurs via PCL escape sequences, such as ESC (8U to select Roman-8 as the primary symbol set or ESC )8U for secondary, enabling dynamic switching during print jobs without resetting the printer. This allows applications to embed multinational text in documents processed by HP printers. In , Roman-8 serves as the default codeset for terminals and , ensuring compatibility with legacy Unix applications and ensuring proper handling of extended characters in system locales. While primarily HP-native, these sets have been emulated in and environments for cross-platform printing compatibility.

Adobe and Other Emulation Sets

Adobe Standard Encoding, introduced in 1985 as part of 's PostScript LanguageLevel 1, serves as the foundational for text representation in PostScript documents and fonts. This 256-character set extends ASCII with additional diacritics, symbols, and typographic elements, enabling consistent mapping across printers and software in early workflows. It forms the basis for PDF text handling, where character codes are indexed to glyph names in font dictionaries like Type 1 formats. later assigned code page 1276 to this encoding in 1995 to facilitate compatibility in multi-platform environments. Variants of Adobe Standard Encoding address non-Latin scripts, such as Adobe Standard Cyrillic, specified in 1998 to support Russian and related languages by mapping the upper 128 code points to ISO 8859-5 equivalents while using alphanumeric glyph names compatible with PostScript. Similarly, Adobe encodings for Greek, often via the Expert Encoding vector, incorporate polytonic characters and diacritics for classical and modern Greek typography in PostScript fonts. These variants emulate regional standards while maintaining PostScript's device-independent rendering. Other vendors developed code pages to emulate Adobe sets for cross-system compatibility. For instance, IBM code page 1038 for Symbol Encoding, ensures that mathematical and special symbols render consistently in PostScript-derived outputs. In desktop publishing, Adobe encodings enabled precise typographic control during the 1980s and 1990s, powering tools like PageMaker and for high-quality output to devices. However, modern tools face challenges with glyph substitution, where legacy Adobe mappings may trigger incorrect fallbacks or missing characters in Unicode-based workflows, requiring manual overrides in applications like InDesign to preserve fidelity.

DEC and Additional Vendor Sets

Digital Equipment Corporation (DEC) developed the National Replacement Character Sets (NRCS) as a feature for its VT series of computer terminals, beginning with the VT200 series in the early . These sets consist of 7-bit character encodings that modify the standard ASCII set by substituting a small number of graphic characters with equivalents tailored to specific national languages or dialects, enabling localized text display without requiring full 8-bit support. For instance, the DEC Greek NRCS replaces symbols like the and curly braces with Greek letters such as alpha and beta, facilitating Greek text input and display on terminals like the and VT320. A key component of DEC's ecosystem was the DEC Multinational Character Set (MCS), introduced in 1983 for the terminal and registered by as code page (also known as CCSID 1100). This 7-bit set supports Western European accented characters and symbols, such as accented vowels and marks, while maintaining compatibility with ASCII in the 32–126 range; it includes both 7-bit and 8-bit modes for extended use in systems requiring broader coverage. The MCS was integral to DEC's operating environments, including the VMS operating system for VAX computers and earlier PDP-11 systems, where it handled multinational text processing in applications and assemblers like MACRO-11. Beyond DEC, other vendors introduced specialized code pages during the same era. NeXT Computer, Inc., utilized the NeXT character set (often referred to as NS Roman in documentation) in its operating system starting in 1988; this 8-bit encoding, based on Adobe's Standard Encoding, with symbols, accented Latin characters, and typographic elements for desktop publishing and user interfaces on NeXT workstations. Sun Microsystems extended character support in Solaris through custom locale definitions and code page mappings, incorporating extensions for international text handling, such as multi-byte support for Asian scripts and supplementary mappings for European languages beyond ISO 8859 standards. These DEC and vendor-specific sets were predominantly used from the through the in terminal-based and early workstation environments, but their legacy persists in modern emulations; for example, the terminal emulator in systems supports DEC NRCS and MCS selections via escape sequences, allowing compatibility with legacy applications.

Code Page Assignments and Lists

Numbering Assignments by Vendor

IBM maintains a structured numbering system for its code pages, referred to as Coded Character Set (CCSIDs). The range 000–199 is primarily allocated to -based encodings, supporting legacy mainframe environments and international variants such as CCSID 037 for U.S. English . Numbers 300–499 are assigned to ASCII and compatible code pages, including CCSID 437 for the original IBM PC OEM character set. Extensions and additional encodings, such as those for double-byte character sets, occupy numbers 500 and higher, with the full registry documented in IBM's official resources for system configuration and data conversion. Microsoft employs a distinct assignment scheme for its code pages, where numbers 000–099 act as aliases referencing OEM code pages like 437 for console and legacy DOS applications. The range 100–199 designates ANSI code pages, such as 1252 for Western European languages, which extend the basic ASCII set for graphical user interfaces. For Unicode-related mappings, Microsoft utilizes numbers 2000 and above, including 1200 for UTF-16 little-endian, aligning with broader IANA standardization to ensure across Windows systems and international software. Other vendors follow proprietary numbering conventions tailored to their hardware and software ecosystems. assigns numbers 0–99 to symbol sets within its (PCL), enabling precise character mapping for printing tasks, such as Roman-8 under set 8U. (DEC) used numbers 10–99 for its National Replacement Character Sets (NRCS) in VT-series terminals, with 10 denoting the U.S. ASCII variant and higher numbers for European locales like 11 for . To mitigate conflicts arising from overlapping assignments across vendors, standardized aliases are employed, such as equating to cp1252 in cross-platform applications. Significant gaps exist in code page coverage, particularly with unassigned numbers beyond 2000, reflecting the shift away from proprietary extensions toward universal encodings. The IANA character set registry, last updated June 6, 2024, registers these with aliases for .

Common Code Page Charts and Mappings

Code page charts provide tabular representations of byte values mapped to characters, typically in (hex) or decimal formats, facilitating conversion between legacy encodings and modern standards like . These charts are essential for developers and system administrators handling text in older software or files. For instance, the US OEM (CP437), originally designed for PCs and , extends ASCII with graphics characters for box-drawing and international symbols. Its mapping table, hosted by the , lists 256 entries from byte 0x00 to 0xFF, where the first 128 bytes (0x00-0x7F) align with ASCII control and printable characters, while the upper range (0x80-0xFF) includes line-drawing elements (e.g., 0xB0 to light shade U+2591), Greek letters (e.g., 0xE0 to alpha U+03B1), and mathematical symbols (e.g., 0xF6 to U+00F7). To interpret such mappings, locate the byte value in hex (base-16, e.g., 0x41) or (base-10, e.g., 65), which corresponds to a (e.g., U+0041 for 'A'). Tools like BabelMap, a free Windows application developed by Unicode expert Andrew West, allow users to visualize these mappings by selecting a code page from installed system encodings and displaying glyphs alongside Unicode equivalents, supporting searches by byte value or character name for accurate conversions. A representative partial table for CP437 illustrates its graphics focus:
Hex ByteDecimalUnicodeCharacter/Description
0x000U+0000NULL
0x011U+0001START OF HEADING
0x2032U+0020
0x4165U+0041LATIN CAPITAL LETTER A
0xB0176U+2591
0xB1177U+2592MEDIUM SHADE
0xDA218U+250CBOX DRAWINGS LIGHT DOWN AND RIGHT
0xE0224U+03B1GREEK SMALL LETTER ALPHA
0xF6246U+00F7
0xFF255U+00A0NO-BREAK SPACE
Another key example is CP1252, the Windows Latin-1 code page for Western European languages, which differs from ISO-8859-1 primarily in the 0x80-0x9F range: ISO-8859-1 reserves these for control characters, but CP1252 assigns printable symbols, enabling broader text display in early Windows applications. For instance, byte 0x80 maps to the (U+20AC), 0x82 to single low-9 quotation mark (U+201A), and 0x85 to horizontal ellipsis (U+2026), while bytes like 0x81 and 0x8D remain undefined. The full mapping confirms 256 entries, with 0x00-0x7F matching ASCII and 0xA0-0xFF largely aligning with ISO-8859-1's . Sample mappings highlighting CP1252's differences:
Hex ByteDecimalUnicodeCharacter/Description (vs. ISO-8859-1 Control)
0x80128U+20AC
0x82130U+201ASINGLE LOW-9 QUOTATION MARK
0x85133U+2026
0x91145U+2018LEFT SINGLE QUOTATION MARK
0x9F159U+0178LATIN CAPITAL LETTER Y WITH DIAERESIS
Common code pages by usage include legacy encodings still encountered in files, databases, and embedded systems, with web statistics showing ISO-8859-1 at 1.0%, (CP1252) at 0.3%, and Shift_JIS (related to CP932) at 0.1% as of November 2025. Other frequently referenced ones are CP437 (OEM ), CP850 (DOS Multilingual Latin-1), CP932 (Windows Japanese), and CP1251 (Windows Cyrillic), often prioritized in migration tools due to their prevalence in regional software. The IANA registry, last updated June 6, 2024, registers these with aliases for interoperability: e.g., (MIBenum 2252, ), ibm437 or cp437 (MIBenum 2011, ), and shift_jis for CP932 variants (MIBenum 2024, various). Excerpts include (Central European, MIBenum 2250), (Cyrillic, MIBenum 2251), and (Greek, MIBenum 2253), emphasizing vendor-specific mappings.

Limitations and Criticism

Historical Limitations

Code pages, particularly single-byte variants, were inherently limited to 256 characters due to their reliance on 8-bit encoding schemes, which proved insufficient for representing the vast repertoires required by many non-Latin scripts. For instance, languages such as Chinese, Japanese, and Korean (collectively known as CJK) demand tens of thousands of characters, necessitating double-byte character sets (DBCS) to extend beyond the 256-code-point barrier and accommodate up to 65,536 possibilities with two bytes. This constraint forced developers to implement complex shift mechanisms or separate encodings, complicating and data interchange across systems. Vendor-specific implementations exacerbated these issues through fragmentation, where incompatible code pages led to widespread —garbled text resulting from mismatched interpretations during data transfer. A notable example is the divergence between 's Code Page 850 (designed for multilingual Latin support in DOS environments) and Microsoft's (an extension of ISO 8859-1 for Western European languages), which differ in the assignment of characters in the 0x80–0x9F range, causing accented letters or symbols to appear as unrelated glyphs when files were opened in the wrong environment. Such incompatibilities were common in cross-platform exchanges during the and , as vendors like , , and DEC prioritized proprietary optimizations over interoperability, resulting in frequent without explicit encoding declarations. In the and , code pages lacked standardized mechanisms for sorting and collation, leading to inconsistent ordering of text across applications and locales, as there were no uniform rules for precedence among characters beyond basic ASCII. Additionally, heavy dependencies on hardware configurations plagued deployment; printers and terminals, such as IBM's 3270 series, required specific code page support embedded in or drivers, rendering output unpredictable when mismatched with the host system's encoding and often necessitating custom mappings for reliable rendering. Specific events underscored these vulnerabilities, including the 1999 introduction of the (€), which prompted urgent code page revisions akin to Y2K preparations, as vendors like created new "Euro Country Extended Code Pages" (ECECPs) by reassigning existing code points—such as replacing the international currency symbol in positions like 0x9F—potentially disrupting legacy applications reliant on prior mappings. Similarly, early in the suffered from display errors due to unstandardized code page handling in HTTP transfers, where text conversion between disparate systems frequently produced , as browsers and servers defaulted to local encodings without robust negotiation protocols.

Transition to Unicode and Modern Critique

The development of Unicode emerged as a direct response to the fragmentation caused by numerous proprietary code pages, which complicated multilingual text handling across systems. In October 1991, the Unicode Consortium released version 1.0 of the Standard, establishing a universal character encoding scheme to unify representations of scripts from diverse languages and resolve the incompatibilities inherent in code page proliferation. This standard assigned unique code points to characters, enabling a single encoding to support multiple scripts including Latin, Greek, Cyrillic, and others, with thousands of characters initially, far surpassing the limitations of single-byte code pages like those from or . The transition accelerated in operating systems, with Microsoft Windows 2000, released in February 2000, marking a pivotal shift by upgrading its internal character handling from the fixed-width UCS-2 to the variable-width encoding, thereby prioritizing over legacy code pages for new applications. Subsequent Windows versions, including those post-2000, further emphasized and as preferred encodings, with recommending their use in APIs to avoid code page dependencies. By the early , major platforms like Windows had internalized support, reducing reliance on code pages for core text processing while retaining backward compatibility through conversion layers. Despite this shift, legacy code page debt persists in modern databases, particularly in systems like SQL Server, where older collations tied to specific code pages—such as SQL_Latin1_General_CP1_CI_AS (based on code page 1252)—cause sorting and comparison inconsistencies for data. These legacy collations perform incomplete code-point comparisons, leading to issues in mixed-language environments and requiring manual conversions between (code page-bound) and nvarchar (Unicode) columns during migrations. Such remnants complicate upgrades, as seen in SQL Server 2022 and later, where deprecated binary collations exacerbate performance overhead in globalized applications. Security vulnerabilities arise from code page misinterpretation, where improper handling of character encodings allows attackers to bypass filters by exploiting discrepancies between assumed and actual encodings. For instance, Unicode normalization differences can enable injection attacks, as malicious input disguised via alternative encodings evades validation in legacy systems still parsing code page data. The OWASP Foundation highlights how Unicode encoding variants can conceal payloads in web inputs, leading to cross-site scripting (XSS) or SQL injection when code pages interpret bytes differently from UTF-8 expectations. These risks are amplified in transitional environments, where mixed code page and Unicode usage creates parsing ambiguities exploitable for command insertion. Critics argue that maintaining code page emulators imposes unnecessary environmental costs, as legacy systems consume more energy due to inefficient processing and outdated hardware dependencies compared to streamlined Unicode implementations. For example, emulating code page conversions in virtualized environments increases computational overhead, contributing to higher carbon emissions from data centers. This ongoing support for obsolete encodings diverts resources from sustainable computing practices, perpetuating e-waste through prolonged use of incompatible hardware. Code pages create accessibility barriers for non-Latin scripts, as their limited character sets fail to render complex writing systems like or properly in assistive technologies. Screen readers often mispronounce or garble text when encountering code page-mapped glyphs for bidirectional or combining characters, hindering comprehension for visually impaired users in multilingual contexts. Without Unicode's comprehensive support, these scripts suffer from incomplete phonetic mapping, exacerbating exclusion for billions of people worldwide using non-Latin scripts. In 2025, regulatory pressures underscore the transition's incompleteness, with the mandating accessible digital formats for products and services, including government documents, by June 28, 2025, to ensure compatibility across member states' diverse languages. gaps persist, particularly for legacy code pages in IoT devices, where integration challenges arise from undocumented serial protocols relying on vendor-specific encodings, leading to failures in industrial settings. These omissions highlight ongoing risks in embedded systems, where incomplete mappings contribute to silos and security blind spots.

Private and Custom Code Pages

Definition and Usage

Private code pages, also referred to as custom code pages, are non-standard character encodings that establish unique, user-defined mappings between byte values (typically in the 8-bit range) and characters or symbols, without registration in official registries such as the IANA character sets list or vendor-specific assignments from or . These mappings are frequently developed for proprietary data representation or solutions, such as 8-bit extensions in video games to accommodate custom symbols, icons, or graphical elements not covered by standard encodings. In practice, private code pages find application in resource-constrained environments like embedded systems, where tailored encodings minimize storage and processing overhead for displaying text on microcontrollers or low-memory devices. They also appear in legacy software, exemplified by for DOS, which supported user-created code pages through editable .WCP files to handle characters from unsupported regional sets, such as adding the symbol to Code Page 1252 via custom byte assignments. A key drawback is their inherent non-interoperability, as data encoded this way remains opaque or garbled on systems lacking the proprietary mapping, complicating data exchange across diverse platforms. Creating private code pages generally involves specialized tools, such as custom font editors that allow designers to assign glyphs to byte values, or plain-text configuration files defining the mappings for integration into software or firmware. While the (IANA) actively discourages the creation and deployment of unregistered encodings to foster global compatibility in internet protocols, it permits limited private customization within established standards like 's designated areas. Specifically, Unicode reserves the Private Use Area (PUA) code points from U+E000 to U+F8FF for such non-standard assignments, enabling applications to define and interpret characters privately without conflicting with the core standard. Historically, private code pages gained traction in early Systems (BBS) during the 1980s and 1990s, where custom variations of encodings like IBM Code Page 437 were adapted to optimize the rendering of —text-based graphics incorporating colors and extended characters for visually engaging interfaces. As of 2025, reliance on private code pages has significantly diminished, with most custom needs now addressed through Unicode's PUA to maintain broader compatibility in contemporary systems.

Examples and Risks

In the gaming domain, early role-playing games (RPGs) developed with tools like often employed custom encodings derived from to handle non-standard text, such as Japanese characters in titles like those from the era, ensuring compatibility with limited hardware resources. Private code pages introduce significant risks during data migrations, where mismatches between source and target encodings can result in character , rendering text unreadable or substituting incorrect symbols, as seen in database transfers involving non-Unicode columns. To mitigate these issues, developers should maintain documented private mappings within technical specifications to facilitate and auditing. A recommended transition involves adopting 's Private Use Area (PUA), which reserves code points like U+E000 to U+F8FF for custom assignments without conflicting with standard characters, enabling safer round-trip conversions in modern applications. As of 2025, private code pages remain rare but can persist in legacy systems for .

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.