Recent from talks
Nothing was collected or created yet.
Windows code page
View on WikipediaWindows code pages are sets of characters or code pages (known as character encodings in other operating systems) used in Microsoft Windows from the 1980s and 1990s. Windows code pages were gradually superseded when Unicode was implemented in Windows,[citation needed] although they are still supported both within Windows and other platforms, and still apply when Alt code shortcuts are used.
Current Windows versions support Unicode, new Windows applications should use Unicode (UTF-8) and not 8-bit character encodings.[1]
There are two groups of system code pages in Windows systems: OEM and Windows-native ("ANSI") code pages. (ANSI is the American National Standards Institute.) Code pages in both of these groups are extended ASCII code pages. Additional code pages are supported by standard Windows conversion routines, but not used as either type of system code page.
ANSI code page
[edit]| Alias(es) | ANSI (misnomer) |
|---|---|
| Standard | WHATWG Encoding Standard |
| Extends | ASCII |
| Preceded by | ISO 8859 |
| Succeeded by | Unicode UTF-16 (in Win32 API) UTF-8 (for files) |
ANSI code pages (officially called "Windows code pages"[2] after Microsoft accepted the former term being a misnomer[3]) are used for native non-Unicode (say, byte oriented) applications using a graphical user interface on Windows systems. The term "ANSI" is a misnomer because these Windows code pages do not comply with any ANSI (American National Standards Institute) standard; code page 1252 was based on an early ANSI draft that became the international standard ISO 8859-1,[3] which adds a further 32 control codes and space for 96 printable characters. Among other differences, Windows code-pages allocate printable characters to the supplementary control code space, making them at best illegible to standards-compliant operating systems.[a]
Most legacy "ANSI" code pages have code page numbers in the pattern 125x. However, 874 (Thai) and the East Asian multi-byte "ANSI" code pages (932, 936, 949, 950), all of which are also used as OEM code pages, are numbered to match IBM encodings, none of which are identical to the Windows encodings (although most are similar). While code page 1258 is also used as an OEM code page, it is original to Microsoft rather than an extension to an existing encoding. IBM have assigned their own, different numbers for Microsoft's variants, these are given for reference in the lists below where applicable.
All of the 125x Windows code pages, as well as 874 and 936, are labelled by Internet Assigned Numbers Authority (IANA) as "Windows-number", although "Windows-936" is treated as a synonym for "GBK". Windows code page 932 is instead labelled as "Windows-31J".[4]
ANSI Windows code pages, and especially the code page 1252, were so called since they were purportedly based on drafts submitted or intended for ANSI. However, ANSI and ISO have not standardized any of these code pages. Instead they are either:[3]
- Supersets of the standard sets such as those of ISO 8859 and the various national standards (like Windows-1252 vs. ISO-8859-1),
- Major modifications of these (making them incompatible to various degrees, like Windows-1250 vs. ISO-8859-2)
- Having no parallel encoding (like Windows-1257 vs. ISO-8859-4; ISO-8859-13 was introduced much later). Also, Windows-1251 follows neither the ISO-standardised ISO-8859-5 nor the then-prevailing KOI-8.
Microsoft assigned about twelve of the typography and business characters (including notably, the euro sign, €) in CP1252 to the code points 0x80–0x9F that, in ISO 8859, are assigned to C1 control codes. These assignments are also present in many other ANSI/Windows code pages at the same code-points. Windows did not use the C1 control codes, so this decision had no direct effect on Windows users. However, if included in a file transferred to a standards-compliant platform like Unix or MacOS, the information was invisible and potentially disruptive.[a]
OEM code page
[edit]The OEM code pages (original equipment manufacturer) are used by Win32 console applications, and by virtual DOS, and can be considered a holdover from DOS and the original IBM PC architecture. A separate suite of code pages was implemented not only due to compatibility, but also because the fonts of VGA (and descendant) hardware suggest encoding of line-drawing characters to be compatible with code page 437. Most OEM code pages share many code points, particularly for non-letter characters, with the second (non-ASCII) half of CP437.
A typical OEM code page, in its second half, does not resemble any ANSI/Windows code page even roughly. Nevertheless, two single-byte, fixed-width code pages (874 for Thai and 1258 for Vietnamese) and four multibyte CJK code pages (932, 936, 949, 950) are used as both OEM and ANSI code pages. Code page 1258 uses combining diacritics, as Vietnamese requires more than 128 letter-diacritic combinations. This is in contrast to VISCII, which replaces some of the C0 (i.e. ASCII) control codes.
History
[edit]Early computer systems had limited storage and restricted the number of bits available to encode a character. Although earlier proprietary encodings had fewer, the American Standard Code for Information Interchange (ASCII) settled on seven bits: this was sufficient to encode a 96 member subset of the characters used in the US. As eight-bit bytes came to predominate, Microsoft (and others) expanded the repertoire to 224, to handle a variety of other uses such a box-drawing symbols. The need to provide precomposed characters for the Western European and South American markets required a different character set: Microsoft established the principle of code pages, one for each alphabet. For the segmental scripts used in most of Africa, the Americas, southern and south-east Asia, the Middle East and Europe, a character needs just one byte but two or more bytes are needed for the ideographic sets used in the rest of the world. The code-page model was unable to handle this challenge.
Since the late 1990s, software and systems have adopted Unicode as their preferred character encoding format: Unicode is designed to handle millions of characters. All current Microsoft products and application program interfaces use Unicode internally,[citation needed] but some applications continue to use the default encoding[clarification needed] of the computer's 'locale' when reading and writing text data to files or standard output.[citation needed] Therefore, files may still be encountered that are legible and intelligible in one part of the world but unintelligible mojibake in another.
UTF-8, UTF-16
[edit]Microsoft adopted a Unicode encoding (first the now-obsolete UCS-2, which was then Unicode's only encoding), i.e. UTF-16 for all its operating systems from Windows NT onwards, but additionally supports UTF-8 (aka CP_UTF8) since Windows 10 version 1803.[5]
UTF-16 uniquely encodes all Unicode characters in the Basic Multilingual Plane (BMP) using 16 bits but the remaining Unicode (e.g. emojis) is encoded with a 32-bit (four byte) code – while the rest of the industry (Unix-like systems and the web), and now Microsoft chose UTF-8 (which uses one byte for the 7-bit ASCII character set, two or three bytes for other characters in the BMP, and four bytes for the remainder).
List
[edit]The following Windows code pages exist:
Windows-125x series
[edit]These nine code pages are all extended ASCII 8-bit SBCS encodings, and were designed by Microsoft for use as ANSI codepages on Windows. They are commonly known by their IANA-registered[6] names as windows-<number>, but are also sometimes called cp<number>, "cp" for "code page". They are all used as ANSI code pages; Windows-1258 is also used as an OEM code page.
The Windows-125x series includes nine of the ANSI code pages, and mostly covers scripts from Europe and West Asia with the addition of Vietnam. System encodings for Thai and for East Asian languages were numbered to match similar IBM code pages and are used as both ANSI and OEM code pages; these are covered in following sections.
| ID | Description | Relationship to ISO 8859 or other established encodings |
|---|---|---|
| 1250[7][8] | Latin 2 / Central European | Similar to ISO-8859-2 but moves several characters, including multiple letters. |
| 1251[9][10] | Cyrillic | Incompatible with both ISO-8859-5 and KOI-8. |
| 1252[11][12] | Latin 1 / Western European | Superset of ISO-8859-1 (without C1 controls). Letter repertoire accordingly similar to CP850. |
| 1253[13][14] | Greek | Similar to ISO 8859-7 but moves several characters, including a letter. |
| 1254[15][16] | Turkish | Superset of ISO 8859-9 (without C1 controls). |
| 1255[17][18] | Hebrew | Almost a superset of ISO 8859-8, but with two incompatible punctuation changes. |
| 1256[19][20] | Arabic | Not compatible with ISO 8859-6; rather, OEM Code page 708 is an ISO 8859-6 (ASMO 708) superset. |
| 1257[21][22] | Baltic | Not ISO 8859-4; the later ISO 8859-13 is closely related, but with some differences in available punctuation. |
| 1258[23][24] | Vietnamese (also OEM) | Not related to VSCII or VISCII, uses fewer base characters with combining diacritics. |
DOS code pages
[edit]These are also ASCII-based. Most of these are included for use as OEM code pages; code page 874 is also used as an ANSI code page.
- 437 – IBM PC US, 8-bit SBCS extended ASCII.[25] Known as OEM-US, the encoding of the primary built-in font of VGA graphics cards.
- 708 – Arabic, extended ISO 8859-6 (ASMO 708)
- 720 – Arabic, retaining box drawing characters in their usual locations
- 737 – "MS-DOS Greek". Retains all box drawing characters. More popular than 869.
- 775 – "MS-DOS Baltic Rim"
- 850 – "MS-DOS Latin 1". Full (re-arranged) repertoire of ISO 8859-1.
- 852 – "MS-DOS Latin 2"
- 855 – "MS-DOS Cyrillic". Mainly used for South Slavic languages. Includes (re-arranged) repertoire of ISO-8859-5. Not to be confused with cp866.
- 857 – "MS-DOS Turkish"
- 858 – Western European with euro sign
- 860 – "MS-DOS Portuguese"
- 861 – "MS-DOS Icelandic"
- 862 – "MS-DOS Hebrew"
- 863 – "MS-DOS French Canada"
- 864 – Arabic
- 865 – "MS-DOS Nordic"
- 866 – "MS-DOS Cyrillic Russian", cp866. Sole purely OEM code page (rather than ANSI or both) included as a legacy encoding in WHATWG Encoding Standard for HTML5.
- 869 – "MS-DOS Greek 2", IBM869. Full (re-arranged) repertoire of ISO 8859-7.
- 874 – Thai, also used as the ANSI code page, extends ISO 8859-11 (and therefore TIS-620) with a few additional characters from Windows-1252. Corresponds to IBM code page 1162 (IBM-874 is similar but has different extensions).
East Asian multi-byte code pages
[edit]These often differ from the IBM code pages of the same number: code pages 932, 949 and 950 only partly match the IBM code pages of the same number, while the number 936 was used by IBM for another Simplified Chinese encoding which is now deprecated and Windows-951, as part of a kludge, is unrelated to IBM-951. IBM equivalent code pages are given in the second column. Code pages 932, 936, 949 and 950/951 are used as both ANSI and OEM code pages on the locales in question.
| ID | Language | Encoding | IBM Equivalent | Difference from IBM CCSID of same number | Use |
|---|---|---|---|---|---|
| 932 | Japanese | Shift JIS (Microsoft variant) | 943[26] | IBM-932 is also Shift JIS, has fewer extensions (but those extensions it has are in common), and swaps some variant Chinese characters (itaiji) for interoperability with earlier editions of JIS C 6226. | ANSI/OEM (Japan) |
| 936 | Chinese (simplified) | GBK | 1386 | IBM-936 is a different Simplified Chinese encoding with a different encoding method, which has been deprecated since 1993. | ANSI/OEM (PRC, Singapore) |
| 949 | Korean | Unified Hangul Code | 1363 | IBM-949 is also an EUC-KR superset, but with different (colliding) extensions. | ANSI/OEM (Republic of Korea) |
| 950 | Chinese (traditional) | Big5 (Microsoft variant) | 1373[27] | IBM-950 is also Big5, but includes a different subset of the ETEN extensions, adds further extensions with an expanded trail byte range, and lacks the Euro. | ANSI/OEM (Taiwan, Hong Kong) |
| 951 | Chinese (traditional) including Cantonese | Big5-HKSCS (2001 ed.) | 5471[28] | IBM-951 is the double-byte plane from IBM-949 (see above), and unrelated to Microsoft's internal use of the number 951. | ANSI/OEM (Hong Kong, 98/NT4/2000/XP with HKSCS patch) |

A few further multiple-byte code pages are supported for decoding or encoding using operating system libraries, but not used as either sort of system encoding in any locale.
| ID | IBM Equivalent | Language | Encoding | Use |
|---|---|---|---|---|
| 1361 | - | Korean | Johab (KS C 5601-1992 annex 3) | Conversion |
| 20000 | - | Chinese (traditional) | An encoding of CNS 11643 | Conversion |
| 20001 | - | Chinese (traditional) | TCA | Conversion |
| 20002 | - | Chinese (traditional) | Big5 (ETEN variant) | Conversion |
| 20003 | 938 | Chinese (traditional) | IBM 5550 | Conversion |
| 20004 | - | Chinese (traditional) | Teletext | Conversion |
| 20005 | - | Chinese (traditional) | Wang | Conversion |
| 20932 | 954 (roughly) | Japanese | EUC-JP | Conversion |
| 20936 | 5479 | Chinese (simplified) | GB 2312 | Conversion |
| 20949, 51949 | 970 | Korean | Wansung (8-bit with ASCII, i.e. EUC-KR)[29] | Conversion |
EBCDIC code pages
[edit]| ID | IBM Equivalent | Description |
|---|---|---|
| 37 | Country Extended Code Page for US, Canada, Netherlands, Portugal, Brazil, Australia, New Zealand[30] | |
| 500 | Country Extended Code Page for Belgium, Canada and Switzerland | |
| 870 | EBCDIC Latin-2 | |
| 875 | EBCDIC Greek | |
| 1026 | EBCDIC Latin-5 (Turkish) | |
| 1047 | Country Extended Code Page for Open Systems (POSIX) | |
| 1140 | Euro-sign Country Extended Code Page for US, Canada, Netherlands, Portugal, Brazil, Australia, New Zealand | |
| 1141 | Euro-sign Country Extended Code Page for Austria and Germany | |
| 1142 | Euro-sign Country Extended Code Page for Denmark and Norway | |
| 1143 | Euro-sign Country Extended Code Page for Finland and Sweden | |
| 1144 | Euro-sign Country Extended Code Page for Italy | |
| 1145 | Euro-sign Country Extended Code Page for Spain and Latin America | |
| 1146 | Euro-sign Country Extended Code Page for UK | |
| 1147 | Euro-sign Country Extended Code Page for France | |
| 1148 | Euro-sign Country Extended Code Page for Belgium, Canada and Switzerland | |
| 1149 | Euro-sign Country Extended Code Page for Iceland | |
| 20273 | 273 | Country Extended Code Page for Germany |
| 20277 | 277 | Country Extended Code Page for Denmark/Norway |
| 20278 | 278 | Country Extended Code Page for Finland/Sweden |
| 20280 | 280 | Country Extended Code Page for Italy |
| 20284 | 284 | Country Extended Code Page for Latin America/Spain |
| 20285 | 285 | Country Extended Code Page for United Kingdom |
| 20290 | 290 | Japanese Katakana EBCDIC |
| 20297 | 297 | Country Extended Code Page for France |
| 20420 | 420 | EBCDIC Arabic |
| 20423 | 423 | EBCDIC Greek with Extended Latin |
| 20424 | 424 | EBCDIC Hebrew |
| 20833 | 833 | Korean EBCDIC for N-Byte Hangul; x-EBCDIC-KoreanExtended
|
| 20838 | 838 | EBCDIC Thai |
| 20871 | 871 | Country Extended Code Page for Iceland |
| 20880 | 880 | EBCDIC Cyrillic (DKOI) |
| 20905 | 905 | EBCDIC Latin-3 (Maltese, Esperanto and Turkish) |
| 20924 | 924 | EBCDIC Latin-9 (including Euro sign) for Open Systems (POSIX) |
| 21025 | 1025 | EBCDIC Cyrillic (DKOI) with section sign |
| 21027 | (1027) | Japanese EBCDIC (an incomplete implementation of IBM code page 1027,[31] now deprecated)[32] |
Unicode-related code pages
[edit]| ID | IBM Equivalent | Description |
|---|---|---|
| 1200 | 1202, 1203 | Unicode (BMP of ISO 10646, UTF-16LE). Available only to managed applications.[32] |
| 1201 | 1200, 1201 | Unicode (UTF-16BE). Available only to managed applications.[32] |
| 12000 | 1234, 1235 | UTF-32. Available only to managed applications.[32] |
| 12001 | 1232, 1233 | UTF-32. Big-endian. Available only to managed applications.[32] |
| 65000 | - | Unicode (UTF-7) |
| 65001 | 1208, 1209 | Unicode (UTF-8) |
Macintosh compatibility code pages
[edit]| ID | IBM Equivalent | Description |
|---|---|---|
| 10000 | 1275 | Apple Macintosh Roman |
| 10001 | - | Apple Macintosh Japanese |
| 10002 | - | Apple Macintosh Chinese (traditional) (BIG-5) |
| 10003 | - | Apple Macintosh Korean |
| 10004 | - | Apple Macintosh Arabic |
| 10005 | - | Apple Macintosh Hebrew |
| 10006 | 1280 | Apple Macintosh Greek |
| 10007 | 1283 | Apple Macintosh Cyrillic |
| 10008 | - | Apple Macintosh Chinese (simplified) (GB 2312) |
| 10010 | 1285 | Apple Macintosh Romanian |
| 10017 | - | Apple Macintosh Ukrainian |
| 10021 | - | Apple Macintosh Thai |
| 10029 | 1282 | Apple Macintosh Roman II / Central Europe |
| 10079 | 1286 | Apple Macintosh Icelandic |
| 10081 | 1281 | Apple Macintosh Turkish |
| 10082 | 1284 | Apple Macintosh Croatian |
ISO 8859 code pages
[edit]| ID | IBM Equivalent | Description |
|---|---|---|
| 28591 | 819, 5100 | ISO-8859-1 – Latin-1 |
| 28592 | 912 | ISO-8859-2 – Latin-2 |
| 28593 | 913 | ISO-8859-3 – Latin-3 or South European |
| 28594 | 914 | ISO-8859-4 – Latin-4 or North European |
| 28595 | 915 | ISO-8859-5 – Latin/Cyrillic |
| 28596 | - | ISO-8859-6 – Latin/Arabic |
| 28597 | 813, 4909, 9005 | ISO-8859-7 – Latin/Greek (1987 edition, i.e. without euro sign, drachma sign or iota subscript)[33] |
| 28598 | - | ISO-8859-8 – Latin/Hebrew (visual order; 1988 edition, i.e. without LRM and RLM)[33] |
| 28599 | 920 | ISO-8859-9 – Latin-5 or Turkish |
| 28600 | 919 | ISO-8859-10 – Latin-6 or Nordic |
| 28601 | - | ISO-8859-11 – Latin/Thai |
| 28602 | - | ISO-8859-12 – reserved for Latin/Devanagari but abandoned (not supported) |
| 28603 | 921 | ISO-8859-13 – Latin-7 or Baltic Rim |
| 28604 | - | ISO-8859-14 – Latin-8 or Celtic |
| 28605 | 923 | ISO-8859-15 – Latin-9 |
| 28606 | - | ISO-8859-16 – Latin-10 or South-Eastern European |
| 38596 | 1089 | ISO-8859-6-I – Latin/Arabic (logical bidirectional order) |
| 38598 | 916, 5012 | ISO-8859-8-I – Latin/Hebrew (logical bidirectional order; 1988 edition, i.e. without LRM and RLM)[33] |
ITU-T code pages
[edit]| ID | IBM Equivalent | Description |
|---|---|---|
| 20105 | 1009 | 7-bit IA5 IRV (Western European)[34][35][36] |
| 20106 | 1011 | 7-bit IA5 German (DIN 66003)[34][35][37] |
| 20107 | 1018 | 7-bit IA5 Swedish (SEN 850200 C)[34][35][38] |
| 20108 | 1016 | 7-bit IA5 Norwegian (NS 4551-2)[34][35][39] |
| 20127 | 367 | 7-bit ASCII[34][35][40] |
| 20261 | 1036 | T.61 (T.61-8bit) |
| 20269 | ? | ISO-6937 |
KOI8 code pages
[edit]| ID | IBM Equivalent | Description |
|---|---|---|
| 20866 | 878 | Russian – KOI8-R |
| 21866 | 1167, 1168 | Ukrainian – KOI8-U (or KOI8-RU in some versions)[41] |
Problems arising from the use of code pages
[edit]Microsoft strongly recommends using Unicode in modern applications, but many applications or data files still depend on the legacy code pages.
- Programs need to know what code page to use in order to display the contents of (pre-Unicode) files correctly. If a program uses the wrong code page it may show text as mojibake.
- The code page in use may differ between machines, so (pre-Unicode) files created on one machine may be unreadable on another.
- Data is often improperly tagged with the code page, or not tagged at all, making determination of the correct code page to read the data difficult.
- These Microsoft code pages differ to various degrees from some of the standards and other vendors' implementations. This isn't a Microsoft issue per se, as it happens to all vendors, but the lack of consistency makes interoperability with other systems unreliable in some cases.
- The use of code pages limits the set of characters that may be used.
- Characters expressed in an unsupported code page may be converted to question marks (?) or other replacement characters, or to a simpler version (such as removing accents from a letter). In either case, the original character may be lost.
Notes
[edit]References
[edit]- ^ "Unicode and character sets". Microsoft. 2023-06-13. Retrieved 2024-05-27.
- ^ "Code Pages". 2016-03-07. Archived from the original on 2016-03-07. Retrieved 2021-05-26.
- ^ a b c "Glossary of Terms Used on this Site". December 8, 2018. Archived from the original on 2018-12-08.
The term "ANSI" as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community. The source of this comes from the fact that the Windows code page 1252 was originally based on an ANSI draft—which became International Organization for Standardization (ISO) Standard 8859-1. "ANSI applications" are usually a reference to non-Unicode or code page–based applications.
- ^ "Character Sets". www.iana.org. Archived from the original on 2021-05-25. Retrieved 2021-05-26.
- ^ hylom (2017-11-14). "Windows 10のInsider PreviewでシステムロケールをUTF-8にするオプションが追加される" [The option to make UTF-8 the system locale added in Windows 10 Insider Preview]. スラド (in Japanese). Archived from the original on 2018-05-11. Retrieved 2018-05-10.
- ^ "Character Sets". IANA. Archived from the original on 2016-12-03. Retrieved 2019-04-07.
- ^ Microsoft. "Windows 1250". Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ IBM. "SBCS code page information document CPGID 01250". Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ Microsoft. "Windows 1251". Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ IBM. "SBCS code page information document CPGID 01251". Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ Microsoft. "Windows 1252". Archived from the original on 2013-05-04. Retrieved 2014-07-06.
- ^ IBM. "SBCS code page information document CPGID 01252". Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ Microsoft. "Windows 1253". Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ IBM. "SBCS code page information document CPGID 01253". Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ Microsoft. "Windows 1254". Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ IBM. "SBCS code page information document CPGID 01254". Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ Microsoft. "Windows 1255". Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ IBM. "SBCS code page information document CPGID 01255". Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ Microsoft. "Windows 1256". Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ IBM. "SBCS code page information document CPGID 01256". Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ Microsoft. "Windows 1257". Archived from the original on 2013-03-16. Retrieved 2014-07-06.
- ^ IBM. "SBCS code page information document CPGID 01257". Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ Microsoft. "Windows 1258". Archived from the original on 2013-10-25. Retrieved 2014-07-06.
- ^ IBM. "SBCS code page information document CPGID 01258". Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ IBM. "SBCS code page information document - CPGID 00437". Archived from the original on 2016-06-09. Retrieved 2014-07-04.
- ^ "IBM-943 and IBM-932". IBM Knowledge Center. IBM. Archived from the original on 2018-08-18. Retrieved 2020-07-08.
- ^ "Converter Explorer: ibm-1373_P100-2002". ICU Demonstration. International Components for Unicode. Archived from the original on 2021-05-26. Retrieved 2020-06-27.
- ^ "Coded character set identifiers – CCSID 5471". IBM Globalization. IBM. Archived from the original on 2014-11-29.
- ^ Julliard, Alexandre (11 March 2021). "dump_krwansung_codepage: build Korean Wansung table from the KSX1001 file". make_unicode: Generate code page .c files from ftp.unicode.org descriptions. Wine Project. Archived from the original on 2021-05-26. Retrieved 2021-03-14.
- ^ IBM. "SBCS code page information document - CPGID 00037". Archived from the original on 2014-07-14. Retrieved 2014-07-04.
- ^ Steele, Shawn (2005-09-12). "Code Page 21027 "Extended/Ext Alpha Lowercase"". MSDN. Archived from the original on 2019-04-06. Retrieved 2019-04-06.
- ^ a b c d e "Code Page Identifiers". docs.microsoft.com. Archived from the original on 2019-04-07. Retrieved 2019-04-07.
- ^ a b c Mozilla Foundation. "Relationship with Windows Code Pages". Crate encoding_rs. Docs.rs.
- ^ a b c d e "Code Page Identifiers". Microsoft Developer Network. Microsoft. 2014. Archived from the original on 2016-06-19. Retrieved 2016-06-19.
- ^ a b c d e "Web Encodings - Internet Explorer - Encodings". WHATWG Wiki. 2012-10-23. Archived from the original on 2016-06-20. Retrieved 2016-06-20.
- ^ Foller, Antonin (2014) [2011]. "Western European (IA5) encoding - Windows charsets". WUtils.com - Online web utility and help. Motobit Software. Archived from the original on 2016-06-20. Retrieved 2016-06-20.
- ^ Foller, Antonin (2014) [2011]. "German (IA5) encoding – Windows charsets". WUtils.com – Online web utility and help. Motobit Software. Archived from the original on 2016-06-20. Retrieved 2016-06-20.
- ^ Foller, Antonin (2014) [2011]. "Swedish (IA5) encoding - Windows charsets". WUtils.com - Online web utility and help. Motobit Software. Archived from the original on 2016-06-20. Retrieved 2016-06-20.
- ^ Foller, Antonin (2014) [2011]. "Norwegian (IA5) encoding - Windows charsets". WUtils.com - Online web utility and help. Motobit Software. Archived from the original on 2016-06-20. Retrieved 2016-06-20.
- ^ Foller, Antonin (2014) [2011]. "US-ASCII encoding - Windows charsets". WUtils.com - Online web utility and help. Motobit Software. Archived from the original on 2016-06-20. Retrieved 2016-06-20.
- ^ Nechayev, Valentin (2013) [2001]. "Review of 8-bit Cyrillic encodings universe". Archived from the original on 2016-12-05. Retrieved 2016-12-05.
External links
[edit]- National Language Support (NLS) API Reference. Table showing ANSI and OEM codepages per language (from web-archive since Microsoft removed the original page)
- IANA Charset Name Registrations
- Unicode mapping table for Windows code pages
- Unicode mappings of windows code pages with "best fit"
Windows code page
View on GrokipediaMultiByteToWideChar and WideCharToMultiByte to convert between legacy encodings and Unicode.[3][1] Microsoft recommends transitioning to Unicode encodings, such as UTF-8 (code page 65001), for new applications to avoid data corruption risks associated with varying system-default code pages across locales.[2][1] Over 50 code pages are supported in modern Windows versions, covering languages from Arabic (code page 1256) to Vietnamese (code page 1258), but their use is increasingly limited to specific scenarios like regional file naming or command-line tools.[2][3]
Fundamentals
Definition and Purpose
A Windows code page is a character encoding system that defines a mapping between byte values, typically in an 8-bit range, and specific characters, enabling the representation of text in legacy Windows applications and files. These mappings associate sequences of bytes—most commonly single bytes for 256 possible characters—with Unicode code points, allowing for the encoding and decoding of text data. Developed by Microsoft, code pages extend the limitations of 7-bit ASCII by incorporating an additional 128 characters in the upper byte range (0x80–0xFF), which vary according to language or regional requirements.[3][1] The primary purpose of Windows code pages is to facilitate internationalization in software and systems prior to the widespread adoption of Unicode, supporting non-ASCII characters for various scripts such as Latin extensions, Cyrillic, Arabic, and others. By providing deterministic translations—often one-to-one for single-byte sets or many-to-one for multi-byte variants—code pages ensure compatibility for legacy applications, older mail and news servers, command-line tools, and document formats that rely on regional character sets. For instance, code page identifiers like CP1252 are used for Western European languages, assigning unique byte values to accented letters and symbols not covered by basic ASCII. This approach addressed the need for localized text handling in global markets without requiring a universal encoding standard at the time.[3][1] Key characteristics of Windows code pages include their identification by numeric codes (e.g., 1252 for ANSI Western European), support for both single-byte character sets (SBCS) and double- or multi-byte character sets (DBCS/MBCS) for denser scripts, and a fixed mapping that remains consistent within a given locale. Commonly referred to as "ANSI code pages" in Windows contexts, they are not identical to formal ANSI or ISO standards but are based on drafts like ISO 8859-1, with Microsoft-specific extensions for broader compatibility. While modern Windows primarily uses Unicode internally for universal character support, code pages persist for backward compatibility in transitional environments.[1][3]Types of Windows Code Pages
Windows code pages are categorized into several main types based on their intended use and character encoding mechanisms within the operating system. These include ANSI code pages, OEM code pages, multi-byte character set (MBCS) code pages, and Unicode-based code pages, each serving distinct roles in handling text data across different contexts such as graphical user interfaces, consoles, and internationalized applications.[2] ANSI code pages, also known as active code pages (ACP), are primarily used for text rendering in Windows graphical user interfaces (GUI), file I/O operations, and legacy text files, with the default varying by system locale—for instance, code page 1252 (Windows-1252) for English-language systems.[2][4] These single-byte encodings map the 256 possible byte values to characters, supporting Western European languages in the default case, and are retrieved programmatically via the GetACP API function.[4] OEM code pages, in contrast, are designed for console applications, command-line interfaces, and compatibility with MS-DOS-era systems, often differing from ANSI pages to accommodate hardware-specific character sets like box-drawing symbols.[2][5] For example, code page 437 serves as the OEM default for United States English locales, and it can be queried using the GetOEMCP API.[2][6] These pages ensure proper display in text-mode environments but are locale-dependent and not suitable for cross-system data exchange without verification.[5] Multi-byte code pages (MBCS) extend single-byte capabilities to support languages with extensive character sets, such as East Asian scripts, by employing a variable-length scheme where most characters use a single byte but others require a lead byte followed by a trail byte to encode extended glyphs.[2][7] Examples include code page 932 for Japanese (Shift JIS) and 936 for Simplified Chinese (GBK), which allow Windows applications to process double-byte characters seamlessly in MBCS-aware strings.[2][3] This approach enables denser representation of non-Latin scripts but requires careful byte parsing to distinguish single- from multi-byte sequences.[8] Unicode code pages represent a modern bridge between legacy systems and universal text encoding, incorporating UTF formats directly as code pages for interoperability; notable examples are code page 1200 for UTF-16 little-endian and 65001 for UTF-8, which support all Unicode characters without locale-specific limitations.[2] These are increasingly recommended over traditional pages to avoid data corruption from varying system defaults, and they integrate with APIs like MultiByteToWideChar for conversions.[9][1] All Windows code pages are identified by unique numeric identifiers (e.g., 1252 for ANSI Western European), which applications use to specify encoding in functions like CreateFile or registry queries.[2] Character mappings for these code pages are stored in National Language Support (NLS) files, such as C_1252.NLS, located in the %SystemRoot%\System32 directory, while active code page settings are configurable via registry keys under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage (e.g., ACP for ANSI).[10][11] These mappings are loaded into memory by system DLLs like kernel32.dll during runtime for efficient text processing.[1]Historical Development
Origins in MS-DOS and Early Windows
The origins of Windows code pages trace back to the MS-DOS era in the 1980s, where they emerged as essential extensions to handle international characters and graphics on the IBM PC platform. In 1981, with the release of the IBM PC, Code Page 437 (CP437) was introduced as the original OEM code page, extending the 7-bit US-ASCII standard to an 8-bit, 256-character set that included box-drawing graphics, mathematical symbols, and a selection of European accented characters to support text-based user interfaces and early applications.[12] This code page, also known as OEM-US or PC-8, was designed for compatibility with the IBM PC's hardware, particularly its display adapters, and became the foundational character encoding for MS-DOS systems.[1] Key milestones in the 1980s built upon CP437 as the baseline for subsequent code pages. IBM established CP437 as the standard for the US English market, while Microsoft extended this framework for international MS-DOS versions to accommodate diverse linguistic needs. A notable example is CP850, introduced in 1987 with MS-DOS 3.3, which served as a multilingual extension for Western European languages, Latin America, and Canada, incorporating a broader set of Latin-1 characters while retaining compatibility with CP437's structure.[12] These developments allowed MS-DOS to support country-specific variants, loaded dynamically to adapt to regional requirements without altering the core operating system. Technically, these early code pages operated as 8-bit supersets of 7-bit US-ASCII, where the first 128 characters (hex 00-7F) matched ASCII exactly, and the upper 128 (hex 80-FF) provided extensions for localized content such as diacritics and line art. In MS-DOS, country-specific code pages were configured via the CONFIG.SYS file during boot, using commands likeCOUNTRY=XXX to load appropriate national language support files (e.g., COUNTRY.SYS) and DISPLAY.SYS to set console code pages, enabling seamless switching between encodings like CP437 for the US or equivalents for other regions.[13] This modular approach ensured hardware and software compatibility across global markets.
Early Windows versions from 1.0 (1985) to 3.x (up to 1992) inherited these MS-DOS OEM code pages primarily for console and backward compatibility, maintaining support for CP437 and its variants in command-line interfaces and file systems. For graphical user interfaces (GUI), Windows introduced "ANSI" code pages—distinct from true ANSI standards but based on ECMA-94—to handle text rendering, with the initial set in Windows 1.0 evolving to include additional characters by Windows 3.1 and tying selections to regional settings for localized installations.[14] This dual system of OEM for legacy DOS integration and ANSI for native Windows applications laid the groundwork for the platform's character encoding architecture.[1]
Evolution and Standardization
With the release of Windows 95 in 1995 and Windows NT 4.0 in 1996, Microsoft formalized the Windows-125x series as the primary ANSI code pages for single-byte character encodings in the Windows environment, establishing them as the default for text handling in graphical applications. These code pages, such as CP1252 for Western European languages, were designed to extend the ASCII range (0x00-0x7F) while aligning as closely as possible with the ISO/IEC 8859 standards, for instance mapping CP1252 to ISO/IEC 8859-1 for Latin-1 characters. However, alignments were incomplete due to proprietary Microsoft extensions, including the addition of 27 printable characters in the 0x80-0x9F range of CP1252, which ISO/IEC 8859-1 left undefined for control codes.[15][16] Simultaneously, Windows 95 introduced enhanced support for multi-byte character sets through Double-Byte Character Sets (DBCS), enabling efficient handling of languages requiring more than 256 characters, such as those in East Asia, by using lead and trail bytes for extended glyphs while preserving ASCII compatibility. This formalization was documented in the Windows 95 SDK, where conversion tables for code pages like CP1252 were provided in files such as UNICODE.BIN, supporting up to 18 code pages in international editions for bidirectional mappings between code pages and internal Unicode representations. Standardization efforts involved Microsoft's submission of the Windows-125x mappings to the Internet Assigned Numbers Authority (IANA) for official MIME charset registration, with CP1250 (Central European) registered on May 3, 1996, following collaborative development with ISO/IEC to incorporate ISO 8859-2 mappings while adding vendor-specific characters.[16][17][18] In the mid-1990s, Microsoft expanded the Windows-125x series to address global market needs, introducing CP1251 for Cyrillic scripts in 1995 to support languages like Russian and Bulgarian, building on earlier non-English Windows 3.1 implementations but integrating it fully into the Windows 95/NT ecosystem. This was followed by CP1256 for Arabic in 1996, which incorporated right-to-left text rendering and visual ordering adjustments, registered with IANA on May 3, 1996, to facilitate telecom and document exchange in Middle Eastern regions. These expansions reflected influences from ITU-T recommendations for international telecom encodings, such as those in Recommendation T.61 for teletex services, where Microsoft adapted mappings to ensure compatibility with global data transmission standards despite proprietary deviations. Key documentation of these code pages, including detailed glyph tables and conversion matrices, appeared in Microsoft SDK releases from 1995 onward, serving as authoritative references for developers despite the incomplete harmonization with ISO standards due to extensions for Windows-specific typography.[14][19][20]Transition to Unicode
The Windows NT family has used Unicode internally since Windows NT 3.1 in 1993, initially with UCS-2 encoding, providing Unicode support for the enterprise line while the consumer Windows 9x series continued relying on code pages. With the release of Windows 2000 in 2000, Microsoft upgraded the operating system's internal text encoding to native UTF-16, including surrogate pairs for full Unicode support beyond the Basic Multilingual Plane across applications and system components.[21] This change marked a significant pivot from reliance on single-byte and multi-byte code pages, which were limited to specific language sets, toward a unified encoding capable of handling global scripts. However, to maintain compatibility with existing software, Windows retained support for legacy code pages through conversion APIs such as MultiByteToWideChar and WideCharToMultiByte, allowing applications to translate between code page-based strings and UTF-16 Unicode.[1] Windows XP, released in 2001, introduced UTF-8 as code page 65001 (CP_UTF8), enabling limited support for this variable-length encoding in APIs and file handling, though initial implementations suffered from bugs, particularly in console output and certain localization scenarios.[2] These issues persisted for years, prompting developers to favor UTF-16 for reliability, but Microsoft addressed many through cumulative updates, with notable improvements in console handling by Windows 10's version 1607 (Anniversary Update) in 2016, enhancing UTF-8 stability for non-Unicode applications.[22] By Windows 10 version 1903 (May 2019 Update), further refinements included the activeCodePage manifest property, allowing apps to declare UTF-8 as their default code page, and a beta system-wide option to set UTF-8 as the active ANSI code page via registry or settings (e.g., enabling "Beta: Use Unicode UTF-8 for worldwide language support" under Administrative language settings).[23] As Windows evolved, code pages' role diminished, with Microsoft explicitly marking them as legacy components by Windows 11 in 2021 to discourage new development reliance on them in favor of Unicode encodings like UTF-8 and UTF-16.[3] This deprecation trend emphasized UTF-8 adoption for legacy non-Unicode apps through configuration tweaks, such as registry keys under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage, while ensuring backward compatibility for older systems and tools.[23] Overall, the transition underscored Unicode's superiority for internationalization, reducing the fragmentation caused by region-specific code pages while preserving interoperability via robust conversion mechanisms.[24]Single-Byte Code Page Families
Windows-125x Series
The Windows-125x series comprises a family of 8-bit single-byte code pages, designated as code pages 1250 through 1258, developed by Microsoft to support Western European and other non-East Asian scripts in Windows operating systems. These code pages serve as the primary ANSI code pages for graphical user interfaces, extending the ISO/IEC 8859 family of standards by incorporating additional glyphs not defined in the ISO specifications, such as curly quotation marks, em dashes, and other typographic symbols in the range 0x80–0x9F. Unlike the ISO 8859 standards, which reserve this C1 control range for non-printing characters, the Windows-125x implementations assign printable characters to these bytes to better accommodate common usage in word processing and display applications.[25][1] Each code page in the series targets specific linguistic regions, mapping the first 128 bytes (0x00–0x7F) identically to the ASCII standard while using the extended range (0x80–0xFF) for language-specific accented letters, symbols, and punctuation. For instance, code page 1252 (Western European, also known as Windows Latin 1) was adopted in the 1980s as the default for English and other Western European languages, based on an early American National Standards Institute (ANSI) draft that preceded the finalization of ISO 8859-1; it includes characters like the en dash (–) at 0x96 and non-breaking space ( ) at 0xA0, which differ from ISO 8859-1's undefined or control assignments in the 0x80–0x9F block.[1][25] Code page 1250 supports Central European languages (e.g., Polish, Czech) by extending ISO 8859-2 with additional diacritics; 1251 handles Cyrillic scripts (e.g., Russian, Bulgarian) based on ISO 8859-5; 1253 covers Greek, extending ISO 8859-7; 1254 addresses Turkish needs, modifying ISO 8859-9; 1255 encodes Hebrew from right-to-left, drawing from ISO 8859-8; 1256 supports Arabic, also based on ISO 8859-6; 1257 serves Baltic languages (e.g., Latvian, Lithuanian), extending ISO 8859-4 and 13; and 1258 accommodates Vietnamese, combining Latin characters with tone marks in a manner similar to but distinct from VISCII.[2][25] The following table summarizes the key code pages in the series, their primary language coverage, and .NET encoding names:| Code Page | Description | Primary Languages/Region | .NET Name |
|---|---|---|---|
| 1250 | ANSI Central European | Central/Eastern European (Latin script) | windows-1250 |
| 1251 | ANSI Cyrillic | Cyrillic (Russian, Ukrainian, etc.) | windows-1251 |
| 1252 | ANSI Latin 1 (Western) | Western European (English, French, etc.) | windows-1252 |
| 1253 | ANSI Greek | Greek | windows-1253 |
| 1254 | ANSI Turkish | Turkish | windows-1254 |
| 1255 | ANSI Hebrew | Hebrew | windows-1255 |
| 1256 | ANSI Arabic | Arabic | windows-1256 |
| 1257 | ANSI Baltic | Baltic (Latvian, Lithuanian, etc.) | windows-1257 |
| 1258 | ANSI/OEM Vietnamese | Vietnamese | windows-1258 |
OEM and DOS Code Pages
OEM and DOS code pages refer to a family of 8-bit character encodings designed for use in MS-DOS systems and Windows console applications, where they handle text display and input in legacy environments. Unlike ANSI code pages, which prioritize international characters in the upper byte range, OEM code pages allocate values from 0x80 to 0xFF primarily to graphics symbols, including line-drawing elements, block characters, and punctuation for text-based user interfaces.[1] These encodings originated with the IBM PC in the early 1980s, evolving from the initial MS-DOS support for regional variations.[12] In MS-DOS, OEM code pages were configured and loaded through the COUNTRY= directive in the CONFIG.SYS file, which specified a country code and optional code page identifier to enable appropriate character sets, keyboard layouts, and formatting conventions from a supporting file like COUNTRY.SYS.[26] This mechanism allowed DOS to adapt to different locales without altering the core 7-bit ASCII base (0x00–0x7F), which remained consistent across code pages. The first such code page, CP437, was introduced in 1981 for the United States and included dedicated slots for box-drawing characters to support applications like early text editors and games.[12][2] Subsequent OEM code pages extended this model to other regions, replacing graphics symbols with language-specific characters while retaining many visual elements for compatibility. Key examples include:| Code Page | Name | Region/Language |
|---|---|---|
| 437 | IBM437 | United States |
| 737 | IBM737 | Greek |
| 775 | IBM775 | Baltic States |
| 850 | IBM850 | Western Europe (Multilingual) |
| 852 | IBM852 | Central Europe |
| 855 | IBM855 | Cyrillic (Russian) |
| 857 | IBM857 | Turkish |
| 860 | IBM860 | Portuguese |
| 861 | IBM861 | Icelandic |
| 863 | IBM863 | Canadian French |
| 865 | IBM865 | Nordic |
Other Single-Byte Encodings
Windows supports several single-byte encodings derived from international standards beyond its proprietary Windows-125x and OEM families, including adaptations of the ISO/IEC 8859 series, ITU-T recommendations, and KOI8 variants. These encodings facilitate compatibility with global standards for text handling in legacy applications and data exchange, particularly in regions where specific scripts require precise mapping to 8-bit code points.[2] The ISO/IEC 8859 series provides single-byte encodings for various Latin-based scripts, with Windows assigning dedicated code page identifiers for direct support. For instance, code page 28591 corresponds to ISO/IEC 8859-1 (Latin-1), covering Western European languages with characters such as accented letters and symbols for English, French, German, and Spanish. Similarly, code page 28599 maps to ISO/IEC 8859-9 (Latin-5), tailored for Turkish, incorporating letters like Ğ (U+011E) and I without dot (U+0131) to support the Turkish alphabet. These Windows mappings adhere closely to the ISO standards but include platform-specific implementations for accent handling and control codes, differing from extensions in Windows' own code pages.[2][2] ITU-T code pages in Windows address telecommunications and multimedia needs, drawing from standards like ISO/IEC 6937. Code page 20269 implements ISO/IEC 6937, a non-spacing accent encoding for Latin scripts used in early digital telephony and videotex systems, allowing combined diacritics for efficient transmission of accented characters in bandwidth-limited environments. Additionally, code page 20866 supports KOI8-R, a Cyrillic encoding standardized in the 1990s for Russian text, originating from Soviet-era computing but adapted for post-Cold War interoperability in Unix-like systems and early web content.[2][2][2] KOI8 variants extend this support to other Cyrillic scripts, enhancing legacy Unix-Windows data interchange. Code page 21866 corresponds to KOI8-U, an extension of KOI8-R for Ukrainian, incorporating characters like Є (U+0404) and І (U+0406) while maintaining compatibility with the Russian base for shared Cyrillic layouts. These encodings remain relevant for processing older files from Eastern European systems, where full Unicode adoption has been gradual.[2][2] All these single-byte encodings are integrated into Windows through code page APIs, such asMultiByteToWideChar and WideCharToMultiByte, enabling conversion to and from Unicode (UTF-16) for applications requiring backward compatibility. Support persists in modern Windows versions, including mappings for rare ITU-T standards that have seen limited updates since the early 2000s, ensuring reliable handling of international legacy data without requiring custom implementations.[2][25]
Multi-Byte and Specialized Code Page Families
East Asian Multi-Byte Code Pages
East Asian multi-byte code pages in Windows support CJK (Chinese, Japanese, Korean) languages by combining single-byte character sets (SBCS) for ASCII compatibility with double-byte character sets (DBCS) for ideographic characters. These encodings use variable-width representations, where the first 128 code points (0x00–0x7F) encode ASCII characters in a single byte, while extended characters require two bytes: a lead byte typically in the range 0x81–0xFE followed by a trail byte that together represent a hanzi, kanji, or hangul syllable.[3][28] The lead byte signals the start of a multi-byte sequence, allowing parsers to distinguish between single-byte and double-byte characters during text processing. For example, in code page 932, lead bytes occupy ranges such as 0x81–0x9F, enabling encoding of thousands of Japanese characters beyond the 7-bit ASCII limit. Trail bytes vary by code page but generally fall in non-overlapping ranges to avoid ambiguity with ASCII or single-byte extensions. This structure ensures backward compatibility with 8-bit systems while accommodating the vast character sets needed for East Asian scripts.[28] Prominent East Asian multi-byte code pages in Windows include:- CP932: A Microsoft variant of the Shift JIS encoding for Japanese, developed in the 1990s to handle JIS X 0208 characters plus extensions; it supports over 6,000 kanji and kana.[3][2]
- CP936: The encoding for Simplified Chinese, initially based on GB 2312 but extended to GBK in Windows to include additional characters from GB 13000.1 for better coverage of modern usage in mainland China and Singapore.[3][2]
- CP949: An extension of EUC-KR based on the KS C 5601 standard for Korean, incorporating unified Hangul syllables and Hanja characters for compatibility with Windows Korean locales.[3][2]
- CP950: An extension of the Big5 encoding for Traditional Chinese, used in Taiwan and Hong Kong, with added characters for regional variants and compatibility.[3][2]
EBCDIC Code Pages
EBCDIC, or Extended Binary Coded Decimal Interchange Code, is an 8-bit character encoding developed by IBM in the early 1960s as part of its System/360 mainframe architecture to facilitate data interchange on punched cards and tapes.[30][31] This encoding extends earlier BCD-based codes used in IBM systems, assigning 256 possible values to characters while maintaining compatibility with decimal arithmetic through a structured bit layout. Unlike ASCII, which follows a sequential ordering where digits precede letters, EBCDIC places alphabetic characters before numeric digits in its collating sequence, a design choice rooted in legacy punched-card sorting practices.[32][33] Windows provides support for EBCDIC code pages mainly to enable interoperability with IBM mainframe environments, such as z/OS, where EBCDIC remains the native encoding for legacy applications and data stores. These code pages are classified as non-native encodings in Windows and are handled through system APIs rather than as default system locales. Key variants include those tailored for specific languages and regions, using the EBCDIC framework's flexibility for national adaptations. For instance, the high-order "zone" bits (bits 7-4) categorize characters into groups like punctuation, letters, and digits, while the low-order "numeric" bits (bits 3-0) define specific symbols within those groups, allowing variants to remap accented or national characters without altering the core structure.[1][25][34] The following table summarizes prominent Windows EBCDIC code pages, their associated names, and primary uses, drawn from Microsoft's supported code page specifications:| Code Page | Name | Description | CCSID (IBM Equivalent) |
|---|---|---|---|
| 037 | IBM EBCDIC US-Canada | Standard for English text in North America | 37 |
| 500 | IBM EBCDIC International | Supports Western European languages | 500 |
| 870 | IBM EBCDIC Multilingual Latin 2 | For Central and Eastern European Latin scripts | 870 |
| 1047 | IBM EBCDIC Open Systems Latin 1 | POSIX-compliant variant for Western Latin | 1047 |
MultiByteToWideChar and WideCharToMultiByte, which handle EBCDIC-to-Unicode translations. This support is particularly relevant in enterprise scenarios involving Host Integration Server, where SNA protocols enable seamless data flow between Windows clients and mainframes. Microsoft maintains over 40 EBCDIC variants in its code page library, ensuring compatibility without native rendering in the Windows UI.[1][36][35]
Macintosh Compatibility Code Pages
Windows includes a series of code pages specifically designed for compatibility with classic Mac OS character encodings, facilitating cross-platform text handling for applications and file transfers. The foundational encoding is MacRoman (CP10000), a single-byte character set introduced by Apple in 1984 to support Western European languages on Macintosh computers. MacRoman uses bytes 0-127 for ASCII compatibility but assigns distinct characters to the 128-255 range, incorporating Apple-specific symbols like the dagger (†), apple logo (), and various fractions, which differ significantly from equivalent ranges in Windows encodings such as CP1252.[37][2] These compatibility code pages extend beyond Roman scripts to cover other Macintosh language systems, mapping them to Windows representations for bidirectional conversion. Key examples include support for East Asian and right-to-left scripts, ensuring that text from Mac OS applications could be processed in Windows without loss of meaning.| Code Page ID | IANA Name | Description |
|---|---|---|
| 10000 | macintosh | MAC Roman; Western European (Mac) |
| 10001 | x-mac-japanese | Japanese (Mac) |
| 10002 | x-mac-chinesetrad | Traditional Chinese (Big5; Mac) |
| 10003 | x-mac-korean | Korean (Mac) |
| 10004 | x-mac-arabic | Arabic (Mac) |
| 10005 | x-mac-hebrew | Hebrew (Mac) |
| 10006 | x-mac-greek | Greek (Mac) |
| 10007 | x-mac-cyrillic | Cyrillic (Mac) |
| 10008 | x-mac-chinesesimp | Simplified Chinese (GB2312; Mac) |
MultiByteToWideChar and WideCharToMultiByte, which handle mapping for accurate round-trip preservation where possible. These mechanisms were essential for pre-Unicode file exchanges, such as sharing documents between Macintosh and Windows systems in mixed environments like publishing or office workflows.[9][38]
Developed during the 1990s amid growing interoperability needs between Microsoft Windows and Apple Macintosh platforms, these code pages addressed the challenges of diverse legacy encodings but have seen limited adoption in contemporary applications, overshadowed by Unicode standards.[14]
Unicode-Related Code Pages
UTF-8 and UTF-16 Integration
Windows has utilized UTF-16 as its primary internal encoding for text processing since the introduction of Windows NT in 1993, initially based on UCS-2 and later extended to full UTF-16 support with surrogate pairs to handle Unicode code points beyond the Basic Multilingual Plane (BMP), exceeding 65,536 characters.[39] However, code page 1200 designates a UCS-2-compatible version of UTF-16 in little-endian byte order, limited to the BMP, which is the default for Windows systems in code page conversion APIs, while code page 1201 specifies big-endian UTF-16 with the same BMP limitation.[2] This encoding serves as the native format for Windows API calls involving wide characters (WCHAR), ensuring efficient handling of Unicode data within the operating system kernel and user-mode applications, though code page conversions via 1200/1201 do not fully support surrogates. In contrast, UTF-8 support in Windows, designated as code page 65001, was introduced with Windows XP in 2001 as a variable-width encoding using 1 to 4 bytes per character to represent the full Unicode repertoire.[2] UTF-8 enables compatibility with web standards and cross-platform text files, but its adoption was initially limited due to incomplete system integration. Significant improvements occurred starting with Windows 10 version 1903 (May 2019 Update), which enhanced UTF-8 handling in the console, file I/O, and API functions, making it viable for broader system-wide use without relying solely on UTF-16 conversions.[23] Integration of these encodings into Windows code page mechanisms occurs primarily through string conversion APIs, such as WideCharToMultiByte, which accepts code page 65001 to convert UTF-16 wide strings to UTF-8 byte sequences, and the reciprocal MultiByteToWideChar for the reverse.[38] Developers can specify these code pages explicitly for portability, bypassing locale-dependent ANSI code pages. Additionally, since Windows 10 version 1903, UTF-8 can be set as the active ANSI code page (ACP) either via a beta system locale configuration in Settings > Time & Language > Language > Administrative language settings (enabling "Beta: Use Unicode UTF-8 for worldwide language support," which updates the registry under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage to set ACP to 65001) or through the stable activeCodePage manifest element set to "UTF-8" for individual applications; this was expanded in Windows 11 to also allow "Legacy" or specific code pages via manifests.[23][40] This enables UTF-8 as the default for non-Unicode (A) API variants like CreateFileA. Early implementations of code page 65001 in Windows XP and Server 2003 exhibited limitations and bugs, including inconsistent handling of invalid sequences and partial support for Unicode operations like normalization, which could lead to data corruption in multi-byte contexts.[41] These issues, such as errors in surrogate pair processing and flag support in conversion functions, were progressively addressed through updates, with key fixes for normalization and error detection by Windows Vista in 2006 and further refinements in subsequent service packs.[38] Despite these advancements, UTF-8 remains non-default in most Windows configurations to maintain backward compatibility with legacy single-byte code pages, though Microsoft recommends transitioning to UTF-8 or UTF-16 for new applications to avoid encoding pitfalls.[23]Other Unicode-Derived Code Pages
In addition to the primary UTF-8 and UTF-16 encodings, Windows supports several other code pages derived from or closely related to Unicode standards, primarily for legacy compatibility and specialized applications. Code page 1200 corresponds to UTF-16 in little-endian byte order, encoding the Basic Multilingual Plane (BMP) of ISO/IEC 10646 and available only to managed applications (e.g., .NET).[2] This format evolved from UCS-2, an earlier fixed-width 16-bit encoding limited to the BMP and incapable of representing characters beyond U+FFFF without surrogates; UCS-2 has been deprecated in favor of full UTF-16 since Unicode 2.0, though Windows maintains backward compatibility for applications assuming the legacy behavior.[42] Code page 65000 implements UTF-7, a variable-width encoding designed to be safe for transport over 7-bit channels like email, where ASCII remains unmodified and non-ASCII Unicode characters are encoded using a modified Base64 scheme with escape sequences.[2] Developed as part of early Unicode efforts, UTF-7 prioritizes compatibility with legacy protocols but is less efficient than UTF-8 and rarely used outside specific contexts like IMAP folders. Code page 1201 provides UTF-16 in big-endian byte order, mirroring the structure of CP1200 but with reversed byte serialization for network or cross-platform interchange where big-endian is conventional, and also available only to managed code environments in Windows, such as .NET applications.[2] Windows also supports UTF-32 encodings via code page 12000 (little-endian) and 12001 (big-endian), fixed-width 32-bit formats that directly map Unicode code points without surrogates, available only to managed applications for scenarios requiring explicit 32-bit Unicode handling.[2] The Standard Compression Scheme for Unicode (SCSU), a draft Unicode Technical Standard for reducing storage needs through dynamic windowing of frequent character ranges, sees limited implementation in Windows products like SQL Server for internal Unicode compression, but it lacks a dedicated code page identifier and is not exposed for general file or text handling.[43][44] Experimental encodings like UTF-EBCDIC, proposed to map Unicode onto EBCDIC-compatible structures for mainframe interoperability, remain unimplemented as Windows code pages and are confined to niche research or vendor-specific extensions post-2010.Usage in Windows
API and System Implementation
Windows code pages are integrated into the operating system through the National Language Support (NLS) subsystem, which provides APIs in kernel32.dll for managing and converting between code pages and Unicode representations.[45] The NLS framework loads locale-specific data, including code page translation tables stored in binary NLS files (e.g., c_1252.nls for Windows-1252), from the %SystemRoot%\System32 directory, enabling dynamic access to character mappings without embedding them in applications.[2] Core APIs for querying active code pages include GetACP, which retrieves the current ANSI code page identifier used for non-Unicode text in the system locale, and GetOEMCP, which returns the OEM code page identifier typically used for console and DOS compatibility operations.[4][6] These functions allow applications to determine the system's default mappings for ANSI and OEM contexts, respectively, ensuring compatibility with legacy single-byte encodings. For validation, the IsValidCodePage API checks whether a specified code page identifier (e.g., 1252 for Western European) is supported by the system, returning a nonzero value if valid.[46] Character conversion between multi-byte code pages and Unicode is handled primarily by MultiByteToWideChar and its counterpart WideCharToMultiByte, both exported from kernel32.dll. MultiByteToWideChar translates a string from a specified code page to UTF-16, with flags such as MB_PRECOMPOSED (the default) directing the function to produce precomposed Unicode characters where possible, avoiding separate base and combining marks.[9] Conversely, WideCharToMultiByte performs the reverse, mapping UTF-16 to a target code page, and supports similar flags to control decomposition behavior for compatibility with legacy applications. These APIs rely on NLS-loaded tables for accurate mappings and are essential for bridging code page-based data with internal Unicode processing. Error handling in these conversion functions addresses invalid byte sequences, which occur when input bytes do not map to valid characters in the source code page. Without the MB_ERR_INVALID_CHARS flag, MultiByteToWideChar substitutes such sequences with the system's default character (typically '?'), allowing partial conversions to proceed rather than failing entirely; setting the flag causes the function to return 0 and set GetLastError to ERROR_INVALID_PARAMETER upon encountering invalid input.[9] This substitution mechanism prevents crashes in legacy code but can lead to data loss, underscoring the importance of validating inputs via IsValidCodePage beforehand. In modern Windows versions starting from Windows 10, the system prefers UTF-16 as the internal encoding for strings and UI elements, reducing reliance on code pages for core operations while maintaining API support for backward compatibility.[1] Starting with Windows 10, beta system-wide UTF-8 support is available as an alternative to traditional ANSI code pages, configurable via administrative settings, and this feature is supported in Windows 11.[23]Regional and Language Settings
In Windows, users configure code pages through the Regional and Language Settings in the Settings app (or legacy Control Panel), specifically under Time & Language > Region > Administrative language settings, where the system locale can be changed to select a non-Unicode code page for legacy applications.[23][47] For example, selecting the Russian (Russia) system locale sets the default ANSI code page to Windows-1251 (CP1251), which determines how non-Unicode programs interpret text characters.[47] This configuration affects the active code page used by the system for ANSI operations, with changes requiring a restart to take effect.[48] Locale IDs (LCIDs) in Windows associate specific code pages with languages and regions, enabling the system to retrieve relevant encoding information for internationalization.[49] Each LCID is a 32-bit value combining a primary language identifier (lower 10 bits), sublanguage (next 6 bits), sort ID (next 4 bits), and reserved bits, with the default ANSI code page tied to the locale—for instance, LCID 1049 corresponds to Russian (Russia) and links to CP1251.[50] Applications can query these associations using the GetLocaleInfo function with the LOCALE_IDEFAULTANSICODEPAGE constant to obtain the ANSI code page for a given LCID, supporting runtime locale-specific text handling.[51][52] Multilingual User Interface (MUI) packs and language feature updates install additional locales and their associated code pages during Windows setup or post-installation via the Settings app under Time & Language > Language > Add a language.[53] These packs extend system support for non-default languages, automatically incorporating the corresponding code pages without altering the primary system locale.[54] In recent versions, such as Windows 11, Microsoft has promoted UTF-8 as a configurable system locale option (labeled as a beta feature in Administrative settings), allowing users to set it as the default active code page (CP65001) for broader Unicode compatibility across legacy and modern applications.[23][55] These settings directly impact compatibility for legacy applications that rely on the system locale's ANSI code page rather than Unicode, such as older versions of Notepad or console tools, where mismatched locales can result in garbled text display.[47][56] Enabling UTF-8 as the system locale enhances cross-language support in such apps by standardizing on a Unicode-based encoding, though it may require application-specific adjustments for optimal rendering.[23]Limitations and Challenges
Common Encoding Problems
One prevalent issue in handling Windows code pages arises from mojibake, where text becomes garbled due to the assumption of an incorrect code page during decoding. This occurs because different code pages assign distinct byte values to characters, leading to systematic misinterpretation; for instance, ANSI code pages can vary across systems or be altered, resulting in data corruption when a file encoded in one code page is read using another. A common example involves a file saved in Windows-1252 (CP1252), which maps the euro symbol (€) to byte 0x80, but when interpreted as CP850 (OEM Multilingual Latin I), this byte maps to Ç (C with cedilla), producing mojibake such as â in unrelated contexts or a replacement character in some displays if unsupported. Such mismatches are particularly frequent in legacy applications or cross-system file transfers without explicit encoding metadata. Round-trip conversion problems further complicate text handling, where converting from a code page to Unicode and back fails to preserve the original data due to non-reversible mappings. In multi-byte code pages like CP932 (Shift-JIS for Japanese), multiple byte sequences may map to a single Unicode character, but the reverse conversion cannot unambiguously reconstruct the original bytes, leading to loss of information or altered content. This issue is exacerbated when system and thread code pages differ, as the conversion functions like MultiByteToWideChar and WideCharToMultiByte rely on the active code page context, potentially causing corruption during round-trip operations in multithreaded environments. Locale mismatches amplify these risks, especially when files from regions using incompatible code pages are processed on systems configured for different locales. For example, a document encoded in CP1251 (Windows Cyrillic) containing Russian text, such as "Привет" (byte sequence like 0xCF 0xF0 0xE8 0xE2 0xE5 0xF2), will appear corrupted—often as Latin gibberish like Ïðèвåт—when viewed using a Latin-based code page like CP1252 on a Western European-configured Windows system. This corruption stems from the fundamental incompatibility between script-specific code pages, where bytes intended for Cyrillic glyphs overlap with Latin control or printable characters in other pages, rendering the text unusable without proper locale-aware handling. Detecting the correct code page poses significant challenges, particularly for legacy files lacking a Byte Order Mark (BOM), which is absent in traditional Windows code page encodings unlike UTF-8 or UTF-16 variants. Without metadata, applications must rely on heuristics or user intervention, often leading to trial-and-error decoding attempts. In console environments, tools like the chcp command allow switching the active code page (e.g., chcp 1252 to set CP1252), but this only affects output display and does not retroactively detect or correct embedded file encodings, complicating troubleshooting in mixed-locale scenarios.Deprecation and Migration Strategies
Microsoft recommends that developers prioritize Unicode encodings, such as UTF-8 or UTF-16, for new Windows applications to avoid the limitations of legacy code pages and ensure broad international support.[2] This guidance emphasizes using UTF-8 for its efficiency in handling variable-width characters and compatibility with web standards, while UTF-16 remains suitable for internal string processing in Windows APIs.[57] Since Windows 10 version 1903, Microsoft has provided a system configuration option to set UTF-8 as the default ANSI code page for legacy non-Unicode applications, accessible via the "Use Unicode UTF-8 for worldwide language support" setting in Region > Administrative settings, which modifies the registry keyHKLM\SYSTEM\CurrentControlSet\Control\Nls\CodePage\ACP to value 65001.[23] This feature, initially introduced as beta, enables smoother transitions by interpreting legacy API calls through UTF-8, though it requires a system reboot and may impact performance in some scenarios.[58]
For bulk migration of existing data from code pages to Unicode, several tools facilitate conversion. PowerShell cmdlets, such as Get-Content -Encoding <codepage> to read files in legacy encodings (e.g., Default for ANSI) followed by Set-Content -Encoding utf8, allow scripted batch processing of text files to UTF-8.[59] Windows native APIs like MultiByteToWideChar and WideCharToMultiByte provide programmatic conversion capabilities for developers integrating migration into applications.[1] Additionally, the iconv utility, installable via Git for Windows or similar environments, supports command-line bulk conversions, such as iconv -f [WINDOWS-1252](/page/Windows-1252) -t [UTF-8](/page/UTF-8) input.txt > output.txt, for handling multiple files from specific code pages.[60]
Best practices for migration include embedding a Byte Order Mark (BOM) in UTF-8 files to assist legacy Windows applications in detecting the encoding, particularly for text files opened in Notepad or Excel.[61] After conversion, validation using APIs like IsTextUnicode ensures data integrity by checking for valid Unicode sequences and flagging potential corruption from mismatched code pages. Enterprises adopting Windows 10 and later have increasingly implemented these strategies during upgrades, often combining automated scripts with testing phases to handle large-scale data shifts in multilingual environments.[62]
As of 2025, Microsoft continues to promote Unicode adoption without announcing full deprecation of code pages, maintaining backward compatibility for legacy software while encouraging UTF-8 as the standard for future development.[23]