Hubbry Logo
ISO/IEC 8859ISO/IEC 8859Main
Open search
ISO/IEC 8859
Community hub
ISO/IEC 8859
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
ISO/IEC 8859
ISO/IEC 8859
from Wikipedia

ISO 8859 encoding family
StandardISO/IEC 8859
Classification8-bit extended ASCII, ISO/IEC 4873 level 1
ExtendsASCII
Preceded byISO/IEC 646
Succeeded byISO/IEC 10646 (Unicode)
Other related encodingsISO/IEC 10367, Windows-125x

ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12.[1] The ISO working group maintaining this series of standards has been disbanded.

ISO/IEC 8859 parts 1, 2, 3, and 4 were originally Ecma International standard ECMA-94.

Introduction

[edit]

While the bit patterns of the 95 printable ASCII characters are sufficient to exchange information in modern English, most other languages that use Latin alphabets need additional symbols not covered by ASCII. ISO/IEC 8859 sought to remedy this problem by utilizing the eighth bit in an 8-bit byte to allow positions for another 96 printable characters. Early encodings were limited to 7 bits because of restrictions of some data transmission protocols, and partially for historical reasons. However, more characters were needed than could fit in a single 8-bit character encoding, so several mappings were developed, including at least ten suitable for various Latin alphabets.

The ISO/IEC 8859 standard parts only define printable characters, although they explicitly set apart the byte ranges 0x00–1F and 0x7F–9F as "combinations that do not represent graphic characters" (i.e. which are reserved for use as control characters) in accordance with ISO/IEC 4873; they were designed to be used in conjunction with a separate standard defining the control functions associated with these bytes, such as ISO 6429 or ISO 6630.[2] To this end a series of encodings registered with the IANA add the C0 control set (control characters mapped to bytes 0 to 31) from ISO 646 and the C1 control set (control characters mapped to bytes 128 to 159) from ISO 6429, resulting in full 8-bit character maps with most, if not all, bytes assigned. These sets have ISO-8859-n as their preferred MIME name or, in cases where a preferred MIME name is not specified, their canonical name. Many people use the terms ISO/IEC 8859-n and ISO-8859-n interchangeably. ISO/IEC 8859-11 did not get such a charset assigned, presumably because it was almost identical to TIS 620.

Characters

[edit]

The ISO/IEC 8859 standard is designed for reliable information exchange, not typography; the standard omits symbols needed for high-quality typography, such as optional ligatures, curly quotation marks, dashes, etc. As a result, high-quality typesetting systems often use proprietary or idiosyncratic extensions on top of the ASCII and ISO/IEC 8859 standards, or use Unicode instead.

An inexact rule based on practical experience states that if a character or symbol was not already part of a widely used data-processing character set and was also not usually provided on typewriter keyboards for a national language, it did not get in. Hence the directional double quotation marks « and » used for some European languages were included, but not the directional double quotation marks and used for English and some other languages.

French did not get its œ and Œ ligatures because they could be typed as 'oe'. Likewise, Ÿ, needed for all-caps text, was dropped as well.[3][4][5] Albeit under different codepoints, these three characters were later reintroduced with ISO/IEC 8859-15 in 1999, which also introduced the new euro sign character €. Likewise Dutch did not get the ij and IJ letters, because Dutch speakers had become used to typing these as two letters instead.

Romanian did not initially get its Ș/ș and Ț/ț (with comma) letters, because these letters were initially unified with Ş/ş and Ţ/ţ (with cedilla) by the Unicode Consortium, considering the shapes with comma beneath to be glyph variants of the shapes with cedilla. However, the letters with explicit comma below were later added to the Unicode standard and are also in ISO/IEC 8859-16.

Most of the ISO/IEC 8859 encodings provide diacritic marks required for various European languages using the Latin script. Others provide non-Latin alphabets: Greek, Cyrillic, Hebrew, Arabic and Thai. Most of the encodings contain only spacing characters, although the Thai, Hebrew, and Arabic ones do also contain combining characters.

The standard makes no provision for the scripts of East Asian languages (CJK), as their ideographic writing systems require many thousands of code points. Although it uses Latin based characters, Vietnamese does not fit into 96 positions (without using combining diacritics such as in Windows-1258) either. Each Japanese syllabic alphabet (hiragana or katakana, see Kana) would fit, as in JIS X 0201, but like several other alphabets of the world they are not encoded in the ISO/IEC 8859 system.

The parts of ISO/IEC 8859

[edit]

ISO/IEC 8859 is divided into the following parts:

Part Name Revisions Other standards Description
Part 1 Latin-1
Western European
1987, 1998 ECMA-94 (1985, 1986) Perhaps the most widely used part of ISO/IEC 8859, covering most Western European languages: Danish (partial),[nb 1] Dutch,[nb 2] English, Faeroese, Finnish (partial),[nb 3] French (partial),[nb 3] German, Icelandic, Irish, Italian, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Catalan, and Swedish. Languages from other parts of the world are also covered, including: Eastern European Albanian, Southeast Asian Indonesian, as well as the African languages Afrikaans and Swahili.

A modification of DEC MCS; the first (1985) standard version at the ECMA level lacked the times sign and division obelus, which were added the next year. The missing euro sign and capital Ÿ are in the revised version ISO/IEC 8859-15 (see below). The corresponding IANA character set is ISO-8859-1.

Part 2 Latin-2
Central European
1987, 1999 ECMA-94 (1986)[nb 4] Supports those Central and Eastern European languages that use the Latin alphabet, including Bosnian, Polish, Croatian, Czech, Slovak, Slovene, Serbian, and Hungarian. The missing euro sign can be found in version ISO/IEC 8859-16.
Part 3 Latin-3
South European
1988, 1999 Turkish, Maltese, and Esperanto. Largely superseded by ISO/IEC 8859-9 for Turkish.
Part 4 Latin-4
North European
1988, 1998 Estonian, Latvian, Lithuanian, Greenlandic, and Sami.
Part 5 Latin/Cyrillic 1988, 1999 ECMA-113 (1988, 1999)[nb 5] Covers mostly Slavic languages that use a Cyrillic alphabet, including Belarusian, Bulgarian, Macedonian, Russian, Serbian, and Ukrainian (partial).[nb 6]
Part 6 Latin/Arabic 1987, 1999
Covers the most common Arabic language characters. Does not support other languages using the Arabic script. Needs to be BiDi and cursive joining processed for display.
Part 7 Latin/Greek 1987, 2003
Covers the modern Greek language (monotonic orthography). Can also be used for Ancient Greek written without accents or in monotonic orthography, but lacks the diacritics for polytonic orthography. These were introduced with Unicode. Updated 2003 to add the euro sign, drachma sign and spacing ypogegrammeni.
Part 8 Latin/Hebrew 1988, 1999
Covers the modern Hebrew alphabet as used in Israel. In practice two different encodings exist, logical order (needs to be BiDi processed for display) and visual (left-to-right) order (in effect, after bidi processing and line breaking). Updated 1999 to add LRM and RLM. Updated at national standard level in 2002 to add euro and shekel signs and more bidirectional format effectors; the 2002 additions were never incorporated back into the ISO standard version.
Part 9 Latin-5
Turkish
1989, 1999
Largely the same as ISO/IEC 8859-1, replacing the rarely used Icelandic letters with Turkish ones.
Part 10 Latin-6
Nordic
1992, 1998 ECMA-144 (1990, 1992, 2000) A rearrangement of Latin-4. Considered more useful for Nordic languages. Baltic languages use Latin-4 more.
Part 11 Latin/Thai 2001 TIS-620 (1986, 1990) Contains characters needed for the Thai language. First revision established in 1986 at national standard level as TIS 620. Elevated to ISO standard status as a part of ISO 8859 in 2001, with the addition of a non-breaking space.
Part 12 Latin/Devanagari N/A - Originally proposed to support the Celtic languages,[6][7] then slated for Latin/Devanagari,[8] but abandoned in 1997, during the 12th meeting of ISO/IEC JTC 1/SC 2/WG 3.[9] The Celtic proposal was changed to ISO 8859-14, with part 12 possibly being reserved for ISCII Indian.[10]
Part 13 Latin-7
Baltic Rim
1998 - Added some characters for Baltic languages which were missing from Latin-4 and Latin-6. Related to the earlier-published[nb 7] Windows-1257.
Part 14 Latin-8
Celtic
1998 - Covers Celtic languages such as Gaelic and the Breton language. Welsh letters correspond to the earlier (1994) ISO-IR-182.
Part 15 Latin-9 1999 - A revision of 8859-1 that removes some little-used symbols, replacing them with the euro sign and the letters Š, š, Ž, ž, Œ, œ, and Ÿ, which completes the coverage of French, Finnish and Estonian.
Part 16 Latin-10
South-Eastern European
2001 SR 14111 (1998) Intended for Albanian, Croatian, Hungarian, Italian, Polish, Romanian and Slovene, but also Finnish, French, German and Irish Gaelic (new orthography). The focus lies more on letters than symbols. The generic currency sign is replaced with the euro sign.

Each part of ISO/IEC 8859 is designed to support languages that often borrow from each other, so the characters needed by each language are usually accommodated by a single part. However, there are some characters and language combinations that are not accommodated without transcriptions. Efforts were made to make conversions as smooth as possible. For example, German has all of its seven special characters at the same positions in all Latin variants (1–4, 9, 10, 13–16), and in many positions the characters only differ in the diacritics between the sets. In particular, variants 1–4 were designed jointly, and have the property that every encoded character appears either at a given position or not at all.

Table

[edit]
Comparison of the various parts (1–16) of ISO/IEC 8859
Binary Oct Dec Hex 1 2 3 4 5 6 7 8 9 10 11 13 14 15 16
1010 0000 240 160 A0 Non-breaking space (NBSP)
1010 0001 241 161 A1 ¡ Ą Ħ Ą Ё     ¡ Ą ¡ Ą
1010 0010 242 162 A2 ¢ ˘ ĸ Ђ   ¢ Ē ¢ ¢ ą
1010 0011 243 163 A3 £ Ł £ Ŗ Ѓ   £ Ģ £ Ł
1010 0100 244 164 A4 ¤ Є ¤ ¤ Ī ¤ Ċ
1010 0101 245 165 A5 ¥ Ľ   Ĩ Ѕ   ¥ Ĩ ċ ¥
1010 0110 246 166 A6 ¦ Ś Ĥ Ļ І   ¦ Ķ ¦ Š
1010 0111 247 167 A7 § Ї   § §
1010 1000 250 168 A8 ¨ Ј   ¨ Ļ Ø š
1010 1001 251 169 A9 © Š İ Š Љ   © Đ ©
1010 1010 252 170 AA ª Ş Ē Њ   ͺ × ª Š Ŗ ª Ș
1010 1011 253 171 AB « Ť Ğ Ģ Ћ   « Ŧ « «
1010 1100 254 172 AC ¬ Ź Ĵ Ŧ Ќ ، ¬ Ž ¬ ¬ Ź
1010 1101 255 173 AD Soft hyphen (SHY) SHY
1010 1110 256 174 AE ® Ž   Ž Ў     ® Ū ® ź
1010 1111 257 175 AF ¯ Ż ¯ Џ   ¯ Ŋ Æ Ÿ ¯ Ż
1011 0000 260 176 B0 ° А   ° ° °
1011 0001 261 177 B1 ± ą ħ ą Б   ± ą ± ±
1011 0010 262 178 B2 ² ˛ ² ˛ В   ² ē ² Ġ ² Č
1011 0011 263 179 B3 ³ ł ³ ŗ Г   ³ ģ ³ ġ ³ ł
1011 0100 264 180 B4 ´ Д   ΄ ´ ī Ž
1011 0101 265 181 B5 µ ľ µ ĩ Е   ΅ µ ĩ µ µ
1011 0110 266 182 B6 ś ĥ ļ Ж   Ά ķ
1011 0111 267 183 B7 · ˇ · ˇ З   · · ·
1011 1000 270 184 B8 ¸ И   Έ ¸ ļ ø ž
1011 1001 271 185 B9 ¹ š ı š Й   Ή ¹ đ ¹ ¹ č
1011 1010 272 186 BA º ş ē К   Ί ÷ º š ŗ º ș
1011 1011 273 187 BB » ť ğ ģ Л ؛ » ŧ » »
1011 1100 274 188 BC ¼ ź ĵ ŧ М   Ό ¼ ž ¼ Œ
1011 1101 275 189 BD ½ ˝ ½ Ŋ Н   ½ ½ œ
1011 1110 276 190 BE ¾ ž   ž О   Ύ ¾ ū ¾ Ÿ
1011 1111 277 191 BF ¿ ż ŋ П ؟ Ώ   ¿ ŋ æ ¿ ż
1100 0000 300 192 C0 À Ŕ À Ā Р   ΐ   À Ā Ą À
1100 0001 301 193 C1 Á С ء Α   Á Į Á
1100 0010 302 194 C2 Â Т آ Β   Â Ā Â
1100 0011 303 195 C3 Ã Ă   Ã У أ Γ   Ã Ć Ã Ă
1100 0100 304 196 C4 Ä Ф ؤ Δ   Ä Ä
1100 0101 305 197 C5 Å Ĺ Ċ Å Х إ Ε   Å Å Ć
1100 0110 306 198 C6 Æ Ć Ĉ Æ Ц ئ Ζ   Æ Ę Æ
1100 0111 307 199 C7 Ç Į Ч ا Η   Ç Į Ē Ç
1100 1000 310 200 C8 È Č È Č Ш ب Θ   È Č Č È
1100 1001 311 201 C9 É Щ ة Ι   É É
1100 1010 312 202 CA Ê Ę Ê Ę Ъ ت Κ   Ê Ę Ź Ê
1100 1011 313 203 CB Ë Ы ث Λ   Ë Ė Ë
1100 1100 314 204 CC Ì Ě Ì Ė Ь ج Μ   Ì Ė Ģ Ì
1100 1101 315 205 CD Í Э ح Ν   Í Ķ Í
1100 1110 316 206 CE Î Ю خ Ξ   Î Ī Î
1100 1111 317 207 CF Ï Ď Ï Ī Я د Ο   Ï Ļ Ï
Binary Oct Dec Hex 1 2 3 4 5 6 7 8 9 10 11 13 14 15 16
1101 0000 320 208 D0 Ð Đ   Đ а ذ Π   Ğ Ð Š Ŵ Ð
1101 0001 321 209 D1 Ñ Ń Ñ Ņ б ر Ρ   Ñ Ņ Ń Ñ Ń
1101 0010 322 210 D2 Ò Ň Ò Ō в ز     Ò Ō Ņ Ò
1101 0011 323 211 D3 Ó Ķ г س Σ   Ó Ó
1101 0100 324 212 D4 Ô д ش Τ   Ô Ō Ô
1101 0101 325 213 D5 Õ Ő Ġ Õ е ص Υ   Õ Õ Ő
1101 0110 326 214 D6 Ö ж ض Φ   Ö Ö
1101 0111 327 215 D7 × з ط Χ   × Ũ × × Ś
1101 1000 330 216 D8 Ø Ř Ĝ Ø и ظ Ψ   Ø Ų Ø Ű
1101 1001 331 217 D9 Ù Ů Ù Ų й ع Ω   Ù Ų Ł Ù
1101 1010 332 218 DA Ú к غ Ϊ   Ú Ś Ú
1101 1011 333 219 DB Û Ű Û л   Ϋ   Û   Ū Û
1101 1100 334 220 DC Ü м   ά   Ü   Ü
1101 1101 335 221 DD Ý Ŭ Ũ н   έ   İ Ý   Ż Ý Ę
1101 1110 336 222 DE Þ Ţ Ŝ Ū о   ή   Ş Þ   Ž Ŷ Þ Ț
1101 1111 337 223 DF ß п   ί ß ฿ ß
1110 0000 340 224 E0 à ŕ à ā р ـ ΰ א à ā ą à
1110 0001 341 225 E1 á с ف α ב á į á
1110 0010 342 226 E2 â т ق β ג â ā â
1110 0011 343 227 E3 ã ă   ã у ك γ ד ã ć ã ă
1110 0100 344 228 E4 ä ф ل δ ה ä ä
1110 0101 345 229 E5 å ĺ ċ å х م ε ו å å ć
1110 0110 346 230 E6 æ ć ĉ æ ц ن ζ ז æ ę æ
1110 0111 347 231 E7 ç į ч ه η ח ç į ē ç
1110 1000 350 232 E8 è č è č ш و θ ט è č č è
1110 1001 351 233 E9 é щ ى ι י é é
1110 1010 352 234 EA ê ę ê ę ъ ي κ ך ê ę ź ê
1110 1011 353 235 EB ë ы ً λ כ ë ė ë
1110 1100 354 236 EC ì ě ì ė ь ٌ μ ל ì ė ģ ì
1110 1101 355 237 ED í э ٍ ν ם í ķ í
1110 1110 356 238 EE î ю َ ξ מ î ī î
1110 1111 357 239 EF ï ď ï ī я ُ ο ן ï ļ ï
1111 0000 360 240 F0 ð đ   đ ِ π נ ğ ð š ŵ ð đ
1111 0001 361 241 F1 ñ ń ñ ņ ё ّ ρ ס ñ ņ ń ñ ń
1111 0010 362 242 F2 ò ň ò ō ђ ْ ς ע ò ō ņ ò
1111 0011 363 243 F3 ó ķ ѓ   σ ף ó ó
1111 0100 364 244 F4 ô є   τ פ ô ō ô
1111 0101 365 245 F5 õ ő ġ õ ѕ   υ ץ õ õ ő
1111 0110 366 246 F6 ö і   φ צ ö ö
1111 0111 367 247 F7 ÷ ї   χ ק ÷ ũ ÷ ÷ ś
1111 1000 370 248 F8 ø ř ĝ ø ј   ψ ר ø ų ø ű
1111 1001 371 249 F9 ù ů ù ų љ   ω ש ù ų ł ù
1111 1010 372 250 FA ú њ   ϊ ת ú ś ú
1111 1011 373 251 FB û ű û ћ   ϋ   û ū û
1111 1100 374 252 FC ü ќ   ό   ü   ü
1111 1101 375 253 FD ý ŭ ũ §   ύ LRM ı ý   ż ý ę
1111 1110 376 254 FE þ ţ ŝ ū ў   ώ RLM ş þ   ž ŷ þ ț
1111 1111 377 255 FF ÿ ˙ џ       ÿ ĸ   ÿ
Binary Oct Dec Hex 1 2 3 4 5 6 7 8 9 10 11 13 14 15 16

  unassigned code points.
  new additions in ISO/IEC 8859-7:2003 and ISO/IEC 8859-8:1999 versions, previously unassigned.

Relationship to Unicode and the UCS

[edit]

Since 1991, the Unicode Consortium has been working with ISO and IEC to develop the Unicode Standard and ISO/IEC 10646: the Universal Character Set (UCS) in tandem. Newer editions of ISO/IEC 8859 express characters in terms of their Unicode/UCS names and the U+nnnn notation, effectively causing each part of ISO/IEC 8859 to be a Unicode/UCS character encoding scheme that maps a very small subset of the UCS to single 8-bit bytes. The first 256 characters in Unicode and the UCS are identical to those in ISO/IEC-8859-1 (Latin-1).

Single-byte character sets including the parts of ISO/IEC 8859 and derivatives of them were favoured throughout the 1990s, having the advantages of being well-established and more easily implemented in software: the equation of one byte to one character is simple and adequate for most single-language applications, and there are no combining characters or variant forms. As Unicode-enabled operating systems became more widespread, ISO/IEC 8859 and other legacy encodings became less popular. While remnants of ISO 8859 and single-byte character models remain entrenched in many operating systems, programming languages, data storage systems, networking applications, display hardware, and end-user application software, most modern computing applications use Unicode internally, and rely on conversion tables to map to and from other encodings, when necessary.

Current status

[edit]

The ISO/IEC 8859 standard was maintained by ISO/IEC Joint Technical Committee 1, Subcommittee 2, Working Group 3 (ISO/IEC JTC 1/SC 2/WG 3). In June 2004, WG 3 disbanded, and maintenance duties were transferred to SC 2. The standard is not currently being updated, as the Subcommittee's only remaining working group, WG 2, is concentrating on development of Unicode's Universal Coded Character Set.

The WHATWG Encoding Standard, which specifies the character encodings permitted in HTML5 which compliant browsers must support,[12] includes most parts of ISO/IEC 8859,[13] except for parts 1, 9 and 11, which are instead interpreted as Windows-1252, Windows-1254 and Windows-874 respectively.[14] Authors of new pages and the designers of new protocols are instructed to use UTF-8 instead.[14]

See also

[edit]

Notes

[edit]

References

[edit]

Further reading

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
ISO/IEC 8859 is a multipart international standard jointly published by the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) that defines a family of 8-bit single-byte coded graphic character sets for information technology applications. Each part specifies up to 191 graphic characters, including an extension of the 7-bit US-ASCII repertoire, along with control characters and spaces, to support text processing and data interchange. The standards were developed to enable efficient encoding of characters from various scripts, primarily focusing on European languages but extending to some Middle Eastern and Asian ones. The series comprises numerous parts, each designated for specific linguistic groups; for instance, ISO/IEC 8859-1 (Latin alphabet No. 1) covers Western European languages like English, French, and German, while ISO/IEC 8859-2 (Latin alphabet No. 2) addresses Central and Eastern European languages such as Polish and Czech. Other notable parts include ISO/IEC 8859-7 for Greek, ISO/IEC 8859-8 for Hebrew, and ISO/IEC 8859-11 for Thai, with additional parts supporting Baltic, Turkish, Nordic, South European, and Celtic languages. These encodings are compatible with related standards from ANSI and ECMA International, ensuring interoperability in early computing environments. Originating from proposals by ECMA Technical Committee 11 in the mid-1980s, the ISO/IEC 8859 parts were progressively adopted as international standards starting with Part 1 in 1987. Although widely used in legacy systems for their simplicity and efficiency in handling 256 code points, the family has been largely incorporated into and superseded by the Unicode standard, which maps the first 256 code points of ISO/IEC 8859-1 directly and supports far broader multilingual capabilities.

Overview and History

Definition and Purpose

ISO/IEC 8859 is a series of international standards jointly published by the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC), defining single-byte 8-bit character encodings capable of representing up to 256 distinct characters. This family of standards extends the capabilities of the 7-bit ASCII code, which is restricted to 128 characters, by utilizing the full range of an 8-bit byte to accommodate additional graphic symbols and letters needed for multilingual text processing. The primary purpose of ISO/IEC 8859 is to facilitate the standardized representation and interchange of text in various languages, particularly those using alphabetic scripts such as Latin, Greek, Cyrillic, Arabic, and Hebrew, within computing environments constrained to 8-bit data units. By providing a consistent framework for encoding characters beyond basic English, the standards address the need for international data compatibility in applications like document processing, data transmission, and software localization during an era of growing global computing adoption. Each part of the series targets specific linguistic repertoires, enabling efficient handling of accented letters, diacritics, and other script-specific symbols without requiring multi-byte extensions. A key feature of ISO/IEC 8859 is its structural compatibility with ASCII: the first 128 code positions (0x00–0x7F) are reserved to match the ISO/IEC 646 (equivalent to ASCII) control and graphic characters exactly, ensuring backward compatibility, while the upper 128 positions (0x80–0xFF) define up to 96 additional printable graphic characters unique to each part, leaving room for implementation-specific controls. This design allows for 191 total graphic characters across the 256 positions, with the remaining slots dedicated to control functions. The standards emphasize single-byte coding for simplicity in processing and storage, making them suitable for early 8-bit systems. Developed in the mid-1980s by the European Computer Manufacturers Association (ECMA) in collaboration with ANSI and subsequently adopted by ISO/IEC JTC 1, ISO/IEC 8859 emerged to overcome the representational limitations of 7-bit codes for non-Latin scripts in an increasingly internationalized computing landscape. The first part, ISO/IEC 8859-1 (Latin Alphabet No. 1), was published in 1987, marking the beginning of the series' expansion to cover diverse language needs.

Development and Standardization

The development of the ISO/IEC 8859 series originated in the mid-1980s through efforts by Ecma International (formerly the European Computer Manufacturers Association) to create standardized 8-bit extensions to the 7-bit ASCII character set, addressing the need for multilingual support in information processing. Ecma published its foundational standard, ECMA-94, in March 1985, defining the character repertoire for Latin-based Western European languages, which served as the basis for the first part of the ISO/IEC series. Standardization of ISO/IEC 8859 was managed by the Joint Technical Committee ISO/IEC JTC 1, specifically Subcommittee 2 (SC 2) on Coded Character Sets, with Working Group 3 (WG 3) overseeing the development of 7-bit and 8-bit codes. SC 2, established in the early 1960s as part of ISO/TC 97 and later integrated into JTC 1 in 1987, focused on graphic character sets and their coded representations for international interchange. The initial parts of ISO/IEC 8859 were published between 1987 and the early 1990s: ISO/IEC 8859-1 (Latin alphabet No. 1) in February 1987, ISO/IEC 8859-2 (Latin alphabet No. 2, for Central and Eastern European languages) also in February 1987, and subsequent parts such as 8859-3 (Latin/Greek) in 1988 and 8859-4 (Latin alphabet No. 4) in 1988. The revision process involved technical updates to accommodate emerging needs, with new parts added to cover additional scripts and languages through collaborative proposals within SC 2/WG 3. For instance, amendments addressed gaps in earlier versions; ISO/IEC 8859-15 (Latin alphabet No. 9), an update to 8859-1, was published in March 1999 to include the Euro symbol (€) and characters for French and Finnish, replacing less-used symbols from the original. By 2001, the series culminated in ISO/IEC 8859-16 (Latin alphabet No. 10), published in July 2001, which consolidated repertoires from prior parts to support Southeastern European languages including Romanian, Albanian, and Croatian, marking a key evolution in the standard's scope.

Technical Structure

Character Encoding Principles

ISO/IEC 8859 is an 8-bit single-byte character encoding standard that utilizes a 256-position code table arranged in a 16-by-16 grid to represent characters. This structure allocates bits such that each character is encoded using exactly one byte, enabling straightforward mapping without multi-byte sequences or state dependencies. The encoding is fixed for each part of the standard, meaning the assignment of code points to characters does not vary based on context or prior bytes, facilitating reliable data interchange in multilingual environments. The first 128 positions (0x00 to 0x7F) are invariant across all parts of ISO/IEC 8859 and identical to those defined in US-ASCII (ISO 646), ensuring backward compatibility with 7-bit systems. Within this range, the C0 control set occupies positions 0x00–0x1F and 0x7F, comprising 33 control functions such as NUL, TAB, and DEL, as specified in ISO/IEC 6429. The remaining positions in 0x20–0x7E (95 positions) hold invariant graphic characters from ASCII, including letters, digits, and common symbols. The upper 128 positions (0x80 to 0xFF) extend the encoding for national character sets. Positions 0x80–0x9F are designated for the C1 control set from ISO/IEC 6429, which includes functions like IND and NEL, and are not assigned graphic characters in the standard to avoid conflicts with control usage. Consequently, the 96 positions from 0xA0 to 0xFF are available for part-specific graphic characters, such as accented letters or symbols tailored to particular languages, resulting in a total of up to 191 graphic characters per part when combined with the ASCII graphics. This allocation prioritizes control integrity while providing space for linguistic extensions.

Code Page Organization

ISO/IEC 8859 defines a collection of 15 published parts (numbered 1 through 11 and 13 through 16, with part 12 abandoned), each constituting a distinct 8-bit single-byte coded graphic character set designed for specific linguistic needs. Each part functions as an independent code page, allocating 256 positions where the first 128 (0x00 to 0x7F) align with the ASCII standard for basic Latin characters and controls, while the remaining 128 (0x80 to 0xFF) accommodate up to 96 additional graphic characters tailored to regional scripts and diacritics, leaving room for controls as per ISO/IEC 6429. For instance, Windows code page 1252 (CP1252) serves as a proprietary variant of part 1 (ISO/IEC 8859-1), extending it by assigning printable characters to positions 0x80–0x9F that are undefined in the ISO standard. The parts are systematically grouped by script type to facilitate targeted language support. Latin-based parts—1 through 4, 9, 10, 13 through 16—primarily address Western, Central, Southern, Northern, and Southeastern European languages using the Latin alphabet, incorporating extensions for diacritics, ligatures, and symbols needed for languages such as English, Polish, Turkish, Baltic, and Celtic tongues. Non-Latin parts—5 through 8 and 11—cover scripts including Cyrillic (part 5, for Russian and Bulgarian), Arabic (part 6), Greek (part 7), Hebrew (part 8), and Thai (part 11). Across parts, there is significant overlap in the base Latin repertoire (positions 0x20–0x7E), ensuring compatibility for common Western characters, while divergences occur exclusively in the upper half to prioritize script-specific glyphs without conflicting with the shared foundation. This modular design allows implementations to select a single part for a given locale, promoting efficient 8-bit storage and transmission in legacy systems. For practical identification and interoperability, the Internet Assigned Numbers Authority (IANA) maintains a registry of preferred charset names for each part, such as "ISO-8859-1" for part 1 and "ISO-8859-16" for part 16, which are used in protocols like MIME for email and web content negotiation. These labels reference the corresponding ISO/IEC standards, ensuring unambiguous reference in software and network applications.

Specific Parts

Part 1: Latin Alphabet No. 1

ISO/IEC 8859-1, also known as Latin-1, is the first part of the ISO/IEC 8859 series, defining an 8-bit single-byte coded character set for the Latin alphabet primarily used in Western European languages. The standard was first published in February 1987 as ISO 8859-1:1987 by the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC), under the auspices of Joint Technical Committee 1 (JTC 1) Subcommittee 2 (SC 2). It was technically revised and reissued as the second edition, ISO/IEC 8859-1:1998, on April 15, 1998, maintaining compatibility with the original while clarifying certain aspects. This encoding extends the 7-bit US-ASCII set (codes 0x00–0x7F) with an additional 128 graphic characters in the range 0xA0–0xFF, resulting in a total repertoire of 191 graphic characters, excluding control codes. The character set in 0xA0–0xFF includes spacing and non-spacing accents, diacritical marks, and symbols essential for languages such as English, French, German, Spanish, Italian, Dutch, and Portuguese. Notable among these are accented Latin letters like á (0xE1), ñ (0xF1), and ü (0xFC), which support proper orthography in Romance and Germanic languages. Key punctuation and symbols include the inverted exclamation mark ¡ (0xA1) and inverted question mark ¿ (0xBF), used in Spanish; currency symbols such as the cent sign ¢ (0xA2), pound sign £ (0xA3), and yen sign ¥ (0xA5); and typographic quotes represented by guillemets « (0xAB) and » (0xBB). These characters enable multilingual text processing and interchange without requiring multi-byte encodings. Due to its broad applicability, ISO/IEC 8859-1 became the dominant encoding for early internet applications. It served as the default charset for HTML documents in HTML 4.01 and for HTTP text/* content types without an explicit charset parameter, as specified in RFC 2616. The named character entities in HTML (e.g., á for á) were originally defined based on the ISO/IEC 8859-1 repertoire, facilitating the inclusion of these characters in web pages. The Internet Assigned Numbers Authority (IANA) designates "ISO-8859-1" as the preferred MIME name for this charset. A later amendment addressed the need for the euro symbol (€), leading to ISO/IEC 8859-15:1999 (Latin Alphabet No. 9), published in March 1999, which modifies 8 positions in the 0xA0–0xFF range of ISO/IEC 8859-1 to include € (0xA4) while retaining most original characters.

Other Notable Parts

ISO/IEC 8859-2, first published in 1987 and revised in 1999, defines the Latin alphabet No. 2, a character set designed for Central and Eastern European languages that use Latin script, such as Czech, Polish, Hungarian, and Romanian. It includes 191 graphic characters, extending the ASCII base with diacritics and special letters necessary for these languages, such as Ą (A with ogonek), Ł (L with stroke), Ś (S with acute), and Š (S with caron). ISO/IEC 8859-5, introduced in 1988 and updated in 1999, specifies the Latin/Cyrillic alphabet for Cyrillic-based languages, particularly older systems for Bulgarian and Russian. This part encodes 191 characters, incorporating the full Cyrillic alphabet alongside Latin letters, with examples including Ё (Io), А (A), а (a), and ё (io), to support text processing in environments requiring Cyrillic script compatibility. Part 7 of ISO/IEC 8859, originally published in 1987 and revised in 2003, provides the Latin/Greek alphabet tailored for modern Greek text. It comprises 188 graphic characters, adding Greek letters to the Latin base while maintaining compatibility, such as α (alpha), β (beta), and Ω (omega), for use in data interchange and applications involving Greek-language content. ISO/IEC 8859-9, first issued in 1989 and revised in 1999, establishes Latin alphabet No. 5 specifically for the Turkish language. The set includes 191 characters, replacing certain symbols from earlier parts with Turkish-specific ones like ğ (g with breve), ı (dotless i), and ş (s with cedilla), to better accommodate Turkish orthography in information processing. ISO/IEC 8859-15, published in 1999, serves as an update to Part 1 (Latin-1), known as Latin alphabet No. 9. It retains most of the original repertoire but replaces obsolete currency symbols with the euro sign (€) and includes additional old currency symbols like the lira (₤) and peseta (₧), enhancing support for European monetary notations in legacy systems. ISO/IEC 8859-16, released in 2001, defines Latin alphabet No. 10 as a comprehensive consolidation of characters from Parts 1 through 4, 9, and 13 through 15. Aimed at South-Eastern European languages like Romanian and Albanian, it encodes 191 characters in a single set, reducing redundancy by combining repertoires for broader Latin-script coverage without the need for multiple code pages. ISO/IEC 8859-12 was proposed for a Latin/Devanagari alphabet but abandoned due to conflicts with the development of ISO/IEC 10646 (Unicode). Certain aspects of other parts, such as the Turkish coding in Part 3 (Latin-3), are deprecated in favor of Part 9, reflecting efforts to streamline the family. Overall, the ISO/IEC 8859 series is no longer actively maintained, as the responsible working group was disbanded in 2004, and the encodings have been largely superseded by Unicode.

Compatibility and Mappings

Relation to ASCII and ISO 646

ISO/IEC 8859 serves as a direct superset of the American Standard Code for Information Interchange (ASCII), defined in ANSI X3.4-1968, by retaining the exact 7-bit code values from 0x00 to 0x7F unchanged while extending to an 8-bit structure. This design ensures backward compatibility, allowing 7-bit ASCII data to be processed seamlessly within ISO/IEC 8859 environments without alteration to the basic control and graphic characters. The standard also builds upon ISO/IEC 646:1973, which establishes the International Reference Version (IRV) as a 7-bit coded character set for information interchange, serving as the international counterpart to ASCII. ISO/IEC 646 permitted national variants to accommodate local needs, such as the British variant (ISO 646-GB) replacing the US ASCII hash symbol (#) at position 0x23 with the pound sterling (£). By the 1991 revision of ISO/IEC 646, the IRV aligned fully with ASCII, but earlier ambiguities in variant mappings necessitated a unified extension. To address these inconsistencies, ISO/IEC 8859 introduces an additional 128 characters in the upper code range (0x80 to 0xFF), standardizing positions that were ambiguous or unavailable in ISO/IEC 646 variants. For instance, code point 0xA0 is assigned to the non-breaking space, a control character absent from 7-bit sets but essential for consistent text formatting across international systems. This extension mechanism resolves national differences by providing fixed, multilingual repertoires while preserving the 7-bit base as the invariant core. Despite this compatibility focus, early implementations mixing 7-bit ASCII/ISO/IEC 646 systems with 8-bit ISO/IEC 8859 data often resulted in mojibake, where high-bit-set bytes (0x80–0xFF) were stripped or misinterpreted as control codes, rendering accented characters or symbols as garbled text. Such issues arose particularly in data transmission protocols limited to 7 bits or software assuming ASCII-only input, leading to widespread interoperability challenges before standardized 8-bit support became prevalent.

Integration with Unicode

ISO/IEC 8859 characters are integrated into Unicode through direct mappings to the Basic Multilingual Plane (BMP), ensuring compatibility for legacy systems while allowing expansion to a universal character set. Each part of the ISO/IEC 8859 standard assigns its 96 graphic characters (beyond the ASCII subset) to specific Unicode blocks within the BMP, primarily the Latin-1 Supplement (U+0080–U+00FF) and Latin Extended-A (U+0100–U+017F) ranges. This strategy preserves the single-byte nature of ISO/IEC 8859 encodings in Unicode's 16-bit code space, facilitating straightforward conversions without data loss for defined repertoires. For ISO/IEC 8859-1 (Latin-1), the mapping exhibits a near one-to-one correspondence, where byte values 0x00–0xFF directly align with Unicode code points U+0000–U+00FF. The first 128 positions (0x00–0x7F) match the ASCII standard exactly, while the upper range (0xA0–0xFF) includes 96 additional Latin characters such as accented letters and symbols. However, positions 0x80–0x9F, designated as C1 control characters in ISO/IEC 8859-1, map to Unicode control code points U+0080–U+009F rather than printable glyphs; these are often preserved in conversions or ignored in display contexts depending on the application. No positions in this range are left undefined in the official mapping. Similar mappings apply to other ISO/IEC 8859 parts, with characters distributed across BMP blocks tailored to their linguistic focus—for instance, ISO/IEC 8859-2 (Central European) uses Latin-1 Supplement for shared symbols and Latin Extended-A for characters like Ą (U+0104) and Ł (U+0141). The Unicode Consortium provides explicit tables for all 15 parts, confirming round-trip convertibility without loss for the 191 graphic characters defined per part, as well as the control characters. Conversion algorithms, implemented in libraries such as the International Components for Unicode (ICU), handle these mappings by interpreting each ISO/IEC 8859 byte as a Unicode scalar value and encoding it in UTF-8 or other forms; for C1 controls in 8859-1, they are typically output as two-byte UTF-8 sequences (e.g., 0x80 becomes C2 80) or suppressed if non-printable. While Unicode fully encompasses the character repertoires of all ISO/IEC 8859 parts, it extends beyond them by including bidirectional text support absent in the single-direction, byte-oriented ISO/IEC 8859 framework. This addition, governed by the Unicode Bidirectional Algorithm, enables proper rendering of mixed left-to-right and right-to-left scripts (e.g., in ISO/IEC 8859-6 for Arabic or 8859-8 for Hebrew), which ISO/IEC 8859 lacks directional metadata for. Such extensions ensure that legacy ISO/IEC 8859 data can be seamlessly migrated to Unicode environments with enhanced processing capabilities.

Usage and Current Status

Historical Applications

ISO/IEC 8859 saw widespread early adoption in operating systems during the 1980s and 1990s as a means to extend ASCII for multilingual support in Western European languages. In UNIX-like systems, including early Linux distributions, the ISO 8859 family was integrated into locale definitions to handle accented characters, with ISO 8859-1 (Latin-1) serving as a primary encoding for Western European text processing and display. Similarly, Microsoft MS-DOS employed code pages derived from or closely aligned with ISO 8859 parts, such as code page 850 for multilingual Latin scripts and code page 28591 directly implementing ISO 8859-1, enabling consistent character rendering across DOS applications and international variants. On Apple systems, the Macintosh Roman encoding, introduced in 1984, provided an 8-bit extension comparable to ISO 8859-1, supporting diacritics and symbols for Roman-based languages in early Mac applications and file systems. In web and email protocols, ISO 8859-1 emerged as a de facto standard during the mid-1990s internet expansion. The HTTP/1.1 specification defined ISO 8859-1 as the default character set for text media types when no explicit charset was specified, facilitating the transfer of Western European content over the web. For HTML 4.01 documents, the World Wide Web Consortium aligned with this by assuming ISO 8859-1 as the default encoding in the absence of a charset declaration, ensuring compatibility for Latin-script web pages. In email, the MIME standard (RFC 2046, published in 1996) supported ISO 8859-1 for 8-bit text transport, allowing reliable handling of international characters in messages without requiring full 7-bit ASCII restrictions. For printing and typography, ISO 8859-1 was integral to early desktop publishing workflows from the late 1980s onward. Adobe's PostScript language incorporated ISOLatin1Encoding, a 256-entry vector directly compatible with ISO 8859-1, to map character codes to glyphs in standard fonts like Helvetica and Times, enabling precise rendering of Western scripts on laser printers and typesetters. Early Portable Document Format (PDF) files, introduced in 1993, similarly relied on ISO 8859-1 for text encoding in Western documents, with Adobe Acrobat supporting it as a baseline for embedding Latin-1 characters before broader Unicode integration. Regionally, ISO 8859-1 was commonly used in certain European Union contexts for official documentation prior to the euro's introduction in 1999. EU directives, such as Council Directive 1999/37/EC on vehicle registration, required data fields to use Latin characters, with ISO 8859-1 commonly employed for encoding to ensure interoperability across member states' administrative systems. Official Journal publications and related forms from the 1990s routinely employed Latin-1 to accommodate multilingual content in English, French, German, and other Roman-script languages without encoding conflicts.

Modern Relevance and Deprecation

ISO/IEC 8859 has been largely superseded by Unicode and its UTF-8 encoding since the mid-2000s, as the latter provides comprehensive support for global scripts and multilingual text without the limitations of single-byte encodings. The WHATWG Encoding Standard designates UTF-8 as the most appropriate encoding for interchanging Unicode, the universal coded character set, effectively rendering legacy encodings like ISO/IEC 8859 less suitable for new applications. Despite this shift, ISO/IEC 8859 retains legacy persistence in modern web browsers for backward compatibility with older content, such as legacy web pages that declare these encodings in HTML. Browsers like those based on the Web APIs support decoding of ISO 8859 variants (e.g., iso-8859-1, iso-8859-2) through mechanisms like TextDecoder, ensuring that historical data remains accessible without corruption. In embedded systems and certain legacy infrastructures, ISO/IEC 8859 continues to be supported due to its simplicity and compatibility with resource-constrained environments that have not yet migrated to Unicode. Regarding deprecation, while the original ISO 8859-1:1987 edition was withdrawn shortly after publication in favor of updates, the current ISO/IEC 8859-1:1998 remains published and was last confirmed in 2020, indicating no formal withdrawal but limited ongoing maintenance. Similarly, other parts of the series, such as ISO/IEC 8859-9:1999 and ISO/IEC 8859-7:2003, are still active under ISO/IEC JTC 1/SC 2, though the working group responsible for the series has been disbanded, halting new developments or revisions. Equivalents to ISO/IEC 8859 persist in some national standards, preserving compatibility for regional applications. Looking to the future, minimal new development is expected for ISO/IEC 8859, with efforts centered on migration tools and strategies to convert archival data to Unicode/UTF-8. Resources like Oracle's character set migration best practices emphasize planning and tools for transitioning databases and files from ISO 8859 to UTF-8, addressing potential data loss in legacy conversions. This focus underscores a broader industry shift toward universal encodings, reducing reliance on ISO/IEC 8859 except in isolated preservation contexts.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.