Recent from talks
Nothing was collected or created yet.
Base32
View on WikipediaBase32 is binary-to-text encoding based on the base-32 numeral system. It uses an alphabet of 32 digits, each of which represents a different combination of 5 bits (25). Since base32 is not very widely adopted, the question of notation i.e. which characters to use to represent the 32 digits is not as settled as in the case of more well-known numeral systems (such as hexadecimal) even though RFCs and unofficial and de facto standards exist. One way to represent Base32 numbers in human-readable form is using digits 0–9 followed by the twenty-two upper-case letters A–V. However, many other variations are used in different contexts. Historically, Baudot code could be considered a modified (stateful) base32 code. Base32 is often used to represent byte strings.
RFC 4648 encodings
[edit]The October 2006 proposed Internet standard[1] RFC 4648 documents base16, base32 and base64 encodings. It includes two schemes for base32, but recommends one over the other. It further recommends that regardless of precedent, only the alphabet it defines in its section 6 actually be called base32, and that the other similar alphabet in its section 7 instead be called base32hex.[a] Agreement with those recommendations is not universal. Care needs to be taken when using systems that are called base32, as those systems could be base32 per RFC 4648 §6, or per §7 (possibly disregarding that RFC's deprecation of the simpler name for the latter), or they could be yet another encoding variant, see further below.
Base 32 Encoding per §6
[edit]The most widely used[citation needed] base32 alphabet is defined in RFC 4648 §6 and the earlier RFC 3548 (2003). The scheme was originally designed in 2000 by John Myers for SASL/GSSAPI.[2] It uses an alphabet of A–Z, followed by 2–7. The digits 0, 1 and 8 are skipped due to their similarity with the letters O, I and B (thus "2" has a decimal value of 26).
In some circumstances padding is not required or used (the padding can be inferred from the length of the string modulo 8). RFC 4648 states that padding must be used unless the specification of the standard (referring to the RFC) explicitly states otherwise. Excluding padding is useful when using Base32 encoded data in URL tokens or file names where the padding character could pose a problem.
| Value | Symbol | Value | Symbol | Value | Symbol | Value | Symbol | |||
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A | 8 | I | 16 | Q | 24 | Y | |||
| 1 | B | 9 | J | 17 | R | 25 | Z | |||
| 2 | C | 10 | K | 18 | S | 26 | 2 | |||
| 3 | D | 11 | L | 19 | T | 27 | 3 | |||
| 4 | E | 12 | M | 20 | U | 28 | 4 | |||
| 5 | F | 13 | N | 21 | V | 29 | 5 | |||
| 6 | G | 14 | O | 22 | W | 30 | 6 | |||
| 7 | H | 15 | P | 23 | X | 31 | 7 | |||
| padding | = | |||||||||
This is an example of a Base32 representation using the previously described 32-character set (IPFS CIDv1 in Base32 upper-case encoding): BAFYBEICZSSCDSBS7FFQZ55ASQDF3SMV6KLCW3GOFSZVWLYARCI47BGF354
Base 32 Encoding with Extended Hex Alphabet per §7
[edit]"Extended hex" base 32 or base32hex,[3] another scheme for base 32 per RFC 4648 §7, extends hexadecimal in a more natural way: Its lower half is identical with hexadecimal, and beyond that, base32hex simply continues the alphabet through to the letter V.
This scheme was first proposed by Christian Lanctot, a programmer working at Sage software, in a letter to Dr. Dobb's magazine in March 1999[4] as part of a suggested solution for the Y2K bug. Lanctot referred to it as "Double Hex". The same alphabet was described in 2000 in RFC 2938 under the name "Base-32". RFC 4648, while acknowledging existing use of this version in NSEC3, refers to it as base32hex and discourages referring to it as only "base32".
Since this notation uses digits 0–9 followed by consecutive letters of the alphabet, it matches the digits used by the JavaScript parseInt() function[5] and the Python int() constructor[6] when a base larger than 10 (such as 16 or 32) is specified. It also retains hexadecimal's property of preserving bitwise sort order of the represented data, unlike RFC 4648's §6 base32, or base64.[3]
Unlike many other base 32 notation systems, base32hex digits beyond 9 are contiguous. However, its set of digits includes characters that may visually conflict. With many fonts it is possible to visually distinguish between similar looking characters like (0, O) and (1, I), but in other fonts this may be difficult and thus they may be unsuitable for rendering base32 character sequences. This is especially true in a notation system that is only expressing numbers, when the context English usually provides is not present.[b] The choice of font is controlled by neither notation nor encoding, yet base32hex makes no attempt to compensate for the shortcomings of affected fonts.[c]
| Value | Symbol | Value | Symbol | Value | Symbol | Value | Symbol | |||
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 8 | 8 | 16 | G | 24 | O | |||
| 1 | 1 | 9 | 9 | 17 | H | 25 | P | |||
| 2 | 2 | 10 | A | 18 | I | 26 | Q | |||
| 3 | 3 | 11 | B | 19 | J | 27 | R | |||
| 4 | 4 | 12 | C | 20 | K | 28 | S | |||
| 5 | 5 | 13 | D | 21 | L | 29 | T | |||
| 6 | 6 | 14 | E | 22 | M | 30 | U | |||
| 7 | 7 | 15 | F | 23 | N | 31 | V | |||
| padding | = | |||||||||
Alternative encoding schemes
[edit]Changing the Base32 alphabet, all alternative standards have similar combinations of alphanumeric symbols.
z-base-32
[edit]z-base-32[7] is a Base32 encoding designed by Zooko Wilcox-O'Hearn to be easier for human use and more compact. It includes 1, 8 and 9 but excludes l, v, 0 and 2. It also permutes the alphabet so that the easier characters are the ones that occur more frequently.[clarification needed] It compactly encodes bitstrings whose length in bits is not a multiple of 8[clarification needed] and omits trailing padding characters. z-base-32 was used in the Mnet open source project, and is currently used in Phil Zimmermann's ZRTP protocol, and in the Tahoe-LAFS open source project.
| Value | Symbol | Value | Symbol | Value | Symbol | Value | Symbol | |||
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | y | 8 | e | 16 | o | 24 | a | |||
| 1 | b | 9 | j | 17 | t | 25 | 3 | |||
| 2 | n | 10 | k | 18 | 1 | 26 | 4 | |||
| 3 | d | 11 | m | 19 | u | 27 | 5 | |||
| 4 | r | 12 | c | 20 | w | 28 | h | |||
| 5 | f | 13 | p | 21 | i | 29 | 7 | |||
| 6 | g | 14 | q | 22 | s | 30 | 6 | |||
| 7 | 8 | 15 | x | 23 | z | 31 | 9 |
Crockford's Base32
[edit]Another alternative design for Base32 is created by Douglas Crockford, who proposes using additional characters for a mod-37 checksum.[8] It excludes the letters I, L, and O to avoid confusion with digits. It also excludes the letter U to reduce the likelihood of accidental obscenity.
Libraries to encode binary data in Crockford's Base32 are available in a variety of languages.
| Value | Encode Digit | Decode Digit | Value | Encode Digit | Decode Digit | |
|---|---|---|---|---|---|---|
| 0 | 0 | 0 o O | 16 | G | g G | |
| 1 | 1 | 1 i I l L | 17 | H | h H | |
| 2 | 2 | 2 | 18 | J | j J | |
| 3 | 3 | 3 | 19 | K | k K | |
| 4 | 4 | 4 | 20 | M | m M | |
| 5 | 5 | 5 | 21 | N | n N | |
| 6 | 6 | 6 | 22 | P | p P | |
| 7 | 7 | 7 | 23 | Q | q Q | |
| 8 | 8 | 8 | 24 | R | r R | |
| 9 | 9 | 9 | 25 | S | s S | |
| 10 | A | a A | 26 | T | t T | |
| 11 | B | b B | 27 | V | v V | |
| 12 | C | c C | 28 | W | w W | |
| 13 | D | d D | 29 | X | x X | |
| 14 | E | e E | 30 | Y | y Y | |
| 15 | F | f F | 31 | Z | z Z |
Electrologica
[edit]An earlier form of base 32 notation was used by programmers working on the Electrologica X1 to represent machine addresses. The "digits" were represented as decimal numbers from 0 to 31. For example, 12-16 would represent the machine address 400 (= 12 × 32 + 16).
Geohash
[edit]In the Geohash algorithm, a modified base32 representation is used to represent latitude and longitude values in one (bit-interlaced) positive integer.[9] This representation uses all decimal digits (0–9) and almost all of the lower case alphabet, except letters "a", "i", "l", "o", as shown by the following character map:
| Decimal | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Base 32 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | b | c | d | e | f | g | |||
| Decimal | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | |||
| Base 32 | h | j | k | m | n | p | q | r | s | t | u | v | w | x | y | z | |||
Turing's encoding
[edit]In approximately 1950,[10] Alan Turing wrote software requirements for the Manchester Mark I computing system. [11] A transcription of Turing's manual for the Mark I is available on archive.org.[12]
The University of Manchester's archive site commemorating 60 years of computing [13] has a table of the base 32 encoding that Turing used. The table and the accompanying explanation also appear in the manual.
Another account of this period in Turing's life appears on his biography page under Early computers and the Turing test.
Video games
[edit]Before NVRAM became universal, several video games for Nintendo platforms used base 31 numbers for passwords. These systems omit vowels (except Y) to prevent the game from accidentally giving a profane password. Thus, the characters are generally some minor variation of the following set: 0–9, B, C, D, F, G, H, J, K, L, M, N, P, Q, R, S, T, V, W, X, Y, Z, and some punctuation marks. Games known to use such a system include Mario Is Missing!, Mario's Time Machine, Tetris Blast, and The Lord of the Rings (Super NES).
Word-safe alphabet
[edit]The word-safe Base32 alphabet is an extension of the Open Location Code Base20 alphabet. That alphabet uses 8 numeric digits and 12 case-sensitive letter digits chosen to avoid accidentally forming words. Treating the alphabet as case-sensitive produces a 32 (8+12+12) digit set.
| Decimal | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Base 32 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | C | F | G | H | J | M | P | Q | |||
| Decimal | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | |||
| Base 32 | R | V | W | X | c | f | g | h | j | m | p | q | r | v | w | x | |||
Comparisons with other systems
[edit]Advantages
[edit]Base32 has a number of advantages over Base64:
- The resulting character set is all one case, which can often be beneficial when using a case-insensitive filesystem, DNS names, spoken language, or human memory.
- The result can be used as a file name because it cannot possibly contain the '/' symbol, which is the Unix path separator.
- The alphabet can be selected to avoid similar-looking pairs of different symbols, so the strings can be accurately transcribed by hand. (For example, the RFC 4648 §6 symbol set omits the digits for one, eight and zero, since they could be confused with the letters 'I', 'B', and 'O'.)
- A result excluding padding can be included in a URL without encoding any characters.
Base32 has advantages over hexadecimal/Base16:
- Base32 representation takes 20% less space. (1000 bits takes 200 characters, compared with 250 for Base16.)
Compared with 8-bit-based encodings, 5-bit systems might also have advantages when used for character transmission:
- Featuring the complete alphabet, the RFC 4648 §6 Base32 scheme and similar allow encoding two more characters per 32-bit integer (for a total of 6 instead of 4, with 2 bits to spare), saving bandwidth in constrained domains such as radiomeshes.
Disadvantages
[edit]Base32 representation takes roughly 20% more space than Base64. Also, because it encodes five 8-bit bytes (40 bits) to eight 5-bit base32 characters rather than three 8-bit bytes (24 bits) to four 6-bit base64 characters, padding to an 8-character boundary is a greater burden on short messages (which may be a reason to elide padding, which is an option in RFC 4648).
| Base64 | Base32 | Hexadecimal | |
|---|---|---|---|
| 8-bit | 133% | 160% | 200% |
| 7-bit | 117% | 140% | 175% |
Even if Base32 takes roughly 20% less space than hexadecimal, Base32 is much less used. Hexadecimal can easily be mapped to bytes because two hexadecimal digits is a byte. Base32 does not map to individual bytes. However, two Base32 digits correspond to ten bits, which can encode (32 × 32 =) 1,024 values, with obvious applications for orders of magnitude of multiple-byte units in terms of powers of 1,024.
Hexadecimal is easier to learn and remember, since that only entails memorising the numerical values of six additional symbols (A–F), and even if those are not instantly recalled, it is easier to count through just over a handful of values.
Software implementations
[edit]Base32 programs are suitable for encoding arbitrary byte data using a restricted set of symbols that can both be conveniently used by humans and processed by computers.
Base32 implementations use a symbol set made up of at least 32 different characters (sometimes a 33rd for padding), as well as an algorithm for encoding arbitrary sequences of 8-bit bytes into a Base32 alphabet. Because more than one 5-bit Base32 character is needed to represent each 8-bit input byte, if the input is not a multiple of 5 bytes (40 bits), then it doesn't fit exactly in 5-bit Base32 characters. In that case, some specifications require padding characters to be added while some require extra zero bits to make a multiple of 5 bits. The closely related Base64 system, in contrast, uses a set of 64 symbols (or 65 symbols when padding is used).
Base32 implementations in C/C++,[14][15] Perl,[16] Java,[17] JavaScript[18] Python,[19] Go[20] and Ruby[21] are available. [22]
See also
[edit]- .onion – Special-use top-level internet domain
- Ascii85 – Encoding for a sequence of byte values using 85 printable characters
- Base16 – Encoding for a sequence of byte values using hexadecimal representation
- Base64 – Encoding for a sequence of byte values using 64 printable characters
- Base36 – Encoding for a sequence of byte values using 36 printable characters
- Base58 – Representation of binary data as text
- Geohash – Public domain geocoding invented in 2008
Notes
[edit]- ^ For context, the proposed standard also documents two base64 encodings, and here too expresses a preference for one, though for different reasons. Only one base16 encoding is documented – long universally adopted even prior to the publication of RFC 4648 or its predecessor RFC 3548.
- ^ The similarity used to be a feature, not a bug, because it allowed early typewriters to omit extra keys for the numbers 0 and 1, thus reducing mechanical complexity. When computers were introduced, it was felt desirable for early computer printers to be able to produce the same type as quality typewriters, hence typewriter-like fonts kept these letters looking alike. As of this writing in 2025, it is no longer necessary to use fonts that don't clearly distinguish some letters, but the tradition persists. It is also not just typewriter-style fonts that have similar problems – many influential fonts do, e.g. Helvetica.
- ^ The design of many base32 variants is driven by the view that it is risky to assume a distinguishable font will be used. On the other hand, the logic of a scheme not trying to compensate for quirks outside its remit may be more straightforward.
References
[edit]- ^ "Official Internet Protocol Standards » RFC Editor".
- ^ Myers, J. (May 23, 2000). SASL GSSAPI mechanisms. I-D draft-ietf-cat-sasl-gssapi-01. Retrieved 2023-06-24.
- ^ a b Josefsson, Simon (2006). "7. Base 32 Encoding with Extended Hex Alphabet". RFC 4648: The Base16, Base32, and Base64 Data Encodings. IETF. doi:10.17487/RFC4648.
- ^ Lanctot, Christian (1999-03-01). "A Better Date? (second letter under that heading) - Letters". Dr Dobb's.
- ^ "parseInt() - JavaScript". MDN Web Docs. Mozilla. 29 December 2023.
- ^ "Built-in Functions". Python documentation. Python Software Foundation. Archived from the original on 2018-10-26. Retrieved 2017-08-09.
- ^ O'Whielacronx, Zooko (2009). "Human-oriented base-32 encoding".
- ^ Douglas Crockford. "Base 32". Archived from the original on 2002-12-23.
- ^ "Tips & Tricks - geohash.org". geohash.org. Archived from the original on 2020-04-28. Retrieved 2020-04-03.
- ^ "Alan M. Turing (1912 - 1954)". Computer 50. The University of Manchester. Retrieved 17 April 2025.
- ^ "Alan M. Turing (1912 - 1954)". Digital 60. The University of Manchester. Retrieved 17 April 2025.
- ^ Alan M. Turing, transcribed by Robert S. Thau (13 February 2000). "Alan Turing's Manual for the Ferranti Mk. I" (PDF). Computer 50. The University of Manchester. Archived from the original (PDF) on 7 June 2011. Retrieved 17 April 2025.
- ^ "Programming on the Ferranti Mark 1". Digital 60. The University of Manchester. Retrieved 17 April 2025.
- ^ "CyoEncode". SourceForge. 24 June 2023.
- ^ "Gnulib - GNU Portability Library - GNU Project - Free Software Foundation". www.gnu.org.
- ^ "MIME-Base32 - Base32 encoder and decoder". MetaCPAN. Retrieved 2018-07-29.
- ^ "Base32 (Apache Commons Codec 1.15 API)". commons.apache.org.
- ^ "base32". npm. 27 September 2022.
- ^ "base64 — Base16, Base32, Base64, Base85 Data Encodings". Python documentation.
- ^ "Base32 package - encoding/Base32 - PKG.go.dev".
- ^ "base32 | RubyGems.org | your community gem host". rubygems.org.
- ^ "String To Hex Converter". Beautify Code.
Base32
View on GrokipediaFundamentals
Definition and Purpose
Base32 is a binary-to-text encoding scheme that converts arbitrary binary data into an ASCII-compatible string representation using a fixed alphabet of 32 characters, with each character encoding 5 bits of data.[1] This method groups input octets into 40-bit blocks (5 octets), which are then divided into eight 5-bit values, each mapped to a character from the alphabet, resulting in an encoded output that is approximately 60% larger than the original binary due to the reduced information density per character compared to 8-bit octets.[1] The scheme includes padding with the "=" character to ensure proper alignment when the input length is not a multiple of 5 octets, maintaining decodability without ambiguity.[1] The primary purposes of Base32 are to enable the safe transmission and storage of binary data across text-only protocols and systems that restrict or alter non-ASCII characters, such as email (via MIME), URLs, and other ASCII-limited channels.[1] It avoids the use of control characters or ambiguous symbols that could be misinterpreted or stripped during transit, while providing a case-insensitive encoding suitable for environments where uppercase and lowercase distinctions are unreliable.[1] Although not explicitly optimized for human readability, the choice of alphanumeric characters facilitates occasional manual inspection or transcription in technical contexts.[1] Base32's development emerged in the early 2000s as part of IETF efforts to standardize encodings for internet protocols, with its first formal description appearing in RFC 2938 (2000) for representing composite media features in a compact, case-insensitive format.[3] It was subsequently refined and broadly specified in RFC 3548 (2003), which established common alphabets and rules for Base16, Base32, and Base64, and later updated in RFC 4648 (2006) to address ambiguities and improve interoperability, obsoleting the prior version.[4][1] This evolution reflects the need for reliable binary-to-text mappings in growing internet applications, building on earlier encodings like Base64 but prioritizing case insensitivity and simplicity in certain use cases.[1]Alphabet and Encoding Mechanics
The Base32 encoding scheme utilizes a fixed alphabet consisting of 32 symbols to represent values from 0 to 31, enabling the efficient mapping of binary data into a textual format suitable for transmission over text-based protocols. The standard alphabet, as defined in RFC 4648, comprises the uppercase letters A through Z (values 0 to 25) followed by the digits 2 through 7 (values 26 to 31), resulting in the sequence: A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, 2, 3, 4, 5, 6, 7.[5] This selection includes letters I and O, prioritizing a full 26-letter set for compatibility with existing systems, while the digits 0 and 1 are omitted to reduce visual ambiguity with letters, and 8 and 9 are excluded to maintain the 32-symbol limit.[5]| Value | Symbol | Value | Symbol | Value | Symbol | Value | Symbol |
|---|---|---|---|---|---|---|---|
| 0 | A | 8 | I | 16 | Q | 24 | Y |
| 1 | B | 9 | J | 17 | R | 25 | Z |
| 2 | C | 10 | K | 18 | S | 26 | 2 |
| 3 | D | 11 | L | 19 | T | 27 | 3 |
| 4 | E | 12 | M | 20 | U | 28 | 4 |
| 5 | F | 13 | N | 21 | V | 29 | 5 |
| 6 | G | 14 | O | 22 | W | 30 | 6 |
| 7 | H | 15 | P | 23 | X | 31 | 7 |
Standard Encodings
RFC 4648 Base32 (§6)
The RFC 4648 Base32 encoding specifies a method for representing arbitrary sequences of octets as a textual string using a 32-character subset of US-ASCII, designed primarily for applications requiring a URL-safe and human-readable format without ambiguous characters.[5] The alphabet consists of the uppercase letters A through Z followed by the digits 2 through 7, resulting in the ordered set: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 2 3 4 5 6 7.[5] Each character encodes 5 bits of data, with the most significant bit first, and the output is produced in uppercase letters without line wrapping unless explicitly required by the application context.[5] The encoding process groups input octets into blocks of 5 (40 bits), which are then divided into 8 groups of 5 bits each; each 5-bit value serves as an index into the alphabet to select the corresponding character.[5] For input lengths not divisible by 5 octets, padding is applied by appending the pad character '=' to ensure the output length is a multiple of 8 characters: specifically, 1 octet yields 2 characters followed by 6 '='; 2 octets yield 4 characters followed by 4 '='; 3 octets yield 5 characters followed by 3 '='; and 4 octets yield 7 characters followed by 1 '='.[5] This padding aligns with 40-bit processing blocks and facilitates unambiguous decoding.[5] A representative example is the encoding of the single ASCII character "f" (hexadecimal 0x66, binary 01100110). The 8-bit input is treated as an incomplete 40-bit block, padded with zeros to 40 bits (01100110 00000000 00000000 00000000 00000000), then split into 5-bit groups: 01100 (index 12 → M), 11000 (index 24 → Y), followed by six zero groups (index 0 → A, but since padded, replaced by '='). The result is "MY======".[5] This process demonstrates the bit-shifting mechanics: the first 5 bits (01100) map directly, with subsequent shifts extracting the next 5 bits from the remaining octet and implicit zeros. This encoding is compliant with MIME content-transfer-encoding requirements and is inherently safe for inclusion in URLs and filenames, as it avoids characters with special meanings in those contexts and produces no ambiguous symbols that could be misread (e.g., no lowercase, digits 0/1, or punctuation beyond '=').[5] In MIME usage, non-alphabet characters are ignored during decoding, and padding may be omitted if the input length is known in advance; for URLs, the '=' pad is often percent-encoded as %3D to prevent parsing issues.[5] Relative to the earlier RFC 3548, the Base32 specification in RFC 4648 includes minor clarifications on padding handling and output formatting, along with added test vectors and corrections to illustrative examples for improved interoperability.[7]RFC 4648 Base32hex (§7)
The Base32hex encoding, defined in Section 7 of RFC 4648, is an extended hexadecimal variant of the Base32 encoding scheme designed to represent binary data using a 32-character alphabet that prioritizes compatibility with hexadecimal notation while preserving bit-wise sort order.[1] This variant maps input octets to groups of 5 bits, producing an output stream of 8 characters per 40 input bits (5 octets), similar to the standard Base32 encoding in Section 6, but with a distinct alphabet that begins with the digits 0-9 followed by the uppercase letters A-V to facilitate direct representation of hexadecimal values.[1] The encoding process involves concatenating input bits into 40-bit blocks, dividing each block into eight 5-bit segments, and translating each segment to the corresponding character from the alphabet, with zero bits appended to incomplete blocks to form full quanta.[1] Output is always in uppercase letters, and padding with the "=" character is required to ensure the encoded length is a multiple of 8 characters, unless explicitly omitted in a specific application.[1] The alphabet for Base32hex consists of the following 32 characters, assigned to values 0 through 31:| Value | Character | Value | Character | Value | Character | Value | Character |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 8 | 8 | 16 | G | 24 | O |
| 1 | 1 | 9 | 9 | 17 | H | 25 | P |
| 2 | 2 | 10 | A | 18 | I | 26 | Q |
| 3 | 3 | 11 | B | 19 | J | 27 | R |
| 4 | 4 | 12 | C | 20 | K | 28 | S |
| 5 | 5 | 13 | D | 21 | L | 29 | T |
| 6 | 6 | 14 | E | 22 | M | 30 | U |
| 7 | 7 | 15 | F | 23 | N | 31 | V |
Variant Encodings
z-base-32
z-base-32 is a variant of Base32 encoding designed for improved human usability and compactness, particularly in contexts like URIs and file identifiers. Developed by Zooko Wilcox-O'Hearn in November 2002, it prioritizes readability and error resistance by selecting and ordering an alphabet that minimizes visual confusion during transcription.[8] The alphabet consists of the 32 characters: ybndrfg8ejkmcpqxot1uwisza345h769. This set excludes potentially confusable symbols such as 0 (zero), l (lowercase L), v, and 2 to reduce transcription errors, while including digits 1, 3, 4, 5, 6, 7, 8, 9 and a permuted selection of lowercase letters. The permutation ensures that more distinguishable and frequently used characters appear more often in typical encodings, enhancing ergonomic handling. Encoding follows the standard Base32 process of grouping input bits into 5-bit segments, mapping each to an alphabet symbol, but omits padding characters like '=' for conciseness, allowing variable-length inputs without fixed octet alignment.[8][9] A key feature is full case-insensitivity: decoding accepts both uppercase and lowercase letters, mapping them to the lowercase alphabet for consistency, which makes it suitable for case-insensitive environments like filenames and web URLs. Unlike some variants, it does not incorporate hyphens or other separators as part of the core encoding, though applications may add them post-encoding for readability if needed. This design was motivated by needs in projects like Mnet, where 30-octet cryptographic values required compact, human-transmittable URI representations.[8][10] In practice, z-base-32 offers advantages in web and file naming scenarios by producing purely alphanumeric strings that are URL-safe and free of ambiguous characters, thereby lowering error rates in manual entry compared to standard Base32 alphabets that include '0', 'O', or 'I'. For instance, a 128-bit UUID, requiring 128 / 5 = 25.6 symbols (rounded to 26 characters), can be encoded without padding, resulting in a compact string like "pb1sa5dxfoo8q551pt1yw" for a sample input, facilitating shorter identifiers in distributed systems such as Tahoe-LAFS.[8][11]Crockford's Base32
Crockford's Base32 is a variant of the Base32 encoding scheme developed by Douglas Crockford in 2002 specifically to facilitate the accurate transmission of binary data between humans and computers, particularly for short identifiers like UUIDs. It prioritizes human readability and error resistance over strict adherence to standards like RFC 4648.[12] The alphabet consists of 32 symbols: the digits 0 through 9, followed by the uppercase letters A through Z excluding I, L, O, and U to minimize visual confusion with numerals and avoid unintended vulgarities. This results in the set: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, G, H, J, K, M, N, P, Q, R, S, T, V, W, X, Y, Z. Encoding treats input bytes as a bit stream, grouping them into 5-bit quanta, each mapped to a symbol from the alphabet; to avoid padding, the input is zero-extended if necessary to ensure the bit length is a multiple of 5. Outputs use uppercase letters exclusively, with no padding characters appended.[12][13] A distinguishing feature is the optional modulo-37 checksum, which appends a single check symbol to detect transcription errors, using an extended set of 37 symbols including the primary 32 plus *, ~, $, =, and U for the checksum value. Hyphens may be inserted arbitrarily in the encoded string for readability during manual transcription and are ignored during decoding. Decoding is case-insensitive, accepting lowercase letters and mapping ambiguous characters like 'i' or 'l' to '1' and 'o' to '0' to aid error correction; if a checksum is present, it is validated, and mismatches cause decoding to fail, preventing common input errors.[12] For instance, the ASCII string "base" encodes to "C9GQ6S8" without checksum and "C9GQ6S8J" with checksum, where 'J' is the check symbol. This flexible yet robust design enhances reliability in scenarios involving human entry, such as serial numbers or keys.[13]Other Specialized Variants
In the historical context, early adaptations of 5-bit encoding schemes laid groundwork for modern Base32 by representing data in 32-symbol sets tailored to computing constraints of the era. The Electrologica X1, a transistorized computer developed in the Netherlands during the early 1960s, incorporated 5-bit portions for encoding source code and data on 5-channel punched tape systems.[14] Similarly, Alan Turing's contributions to the Manchester Mark 1 computer in the late 1940s promoted a base-32 numerical system for efficient data representation and output, devising encoding methods like Scheme A in collaboration with Cicely Popplewell to map binary values to 32 distinct symbols, influencing post-war computer design.[15] A prominent geospatial variant is Geohash, introduced by Gustavo Niemeyer in 2008 as a public-domain system for encoding latitude and longitude into short, hierarchical strings.[16] It uses a modified Base32 alphabet consisting of digits 0-9 and letters b-h, j-k, m-n, p-q, r-s, t-u, v-w, x-y, z (excluding a, i, l, o to avoid visual similarities with numerals), enabling precise location representation where each additional character refines the geographic precision to approximately 1/32,000 of the Earth's surface. This adaptation interleaves binary coordinates via Z-order curve principles, producing strings like "gcpvj" for central London, facilitating efficient spatial indexing in databases and URL-shortened geolinks.[17] Application-specific variants often prioritize obfuscation and usability in constrained environments. Word-safe Base32 adaptations, for instance, modify the alphabet to exclude ambiguous characters and select letters to avoid forming dictionary words or offensive terms across languages, thereby enhancing security in contexts like key generation or data transmission where readability must not imply meaning.[18] These designs maintain the 5-bit grouping for compactness but select symbols to minimize unintended linguistic patterns.[19] Across these specialized forms, a common trait is the retention of Base32's fundamental 5-bit mechanics for binary-to-text conversion while customizing the symbol set to address domain needs like historical hardware limitations, geospatial hierarchy, or security obfuscation; however, their niche focus has limited broader adoption compared to standardized variants.[20]Comparisons
With Base64
Base32 and Base64 are both binary-to-text encoding schemes defined in RFC 4648, but they differ fundamentally in their design parameters and implications for data representation. Base32 encodes data using a 32-character alphabet, mapping 5 bits per character, which results in processing 40-bit groups (equivalent to 5 octets) into 8 characters. In contrast, Base64 employs a 64-character alphabet, encoding 6 bits per character and handling 24-bit groups (3 octets) into 4 characters. This leads to distinct efficiency profiles: Base32 expands input data by approximately 60% for complete 5-octet blocks (8 characters for 5 bytes), while Base64 achieves about 33% expansion (4 characters for 3 bytes).[1] The alphabets further highlight differences in safety and compatibility. Base32's alphabet consists of the uppercase letters A–Z and digits 2–7, followed by "=" for padding, making it entirely case-insensitive and free of special characters. Base64, however, uses A–Z, a–z, 0–9, plus "+" and "/", with "=" for padding, which can introduce issues in URL-safe contexts or systems intolerant to these symbols, often necessitating variants like Base64url. Both schemes use "=" padding exclusively to align incomplete quanta, but Base32's restricted set enhances readability and reduces errors in human-transmitted identifiers.[1] In terms of use cases, Base32 is preferred in scenarios requiring unambiguous, human-readable strings, such as shared secrets in Time-based One-Time Password (TOTP) systems, where it encodes keys to minimize transcription errors. Base64 remains the standard for general-purpose applications like MIME email attachments and binary data transfer in protocols, due to its higher density. Although Base32 demands more output characters—incurring higher storage and transmission overhead—its 40-bit alignment (multiples of 5 octets) can simplify decoding in certain byte-oriented systems compared to Base64's 24-bit groups, as both align neatly to byte boundaries but Base32 avoids the finer-grained 6-bit shifts. Historically, Base32 emerged in RFC 4648 as a safer alternative to Base64 for restricted US-ASCII environments and case-insensitive needs, prioritizing error resistance over compactness.[1][21]Advantages and Disadvantages
Base32 encoding offers several advantages over other binary-to-text schemes, particularly in scenarios prioritizing human readability and error resistance. Its alphabet, consisting of 32 characters (uppercase letters A–Z and digits 2–7), avoids digits 0 and 1 (using 2–7 instead), though it includes letters such as I, L, and O that may be confused with numerals.[1] This design enhances error detection compared to Base64, where characters like 0, O, and l can be confused. Additionally, standard Base32 is case-insensitive, allowing flexible input during decoding without altering the output, which simplifies usage in varied environments. Variants like Crockford's Base32 further improve this by excluding additional ambiguous characters (I, L, O, U) and being inherently URL-safe, avoiding symbols that could interfere with web transmission.[12] In terms of compactness, Base32 is well-suited for encoding 40-bit blocks into exactly 8 characters, providing a balanced density of 5 bits per symbol that outperforms Base16 (4 bits per symbol) for general binary data. Relative to Base16 (hexadecimal), Base32 yields more compact representations for non-hexadecimal inputs—for instance, 20 bits require 5 Base32 characters versus 5 Base16 characters for only 16 bits—while maintaining readability without the need for specialized hex knowledge.[1] However, Base32 has notable disadvantages, primarily its lower efficiency compared to Base64. It produces approximately 60% larger output than the input (versus Base64's 33% overhead), as each 8-byte input expands to about 12.8 characters on average, making it less ideal for bandwidth-constrained applications. Padding with "=" characters further increases length for non-multiples of 40 bits, adding to the overhead in short encodings. For data already in hexadecimal form, Base16 is more efficient, as it directly maps without the need for regrouping bits.[1] On security aspects, Base32 provides no inherent encryption or confidentiality; it merely represents binary data in text form and can inadvertently leak information through encoding length if not padded consistently, potentially enabling length-based attacks in sensitive contexts. While variants such as Crockford's incorporate optional checksums (using modulo-37 arithmetic) to detect transcription errors or alterations, these do not mitigate cryptographic vulnerabilities and add minor computational overhead.[12] Overall, Base32 trades raw efficiency for enhanced readability and safety, making it preferable in human-centric applications like identifiers or DNS records over purely optimized schemes like Base64, though it underperforms in high-volume data transfer.[1]Implementations and Applications
Software Libraries
Several programming languages provide built-in support or popular third-party libraries for Base32 encoding and decoding, primarily adhering to the RFC 4648 standard. These implementations facilitate the conversion of binary data to and from Base32-encoded strings, enabling applications in data serialization, URL-safe transmission, and human-readable representations of binary values. In Java, there is no native Base32 support in the standard library such asjava.util.Base, which focuses on Base64; developers typically rely on third-party libraries like Apache Commons Codec, which offers a Base32 class for encoding and decoding per RFC 4648, or Google Guava's BaseEncoding for flexible binary-to-text conversions including Base32. Similarly, in C#, the .NET framework lacks a built-in System.Convert.ToBase32String method, with implementations often using custom code or libraries like the ConvertBase32 utility in open-source projects for RFC 4648 compliance.
Python includes native Base32 functions in its standard base64 module, with b32encode() converting bytes to Base32-encoded bytes and b32decode() performing the reverse, supporting optional case folding and character mapping for robustness. Third-party packages like base32-crockford extend this for variants, such as Crockford's Base32, providing additional encoding options beyond the standard alphabet.
Go features a standard library package encoding/base32 that implements RFC 4648 encoding and decoding, including StdEncoding for the standard variant and HexEncoding for the hexadecimal alphabet; it supports streaming via NewEncoder and NewDecoder for efficient handling of large data. In Rust, the base32 crate provides encode() and decode() functions for various Base32 alphabets, including RFC 4648, and is no_std compatible for embedded use cases.
JavaScript lacks native Base32 support in browsers or Node.js, but npm libraries such as base32-encode offer encoding/decoding for multiple variants; for Node.js, the Buffer class can integrate with these via third-party wrappers.
Support for Base32 variants is more limited and often confined to specialized libraries. For Crockford's Base32, the crockford-base32 npm package in JavaScript implements the human-readable encoding without ambiguous characters, and similar crates exist in Rust and Go. z-base-32 has sparse adoption, with implementations like the zbase32 npm module for JavaScript, the z-base-32 PyPI package for Python, and the zbase32 Go package, focusing on URL-safety and brevity but lacking widespread integration.
Base32 implementations generally exhibit linear time complexity O(n) relative to input size, involving straightforward bit shifting and table lookups, with decoding potentially slower due to padding validation but no common hardware acceleration like SIMD instructions.
