Recent from talks
Contribute something
Nothing was collected or created yet.
Collation
View on WikipediaThis article needs additional citations for verification. (March 2019) |
Collation is the assembly of written information into a standard order. Many systems of collation are based on numerical order or alphabetical order, or extensions and combinations thereof. Collation is a fundamental element of most office filing systems, library catalogs, and reference books.
Collation differs from classification in that the classes themselves are not necessarily ordered. However, even if the order of the classes is irrelevant, the identifiers of the classes may be members of an ordered set, allowing a sorting algorithm to arrange the items by class.
Formally speaking, a collation method typically defines a total order on a set of possible identifiers, called sort keys, which consequently produces a total preorder on the set of items of information (items with the same identifier are not placed in any defined order).
A collation algorithm such as the Unicode collation algorithm defines an order through the process of comparing two given character strings and deciding which should come before the other. When an order has been defined in this way, a sorting algorithm can be used to put a list of any number of items into that order.
The main advantage of collation is that it makes it fast and easy for a user to find an element in the list, or to confirm that it is absent from the list. In automatic systems this can be done using a binary search algorithm or interpolation search; manual searching may be performed using a roughly similar procedure, though this will often be done unconsciously. Other advantages are that one can easily find the first or last elements on the list (most likely to be useful in the case of numerically sorted data), or elements in a given range (useful again in the case of numerical data, and also with alphabetically ordered data when one may be sure of only the first few letters of the sought item or items).
Ordering
[edit]Numerical and chronological
[edit]Strings representing numbers may be sorted based on the values of the numbers that they represent. For example, "−4", "2.5", "10", "89", "30,000". Pure application of this method may provide only a partial ordering on the strings, since different strings can represent the same number (as with "2" and "2.0" or, when scientific notation is used, "2e3" and "2000").
A similar approach may be taken with strings representing dates or other items that can be ordered chronologically or in some other natural fashion.
Alphabetical
[edit]Alphabetical order is the basis for many systems of collation where items of information are identified by strings consisting principally of letters from an alphabet. The ordering of the strings relies on the existence of a standard ordering for the letters of the alphabet in question. (The system is not limited to alphabets in the strict technical sense; languages that use a syllabary or abugida, for example Cherokee, can use the same ordering principle provided there is a set ordering for the symbols used.)
To decide which of two strings comes first in alphabetical order, initially their first letters are compared. The string whose first letter appears earlier in the alphabet comes first in alphabetical order. If the first letters are the same, then the second letters are compared, and so on, until the order is decided. (If one string runs out of letters to compare, then it is deemed to come first; for example, "cart" comes before "carthorse".) The result of arranging a set of strings in alphabetical order is that words with the same first letter are grouped together, and within such a group words with the same first two letters are grouped together, and so on.
Capital letters are typically treated as equivalent to their corresponding lowercase letters. (For alternative treatments in computerized systems, see Automated collation, below.)
Certain limitations, complications, and special conventions may apply when alphabetical order is used:
- When strings contain spaces or other word dividers, the decision must be taken whether to ignore these dividers or to treat them as symbols preceding all other letters of the alphabet. For example, if the first approach is taken then "car park" will come after "carbon" and "carp" (as it would if it were written "carpark"), whereas in the second approach "car park" will come before those two words. The first rule is used in many (but not all) dictionaries, the second in telephone directories (so that Wilson, Jim K appears with other people named Wilson, Jim and not after Wilson, Jimbo).
- Abbreviations may be treated as if they were spelt out in full. For example, names containing "St." (short for the English word Saint) are often ordered as if they were written out as "Saint". There is also a traditional convention in English that surnames beginning Mc and M' are listed as if those prefixes were written Mac.
- Strings that represent personal names will often be listed by alphabetical order of surname, even if the given name comes first. For example, Juan Hernandes and Brian O'Leary should be sorted as "Hernandes, Juan" and "O'Leary, Brian" even if they are not written this way.
- Very common initial words, such as The in English, are often ignored for sorting purposes. So The Shining would be sorted as just "Shining" or "Shining, The".
- When some of the strings contain numerals (or other non-letter characters), various approaches are possible. Sometimes such characters are treated as if they came before or after all the letters of the alphabet. Another method is for numbers to be sorted alphabetically as they would be spelled: for example 1776 would be sorted as if spelled out "seventeen seventy-six", and 24 heures du Mans as if spelled "vingt-quatre..." (French for "twenty-four"). When numerals or other symbols are used as special graphical forms of letters, as in 1337 for leet or Se7en for the movie title Seven, they may be sorted as if they were those letters.
- Languages have different conventions for treating modified letters and certain letter combinations. For example, in Spanish the letter ñ is treated as a basic letter following n, and the digraphs ch and ll were formerly (until 1994) treated as basic letters following c and l, although they are now alphabetized as two-letter combinations. A list of such conventions for various languages can be found at Alphabetical order § Language-specific conventions.
In several languages the rules have changed over time, and so older dictionaries may use a different order than modern ones. Furthermore, collation may depend on use. For example, German dictionaries and telephone directories use different approaches.
Root sorting
[edit]Some Arabic dictionaries, such as Hans Wehr's bilingual A Dictionary of Modern Written Arabic, group and sort Arabic words by semitic root.[1] For example, the words kitāba (كتابة 'writing'), kitāb (كتاب 'book'), kātib (كاتب 'writer'), maktaba (مكتبة 'library'), maktab (مكتب 'office'), maktūb (مكتوب 'fate,' or 'written'), are agglomerated under the triliteral root k-t-b (ك ت ب), which denotes 'writing'.[2]
Radical-and-stroke sorting
[edit]Another form of collation is radical-and-stroke sorting, used for non-alphabetic writing systems such as the hanzi of Chinese and the kanji of Japanese, whose thousands of symbols defy ordering by convention. In this system, common components of characters are identified; these are called radicals in Chinese and logographic systems derived from Chinese. Characters are then grouped by their primary radical, then ordered by number of pen strokes within radicals. When there is no obvious radical or more than one radical, convention governs which is used for collation. For example, the Chinese character 妈 (meaning "mother") is sorted as a six-stroke character under the three-stroke primary radical 女 (meaning "woman").
The radical-and-stroke system is cumbersome compared to an alphabetical system in which there are a few characters, all unambiguous. The choice of which components of a logograph comprise separate radicals and which radical is primary is not clear-cut. As a result, logographic languages often supplement radical-and-stroke ordering with alphabetic sorting of a phonetic conversion of the logographs. For example, the kanji word Tōkyō (東京) can be sorted as if it were spelled out in the Japanese characters of the hiragana syllabary as "to-u-ki-yo-u" (とうきょう), using the conventional sorting order for these characters.[citation needed]
In addition, Chinese characters can also be sorted by stroke-based sorting. In Greater China, surname stroke ordering is a convention in some official documents where people's names are listed without hierarchy.
Automation
[edit]When information is stored in digital systems, collation may become an automated process. It is then necessary to implement an appropriate collation algorithm that allows the information to be sorted in a satisfactory manner for the application in question. Often the aim will be to achieve an alphabetical or numerical ordering that follows the standard criteria as described in the preceding sections. However, not all of these criteria are easy to automate.[3]
The simplest kind of automated collation is based on the numerical codes of the symbols in a character set, such as ASCII coding (or any of its supersets such as Unicode), with the symbols being ordered in increasing numerical order of their codes, and this ordering being extended to strings in accordance with the basic principles of alphabetical ordering (mathematically speaking, lexicographical ordering). So a computer program might treat the characters a, b, C, d, and $ as being ordered $, C, a, b, d (the corresponding ASCII codes are $ = 36, a = 97, b = 98, C = 67, and d = 100). Therefore, strings beginning with C, M, or Z would be sorted before strings with lower-case a, b, etc. This is sometimes called ASCIIbetical order. This deviates from the standard alphabetical order, particularly due to the ordering of capital letters before all lower-case ones (and possibly the treatment of spaces and other non-letter characters). It is therefore often applied with certain alterations, the most obvious being case conversion (often to uppercase, for historical reasons[note 1]) before comparison of ASCII values.
In many collation algorithms, the comparison is based not on the numerical codes of the characters, but with reference to the collating sequence – a sequence in which the characters are assumed to come for the purpose of collation – as well as other ordering rules appropriate to the given application. This can serve to apply the correct conventions used for alphabetical ordering in the language in question, dealing properly with differently cased letters, modified letters, digraphs, particular abbreviations, and so on, as mentioned above under Alphabetical order, and in detail in the Alphabetical order article. Such algorithms are potentially quite complex, possibly requiring several passes through the text.[3]
Problems are nonetheless still common when the algorithm has to encompass more than one language. For example, in German dictionaries the word ökonomisch comes between offenbar and olfaktorisch, while Turkish dictionaries treat o and ö as different letters, placing oyun before öbür.
A standard algorithm for collating any collection of strings composed of any standard Unicode symbols is the Unicode Collation Algorithm. This can be adapted to use the appropriate collation sequence for a given language by tailoring its default collation table. Several such tailorings are collected in Common Locale Data Repository.
Sort keys
[edit]In some applications, the strings by which items are collated may differ from the identifiers that are displayed. For example, The Shining might be sorted as Shining, The (see Alphabetical order above), but it may still be desired to display it as The Shining. In this case two sets of strings can be stored, one for display purposes, and another for collation purposes. Strings used for collation in this way are called sort keys.
Issues with numbers
[edit]Sometimes, it is desired to order text with embedded numbers using proper numerical order. For example, "Figure 7b" goes before "Figure 11a", even though '7' comes after '1' in Unicode. This can be extended to Roman numerals. This behavior is not particularly difficult to produce as long as only integers are to be sorted, although it can slow down sorting significantly. For example, Microsoft Windows does this when sorting file names.
Sorting decimals properly is a bit more difficult, because different locales use different symbols for a decimal point, and sometimes the same character used as a decimal point is also used as a separator, for example "Section 3.2.5". There is no universal answer for how to sort such strings; any rules are application dependent.
Labeling of ordered items
[edit]In some contexts, numbers and letters are used not so much as a basis for establishing an ordering, but as a means of labeling items that are already ordered. For example, pages, sections, chapters, and the like, as well as the items of lists, are frequently "numbered" in this way. Labeling series that may be used include ordinary Arabic numerals (1, 2, 3, ...), Roman numerals (I, II, III, ... or i, ii, iii, ...), or letters (A, B, C, ... or a, b, c, ...). (An alternative method for indicating list items, without numbering them, is to use a bulleted list.)
When letters of an alphabet are used for this purpose of enumeration, there are certain language-specific conventions as to which letters are used. For example, the Russian letters Ъ and Ь (which in writing are only used for modifying the preceding consonant), and usually also Ы, Й, and Ё, are omitted. Also in many languages that use extended Latin script, the modified letters are often not used in enumeration.
See also
[edit]Notes
[edit]References
[edit]- ^ Abu-Haidar, J. A. (1983). "Review of A Dictionary of Modern Written Arabic (Arabic-English)". Bulletin of the School of Oriental and African Studies, University of London. 46 (2): 351–353. doi:10.1017/S0041977X00079040. ISSN 0041-977X. JSTOR 615409.
- ^ "Hans Wehr Arabic-English Dictionary". ejtaal.net. Retrieved 2023-06-04.
- ^ a b M Programming: A Comprehensive Guide, Richard F. Walters, Digital Press, 1997
External links
[edit]- Unicode Collation Algorithm: Unicode Technical Standard #10
- Collation in Spanish Archived 2006-08-13 at the Wayback Machine
- Collation of the names of the member states of the United NationsArchived August 30, 2005, at the Wayback Machine
- Typographical collation for many languages, as proposed in the List module of Cascading Style Sheets.
- Collation Charts: Charts demonstrating language-specific sorting orders in various operating systems and DBMS
- ICU Locale Explorer Archived 2008-05-11 at the Wayback Machine: An online demonstration of sorting in different languages that uses the Unicode Collation Algorithm with International Components for Unicode
Collation
View on GrokipediaPrinciples of Ordering
Numerical and Chronological Ordering
Numerical ordering in collation refers to the process of comparing strings by interpreting them as mathematical values rather than their lexical character sequences, ensuring that the relative magnitude of numbers determines their position in the sorted output. For instance, in a list containing "−4", "2.5", and "10", numerical collation would place "−4" first, followed by "2.5", and then "10", as these reflect their actual numeric values rather than codepoint-based comparisons where "10" might precede "2.5". This approach is a customization of the Unicode Collation Algorithm (UCA), which by default sorts digits lexicographically but allows tailoring to parse and compare numeric substrings for intuitive human-readable results.[3] Chronological ordering extends this principle to time-based sequences, where dates or timestamps are sorted according to their temporal progression, often leveraging standardized formats to align lexical and chronological sequences. The ISO 8601 standard, for example, represents dates in the YYYY-MM-DD format, enabling straightforward sorting such that "2023-01-15" precedes "2025-11-10" both numerically and as strings, facilitating efficient organization in databases and archives. This format was developed to promote unambiguous international exchange and machine-readable chronological consistency, avoiding ambiguities in regional date conventions like MM/DD/YYYY. A key challenge in numerical and chronological ordering arises from partial ordering issues, where different string representations of equivalent values must be distinguished to preserve semantic intent without assuming full equivalence. For example, in contexts like version numbering or precise data logging, "2" and "2.0" may represent the same integer value numerically but differ in precision or format, requiring collation rules to treat them as distinct to avoid unintended merging in sorted outputs. This necessitates hybrid approaches in systems like ICU Collation, where numeric parsing is combined with secondary string comparisons for representations.[8] Historically, numerical and chronological ordering emerged in early filing systems and calendars to manage records efficiently amid growing administrative demands. In the late 19th and early 20th centuries, U.S. government offices, including the State Department around 1910, adopted numerical filing to standardize document retrieval, replacing haphazard arrangements with sequential number assignments for faster access. Similarly, ancient calendars, such as the Egyptian solar calendar from circa 3000 BCE, imposed chronological ordering on events for agricultural and ritual purposes, laying foundational principles for temporal sequencing that influenced modern standards.[9][10]Alphabetical Ordering
Alphabetical ordering involves comparing strings character by character based on their positions within an alphabet, such as placing A before B in the Latin script.[3] This letter-by-letter approach determines the sequence by assigning weights to each character, starting from the primary level for base letters, and proceeding to secondary levels for diacritics if needed, ensuring that "role" precedes "roles" due to the additional 's' at the primary level.[11] Case handling in alphabetical ordering varies by system, but many dictionary conventions treat uppercase and lowercase letters as equivalent in value, allowing "Apple" to file near "apple" without strict precedence.[12] However, in some computational and filing systems, uppercase letters precede lowercase ones based on ASCII values, resulting in "Apple" sorting before "banana" because 'A' (code 65) comes before 'b' (code 98).[13] Language-specific rules adapt alphabetical ordering to account for digraphs, ligatures, and modified letters unique to each script. In Spanish, the letter ñ is treated as a distinct character positioned after 'n' but before 'o', while digraphs like "ch" and "ll"—once considered separate letters until their 1994 reclassification and 2010 exclusion from the alphabet—are now sorted letter-by-letter as 'c' followed by 'h' (after "ce" but before "ci") and 'l' followed by 'l' (after "li" but before "lo"), respectively.[14] In French, accented letters like é follow their base form 'e' in primary ordering, with diacritics evaluated at the secondary level in a tailored sequence that often places acute accents (é) after unaccented e but considers grave accents (è) in dictionary-specific backward weighting for precision.[15] Ligatures such as œ in French are typically expanded to "oe" for collation, ensuring "coeur" sorts after "cœur" if accents are secondary.[16] Abbreviations and punctuation are frequently ignored or treated as separators to simplify ordering, preventing disruptions from non-letter characters. Spaces and hyphens serve as element dividers, while periods in abbreviations like "St." are disregarded, filing "St. Louis" under "S" as if it were "Saint Louis."[12] In English word lists, this results in "U.S.A." sorting under "U" after ignoring the periods.[12] Examples from major languages illustrate these principles: In English dictionaries, "cat" precedes "dog" via primary letter comparison, with case-insensitive filing placing "Cat" adjacent to "cat."[12] French lists order "cote" before "côte" (unaccented before accented) and "éte" after "ete" but before "fête," reflecting secondary diacritic weighting.[17] Spanish dictionaries place "nación" after "nada" but before "oasis" due to ñ's position, and "chico" after "cebra" but before "cima" under letter-by-letter digraph treatment.[14]Specialized Sorting Methods
Root-Based Sorting
Root-based sorting is a collation method used primarily in dictionaries of Semitic languages, where entries are organized by shared consonantal roots—typically triliteral sequences of consonants that form the core semantic unit—rather than by the full spelled-out words. For instance, in Arabic, words derived from the root k-t-b (كتب), such as kitāb (book) and kataba (he wrote), are grouped together under the root entry, with subentries arranged by vowel patterns or affixes.[18] This approach prioritizes morphological structure over linear alphabetical sequences, contrasting briefly with standard alphabetical ordering in non-Semitic scripts. A prominent historical example is Hans Wehr's A Dictionary of Modern Written Arabic, first published in German in 1952 with English editions appearing by 1961, which employs roots as the primary sort keys followed by form patterns.[18] In this dictionary, roots are listed in a modified alphabetical order based on their consonants, enabling users to locate related derivations systematically.[19] This method extends to other Semitic languages, such as Hebrew, where standard lexicons like the Brown-Driver-Briggs Hebrew and English Lexicon arrange entries by triliteral roots to reflect etymological families.[20] Similarly, in Amharic, dictionaries including Grover Hudson's A Student’s Amharic-English, English-Amharic Dictionary (1994) and the Kane Amharic-English Dictionary organize words by roots when applicable, following the Ethiopic syllabary for root sequencing.[21][22] The advantages of root-based sorting lie in its ability to reveal etymological and semantic connections among words, facilitating deeper understanding for language learners and researchers by grouping morphologically related terms. This organization highlights the templatic morphology of Semitic languages, where a single root can generate dozens of forms across grammatical categories.Radical-and-Stroke Sorting
Radical-and-stroke sorting is a hierarchical method employed in East Asian writing systems to order logographic characters, primarily by identifying a semantic or graphic component known as the radical, followed by the number of strokes in the remaining portion of the character. This approach facilitates dictionary lookup and collation for scripts like Chinese hanzi, Japanese kanji, and Korean hanja, where characters do not follow phonetic alphabets. The radical serves as the primary key, often hinting at the character's meaning, while the residual stroke count provides the secondary sorting criterion, ensuring a systematic arrangement without reliance on linear phonetic sequences.[23][24] In the Chinese system, characters are decomposed such that the radical—typically the leftmost, topmost, or bottommost component—determines the main category, with the total stroke count minus the radical's strokes used for subordering. For instance, the character 妈 ("mother") is sorted under the radical 女 (nǚ, meaning "female," 3 strokes), with 3 additional strokes in the remainder 马 (mǎ, "horse"). Similarly, 好 ("good") falls under 女, followed by 3 strokes in 子 (zǐ, "child"). This method relies on a standardized set of 214 radicals, ordered by their own stroke counts from 1 to 17.[23][25] The foundational framework emerged from the Kangxi Dictionary (康熙字典, Kāngxī Zìdiǎn), commissioned by the Qing emperor Kangxi and completed in 1716, which formalized the 214-radical system drawing from earlier Ming-era works like the Zhengzitong. This dictionary organized approximately 47,000 characters under these radicals, with subentries by residual strokes, establishing a enduring standard for traditional Chinese collation that persists in modern print and digital references. Its influence extends to adaptations in simplified Chinese contexts, where variant radicals are mapped to maintain compatibility.[23][24] Japanese kanji dictionaries adopt a parallel structure, classifying characters under one of the 214 Kangxi radicals before sorting by additional strokes, often supplemented by total stroke indices for cross-verification. For example, the character 読 ("read") is indexed under the radical 言 (yán, "speech," 7 strokes), with 7 further strokes in the phonetic component 賣 (mài). This method supports efficient lookup in resources like the Dai Kan-Wa Jiten, aligning closely with Chinese traditions while accommodating Japanese-specific readings and usages.[26][24] In contemporary digital environments, radical-and-stroke sorting has been integrated into font systems and collation algorithms through the Unicode Standard, which encodes the 214 Kangxi radicals in the range U+2F00–U+2FDF and provides radical-stroke indices in the Unihan Database for CJK unified ideographs. This enables consistent machine-readable ordering across Chinese, Japanese, and Korean texts, preserving the method's utility in search engines, databases, and typesetting software.[25][24] Korean hanja collation mirrors the radical-and-stroke approach using the same 214 Kangxi radicals, with characters sorted first by radical strokes and then by residuals, though practical dictionaries often integrate this with Hangul phonetic indices for Sino-Korean compounds. This adaptation supports hanja's role in formal nomenclature and etymology, where it coexists with the alphabetic Hangul script without disrupting the logographic ordering.[27][24]Collation in Computing
Sort Keys and Algorithms
In computing, collation relies on sort keys, which are binary strings or arrays of integers derived from character codes to enable efficient string comparisons. These keys transform Unicode code points into weighted sequences that reflect linguistic order rather than raw numerical values, allowing binary operations like memcmp for sorting. For example, the character "a" (U+0061) might map to a multi-byte key such as [0x15EF, 0x0020, 0x0002], representing its primary, secondary, and tertiary weights.[3] The historical evolution of collation mechanisms traces back to the American Standard Code for Information Interchange (ASCII), standardized in 1963 by the American Standards Association to provide a 7-bit encoding for 128 characters, primarily supporting English text. Early ASCII-based sorting used simple codepoint comparisons, ordering characters by their binary values (e.g., 'A' at 65 precedes 'a' at 97), which sufficed for basic English but failed for multilingual needs. The introduction of Unicode in 1991 initiated a major evolution in character encoding, with the standard growing to encompass 159,801 assigned characters across 172 scripts as of version 17.0 (September 2025), necessitating algorithms beyond codepoint order to ensure culturally appropriate sorting.[28][29] The Unicode Collation Algorithm (UCA), specified in Unicode Technical Standard #10, provides a foundational, customizable method for generating these sort keys and performing comparisons. It decomposes strings into collation elements—triplets of weights—and compares them level by level: primary weights for base letter differences (e.g., "a" < "b"), secondary for diacritics (e.g., "é" > "e"), and tertiary for case or punctuation (e.g., "A" < "a"). The algorithm is tailorable through modifications to the Default Unicode Collation Element Table (DUCET), allowing reordering of scripts, contractions (e.g., "ch" in Spanish), or level adjustments without altering the core process.[3] The UCA's main algorithm proceeds in four steps:- Normalization: Canonicalize the input strings to Normalization Form D (NFD), decomposing combined characters (e.g., "é" to "e" + combining acute). This ensures consistent element mapping.[30]
- Collation Element Generation: Map each grapheme cluster to one or more collation elements from the CET. Simple mappings use a single triplet [P.S.T]; expansions handle ignorables or composites (e.g., "ffi" ligature expands to multiple elements); contractions treat digraphs as units. Elements with zero primary weight are typically ignored at higher levels.[31]
- Sort Key Formation: For each level, collect non-ignorable weights into a key array, padding with zeros and optionally processing secondary weights backward for certain languages. The full key concatenates levels: L1 || || L2 || || L3, forming a binary string for comparison.[32]
- Comparison: Iterate through keys level by level, stopping at the first differing non-zero weight (L1 first, then L2, etc.). If equal up to the tertiary level, strings are equivalent; higher levels (quaternary) can resolve ties.[33]
function compareStrings(s1, s2):
normalize s1 and s2 to NFD
ce1 = getCollationElements(s1) // array of [P, S, T] triplets
ce2 = getCollationElements(s2)
key1 = buildSortKey(ce1) // levels L1, L2, L3 as concatenated weights
key2 = buildSortKey(ce2)
for level in 1 to 3:
if compareLevel(key1[level], key2[level]) != 0:
return that result
return 0 // equal
function compareStrings(s1, s2):
normalize s1 and s2 to NFD
ce1 = getCollationElements(s1) // array of [P, S, T] triplets
ce2 = getCollationElements(s2)
key1 = buildSortKey(ce1) // levels L1, L2, L3 as concatenated weights
key2 = buildSortKey(ce2)
for level in 1 to 3:
if compareLevel(key1[level], key2[level]) != 0:
return that result
return 0 // equal
