Hubbry Logo
search
logo

Words (Unix)

logo
Community Hub0 Subscribers
Read side by side
from Wikipedia

words is a standard file on Unix and Unix-like operating systems, and is simply a newline-delimited list of dictionary words. It is used, for instance, by spell-checking programs.[1]

The words file is usually stored in /usr/share/dict/words or /usr/dict/words.

On Debian and Ubuntu, the words file is provided by the wordlist package, or its provider packages wbritish, wamerican, etc. On Fedora, Alpine Linux and Arch Linux, the words file is provided by the words package. The words package is sourced from data from the Moby Project, a public domain compilation of words.[2]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
In Unix-like operating systems, the words file is a standard plain-text dictionary containing a sorted list of common English words, one per line, designed for efficient binary search and used primarily by system utilities for word lookup and validation.) It is typically located at /usr/share/dict/words, as specified in the Filesystem Hierarchy Standard (FHS) for architecture-independent data, including optional word lists for spell checkers.[1] The file supports commands like look(1), which displays all lines (words) beginning with a specified string while ignoring case and non-alphanumeric characters, making it useful for tasks such as prefix-based word searches.) Historically rooted in early Unix implementations, it also serves as the core reference for spell-checking tools like spell(1), which compares input words against the dictionary to identify potential misspellings, with provisions for local extensions in /usr/local/share/dict/words.[2] Variations in content exist across distributions—for instance, FreeBSD includes a lexicon tailored for system documentation and supplemental dictionaries, while the file's exact word count (often exceeding 100,000 entries) depends on the package providing it, such as bos.data and bos.txt in IBM AIX.) Beyond text processing, the words file plays a role in system security by providing a dictionlist to enforce password policies that prohibit common dictionary terms, thereby reducing the risk of guessable credentials; in AIX, this is configured via /etc/security/users to reference the file directly. Its presence underscores Unix's emphasis on modular, shareable components, allowing applications like editors (e.g., Vim with dictionary integration) and games to leverage a built-in linguistic resource without external dependencies.[3]

Overview

Description

The words file in Unix-like systems is a plain text file containing an alphabetical list of English words, with one entry per line, serving as a standard reference dictionary for system utilities.[4] It is commonly located at /usr/share/dict/words.[5] Its primary purpose is to provide a corpus of valid words for software tools, such as spell-checking programs and the look utility, which performs binary searches on the sorted content to find word prefixes while ignoring case and non-alphanumeric characters.[4][5] Additionally, it supports password validation mechanisms by enabling checks against common dictionary terms to enhance security.[6] The file typically ranges from approximately 100,000 to 250,000 entries, depending on the distribution and word list variant, often drawn from sources like the SCOWL (Spell Checker Oriented Word Lists) project, which offers sized collections for different needs.[7] All entries are in lowercase letters, including lowercased forms of proper nouns, and include possessives with apostrophes while excluding other punctuation. For example, the opening lines often begin with "a", "aa", "aaa", illustrating coverage of basic letters, contractions, and obscure terms like the Hawaiian lava rock "a'a".

File Location

In modern Unix-like systems, the words file is standardly located at /usr/share/dict/words.[8] On older systems, such as early GNU/Linux distributions, it was commonly placed at /usr/dict/words. Variations exist across implementations, including /usr/share/lib/dict/words in Solaris and other System V derivatives.[9] The file is accessed via standard filesystem commands; for example, cat /usr/share/dict/words displays its full contents, while grep can query it for patterns, such as grep '^apple$' /usr/share/dict/words to check for a specific entry. In distributions like Debian and Ubuntu, /usr/share/dict/words is often a symbolic link managed by the update-alternatives system, which allows administrators to select among multiple word list providers (e.g., from the scowl or wamerican packages) without altering application configurations.[10] Permissions on the words file are typically set to 644 (owner read/write, group and others read-only), ensuring it is world-readable for use by spell-checkers and scripts while preventing unauthorized modifications.[11] Although the words file and its conventional paths are not mandated by POSIX, /usr/share/dict/words has emerged as a de facto standard in most Unix derivatives, including Linux, BSD, and Solaris variants, promoting portability in tool development.[4]

History

Origins in Early Unix

The words file emerged during the initial phases of Unix development at Bell Laboratories in the mid-1970s, specifically to support the spell utility, a basic spell-checking program written by Stephen C. Johnson. This utility, introduced in Version 6 Unix in 1975, processed text by comparing words against a dedicated spelling list, marking a foundational step in Unix text processing capabilities. The file was placed in the /usr/dict directory, aligning with early Unix conventions for system utilities and resources. The initial word list was compiled by Johnson and later refined by M. Douglas McIlroy, drawing from diverse sources such as a standard desk dictionary, the Brown Corpus of American English (a collection of about one million words from 1960s texts), and supplementary lists including common proper nouns like surnames from phone directories. Focused on American English, the list began with approximately 24,000 entries, emphasizing common roots while excluding most inflected forms to optimize for derivational checking—such as recognizing "committees" from the root "committee" via algorithmic transformations. This curation process involved iterative culling and field testing on thousands of documents, adding around 1,000 specialized terms to enhance coverage without bloating the file size, which was compressed to about 26,000 16-bit entries for efficient storage and lookup. By Unix Version 7 in 1979, the words file had become an integral component of the /usr/dict directory, directly supporting spell's operation in line with the Unix philosophy of crafting simple, modular tools that could be composed for complex tasks like document preparation.[12] The first documented applications of this setup demonstrated spell's effectiveness in identifying misspellings and derivational errors in technical memos and reports, underscoring Unix's emphasis on practical, lightweight utilities for everyday programming and writing. This early integration exemplified how Bell Labs researchers, building on the core system designed by Ken Thompson and Dennis Ritchie, extended Unix into accessible text-handling tools.[13]

Evolution and Standardization

Following its origins in the spell utility developed at Bell Labs in the mid-1970s, the words file underwent significant expansion during the 1980s and 1990s in both BSD and System V Unix variants, where word lists grew from initial small sets to larger compilations through the merging of multiple dictionary sources to support enhanced spell-checking functionality. This period saw increasing word counts, often reaching tens of thousands of entries, driven by the need for broader linguistic coverage in diverse computing environments. The growth was further propelled by the influence of the Free Software Foundation (FSF) and the open source movement, which emphasized freely distributable tools and data files to foster interoperability in Unix-like systems.[14] Efforts toward standardization emerged prominently with its predecessor, the Filesystem Standard (FSSTND), first released in 1994 by the Linux community, with the FHS itself introduced in 1997, which recommended placing the words file at /usr/share/dict/words as an optional component for architecture-independent data supporting utilities like spell checkers.[15] Subsequent FHS versions, such as 2.0 in 1997 and 3.0 in 2015, reinforced this location while allowing links to variant dictionaries (e.g., American or British English) to accommodate diverse user needs.[4][16] Key milestones in this evolution include the integration of the words file into GNU spell utilities during the 1990s, where it served as the primary dictionary for the GNU clone of the traditional Unix spell command, aligning free software implementations with established practices.[17] In the 2000s, packages like SCOWL (Spell Checker Oriented Word Lists) were introduced in Debian, providing modular updates to the word list through structured mergers of sources such as 12Dicts and variant databases, enabling easier maintenance and expansion without altering core system files.[18] Despite these advances, challenges remain due to the absence of a formal specification for the words file in POSIX standards, which focus on interfaces and utilities but do not mandate filesystem contents like dictionary files, leading to voluntary and inconsistent adoption across Unix variants.

Content and Format

Sources of the Word List

The primary source for the word list in the Unix words file is the Spell Checker Oriented Word Lists (SCOWL), developed by Kevin Atkinson starting in the early 2000s. SCOWL compiles English words from multiple public domain and open-licensed dictionaries, including the 12Dicts package, the UK Advanced Cryptics Dictionary (UKACD) by J. Ross Beresford, and variant databases like VARCON for American, British, Canadian, and Australian spellings.[7][19] Additional contributors to SCOWL and similar lists include the Moby Project, a public domain collection by Grady Ward featuring over 350,000 single words, compounds, and names, and the ENABLE (English words with many extensions) dictionary, which extends basic lists with inflections derived from sources like WordNet.[7][20] Some implementations add system-specific entries, such as proper nouns or regional terms, to tailor the list for local use.[19] The compilation process involves automated scripts—typically in Perl for SCOWL—to merge source lists, apply frequency-based filtering (e.g., including words from high-popularity classes first), add inflections via tools like WordNet, and perform deduplication to ensure uniqueness. In distributions like Debian, this is handled through the scowl and wordlist packages, which generate the final /usr/share/dict/words file from SCOWL outputs.[7][21] Licensing for these sources is generally permissive, with most components in the public domain (e.g., Moby Project, 12Dicts, and ENABLE) or under open terms requiring only copyright attribution (e.g., SCOWL itself and UKACD). This enables free redistribution in Unix-like systems without proprietary restrictions.[19][20]

Structure and Organization

The words file in Unix-like systems is a plain text document, encoded in ISO-8859-1 (Latin-1), with ASCII-compatible content for English words, consisting solely of one word per line without any punctuation, metadata, definitions, or pronunciations.[22][7] This minimalist format ensures compatibility with text-processing tools and spell-checkers, allowing straightforward line-by-line parsing. For example, entries like "apple" or "computers" appear isolated on their respective lines, facilitating efficient indexing and lookup operations. The file is sorted in alphabetical order, with words predominantly in lowercase to enable case-insensitive searching in applications; mixed-case variants are rare but may appear in some implementations for proper nouns or contractions.[22][7] This organization supports binary search algorithms, as utilized by utilities like look(1), enhancing performance for prefix-based queries. Within the file, common English words are selected based on frequency and utility for spell-checking, including inflected forms such as plurals (e.g., "cat" alongside "cats") and contractions (e.g., "don't") where they align with standard usage patterns, but excluding rare or archaic variants unless deemed essential.[7] Some versions, such as those derived from the SCOWL project, omit highly offensive terms to suit general-purpose environments. Word lengths typically range from 1 to 30 characters, capturing everyday vocabulary while avoiding overly specialized or contrived entries.[7] The structure adheres to principles from sources like SCOWL, prioritizing a clean, query-optimized layout.[7]

Usage

Role in Spell-Checking Programs

The /usr/share/dict/words file functions as the primary dictionary for Unix spell-checking utilities, providing a comprehensive list of valid English words against which input text is validated to detect misspellings.[2] In tools like the spell(1) command, words extracted from documents or standard input are compared directly to this sorted list, with unmatched terms output as potential errors in a sorted, unique format.[2] Introduced in Version 5 AT&T UNIX, the original spell utility relied on /usr/dict/words (or its equivalent path) as its core reference, performing a primitive check by flagging any word absent from the dictionary after preprocessing to remove punctuation and contractions.[2] This approach used simple sorted-file comparisons, often leveraging commands like comm for efficiency in early implementations.[23] Interactive spell-checkers such as ispell and its derivative hunspell extend this foundation by compiling the words file or similar lists into hashed or affix-enabled formats for faster lookups, loading them into memory to handle large-scale text processing without linear searches.[24][25] For instance, ispell employs buildhash to generate binary hash files from raw word lists, enabling rapid querying of root words and their morphological variants.[24] To accommodate user needs and regional differences, these programs support enhancements like personal dictionaries—stored in user-specific files such as ~/.ispell_hashfile or ~/.hunspell_en_US—where custom terms can be added interactively during sessions.[24][25] Additionally, variants for British and American spellings are managed via dedicated files, such as /usr/share/dict/american and /usr/share/dict/british, which can be specified to override or supplement the main dictionary for locale-specific checking.[2]

Applications in Scripting and Tools

The /usr/share/dict/words file serves as a versatile resource in Unix shell scripting, enabling developers to filter, manipulate, and generate text based on English vocabulary. For instance, the grep command can be used to search for words matching specific patterns, such as those starting with a prefix and having a defined length, as demonstrated by the query grep '^ex.{4}$' which retrieves five-letter words beginning with "ex" like "exact" and "exalt".[26] These techniques are commonly employed in automation tasks, like generating word lists for testing or data processing pipelines.[27] Random word selection from the file is facilitated by the shuf utility, part of GNU coreutils, which shuffles lines and outputs a specified number, such as shuf -n 1 /usr/share/dict/words to pick a single random entry for applications like temporary naming or simulation inputs.[28] In password generation scripts, multiple words can be selected for memorable passphrases, often combined with capitalization or delimiters to enhance security while maintaining readability.[29] The wc command provides indirect utility by counting entries in the file, as in wc -l /usr/share/dict/words, which typically yields around 235,000 lines on standard systems, aiding in scripting for scale assessment or validation.[30] Beyond scripting, the file integrates with command-line games like hangman, a traditional Unix utility that selects words from /usr/share/dict/words for guessing challenges, with the game tracking incorrect attempts via a visual gallows.[31] In security contexts, tools such as CrackLib build their dictionaries from this file using create-cracklib-dict /usr/share/dict/words, creating hashed databases to detect weak passwords containing dictionary terms during authentication.[32] The PAM module pam_pwquality leverages these CrackLib dictionaries via its dictcheck=1 option to enforce policies preventing dictionary-based passwords, ensuring compliance with strength requirements in system logins.[33]

Variations Across Systems

Differences in Linux Distributions

In Debian and derivatives like Ubuntu, the /usr/share/dict/words file is provided by the virtual wordlist package, with providers such as wamerican for American English or wbritish for British English, both derived from the SCOWL (Spell Checker Oriented Word Lists) project.[34][7] The file is a symlink to one of the provider files, such as the size 50 variant containing over 101,000 words. These variants allow users to choose locale-specific spellings, with British English alternatives emphasizing differences like "colour" over "color," and the file is installed via the apt package manager, often as an optional component not included in minimal server installations. Red Hat Enterprise Linux and Fedora provide the words file primarily through the words package, which delivers a list of approximately 99,000 English words. It is installed using the dnf package manager (successor to yum), remaining optional in base or minimal setups to conserve space. Arch Linux supplies the words file via the words package, which includes SCOWL-derived English lists alongside international variants, totaling about 13.8 MB installed and supporting customizable installations through related packages like aspell-en for spell-checking or AUR options such as words-insane for fuller SCOWL sizes (e.g., up to 95 with nearly all English words).[35][36] Users can select mini variants for smaller footprints (around 100,000 words) or full ones exceeding 200,000, managed via the pacman tool, and typically excluded from core minimal environments to prioritize user choice.

Implementations in Other Unix-like OSes

In FreeBSD, NetBSD, and OpenBSD, the words file is located at /usr/share/dict/words and serves as the primary dictionary for spell-checking utilities. This file is derived from Webster's Second International Dictionary (web2), containing approximately 235,000 entries, including common English words, proper nouns, and inflected forms.[37][38][39] It is sourced from the base system distribution and integrates directly with tools like ispell, where it provides the core word list for validation and suggestions during spelling checks.[40] Administrators can supplement it with local dictionaries for enhanced coverage in applications such as documentation tools or custom scripts. On macOS, based on the Darwin kernel, the words file resides at /usr/share/dict/words, symlinked to the web2 variant from Webster's Second International Dictionary, comprising over 235,000 words.[41] This legacy file supports command-line utilities like look and spell but is supplemented by Apple's Cocoa frameworks for graphical applications. In TextEdit, spell-checking leverages the system's NSSpellChecker, which draws from this base list alongside dynamic user dictionaries and language-specific resources for real-time corrections. Solaris and its derivative Illumos maintain the words file at /usr/share/lib/dict/words, a more compact dictionary tied to legacy spell-checking tools such as the original spell utility. This file provides a foundational list of common English terms for basic validation in system commands and password policies, emphasizing security checks against dictionary-based attacks rather than comprehensive linguistic coverage.[42] For cross-platform consistency across Unix-like systems, package managers like Homebrew on macOS or the ports collections on BSD variants enable installation of standardized dictionary packages, such as those from the SCOWL project, allowing users to align word lists without relying solely on vendor-specific implementations.[43]

References

User Avatar
No comments yet.