Hubbry Logo
search
logo
Soundex
Soundex
current hub

Soundex

logo
Community Hub0 Subscribers
Write something...
Be the first to start a discussion here.
Be the first to start a discussion here.
See all
Soundex

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms (in part because it is a standard feature of popular database software such as IBM Db2, PostgreSQL, MySQL, SQLite, Ingres, MS SQL Server, Oracle, ClickHouse, Snowflake and SAP ASE.) Improvements to Soundex are the basis for many modern phonetic algorithms.

Soundex was developed by Robert C. Russell and Margaret King Odell and patented in 1918 and 1922. A variation, American Soundex, was used in the 1930s for a retrospective analysis of the US censuses from 1890 through 1920. The Soundex code came to prominence in the 1960s when it was the subject of several articles in the Communications and Journal of the Association for Computing Machinery, and especially when described in Donald Knuth's The Art of Computer Programming.

The National Archives and Records Administration (NARA) maintains the current rule set for the official implementation of Soundex used by the U.S. government. These encoding rules are available from NARA, upon request, in the form of General Information Leaflet 55, "Using the Census Soundex".

The Soundex code for a name consists of a letter followed by three numerical digits: the letter is the first letter of the name, and the digits encode the remaining consonants. Consonants at a similar place of articulation share the same digit so, for example, the labial consonants B, F, P, and V are each encoded as the number 1.

The correct value can be found as follows:

Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150". "Ashcraft" and "Ashcroft" both yield "A261". "Tymczak" yields "T522" not "T520" (the chars 'z' and 'k' in the name are coded as 2 twice since a vowel lies in between them). "Pfister" yields "P236" not "P123" (the first two letters have the same number and are coded once as 'P'), and "Honeyman" yields "H555".

The following algorithm is followed by most SQL languages (excluding PostgreSQL[example needed]):

The two algorithms above do not return the same results in all cases primarily because of the difference between when the vowels are removed. The first algorithm is used by most programming languages and the second is used by SQL. For example, "Tymczak" yields "T522" in the first algorithm, but "T520" in the algorithm used by SQL. Often, both algorithms generate the same code. As examples, both "Robert" and "Rupert" yield "R163" and "Honeyman" yields "H555". In designing an application, which combines SQL and a programming language, the architect must decide whether to do all of the Soundex encoding in the SQL server or all in the programming language. The MySQL implementation can return more than 4 characters.

See all
User Avatar
No comments yet.