Recent from talks
Nothing was collected or created yet.
Speech synthesis
View on Wikipedia
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech.[1] The reverse process is speech recognition.
Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database. Systems differ in the size of the stored speech units; a system that stores phones or diphones provides the largest output range, but may lack clarity.[citation needed] For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice characteristics to create a completely "synthetic" voice output.[2]
The quality of a speech synthesizer is judged by its similarity to the human voice and by its ability to be understood clearly. An intelligible text-to-speech program allows people with visual impairments or reading disabilities to listen to written words on a home computer. The earliest computer operating system to have included a speech synthesizer was Unix in 1974, through the Unix speak utility.[3] In 2000, Microsoft Sam was the default text-to-speech voice synthesizer used by the narrator accessibility feature, which shipped with all Windows 2000 operating systems, and subsequent Windows XP systems.

A text-to-speech system (or "engine") is composed of two parts:[4] a front-end and a back-end. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called text normalization, pre-processing, or tokenization. The front-end then assigns phonetic transcriptions to each word, and divides and marks the text into prosodic units, like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion. Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end. The back-end—often referred to as the synthesizer—then converts the symbolic linguistic representation into sound. In certain systems, this part includes the computation of the target prosody (pitch contour, phoneme durations),[5] which is then imposed on the output speech.
History
[edit]Long before the invention of electronic signal processing, some people tried to build machines to emulate human speech.[6][better source needed] There were also legends of the existence of "Brazen Heads", such as those involving Pope Silvester II (d. 1003 AD), Albertus Magnus (1198–1280), and Roger Bacon (1214–1294).[7]
In 1779, the German-Danish scientist Christian Gottlieb Kratzenstein won the first prize in a competition announced by the Russian Imperial Academy of Sciences and Arts for models he built of the human vocal tract that could produce the five long vowel sounds (in International Phonetic Alphabet notation: [aː], [eː], [iː], [oː] and [uː]).[8] There followed the bellows-operated "acoustic-mechanical speech machine" of Wolfgang von Kempelen of Pressburg, Hungary, described in a 1791 paper.[9] This machine added models of the tongue and lips, enabling it to produce consonants as well as vowels. In 1837, Charles Wheatstone produced a "speaking machine" based on von Kempelen's design, and in 1846, Joseph Faber exhibited the "Euphonia". In 1923, Paget resurrected Wheatstone's design.[10]
In the 1930s, Bell Labs developed the vocoder, which automatically analyzed speech into its fundamental tones and resonances. From his work on the vocoder, Homer Dudley developed a keyboard-operated voice-synthesizer called The Voder (Voice Demonstrator), which he exhibited at the 1939 New York World's Fair.
Dr. Franklin S. Cooper and his colleagues at Haskins Laboratories built the Pattern playback in the late 1940s and completed it in 1950. There were several different versions of this hardware device; only one currently survives. The machine converts pictures of the acoustic patterns of speech in the form of a spectrogram back into sound. Using this device, Alvin Liberman and colleagues discovered acoustic cues for the perception of phonetic segments (consonants and vowels).
Electronic devices
[edit]
The first computer-based speech-synthesis systems originated in the late 1950s. Noriko Umeda et al. developed the first general English text-to-speech system in 1968, at the Electrotechnical Laboratory in Japan.[11] In 1961, physicist John Larry Kelly, Jr and his colleague Louis Gerstman[12] used an IBM 704 computer to synthesize speech, an event among the most prominent in the history of Bell Labs.[citation needed] Kelly's voice recorder synthesizer (vocoder) recreated the song "Daisy Bell", with musical accompaniment from Max Mathews. Coincidentally, Arthur C. Clarke was visiting his friend and colleague John Pierce at the Bell Labs Murray Hill facility. Clarke was so impressed by the demonstration that he used it in the climactic scene of his screenplay for his novel 2001: A Space Odyssey,[13] where the HAL 9000 computer sings the same song as astronaut Dave Bowman puts it to sleep.[14] Despite the success of purely electronic speech synthesis, research into mechanical speech-synthesizers continues.[15][independent source needed]
Linear predictive coding (LPC), a form of speech coding, began development with the work of Fumitada Itakura of Nagoya University and Shuzo Saito of Nippon Telegraph and Telephone (NTT) in 1966. Further developments in LPC technology were made by Bishnu S. Atal and Manfred R. Schroeder at Bell Labs during the 1970s.[16] LPC was later the basis for early speech synthesizer chips, such as the Texas Instruments LPC Speech Chips used in the Speak & Spell toys from 1978.
In 1975, Fumitada Itakura developed the line spectral pairs (LSP) method for high-compression speech coding, while at NTT.[17][18][19] From 1975 to 1981, Itakura studied problems in speech analysis and synthesis based on the LSP method.[19] In 1980, his team developed an LSP-based speech synthesizer chip. LSP is an important technology for speech synthesis and coding, and in the 1990s was adopted by almost all international speech coding standards as an essential component, contributing to the enhancement of digital speech communication over mobile channels and the internet.[18]
In 1975, MUSA was released, and was one of the first Speech Synthesis systems. It consisted of a stand-alone computer hardware and a specialized software that enabled it to read Italian. A second version, released in 1978, was also able to sing Italian in an "a cappella" style.[20]
Dominant systems in the 1980s and 1990s were the DECtalk system, based largely on the work of Dennis Klatt at MIT, and the Bell Labs system;[21] the latter was one of the first multilingual language-independent systems, making extensive use of natural language processing methods.


Handheld electronics featuring speech synthesis began emerging in the 1970s. One of the first was the Telesensory Systems Inc. (TSI) Speech+ portable calculator for the blind in 1976.[22][23] Other devices had primarily educational purposes, such as the Speak & Spell toy produced by Texas Instruments in 1978.[24] Fidelity released a speaking version of its electronic chess computer in 1979.[25] The first video game to feature speech synthesis was the 1980 shoot 'em up arcade game, Stratovox (known in Japan as Speak & Rescue), from Sun Electronics.[26][27] The first personal computer game with speech synthesis was Manbiki Shoujo (Shoplifting Girl), released in 1980 for the PET 2001, for which the game's developer, Hiroshi Suzuki, developed a "zero cross" programming technique to produce a synthesized speech waveform.[28] Another early example, the arcade version of Berzerk, also dates from 1980. The Milton Bradley Company produced the first multi-player electronic game using voice synthesis, Milton, in the same year.
In 1976, Computalker Consultants released their CT-1 Speech Synthesizer. Designed by D. Lloyd Rice and Jim Cooper, it was an analog synthesizer built to work with microcomputers using the S-100 bus standard.[29]
Synthesized voices typically sounded male until 1990, when Ann Syrdal, at AT&T Bell Laboratories, created a female voice.[30]
Kurzweil predicted in 2005 that as the cost-performance ratio caused speech synthesizers to become cheaper and more accessible, more people would benefit from the use of text-to-speech programs.[31]
Artificial intelligence
[edit]In September 2016, DeepMind released WaveNet, which demonstrated that deep learning models are capable of modeling raw waveforms and generating speech from acoustic features like spectrograms or mel-spectrograms, starting the field of deep learning speech synthesis. Although WaveNet was initially considered to be computationally expensive and slow to be used in consumer products at the time, a year after its release, DeepMind unveiled a modified version of WaveNet known as "Parallel WaveNet," a production model 1,000 faster than the original.[32] This was followed by Google AI's Tacotron 2 in 2018, which demonstrated that neural networks could produce highly natural speech synthesis but required substantial training data—typically tens of hours of audio—to achieve acceptable quality. Tacotron 2 employed an autoencoder architecture with attention mechanisms to convert input text into mel-spectrograms, which were then converted to waveforms using a separate neural vocoder. When trained on smaller datasets, such as 2 hours of speech, the output quality degraded while still being able to maintain intelligible speech, and with just 24 minutes of training data, Tacotron 2 failed to produce intelligible speech.[33]
In 2019, Microsoft Research introduced FastSpeech, which addressed speed limitations in autoregressive models like Tacotron 2.[34] The same year saw the release of HiFi-GAN, a generative adversarial network (GAN)-based vocoder that improved the efficiency of waveform generation while producing high-fidelity speech.[35] In 2020, the release of Glow-TTS introduced a flow-based approach that allowed for both fast inference and voice style transfer capabilities.[36]
In March 2020, the free text-to-speech website 15.ai was launched. 15.ai gained widespread international attention in early 2021 for its ability to synthesize emotionally expressive speech of fictional characters from popular media with minimal amount of data.[37][38][39] The creator of 15.ai stated that 15 seconds of training data is sufficient to perfectly clone a person's voice (hence its name, "15.ai"), a significant reduction from the previously known data requirement of tens of hours.[40] 15.ai is credited as the first platform to popularize AI voice cloning in memes and content creation.[41][42][40] In January 2022, the first instance of speech synthesis NFT fraud occurred when a cryptocurrency company called Voiceverse generated voice lines using 15.ai, pitched them up to sound unrecognizable, promoted them as the byproduct of their own technology, and sold them as NFTs without permission.[43][44][45][46]
In January 2023, ElevenLabs launched its browser-based text-to-speech platform, which employs advanced algorithms to analyze contextual aspects of text and detect emotions such as anger, sadness, happiness, or alarm.[47][48][49] The platform is able to adjust intonation and pacing based on linguistic context to produce lifelike speech with human-like inflection, and offers features including multilingual speech generation and long-form content creation.[50][51]
In March 2024, OpenAI corroborated the 15 second benchmark to clone a human's voice.[52] However, they deemed their Voice Engine tool "too risky" for general release, stating that they would only release a preview and not release the technology for public use.[53]
Synthesizer technologies
[edit]The most important qualities of a speech synthesis system are naturalness and intelligibility.[54] Naturalness describes how closely the output sounds like human speech, while intelligibility is the ease with which the output is understood. The ideal speech synthesizer is both natural and intelligible. Speech synthesis systems usually try to maximize both characteristics.
The two primary technologies generating synthetic speech waveforms are concatenative synthesis and formant synthesis. Each technology has strengths and weaknesses, and the intended uses of a synthesis system will typically determine which approach is used.
Concatenation synthesis
[edit]Concatenative synthesis is based on the concatenation (stringing together) of segments of recorded speech. Generally, concatenative synthesis produces the most natural-sounding synthesized speech. However, differences between natural variations in speech and the nature of the automated techniques for segmenting the waveforms sometimes result in audible glitches in the output. There are three main sub-types of concatenative synthesis.
Unit selection synthesis
[edit]Unit selection synthesis uses large databases of recorded speech. During database creation, each recorded utterance is segmented into some or all of the following: individual phones, diphones, half-phones, syllables, morphemes, words, phrases, and sentences. Typically, the division into segments is done using a specially modified speech recognizer set to a "forced alignment" mode with some manual correction afterward, using visual representations such as the waveform and spectrogram.[55] An index of the units in the speech database is then created based on the segmentation and acoustic parameters like the fundamental frequency (pitch), duration, position in the syllable, and neighboring phones. At run time, the desired target utterance is created by determining the best chain of candidate units from the database (unit selection). This process is typically achieved using a specially weighted decision tree.
Unit selection provides the greatest naturalness, because it applies only a small amount of digital signal processing (DSP) to the recorded speech. DSP often makes recorded speech sound less natural, although some systems use a small amount of signal processing at the point of concatenation to smooth the waveform. The output from the best unit-selection systems is often indistinguishable from real human voices, especially in contexts for which the TTS system has been tuned. However, maximum naturalness typically require unit-selection speech databases to be very large, in some systems ranging into the gigabytes of recorded data, representing dozens of hours of speech.[56] Also, unit selection algorithms have been known to select segments from a place that results in less than ideal synthesis (e.g. minor words become unclear) even when a better choice exists in the database.[57] Recently, researchers have proposed various automated methods to detect unnatural segments in unit-selection speech synthesis systems.[58]
Diphone synthesis
[edit]Diphone synthesis uses a minimal speech database containing all the diphones (sound-to-sound transitions) occurring in a language. The number of diphones depends on the phonotactics of the language: for example, Spanish has about 800 diphones, and German about 2500. In diphone synthesis, only one example of each diphone is contained in the speech database. At runtime, the target prosody of a sentence is superimposed on these minimal units by means of digital signal processing techniques such as linear predictive coding, PSOLA[59] or MBROLA.[60] or more recent techniques such as pitch modification in the source domain using discrete cosine transform.[61] Diphone synthesis suffers from the sonic glitches of concatenative synthesis and the robotic-sounding nature of formant synthesis, and has few of the advantages of either approach other than small size. As such, its use in commercial applications is declining,[citation needed] although it continues to be used in research because there are a number of freely available software implementations. An early example of Diphone synthesis is a teaching robot, Leachim, that was invented by Michael J. Freeman.[62] Leachim contained information regarding class curricular and certain biographical information about the students whom it was programmed to teach.[63] It was tested in a fourth grade classroom in the Bronx, New York.[64][65]
Domain-specific synthesis
[edit]Domain-specific synthesis concatenates prerecorded words and phrases to create complete utterances. It is used in applications where the variety of texts the system will output is limited to a particular domain, like transit schedule announcements or weather reports.[66] The technology is very simple to implement, and has been in commercial use for a long time, in devices like talking clocks and calculators. The level of naturalness of these systems can be very high because the variety of sentence types is limited, and they closely match the prosody and intonation of the original recordings.[citation needed]
Because these systems are limited by the words and phrases in their databases, they are not general-purpose and can only synthesize the combinations of words and phrases with which they have been preprogrammed. The blending of words within naturally spoken language however can still cause problems unless the many variations are taken into account. For example, in non-rhotic dialects of English the "r" in words like "clear" /ˈklɪə/ is usually only pronounced when the following word has a vowel as its first letter (e.g. "clear out" is realized as /ˌklɪəɹˈʌʊt/). Likewise in French, many final consonants become no longer silent if followed by a word that begins with a vowel, an effect called liaison. This alternation cannot be reproduced by a simple word-concatenation system, which would require additional complexity to be context-sensitive.
Formant synthesis
[edit]Formant synthesis does not use human speech samples at runtime. Instead, the synthesized speech output is created using additive synthesis and an acoustic model (physical modelling synthesis).[67] Parameters such as fundamental frequency, voicing, and noise levels are varied over time to create a waveform of artificial speech. This method is sometimes called rules-based synthesis; however, many concatenative systems also have rules-based components. Many systems based on formant synthesis technology generate artificial, robotic-sounding speech that would never be mistaken for human speech. However, maximum naturalness is not always the goal of a speech synthesis system, and formant synthesis systems have advantages over concatenative systems. Formant-synthesized speech can be reliably intelligible, even at very high speeds, avoiding the acoustic glitches that commonly plague concatenative systems. High-speed synthesized speech is used by the visually impaired to quickly navigate computers using a screen reader. Formant synthesizers are usually smaller programs than concatenative systems because they do not have a database of speech samples. They can therefore be used in embedded systems, where memory and microprocessor power are especially limited. Because formant-based systems have complete control of all aspects of the output speech, a wide variety of prosodies and intonations can be output, conveying not just questions and statements, but a variety of emotions and tones of voice.
Examples of non-real-time but highly accurate intonation control in formant synthesis include the work done in the late 1970s for the Texas Instruments toy Speak & Spell, and in the early 1980s Sega arcade machines[68] and in many Atari, Inc. arcade games[69] using the TMS5220 LPC Chips. Creating proper intonation for these projects was painstaking, and the results have yet to be matched by real-time text-to-speech interfaces.[70][when?]
For tonal languages, such as Chinese or Taiwanese language, there are different levels of tone sandhi required and sometimes the output of speech synthesizer may result in the mistakes of tone sandhi.[71]
Articulatory synthesis
[edit]Articulatory synthesis consists of computational techniques for synthesizing speech based on models of the human vocal tract and the articulation processes occurring there. The first articulatory synthesizer regularly used for laboratory experiments was developed at Haskins Laboratories in the mid-1970s by Philip Rubin, Tom Baer, and Paul Mermelstein. This synthesizer, known as ASY, was based on vocal tract models developed at Bell Laboratories in the 1960s and 1970s by Paul Mermelstein, Cecil Coker, and colleagues.
Until recently, articulatory synthesis models have not been incorporated into commercial speech synthesis systems. A notable exception is the NeXT-based system originally developed and marketed by Trillium Sound Research, a spin-off company of the University of Calgary, where much of the original research was conducted. Following the demise of the various incarnations of NeXT (started by Steve Jobs in the late 1980s and merged with Apple Computer in 1997), the Trillium software was published under the GNU General Public License, with work continuing as gnuspeech. The system, first marketed in 1994, provides full articulatory-based text-to-speech conversion using a waveguide or transmission-line analog of the human oral and nasal tracts controlled by Carré's "distinctive region model".
More recent synthesizers, developed by Jorge C. Lucero and colleagues, incorporate models of vocal fold biomechanics, glottal aerodynamics and acoustic wave propagation in the bronchi, trachea, nasal and oral cavities, and thus constitute full systems of physics-based speech simulation.[72][73]
HMM-based synthesis
[edit]HMM-based synthesis is a synthesis method based on hidden Markov models, also called Statistical Parametric Synthesis. In this system, the frequency spectrum (vocal tract), fundamental frequency (voice source), and duration (prosody) of speech are modeled simultaneously by HMMs. Speech waveforms are generated from HMMs themselves based on the maximum likelihood criterion.[74]
Sinewave synthesis
[edit]Sinewave synthesis is a technique for synthesizing speech by replacing the formants (main bands of energy) with pure tone whistles.[75]
Deep learning-based synthesis
[edit]Deep learning speech synthesis uses deep neural networks (DNN) to produce artificial speech from text (text-to-speech) or spectrum (vocoder). The deep neural networks are trained using a large amount of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.
Audio deepfakes
[edit]| Part of a series on |
| Artificial intelligence (AI) |
|---|
In 2023, VICE reporter Joseph Cox published findings that he had recorded five minutes of himself talking and then used a tool developed by ElevenLabs to create voice deepfakes that defeated a bank's voice-authentication system.[83]
Challenges
[edit]Text normalization challenges
[edit]The process of normalizing text is rarely straightforward. Texts are full of heteronyms, numbers, and abbreviations that all require expansion into a phonetic representation. There are many spellings in English which are pronounced differently based on context. For example, "My latest project is to learn how to better project my voice" contains two pronunciations of "project".
Most text-to-speech (TTS) systems do not generate semantic representations of their input texts, as processes for doing so are unreliable, poorly understood, and computationally ineffective. As a result, various heuristic techniques are used to guess the proper way to disambiguate homographs, like examining neighboring words and using statistics about frequency of occurrence.
Recently TTS systems have begun to use HMMs (discussed above) to generate "parts of speech" to aid in disambiguating homographs. This technique is quite successful for many cases such as whether "read" should be pronounced as "red" implying past tense, or as "reed" implying present tense. Typical error rates when using HMMs in this fashion are usually below five percent. These techniques also work well for most European languages, although access to required training corpora is frequently difficult in these languages.
Deciding how to convert numbers is another problem that TTS systems have to address. It is a simple programming challenge to convert a number into words (at least in English), like "1325" becoming "one thousand three hundred twenty-five". However, numbers occur in many different contexts; "1325" may also be read as "one three two five", "thirteen twenty-five" or "thirteen hundred and twenty five". A TTS system can often infer how to expand a number based on surrounding words, numbers, and punctuation, and sometimes the system provides a way to specify the context if it is ambiguous.[84] Roman numerals can also be read differently depending on context. For example, "Henry VIII" reads as "Henry the Eighth", while "Chapter VIII" reads as "Chapter Eight".
Similarly, abbreviations can be ambiguous. For example, the abbreviation "in" for "inches" must be differentiated from the word "in", and the address "12 St John St." uses the same abbreviation for both "Saint" and "Street". TTS systems with intelligent front ends can make educated guesses about ambiguous abbreviations, while others provide the same result in all cases, resulting in nonsensical (and sometimes comical) outputs, such as "Ulysses S. Grant" being rendered as "Ulysses South Grant".
Text-to-phoneme challenges
[edit]Speech synthesis systems use two basic approaches to determine the pronunciation of a word based on its spelling, a process which is often called text-to-phoneme or grapheme-to-phoneme conversion (phoneme is the term used by linguists to describe distinctive sounds in a language). The simplest approach to text-to-phoneme conversion is the dictionary-based approach, where a large dictionary containing all the words of a language and their correct pronunciations is stored by the program. Determining the correct pronunciation of each word is a matter of looking up each word in the dictionary and replacing the spelling with the pronunciation specified in the dictionary. The other approach is rule-based, in which pronunciation rules are applied to words to determine their pronunciations based on their spellings. This is similar to the "sounding out", or synthetic phonics, approach to learning reading.
Each approach has advantages and drawbacks. The dictionary-based approach is quick and accurate, but completely fails if it is given a word which is not in its dictionary. As dictionary size grows, so too does the memory space requirements of the synthesis system. On the other hand, the rule-based approach works on any input, but the complexity of the rules grows substantially as the system takes into account irregular spellings or pronunciations. (Consider that the word "of" is very common in English, yet is the only word in which the letter "f" is pronounced [v].) As a result, nearly all speech synthesis systems use a combination of these approaches.
Languages with a phonemic orthography have a very regular writing system, and the prediction of the pronunciation of words based on their spellings is quite successful. Speech synthesis systems for such languages often use the rule-based method extensively, resorting to dictionaries only for those few words, like foreign names and loanwords, whose pronunciations are not obvious from their spellings. On the other hand, speech synthesis systems for languages like English, which have extremely irregular spelling systems, are more likely to rely on dictionaries, and to use rule-based methods only for unusual words, or words that are not in their dictionaries.
Evaluation challenges
[edit]The consistent evaluation of speech synthesis systems may be difficult because of a lack of universally agreed objective evaluation criteria. Different organizations often use different speech data. The quality of speech synthesis systems also depends on the quality of the production technique (which may involve analogue or digital recording) and on the facilities used to replay the speech. Evaluating speech synthesis systems has therefore often been compromised by differences between production techniques and replay facilities.
Since 2005, however, some researchers have started to evaluate speech synthesis systems using a common speech dataset.[85]
Prosodics and emotional content
[edit]A study in the journal Speech Communication by Amy Drahota and colleagues at the University of Portsmouth, UK, reported that listeners to voice recordings could determine, at better than chance levels, whether or not the speaker was smiling.[86][87][88] It was suggested that identification of the vocal features that signal emotional content may be used to help make synthesized speech sound more natural. One of the related issues is modification of the pitch contour of the sentence, depending upon whether it is an affirmative, interrogative or exclamatory sentence. One of the techniques for pitch modification[61] uses discrete cosine transform in the source domain (linear prediction residual). Such pitch synchronous pitch modification techniques need a priori pitch marking of the synthesis speech database using techniques such as epoch extraction using dynamic plosion index applied on the integrated linear prediction residual of the voiced regions of speech.[89] In general, prosody remains a challenge for speech synthesizers, and is an active research topic.
Dedicated hardware
[edit]
- Icophone
- General Instrument SP0256-AL2
- National Semiconductor DT1050 Digitalker (Mozer – Forrest Mozer)
- Texas Instruments LPC Speech Chips[90]
Hardware and software systems
[edit]Popular systems offering speech synthesis as a built-in capability.
Texas Instruments
[edit]In the early 1980s, TI was known as a pioneer in speech synthesis, and a highly popular plug-in speech synthesizer module was available for the TI-99/4 and 4A. Speech synthesizers were offered free with the purchase of a number of cartridges and were used by many TI-written video games (games offered with speech during this promotion included Alpiner and Parsec). The synthesizer uses a variant of linear predictive coding and has a small in-built vocabulary. The original intent was to release small cartridges that plugged directly into the synthesizer unit, which would increase the device's built-in vocabulary. However, the success of software text-to-speech in the Terminal Emulator II cartridge canceled that plan.
Mattel
[edit]The Mattel Intellivision game console offered the Intellivoice Voice Synthesis module in 1982. It included the SP0256 Narrator speech synthesizer chip on a removable cartridge. The Narrator had 2kB of Read-Only Memory (ROM), and this was utilized to store a database of generic words that could be combined to make phrases in Intellivision games. Since the Orator chip could also accept speech data from external memory, any additional words or phrases needed could be stored inside the cartridge itself. The data consisted of strings of analog-filter coefficients to modify the behavior of the chip's synthetic vocal-tract model, rather than simple digitized samples.
SAM
[edit]Also released in 1982, Software Automatic Mouth was the first commercial all-software voice synthesis program. It was later used as the basis for Macintalk. The program was available for non-Macintosh Apple computers (including the Apple II, and the Lisa), various Atari models and the Commodore 64. The Apple version preferred additional hardware that contained DACs, although it could instead use the computer's one-bit audio output (with the addition of much distortion) if the card was not present. The Atari made use of the embedded POKEY audio chip. Speech playback on the Atari normally disabled interrupt requests and shut down the ANTIC chip during vocal output. The audible output is extremely distorted speech when the screen is on. The Commodore 64 made use of the 64's embedded SID audio chip.
Atari
[edit]Arguably, the first speech system integrated into an operating system was the circa 1983 unreleased Atari 1400XL/1450XL computers. These used the Votrax SC01 chip and a finite-state machine to enable World English Spelling text-to-speech synthesis.[91]
The Atari ST computers were sold with "stspeech.tos" on floppy disk.
Apple
[edit]The first speech system integrated into an operating system that shipped in quantity was Apple Computer's MacInTalk. The software was licensed from third-party developers Joseph Katz and Mark Barton (later, SoftVoice, Inc.) and was featured during the 1984 introduction of the Macintosh computer. This January demo required 512 kilobytes of RAM memory. As a result, it could not run in the 128 kilobytes of RAM the first Mac actually shipped with.[92] So, the demo was accomplished with a prototype 512k Mac, although those in attendance were not told of this and the synthesis demo created considerable excitement for the Macintosh. In the early 1990s Apple expanded its capabilities offering system wide text-to-speech support. With the introduction of faster PowerPC-based computers they included higher quality voice sampling. Apple also introduced speech recognition into its systems which provided a fluid command set. More recently, Apple has added sample-based voices. Starting as a curiosity, the speech system of Apple Macintosh has evolved into a fully supported program, PlainTalk, for people with vision problems. VoiceOver was for the first time featured in 2005 in Mac OS X Tiger (10.4). During 10.4 (Tiger) and first releases of 10.5 (Leopard) there was only one standard voice shipping with Mac OS X. Starting with 10.6 (Snow Leopard), the user can choose out of a wide range list of multiple voices. VoiceOver voices feature the taking of realistic-sounding breaths between sentences, as well as improved clarity at high read rates over PlainTalk. Mac OS X also includes say, a command-line based application that converts text to audible speech. The AppleScript Standard Additions includes a say verb that allows a script to use any of the installed voices and to control the pitch, speaking rate and modulation of the spoken text.
Amazon
[edit]Used in Alexa and as Software as a Service in AWS[93] (from 2017).
AmigaOS
[edit]
The second operating system to feature advanced speech synthesis capabilities was AmigaOS, introduced in 1985. The voice synthesis was licensed by Commodore International from SoftVoice, Inc., who also developed the original MacinTalk text-to-speech system. It featured a complete system of voice emulation for American English, with both male and female voices and "stress" indicator markers, made possible through the Amiga's audio chipset.[94] The synthesis system was divided into a translator library which converted unrestricted English text into a standard set of phonetic codes and a narrator device which implemented a formant model of speech generation.. AmigaOS also featured a high-level "Speak Handler", which allowed command-line users to redirect text output to speech. Speech synthesis was occasionally used in third-party programs, particularly word processors and educational software. The synthesis software remained largely unchanged from the first AmigaOS release and Commodore eventually removed speech synthesis support from AmigaOS 2.1 onward.
Despite the American English phoneme limitation, an unofficial version with multilingual speech synthesis was developed. This made use of an enhanced version of the translator library which could translate a number of languages, given a set of rules for each language.[95]
Microsoft Windows
[edit]Modern Windows desktop systems can use SAPI 4 and SAPI 5 components to support speech synthesis and speech recognition. SAPI 4.0 was available as an optional add-on for Windows 95 and Windows 98. Windows 2000 added Narrator, a text-to-speech utility for people who have visual impairment. Third-party programs such as JAWS for Windows, Window-Eyes, Non-visual Desktop Access, Supernova and System Access can perform various text-to-speech tasks such as reading text aloud from a specified website, email account, text document, the Windows clipboard, the user's keyboard typing, etc. Not all programs can use speech synthesis directly.[96] Some programs can use plug-ins, extensions or add-ons to read text aloud. Third-party programs are available that can read text from the system clipboard.
Microsoft Speech Server is a server-based package for voice synthesis and recognition. It is designed for network use with web applications and call centers.
Votrax
[edit]From 1971 to 1996, Votrax produced a number of commercial speech synthesizer components. A Votrax synthesizer was included in the first generation Kurzweil Reading Machine for the Blind.
Text-to-speech systems
[edit]Text-to-speech (TTS) refers to the ability of computers to read text aloud. A TTS engine converts written text to a phonemic representation, then converts the phonemic representation to waveforms that can be output as sound. TTS engines with different languages, dialects and specialized vocabularies are available through third-party publishers.[97]
Android
[edit]Version 1.6 of Android added support for speech synthesis (TTS).[98]
Internet
[edit]Currently, there are a number of applications, plugins and gadgets that can read messages directly from an e-mail client and web pages from a web browser or Google Toolbar. Some specialized software can narrate RSS-feeds. On one hand, online RSS-narrators simplify information delivery by allowing users to listen to their favourite news sources and to convert them to podcasts. On the other hand, on-line RSS-readers are available on almost any personal computer connected to the Internet. Users can download generated audio files to portable devices, e.g. with a help of podcast receiver, and listen to them while walking, jogging or commuting to work.
A growing field in Internet based TTS is web-based assistive technology, e.g. 'Browsealoud' from a UK company and Readspeaker. It can deliver TTS functionality to anyone (for reasons of accessibility, convenience, entertainment or information) with access to a web browser. The non-profit project Pediaphon was created in 2006 to provide a similar web-based TTS interface to the Wikipedia.[99]
Other work is being done in the context of the W3C through the W3C Audio Incubator Group with the involvement of The BBC and Google Inc.
Open source
[edit]Some open-source software systems are available, such as:
- eSpeak which supports a broad range of languages.
- Festival Speech Synthesis System which uses diphone-based synthesis, as well as more modern and contemporary sounding techniques.
- gnuspeech which uses articulatory synthesis[100] from the Free Software Foundation.
Others
[edit]- Following the commercial failure of the hardware-based Intellivoice, gaming developers sparingly used software synthesis in later games[citation needed]. Earlier systems from Atari, such as the Atari 5200 (Baseball) and the Atari 2600 (Quadrun and Open Sesame), also had games utilizing software synthesis.[citation needed]
- Some e-book readers, such as the Amazon Kindle, Samsung E6, PocketBook eReader Pro, enTourage eDGe, and the Bebook Neo.
- The BBC Micro incorporated the Texas Instruments TMS5220 speech synthesis chip.
- Some models of Texas Instruments home computers produced in 1979 and 1981 (Texas Instruments TI-99/4 and TI-99/4A) were capable of text-to-phoneme synthesis or reciting complete words and phrases (text-to-dictionary), using a very popular Speech Synthesizer peripheral. TI used a proprietary codec to embed complete spoken phrases into applications, primarily video games.[101]
- IBM's OS/2 Warp 4 included VoiceType, a precursor to IBM ViaVoice.
- GPS Navigation units produced by Garmin, Magellan, TomTom and others use speech synthesis for automobile navigation.
- Yamaha produced a music synthesizer in 1999, the Yamaha FS1R which included a Formant synthesis capability. Sequences of up to 512 individual vowel and consonant formants could be stored and replayed, allowing short vocal phrases to be synthesized.
Digital sound-alikes
[edit]At the 2018 Conference on Neural Information Processing Systems (NeurIPS) researchers from Google presented the work 'Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis', which transfers learning from speaker verification to achieve text-to-speech synthesis, that can be made to sound almost like anybody from a speech sample of only 5 seconds.[102]
Also researchers from Baidu Research presented a voice cloning system with similar aims at the 2018 NeurIPS conference,[103] though the result is rather unconvincing.
By 2019 the digital sound-alikes found their way to the hands of criminals as Symantec researchers know of 3 cases where digital sound-alikes technology has been used for crime.[104][105]
This increases the stress on the disinformation situation coupled with the facts that
- Human image synthesis since the early 2000s has improved beyond the point of human's inability to tell a real human imaged with a real camera from a simulation of a human imaged with a simulation of a camera.
- 2D video forgery techniques were presented in 2016 that allow near real-time counterfeiting of facial expressions in existing 2D video.[106]
- In SIGGRAPH 2017 an audio driven digital look-alike of upper torso of Barack Obama was presented by researchers from University of Washington. It was driven only by a voice track as source data for the animation after the training phase to acquire lip sync and wider facial information from training material consisting of 2D videos with audio had been completed.[107]
Speech synthesis markup languages
[edit]A number of markup languages have been established for the rendition of text as speech in an XML-compliant format. The most recent is Speech Synthesis Markup Language (SSML), which became a W3C recommendation in 2004. Older speech synthesis markup languages include Java Speech Markup Language (JSML) and SABLE. Although each of these was proposed as a standard, none of them have been widely adopted.[citation needed]
Speech synthesis markup languages are distinguished from dialogue markup languages. VoiceXML, for example, includes tags related to speech recognition, dialogue management and touchtone dialing, in addition to text-to-speech markup.[citation needed]
Applications
[edit]Speech synthesis has long been a vital assistive technology tool and its application in this area is significant and widespread. It allows environmental barriers to be removed for people with a wide range of disabilities. The longest application has been in the use of screen readers for people with visual impairment, but text-to-speech systems are now commonly used by people with dyslexia and other reading disabilities as well as by pre-literate children.[108] They are also frequently employed to aid those with severe speech impairment usually through a dedicated voice output communication aid.[109] Work to personalize a synthetic voice to better match a person's personality or historical voice is becoming available.[110] A noted application, of speech synthesis, was the Kurzweil Reading Machine for the Blind which incorporated text-to-phonetics software based on work from Haskins Laboratories and a black-box synthesizer built by Votrax.[111]

Speech synthesis techniques are also used in entertainment productions such as games and animations. In 2007, Animo Limited announced the development of a software application package based on its speech synthesis software FineSpeech, explicitly geared towards customers in the entertainment industries, able to generate narration and lines of dialogue according to user specifications.[112] The application reached maturity in 2008, when NEC Biglobe announced a web service that allows users to create phrases from the voices of characters from the Japanese anime series Code Geass: Lelouch of the Rebellion R2.[113] 15.ai has been frequently used for content creation in various fandoms, including the My Little Pony: Friendship Is Magic fandom, the Team Fortress 2 fandom, the Portal fandom, and the SpongeBob SquarePants fandom.[114][115][116]
Text-to-speech for disability and impaired communication aids have become widely available. Text-to-speech is also finding new applications; for example, speech synthesis combined with speech recognition allows for interaction with mobile devices via natural language processing interfaces. Some users have also created AI virtual assistants using 15.ai and external voice control software.[37][117]
Text-to-speech is also used in second language acquisition. Voki, for instance, is an educational tool created by Oddcast that allows users to create their own talking avatar, using different accents. They can be emailed, embedded on websites or shared on social media.
Content creators have used voice cloning tools to recreate their voices for podcasts,[118][119] narration,[49] and comedy shows.[120][121][122] Publishers and authors have also used such software to narrate audiobooks and newsletters.[123][124] Another area of application is AI video creation with talking heads. Webapps and video editors like Elai.io or Synthesia allow users to create video content involving AI avatars, who are made to speak using text-to-speech technology.[125][126]
Speech synthesis is a valuable computational aid for the analysis and assessment of speech disorders. A voice quality synthesizer, developed by Jorge C. Lucero et al. at the University of Brasília, simulates the physics of phonation and includes models of vocal frequency jitter and tremor, airflow noise and laryngeal asymmetries.[72] The synthesizer has been used to mimic the timbre of dysphonic speakers with controlled levels of roughness, breathiness and strain.[73]
Singing synthesis
[edit]See also
[edit]References
[edit]- ^ Allen, Jonathan; Hunnicutt, M. Sharon; Klatt, Dennis (1987). From Text to Speech: The MITalk system. Cambridge University Press. ISBN 978-0-521-30641-6.
- ^ Rubin, P.; Baer, T.; Mermelstein, P. (1981). "An articulatory synthesizer for perceptual research". Journal of the Acoustical Society of America. 70 (2): 321–328. Bibcode:1981ASAJ...70..321R. doi:10.1121/1.386780.
- ^ McIlroy, M. D. (1974-04-01). "Synthetic English speech by rule". The Journal of the Acoustical Society of America. 55 (S1): S55 – S56. Bibcode:1974ASAJ...55R..55M. doi:10.1121/1.1919804. ISSN 0001-4966.
- ^ van Santen, Jan P. H.; Sproat, Richard W.; Olive, Joseph P.; Hirschberg, Julia (1997). Progress in Speech Synthesis. Springer. ISBN 978-0-387-94701-3.
- ^ Van Santen, J. (April 1994). "Assignment of segmental duration in text-to-speech synthesis". Computer Speech & Language. 8 (2): 95–128. doi:10.1006/csla.1994.1005.
- ^ PhD, Everton Gomede (2024-03-10). "The Evolution of Speech Synthesis through Deep Learning". The Modern Scientist. Medium. Retrieved 2025-09-08.
- ^ ""You Are My Friend": Early Androids and Artificial Speech". The Public Domain Review. Retrieved 2025-09-08.
- ^ History and Development of Speech Synthesis, Helsinki University of Technology, Retrieved on November 4, 2006
- ^ Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine ("Mechanism of the human speech with description of its speaking machine", J. B. Degen, Wien). (in German)
- ^ Mattingly, Ignatius G. (1974). Sebeok, Thomas A. (ed.). "Speech synthesis for phonetic and phonological models" (PDF). Current Trends in Linguistics. 12. Mouton, The Hague: 2451–2487. Archived from the original (PDF) on 2013-05-12. Retrieved 2011-12-13.
- ^ Klatt, D (1987). "Review of text-to-speech conversion for English". Journal of the Acoustical Society of America. 82 (3): 737–93. Bibcode:1987ASAJ...82..737K. doi:10.1121/1.395275. PMID 2958525.
- ^ Lambert, Bruce (March 21, 1992). "Louis Gerstman, 61, a Specialist In Speech Disorders and Processes". The New York Times.
- ^ "Arthur C. Clarke Biography". Archived from the original on December 11, 1997. Retrieved 5 December 2017.
- ^ "Where "HAL" First Spoke (Bell Labs Speech Synthesis website)". Bell Labs. Archived from the original on 2000-04-07. Retrieved 2010-02-17.
- ^ Anthropomorphic Talking Robot Waseda-Talker Series Archived 2016-03-04 at the Wayback Machine
- ^ Gray, Robert M. (2010). "A History of Realtime Digital Speech on Packet Networks: Part II of Linear Predictive Coding and the Internet Protocol" (PDF). Found. Trends Signal Process. 3 (4): 203–303. doi:10.1561/2000000036. ISSN 1932-8346. Archived (PDF) from the original on 2022-10-09.
- ^ Zheng, F.; Song, Z.; Li, L.; Yu, W. (1998). "The Distance Measure for Line Spectrum Pairs Applied to Speech Recognition" (PDF). Proceedings of the 5th International Conference on Spoken Language Processing (ICSLP'98) (3): 1123–6. Archived (PDF) from the original on 2022-10-09.
- ^ a b "List of IEEE Milestones". IEEE. Retrieved 15 July 2019.
- ^ a b "Fumitada Itakura Oral History". IEEE Global History Network. 20 May 2009. Retrieved 2009-07-21.
- ^ Billi, Roberto; Canavesio, Franco; Ciaramella, Alberto; Nebbia, Luciano (1 November 1995). "Interactive voice technology at work: The CSELT experience". Speech Communication. 17 (3): 263–271. doi:10.1016/0167-6393(95)00030-R.
- ^ Sproat, Richard W. (1997). Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Springer. ISBN 978-0-7923-8027-6.
- ^ [TSI Speech+ & other speaking calculators]
- ^ Gevaryahu, Jonathan, [ "TSI S14001A Speech Synthesizer LSI Integrated Circuit Guide"][dead link]
- ^ Breslow, et al. US 4326710: "Talking electronic game", April 27, 1982
- ^ Voice Chess Challenger
- ^ Gaming's most important evolutions Archived 2011-06-15 at the Wayback Machine, GamesRadar
- ^ Adlum, Eddie (November 1985). "The Replay Years: Reflections from Eddie Adlum". RePlay. Vol. 11, no. 2. pp. 134-175 (160-3).
- ^ Szczepaniak, John (2014). The Untold History of Japanese Game Developers. Vol. 1. SMG Szczepaniak. pp. 544–615. ISBN 978-0992926007.
- ^ "A Short History of Computalker". Smithsonian Speech Synthesis History Project.
- ^ CadeMetz (2020-08-20). "Ann Syrdal, Who Helped Give Computers a Female Voice, Dies at 74". The New York Times. Retrieved 2020-08-23.
- ^ Kurzweil, Raymond (2005). The Singularity is Near. Penguin Books. ISBN 978-0-14-303788-0.
- ^ van den Oord, Aäron (2017-11-12). "High-fidelity speech synthesis with WaveNet". DeepMind. Retrieved 2022-06-05.
- ^ "Audio samples from "Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis"". 2018-08-30. Archived from the original on 2020-11-11. Retrieved 2022-06-05.
- ^ Ren, Yi (2019). "FastSpeech: Fast, Robust and Controllable Text to Speech". arXiv:1905.09263 [cs.CL].
- ^ Kong, Jungil (2020). "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis". arXiv:2010.05646 [cs.SD].
- ^ Kim, Jaehyeon (2020). "Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search". arXiv:2005.11129 [eess.AS].
- ^ a b Kurosawa, Yuki (January 19, 2021). "ゲームキャラ音声読み上げソフト「15.ai」公開中。『Undertale』や『Portal』のキャラに好きなセリフを言ってもらえる" [Game Character Voice Reading Software "15.ai" Now Available. Get Characters from Undertale and Portal to Say Your Desired Lines]. AUTOMATON (in Japanese). Archived from the original on January 19, 2021. Retrieved December 18, 2024.
- ^ 遊戲, 遊戲角落 (January 20, 2021). "這個AI語音可以模仿《傳送門》GLaDOS講出任何對白!連《Undertale》都可以學" [This AI Voice Can Imitate Portal's GLaDOS Saying Any Dialog! It Can Even Learn Undertale]. United Daily News (in Chinese (Taiwan)). Archived from the original on December 19, 2024. Retrieved December 18, 2024.
- ^ Lamorlette, Robin (January 25, 2021). "Insolite : un site permet de faire dire ce que vous souhaitez à GlaDOS (et à d'autres personnages de jeux vidéo)" [Unusual: A site lets you make GlaDOS (and other video game characters) say whatever you want]. Clubic (in French). Archived from the original on January 19, 2025. Retrieved March 23, 2025.
- ^ a b Temitope, Yusuf (December 10, 2024). "15.ai Creator reveals journey from MIT Project to internet phenomenon". The Guardian. Archived from the original on December 28, 2024. Retrieved December 25, 2024.
- ^ Anirudh VK (March 18, 2023). "Deepfakes Are Elevating Meme Culture, But At What Cost?". Analytics India Magazine. Archived from the original on December 26, 2024. Retrieved December 18, 2024.
- ^ Wright, Steven (March 21, 2023). "Why Biden, Trump, and Obama Arguing Over Video Games Is YouTube's New Obsession". Inverse. Archived from the original on December 20, 2024. Retrieved December 18, 2024.
- ^ Innes, Ruby (January 18, 2022). "Voiceverse Is The Latest NFT Company Caught Using Someone Else's Content". Kotaku Australia. Archived from the original on July 26, 2024. Retrieved February 28, 2025.
- ^ Phillips, Tom (January 17, 2022). "Troy Baker-backed NFT firm admits using voice lines taken from another service without permission". Eurogamer. Archived from the original on January 17, 2022. Retrieved December 31, 2024.
- ^ Williams, Demi (January 18, 2022). "Voiceverse NFT admits to taking voice lines from non-commercial service". NME. Archived from the original on January 18, 2022. Retrieved December 18, 2024.
- ^ Lam, Khoa (January 14, 2022). "Incident 277: Voices Created Using Publicly Available App Stolen and Resold as NFT without Attribution". AI Incident Database. Archived from the original on January 13, 2025. Retrieved February 27, 2025.
- ^ "Generative AI comes for cinema dubbing: Audio AI startup ElevenLabs raises pre-seed". Sifted. January 23, 2023. Retrieved 2023-02-03.
- ^ WIRED Staff. "This Podcast Is Not Hosted by AI Voice Clones. We Swear". Wired. ISSN 1059-1028. Retrieved 2023-07-25.
- ^ a b Ashworth, Boone (April 12, 2023). "AI Can Clone Your Favorite Podcast Host's Voice". Wired. Retrieved 2023-04-25.
- ^ Wiggers, Kyle (2023-06-20). "Voice-generating platform ElevenLabs raises $19M, launches detection tool". TechCrunch. Retrieved 2023-07-25.
- ^ Bonk, Lawrence. "ElevenLabs' Powerful New AI Tool Lets You Make a Full Audiobook in Minutes". Lifewire. Retrieved 2023-07-25.
- ^ "Navigating the Challenges and Opportunities of Synthetic Voices". OpenAI. March 9, 2024. Archived from the original on November 25, 2024. Retrieved December 18, 2024.
- ^ Hern, Alex (March 31, 2024). "OpenAI deems its voice cloning tool too risky for general release". The Guardian. Retrieved September 30, 2025.
- ^ Taylor, Paul (2009). Text-to-speech synthesis. Cambridge, UK: Cambridge University Press. p. 3. ISBN 9780521899277.
- ^ Alan W. Black, Perfect synthesis for all of the people all of the time. IEEE TTS Workshop 2002.
- ^ John Kominek and Alan W. Black. (2003). CMU ARCTIC databases for speech synthesis. CMU-LTI-03-177. Language Technologies Institute, School of Computer Science, Carnegie Mellon University.
- ^ Julia Zhang. Language Generation and Speech Synthesis in Dialogues for Language Learning, masters thesis, Section 5.6 on page 54.
- ^ William Yang Wang and Kallirroi Georgila. (2011). Automatic Detection of Unnatural Word-Level Segments in Unit-Selection Speech Synthesis, IEEE ASRU 2011.
- ^ "Pitch-Synchronous Overlap and Add (PSOLA) Synthesis". Archived from the original on February 22, 2007. Retrieved 2008-05-28.
- ^ T. Dutoit, V. Pagel, N. Pierret, F. Bataille, O. van der Vrecken. The MBROLA Project: Towards a set of high quality speech synthesizers of use for non commercial purposes. ICSLP Proceedings, 1996.
- ^ a b Muralishankar, R.; Ramakrishnan, A. G.; Prathibha, P. (February 2004). "Modification of Pitch using DCT in the Source Domain". Speech Communication. 42 (2): 143–154. doi:10.1016/j.specom.2003.05.001.
- ^ "Education: Marvel of The Bronx". Time. 1974-04-01. ISSN 0040-781X. Retrieved 2019-05-28.
- ^ "1960 - Rudy the Robot - Michael Freeman (American)". cyberneticzoo.com. 2010-09-13. Retrieved 2019-05-23.
- ^ New York Magazine. New York Media, LLC. 1979-07-30.
- ^ The Futurist. World Future Society. 1978. pp. 359, 360, 361.
- ^ L.F. Lamel, J.L. Gauvain, B. Prouts, C. Bouhier, R. Boesch. Generation and Synthesis of Broadcast Messages, Proceedings ESCA-NATO Workshop and Applications of Speech Technology, September 1993.
- ^ Dartmouth College: Music and Computers Archived 2011-06-08 at the Wayback Machine, 1993.
- ^ Examples include Astro Blaster, Space Fury, and Star Trek: Strategic Operations Simulator
- ^ Examples include Star Wars, Firefox, Return of the Jedi, Road Runner, The Empire Strikes Back, Indiana Jones and the Temple of Doom, 720°, Gauntlet, Gauntlet II, A.P.B., Paperboy, RoadBlasters, Vindicators Part II, Escape from the Planet of the Robot Monsters.
- ^ John Holmes and Wendy Holmes (2001). Speech Synthesis and Recognition (2nd ed.). CRC. ISBN 978-0-7484-0856-6.
- ^ Zhu, Jian (2020-05-25). "Probing the phonetic and phonological knowledge of tones in Mandarin TTS models". Speech Prosody 2020. ISCA: ISCA: 930–934. arXiv:1912.10915. doi:10.21437/speechprosody.2020-190. S2CID 209444942.
- ^ a b Lucero, J. C.; Schoentgen, J.; Behlau, M. (2013). "Physics-based synthesis of disordered voices" (PDF). Interspeech 2013. Lyon, France: International Speech Communication Association: 587–591. doi:10.21437/Interspeech.2013-161. S2CID 17451802. Retrieved Aug 27, 2015.
- ^ a b Englert, Marina; Madazio, Glaucya; Gielow, Ingrid; Lucero, Jorge; Behlau, Mara (2016). "Perceptual error identification of human and synthesized voices". Journal of Voice. 30 (5): 639.e17–639.e23. doi:10.1016/j.jvoice.2015.07.017. PMID 26337775.
- ^ "The HMM-based Speech Synthesis System". Hts.sp.nitech.ac.j. Archived from the original on 2012-02-13. Retrieved 2012-02-22.
- ^ Remez, R.; Rubin, P.; Pisoni, D.; Carrell, T. (22 May 1981). "Speech perception without traditional speech cues" (PDF). Science. 212 (4497): 947–949. Bibcode:1981Sci...212..947R. doi:10.1126/science.7233191. PMID 7233191. Archived from the original (PDF) on 2011-12-16. Retrieved 2011-12-14.
- ^ Smith, Hannah; Mansted, Katherine (April 1, 2020). Weaponised deep fakes: National security and democracy. Vol. 28. Australian Strategic Policy Institute. pp. 11–13. ISSN 2209-9689.
- ^ Lyu, Siwei (2020). "Deepfake Detection: Current Challenges and Next Steps". 2020 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). pp. 1–6. arXiv:2003.09234. doi:10.1109/icmew46912.2020.9105991. ISBN 978-1-7281-1485-9. S2CID 214605906.
- ^ Diakopoulos, Nicholas; Johnson, Deborah (June 2020). "Anticipating and addressing the ethical implications of deepfakes in the context of elections". New Media & Society. 23 (7) (published 2020-06-05): 2072–2098. doi:10.1177/1461444820925811. ISSN 1461-4448. S2CID 226196422.
- ^ Murphy, Margi (20 February 2024). "Deepfake Audio Boom Exploits One Billion-Dollar Startup's AI". Bloomberg.
- ^ Chadha, Anupama; Kumar, Vaibhav; Kashyap, Sonu; Gupta, Mayank (2021), Singh, Pradeep Kumar; Wierzchoń, Sławomir T.; Tanwar, Sudeep; Ganzha, Maria (eds.), "Deepfake: An Overview", Proceedings of Second International Conference on Computing, Communications, and Cyber-Security, Lecture Notes in Networks and Systems, vol. 203, Singapore: Springer Singapore, pp. 557–566, doi:10.1007/978-981-16-0733-2_39, ISBN 978-981-16-0732-5, S2CID 236666289, retrieved 2022-06-29
- ^ "AI gave Val Kilmer his voice back. But critics worry the technology could be misused". Washington Post. ISSN 0190-8286. Retrieved 2022-06-29.
- ^ Etienne, Vanessa (August 19, 2021). "Val Kilmer Gets His Voice Back After Throat Cancer Battle Using AI Technology: Hear the Results". PEOPLE.com. Retrieved 2022-07-01.
- ^ Newman, Lily Hay. "AI-Generated Voice Deepfakes Aren't Scary Good—Yet". Wired. ISSN 1059-1028. Retrieved 2023-07-25.
- ^ "Speech synthesis". World Wide Web Organization.
- ^ "Blizzard Challenge". Festvox.org. Retrieved 2012-02-22.
- ^ "Smile -and the world can hear you". University of Portsmouth. January 9, 2008. Archived from the original on May 17, 2008.
- ^ "Smile – And The World Can Hear You, Even If You Hide". Science Daily. January 2008.
- ^ Drahota, A. (2008). "The vocal communication of different kinds of smile" (PDF). Speech Communication. 50 (4): 278–287. doi:10.1016/j.specom.2007.10.001. S2CID 46693018. Archived from the original (PDF) on 2013-07-03.
- ^ Prathosh, A. P.; Ramakrishnan, A. G.; Ananthapadmanabha, T. V. (December 2013). "Epoch extraction based on integrated linear prediction residual using plosion index". IEEE Transactions on Audio, Speech, and Language Processing. 21 (12): 2471–2480. Bibcode:2013ITASL..21.2471P. doi:10.1109/TASL.2013.2273717. S2CID 10491251.
- ^ EE Times. "TI will exit dedicated speech-synthesis chips, transfer products to Sensory Archived 2012-05-28 at the Wayback Machine." June 14, 2001.
- ^ "1400XL/1450XL Speech Handler External Reference Specification" (PDF). Archived from the original (PDF) on 2012-03-24. Retrieved 2012-02-22.
- ^ "It Sure Is Great To Get Out Of That Bag!". folklore.org. Retrieved 2013-03-24.
- ^ "Amazon Polly". Amazon Web Services, Inc. Retrieved 2020-04-28.
- ^ Miner, Jay; et al. (1991). Amiga Hardware Reference Manual (3rd ed.). Addison-Wesley Publishing Company, Inc. ISBN 978-0-201-56776-2.
- ^ Devitt, Francesco (30 June 1995). "Translator Library (Multilingual-speech version)". Archived from the original on 26 February 2012. Retrieved 9 April 2013.
- ^ "Accessibility Tutorials for Windows XP: Using Narrator". Microsoft. 2011-01-29. Archived from the original on June 21, 2003. Retrieved 2011-01-29.
- ^ "How to configure and use Text-to-Speech in Windows XP and in Windows Vista". Microsoft. 2007-05-07. Retrieved 2010-02-17.
- ^ Jean-Michel Trivi (2009-09-23). "An introduction to Text-To-Speech in Android". Android-developers.blogspot.com. Retrieved 2010-02-17.
- ^ Andreas Bischoff, The Pediaphon – Speech Interface to the free Wikipedia Encyclopedia for Mobile Phones, PDA's and MP3-Players, Proceedings of the 18th International Conference on Database and Expert Systems Applications, Pages: 575–579 ISBN 0-7695-2932-1, 2007
- ^ "gnuspeech". Gnu.org. Retrieved 2010-02-17.
- ^ "Smithsonian Speech Synthesis History Project (SSSHP) 1986–2002". Mindspring.com. Archived from the original on 2013-10-03. Retrieved 2010-02-17.
- ^ Jia, Ye; Zhang, Yu; Weiss, Ron J. (2018-06-12), "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis", Advances in Neural Information Processing Systems, 31: 4485–4495, arXiv:1806.04558
- ^ Arık, Sercan Ö.; Chen, Jitong; Peng, Kainan; Ping, Wei; Zhou, Yanqi (2018), "Neural Voice Cloning with a Few Samples", Advances in Neural Information Processing Systems, 31, arXiv:1802.06006
- ^ "Fake voices 'help cyber-crooks steal cash'". bbc.com. BBC. 2019-07-08. Retrieved 2019-09-11.
- ^ Drew, Harwell (2019-09-04). "An artificial-intelligence first: Voice-mimicking software reportedly used in a major theft". Washington Post. Retrieved 2019-09-08.
- ^ Thies, Justus (2016). "Face2Face: Real-time Face Capture and Reenactment of RGB Videos". Proc. Computer Vision and Pattern Recognition (CVPR), IEEE. Retrieved 2016-06-18.
- ^ Suwajanakorn, Supasorn; Seitz, Steven; Kemelmacher-Shlizerman, Ira (2017), Synthesizing Obama: Learning Lip Sync from Audio, University of Washington, retrieved 2018-03-02
- ^ Brunow, David A.; Cullen, Theresa A. (2021-07-03). "Effect of Text-to-Speech and Human Reader on Listening Comprehension for Students with Learning Disabilities". Computers in the Schools. 38 (3): 214–231. doi:10.1080/07380569.2021.1953362. hdl:11244/316759. ISSN 0738-0569. S2CID 243101945.
- ^ Triandafilidi, Ioanis I.; Tatarnikova, T. M.; Poponin, A. S. (2022-05-30). "Speech Synthesis System for People with Disabilities". 2022 Wave Electronics and its Application in Information and Telecommunication Systems (WECONF). St. Petersburg, Russian Federation: IEEE. pp. 1–5. doi:10.1109/WECONF55058.2022.9803600. ISBN 978-1-6654-7083-4. S2CID 250118756.
- ^ Zhao, Yunxin; Song, Minguang; Yue, Yanghao; Kuruvilla-Dugdale, Mili (2021-07-27). "Personalizing TTS Voices for Progressive Dysarthria". 2021 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI). Athens, Greece: IEEE. pp. 1–4. doi:10.1109/BHI50953.2021.9508522. ISBN 978-1-6654-0358-0. S2CID 236982893.
- ^ "Evolution of Reading Machines for the Blind: Haskins Laboratories" Research as a Case History" (PDF). Journal of Rehabilitation Research and Development. 21 (1). 1984. Archived from the original (PDF) on 2021-07-25. Retrieved 2021-07-03.
- ^ "Speech Synthesis Software for Anime Announced". Anime News Network. 2007-05-02. Retrieved 2010-02-17.
- ^ "Code Geass Speech Synthesizer Service Offered in Japan". Animenewsnetwork.com. 2008-09-09. Retrieved 2010-02-17.
- ^ Ruppert, Liana (January 18, 2021). "Make Portal's GLaDOS And Other Beloved Characters Say The Weirdest Things With This App". Game Informer. Archived from the original on January 18, 2021. Retrieved December 18, 2024.
- ^ Clayton, Natalie (January 19, 2021). "Make the cast of TF2 recite old memes with this AI text-to-speech tool". PC Gamer. Archived from the original on January 19, 2021. Retrieved December 18, 2024.
- ^ "这个网站可用AI生成语音 让ACG角色"说"出你输入的文本" [This Website Can Use AI to Generate Voice, Making ACG Characters "Say" the Text You Input]. GamerSky (in Chinese). January 18, 2021. Archived from the original on December 11, 2024. Retrieved December 18, 2024.
- ^ Furushima, Takayuki (January 18, 2021). "『Portal』のGLaDOSや『UNDERTALE』のサンズがテキストを読み上げてくれる。文章に込められた感情まで再現することを目指すサービス「15.ai」が話題に" [Portal's GLaDOS and UNDERTALE's Sans Will Read Text for You. "15.ai" Service Aims to Reproduce Even the Emotions in Text, Becomes Topic of Discussion]. Den Fami Nico Gamer (in Japanese). Archived from the original on January 18, 2021. Retrieved December 18, 2024.
- ^ "Now hear this: Voice cloning AI startup ElevenLabs nabs $19M from a16z and other heavy hitters". VentureBeat. 2023-06-20. Retrieved 2023-07-25.
- ^ "Sztuczna inteligencja czyta głosem Jarosława Kuźniara. Rewolucja w radiu i podcastach". Press.pl (in Polish). April 9, 2023. Retrieved 2023-04-25.
- ^ Knibbs, Kate. "Generative AI Podcasts Are Here. Prepare to Be Bored". Wired. ISSN 1059-1028. Retrieved 2023-07-25.
- ^ Suciu, Peter. "Arrested Succession Parody On YouTube Features 'Narration' By AI-Generated Ron Howard". Forbes. Retrieved 2023-07-25.
- ^ Fadulu, Lola (2023-07-06). "Can A.I. Be Funny? This Troupe Thinks So". The New York Times. ISSN 0362-4331. Retrieved 2023-07-25.
- ^ Kanetkar, Riddhi. "Hot AI startup ElevenLabs, founded by ex-Google and Palantir staff, is set to raise $18 million at a $100 million valuation. Check out the 14-slide pitch deck it used for its $2 million pre-seed". Business Insider. Retrieved 2023-07-25.
- ^ "AI-Generated Voice Firm Clamps Down After 4chan Makes Celebrity Voices for Abuse". Vice.com. January 30, 2023. Retrieved 2023-02-03.
- ^ "Usage of text-to-speech in AI video generation". elai.io. Retrieved 10 August 2022.
- ^ "AI Text to speech for videos". synthesia.io. Retrieved 12 October 2023.
- ^ Bruno, Chelsea A (2014-03-25). Vocal Synthesis and Deep Listening (Master of Music Music thesis). Florida International University. doi:10.25148/etd.fi14040802.
External links
[edit]- Simulated singing with the singing robot Pavarobotti or a description from the BBC on how the robot synthesized the singing.
Speech synthesis
View on GrokipediaSpeech synthesis is the computational generation of audible speech signals that approximate human vocal production, most commonly from text input via text-to-speech (TTS) systems employing algorithms to model phonetic, prosodic, and acoustic features.[1][2] Early mechanical attempts date to the 18th century with devices like Wolfgang von Kempelen's speaking machine, which used bellows and reeds to produce basic vowels and consonants through physical simulation of the vocal tract.[3] Electronic milestones include Bell Labs' Voder in 1939, an operator-controlled formant synthesizer that demonstrated real-time speech generation at the New York World's Fair, marking the shift to electrical analogs of speech production parameters.[4] Subsequent developments encompassed rule-based formant synthesis, which constructs speech from source-filter models of excitation and resonance, and concatenative methods that splice pre-recorded speech units for natural timbre at the cost of limited flexibility.[5][6] Statistical parametric synthesis in the 2000s introduced hidden Markov models to predict spectral and prosodic parameters from text, enabling compact representations but often yielding robotic intonation due to over-smoothing.[7] Contemporary neural architectures, such as Google's Tacotron series and DeepMind's WaveNet, leverage deep learning for end-to-end mapping from text to mel-spectrograms or raw waveforms, achieving unprecedented naturalness through autoregressive generation and attention mechanisms that capture contextual dependencies.[8] Applications span assistive devices for individuals with speech impairments, enabling communication via systems like those used in real-time decoding of neural signals; navigation aids and virtual assistants for hands-free interaction; and content creation tools for audiobooks or multilingual translation.[9][10] Empirical evaluations highlight neural TTS's superiority in mean opinion scores for intelligibility and preference, with WaveNet-conditioned models outperforming traditional vocoders in perceptual fidelity across diverse languages and speakers.[8] While enabling accessibility, the technology raises challenges in detecting synthetic audio to mitigate deception in voice impersonation, underscoring the need for robust forensic methods amid advancing realism.[11]
History
Pre-electronic and early mechanical attempts
In the late 18th century, early efforts to synthesize speech mechanically focused on replicating the acoustic properties of vowels through resonators powered by bellows and reeds. Christian Kratzenstein, a professor of physiology, constructed devices in 1779 that produced the five long vowels (/a/, /e/, /i/, /o/, /u/) by exciting tuned resonators with air from bellows vibrating against free reeds, demonstrating physiological differences in vocal tract resonance.[4][3] These apparatuses, submitted to the St. Petersburg Academy, marked one of the first systematic attempts to artificially generate distinct vowel sounds, though limited to isolated tones without consonants or connected speech.[3] Building on such principles, Wolfgang von Kempelen developed a more advanced mechanical synthesizer in the 1760s, publishing a detailed description in 1791. His device used bellows to simulate lungs, a reed for vocal cord vibration, and adjustable leather tubes and chambers to mimic the pharynx, mouth, and nasal cavities, enabling production of vowels, consonants, syllables, words, and short sentences like "arni" or "mama."[12][13] Operators manually controlled keys and levers to shape resonances and airflow, achieving intelligible but monotonous and labored speech that highlighted the causal role of vocal tract geometry in articulation.[12] Kempelen's work emphasized empirical observation of human anatomy, influencing later phonetic studies despite the machine's cumbersome operation and limited fluency.[13] By the mid-19th century, mechanical synthesis advanced toward more humanoid forms with Joseph Faber's Euphonia, exhibited publicly in Philadelphia in 1845 and London in 1846 after over two decades of development. This apparatus featured a mannequin head with artificial lips, tongue, jaw, and bellows-driven lungs, capable of reciting programmed phrases, numbers, and poems in multiple languages via a keyboard that manipulated reeds and valves for 16 basic sounds combinable into about 1,000 words.[14][15] Faber's design prioritized visible anthropomorphism, producing eerie, whispery speech that drew crowds but underscored mechanical constraints like slow response times and unnatural timbre due to imprecise control of formants.[14] These pre-electronic devices collectively demonstrated that speech arises from modulated airflow through configurable resonators, laying groundwork for understanding synthesis as physical modeling, though practicality remained hindered by manual complexity and acoustic fidelity issues.[16]Electronic and formant-based pioneers (1930s–1970s)
In the 1930s, electronic speech synthesis emerged at Bell Laboratories with Homer Dudley's development of the vocoder, a system that analyzed and resynthesized speech by encoding spectral envelopes into a reduced set of channels to capture formant-like resonances while transmitting fundamental frequency and amplitude.[17] Dudley's work, initiated in 1928, culminated in the Voder demonstrator unveiled at the 1939 New York World's Fair, which used a keyboard and pedal interface to generate continuous human-like speech through electronic filters and oscillators, marking the first fully electronic speech synthesizer without mechanical components.[18] The Voder produced recognizable vowels and consonants by manually controlling formant frequencies, though its output required skilled operation and sounded robotic due to limited channel resolution and lack of automated rules.[19] Following World War II, researchers at Haskins Laboratories advanced synthesis through the Pattern Playback, invented by Franklin S. Cooper in the late 1940s, which converted hand-painted spectrographic patterns into audible sound using optical scanning of drawings that represented frequency, amplitude, and timing of speech components.[20] This device, operational by 1950, enabled systematic experimentation with acoustic cues for phoneme perception, synthesizing isolated sounds and simple words to test theories of speech recognition, though it remained a research tool rather than a real-time synthesizer due to its manual pattern preparation.[21] The 1950s saw the introduction of dedicated formant synthesizers, beginning with Walter Lawrence's Parametric Artificial Talker (PAT) in 1953 at the Signals Research and Development Establishment, which modeled the vocal tract as a series of resonant filters to generate speech from parametric inputs for formants, frication noise, and voicing.[4] PAT used three formant circuits for vowels and added noise sources for consonants, producing intelligible British English phrases under manual control, and influenced subsequent rule-based systems by demonstrating that a small number of time-varying parameters could approximate natural prosody.[22] By the 1960s and 1970s, formant synthesis matured with computational implementations, as Dennis Klatt at MIT developed software-based synthesizers starting in the mid-1960s, culminating in the Klattalk system around 1979, which automated formant trajectories via rules derived from linguistic analysis for more fluent text-to-speech conversion.[23] Klatt's cascade-parallel formant architecture, refined in the 1970s, improved naturalness by separately modeling glottal source and vocal tract filtering, enabling applications like the DECtalk hardware synthesizer and influencing assistive devices, though outputs still exhibited monotonic intonation and spectral distortions from idealized formant modeling.[24] These pioneers established formant synthesis as a dominant paradigm, prioritizing computational efficiency over waveform fidelity, with empirical validation through perceptual tests confirming intelligibility for limited vocabularies.[4]Digital concatenative and parametric advances (1980s–2000s)
In the 1980s, increased computational power facilitated the shift toward concatenative speech synthesis, which assembled utterances from pre-recorded natural speech segments such as diphones—transitions between adjacent sounds—yielding output that sounded more lifelike than prior formant-based methods reliant on synthetic waveforms.[25] This approach minimized the robotic quality of rule-based synthesizers by leveraging human-recorded units, though it required careful segment selection to avoid audible discontinuities at join points.[26] Early implementations, like Yoshinori Sagisaka's nuu-talk system developed at Japan's Advanced Telecommunications Research labs in the late 1980s and early 1990s, demonstrated concatenative techniques using diphone inventories to generate fluent Japanese speech.[26] By the mid-1990s, concatenative methods advanced with unit selection synthesis, which optimized the choice of segments from large speech corpora to reduce distortion in both acoustic quality and prosody. Alan Black and Nick Campbell introduced this framework in 1995, modeling selection as a cost-minimization problem that balanced target unit suitability and concatenation smoothness, often using dynamic programming over subword units like demi-syllables or phonemes.[27] Subsequent refinements, such as Andrew Hunt and Alan Black's 1996 system employing large databases (e.g., thousands of utterances), enabled scalable synthesis with improved naturalness by prioritizing contextually appropriate units over fixed diphone sets.[28] Open-source platforms like the Festival Speech Synthesis System, initiated at the University of Edinburgh in the late 1990s, integrated diphone and unit selection modules, supporting multilingual voices and customizable corpora for research and applications.[29] Toward the late 1990s and into the 2000s, parametric synthesis emerged as a data-driven alternative, parameterizing speech via statistical models to generate waveforms from acoustic features like spectrum, fundamental frequency, and duration, thus avoiding some artifacts of direct concatenation. Hidden Markov model (HMM)-based systems, pioneered by researchers including Keiichi Tokuda in Japan, first demonstrated viability around 1995–1996 through trainable context-dependent models that clustered phoneme states via decision trees, enabling adaptation to new speakers with limited data.[30] These methods produced intelligible speech by sampling parameters from HMM probability distributions and synthesizing via vocoders like STRAIGHT, outperforming concatenative systems in flexibility for prosodic control and voice modification, though early versions exhibited buzziness from over-smoothed parameters.[31] The HMM-based Speech Synthesis System (HTS) toolkit, released in December 2002, marked a practical milestone by providing open-source tools for HMM training and synthesis, influencing commercial TTS deployments.[32]Neural and deep learning revolution (2010s–present)
The integration of deep neural networks into speech synthesis during the early 2010s surpassed the limitations of hidden Markov model-based parametric approaches by enabling hierarchical feature learning and more accurate mapping from text to acoustic parameters.[33] Deep neural networks replaced Gaussian mixture models for predicting mel-frequency cepstral coefficients, yielding improvements in naturalness and reducing audible artifacts, as evidenced by higher mean opinion scores in evaluations.[33] A pivotal advancement occurred in September 2016 with DeepMind's WaveNet, an autoregressive convolutional neural network that models raw audio waveforms directly rather than intermediate representations like spectrograms.[34] WaveNet generates speech by predicting each audio sample conditioned on previous ones, capturing fine-grained temporal dependencies and producing output preferred over traditional concatenative systems in listening tests, with mean opinion scores exceeding 4.0 on a 5-point scale for certain voices.[35] This approach, however, incurred high computational costs due to sequential generation, limiting real-time applicability initially.[35] Building on spectrogram prediction, Google's Tacotron, introduced in March 2017, pioneered end-to-end text-to-speech synthesis by using a sequence-to-sequence model with attention mechanisms to convert raw text characters directly into mel-spectrograms, bypassing explicit phoneme or linguistic front-ends.[36] Tacotron 2, released later in 2017, combined this with a WaveNet vocoder, achieving human parity in blind evaluations for single-speaker synthesis, where listeners rated synthesized speech as comparable to real recordings.[37] To mitigate latency in autoregressive models, Microsoft developed FastSpeech in 2019, a non-autoregressive feed-forward transformer-based architecture that generates entire spectrograms in parallel, reducing inference time by orders of magnitude while preserving quality through duration predictors and variance adaptors. FastSpeech 2, an iteration from 2020, further enhanced prosody control and stability by incorporating ground-truth alignments during training, outperforming predecessors in both speed and subjective quality metrics.[38] The 2020s have seen proliferation of efficient architectures and zero-shot capabilities, exemplified by Microsoft's VALL-E in January 2023, a neural codec language model that synthesizes personalized speech from just a 3-second audio enrollment clip without speaker-specific fine-tuning, leveraging in-context learning from large-scale speech-text pairs.[39] VALL-E 2, announced in June 2024, advanced this to human-parity zero-shot text-to-speech, with evaluations showing indistinguishability from real speech in timbre, prosody, and content across diverse speakers and languages.[40] Diffusion probabilistic models, such as those in Grad-TTS and subsequent vocoders, have complemented these by enabling stable, high-fidelity waveform inversion and generation through iterative denoising, addressing mode collapse in GAN-based alternatives.[41] These neural paradigms have driven commercial deployments, including Google Cloud Text-to-Speech's WaveNet integration in 2018, which powers multilingual voices with enhanced expressiveness.[42] Despite gains in realism, challenges persist in multi-speaker generalization, ethical voice cloning risks, and computational demands for low-resource languages.[43]Core Technologies and Methods
Formant and rule-based synthesis
Formant synthesis generates artificial speech by modeling the acoustic resonances, or formants, of the human vocal tract according to the source-filter theory. This approach separates speech production into a neutral sound source—typically a periodic pulse train for voiced sounds or noise for unvoiced sounds—and a linear time-invariant filter that shapes the source spectrum to produce specific phonetic qualities through adjustable formant frequencies, bandwidths, and amplitudes. The source-filter model was formalized by Gunnar Fant in 1960, building on earlier work in acoustic phonetics to explain how vocal tract configurations determine spectral peaks corresponding to vowels and consonants.[44] In practice, formant synthesizers employ either cascade or parallel configurations of resonators to simulate the filter. A cascade formant synthesizer passes the source through a series of second-order resonators connected in tandem, mimicking the serial filtering effect of the vocal tract, while a parallel setup sums outputs from independent formant branches for greater flexibility in spectral control.[45] Rule-based synthesis integrates linguistic rules to derive these parameters from input text: text is first normalized and converted to phonemic sequences, then rules dictate formant trajectories, source excitation patterns, durations, and fundamental frequency (F0) contours based on phonetic context, stress, and intonation patterns.[46] For instance, vowel formants are set to target values interpolated over time, with transitions smoothed for coarticulation effects. Pioneering implementations include the Pattern Playback device developed at Haskins Laboratories in the 1950s, which manually painted spectrograms to drive formant-like synthesis for phonetic research. A landmark digital system was Dennis Klatt's cascade/parallel formant synthesizer, implemented in software for the DEC PDP-11 computer in 1980, capable of real-time synthesis with 12 formants and detailed control over glottal source parameters like open quotient and aspiration noise.[45] This design underpinned commercial systems such as DECtalk, released in 1984, which used rule-based parameter generation to produce intelligible speech from text at rates up to 200 words per minute on hardware of that era.[47] Rule-based formant synthesis excels in computational efficiency, requiring minimal storage—often under 1 MB for rules and models—compared to waveform-based methods, enabling deployment on early microprocessors.[48] It also permits straightforward manipulation of prosody and voice characteristics by altering rules, facilitating applications like foreign accent simulation or low-bitrate transmission. However, the idealized filter models often yield a mechanical, "buzzy" timbre lacking the nuanced harmonics and transients of natural speech, with intelligibility rates typically 80-90% for isolated words but dropping in continuous discourse due to imprecise modeling of fricatives and nasal murmurs.[47] Despite these limitations, formant-rule systems influenced subsequent TTS architectures and remain relevant in resource-constrained environments, such as embedded devices.[46]Concatenative synthesis techniques
Concatenative synthesis techniques generate speech by selecting and sequentially joining pre-recorded acoustic units from a large speech corpus, preserving the natural timbre and prosody of human recordings while constructing novel utterances. These methods emerged as a shift from rule-based formant synthesis in the late 1980s, prioritizing waveform fidelity over parametric modeling, though they demand extensive databases to cover phonetic and prosodic variations. Unit sizes typically include sub-phonemic fragments, diphones, phonemes, syllables, or multi-word phrases, with selection guided by algorithmic optimization to minimize perceptual artifacts at join points.[49] Diphone-based concatenative synthesis represents an early and efficient variant, employing units that capture the steady-state and transition between two adjacent phonemes, such as from a vowel's midpoint to the onset of a following consonant. For languages like English with approximately 40 phonemes, this yields around 1,600 unique diphones, sufficient to span most co-articulation effects with compact storage compared to full phoneme inventories. Synthesis involves phonetic transcription of input text, diphone inventory lookup, and concatenation, often augmented by prosodic adjustments like pitch contour superposition or duration scaling via time-domain pitch-synchronous overlap-add (PSOLA) to align fundamental frequency and reduce glitches. Pioneered in systems like those developed at British Telecom in the 1980s, diphone methods excel in resource-constrained environments but struggle with out-of-corpus prosody, leading to robotic intonation unless hybridized with rule-based modifications.[50][51][52] Corpus-based unit selection advances diphone principles by drawing from expansive, speaker-specific databases—often exceeding 10 hours of read speech—enabling flexible unit granularities beyond fixed diphones. Algorithms compute a target cost reflecting linguistic context (e.g., phoneme identity, stress) and prosodic features (e.g., F0 trajectory, duration, energy), alongside a concatenation cost evaluating spectral and temporal continuity at boundaries via metrics like Mel-cepstral distance or waveform correlation. Viterbi search or beam search traverses a graph of candidate units to optimize the cumulative path cost, as formalized in early implementations requiring corpora of at least 5,000 utterances for robust coverage. This approach, detailed in foundational work from 1996, yields higher naturalness by favoring unmodified segments matching desired intonation, though it incurs computational overhead proportional to database size.[28][53] Hybrid concatenative techniques integrate diphone efficiency with corpus-scale selection, or blend with parametric elements for prosody transplantation; for instance, selecting multi-phoneme units (2-4 phonemes) to better preserve co-articulation while applying harmonic model-based smoothing for seamless joins. Post-selection signal processing, such as weighted overlap-add or LPC residual modification, mitigates discontinuities by blending 20-50 ms windows at edges, with perceptual evaluations confirming reduced buzz or clipping artifacts. These methods dominated commercial TTS until the mid-2000s, powering systems like AT&T's Natural Voices, but scalability limits persist for low-resource languages due to corpus acquisition costs.[54][55]Statistical parametric synthesis
Statistical parametric speech synthesis generates speech waveforms by statistically estimating sequences of acoustic parameters, such as spectral envelopes, fundamental frequency, and durations, from models trained on large speech corpora, followed by vocoder-based reconstruction.[56] This contrasts with concatenative methods by averaging features across similar phonetic contexts rather than selecting and joining pre-recorded units, enabling compact representations and modifiable prosody.[57] The core technique relies on hidden Markov models (HMMs) to capture context-dependent speech variations, where full-context labels align text-derived phoneme sequences with acoustic features extracted via tools like mel-cepstral analysis.[58] During synthesis, maximum likelihood parameter generation algorithms produce smooth trajectories for static and dynamic features (deltas and delta-deltas), often incorporating global variance modeling to mitigate underestimation of parameter variability and enhance naturalness.[59] Vocoders such as STRAIGHT or mixed-phase implementations then synthesize the waveform from these parameters, typically at frame rates of 5 milliseconds.[31] Early developments trace to the late 1990s, with foundational work on HMM-based voice conversion and synthesis by Tokuda, Kobayashi, and Imai, including a 1997 ICASSP paper on adapting pitch and spectrum parameters.[60] The HMM-based Speech Synthesis System (HTS), an open-source toolkit, was first released in December 2002, supporting multi-speaker training and adaptation via techniques like MLLR (maximum likelihood linear regression).[32] By 2007, HTS version 2.0 incorporated advanced clustering and parameter generation for improved efficiency.[61] Advantages include reduced storage needs compared to unit-selection systems—requiring only model parameters rather than full waveforms—and inherent support for prosody manipulation, speaker adaptation, and expressive synthesis through feature interpolation.[62] These properties make it suitable for resource-constrained environments and low-data scenarios, where concatenative methods degrade due to insufficient coverage.[56] However, classical HMM-based implementations suffer from over-smoothing, where generated spectra average natural variability, yielding muffled or buzzy output with limited high-frequency detail and unnatural prosody transitions.[63] Evaluations, such as mean opinion scores from blind listening tests, consistently rate parametric synthesis below natural speech and early concatenative systems in perceptual naturalness until refinements like deep neural network substitutions in the 2010s.[57] Despite these limitations, the framework laid groundwork for data-driven TTS, influencing hybrid systems in tools like the Festival speech synthesis suite.[58]Articulatory and hybrid approaches
Articulatory synthesis generates speech by simulating the biomechanical processes of human speech production, modeling the vocal tract's geometry and the movements of articulators such as the tongue, lips, jaw, and larynx to produce acoustic waveforms.[64] These models typically solve differential equations approximating airflow, pressure, and sound propagation through the vocal tract, often using finite element or finite difference methods for computational efficiency.[65] Early implementations, dating to the 1960s at institutions like Haskins Laboratories, relied on simplified tube models derived from X-ray data of speakers, but accuracy was limited by incomplete physiological data and high computational demands.[66] Key challenges in pure articulatory synthesis include achieving realistic coarticulation—where articulator positions overlap across phonemes—and modeling glottal source excitation from the larynx, which requires precise control parameters often derived from electromagnetic articulography (EMA) or magnetic resonance imaging (MRI).[67] Systems like the Maeda articulatory synthesizer use parametric control of vocal tract shapes to invert acoustic signals back to articulatory gestures, enabling synthesis but suffering from unnatural timbre due to idealized geometries.[68] Computational costs historically restricted real-time use, with synthesis rates on 1990s hardware reaching only 10-20 words per minute for complex utterances.[69] Hybrid approaches mitigate these limitations by integrating articulatory models with acoustic, formant, or statistical methods, leveraging the interpretability of articulatory parameters for prosody control while borrowing efficiency from waveform generation techniques.[70] For instance, hybrid articulatory-acoustic synthesizers map biomechanical trajectories to spectral envelopes using deep neural networks (DNNs), as demonstrated in 2016 work training on EMA data to achieve real-time control with perceptual naturalness scores exceeding 3.5 on MOS scales.[71] Time-frequency domain hybrids combine finite difference vocal tract simulations with source-filter models, reducing artifacts in fricatives and nasals by dynamically adjusting filter parameters based on articulator positions.[72] Recent hybrids incorporate machine learning for articulatory feature integration into parametric frameworks, such as hidden Markov models (HMMs) augmented with trajectory predictions from articulator data, improving intonation variability in read speech by 15-20% over purely acoustic baselines in listener evaluations.[73] Differentiable rendering techniques, advanced in 2024, enable end-to-end optimization of articulatory parameters via gradients, supporting diverse vocal sounds beyond standard speech, though scalability to full languages remains constrained by training data volumes typically under 10 hours per speaker.[74] These methods prioritize causal fidelity to human physiology, offering potential for applications in speech therapy and assistive devices, but require validation against empirical articulatory datasets to counter modeling assumptions that overestimate uniformity in speaker anatomy.[75]Neural network and deep learning synthesis
Neural network-based speech synthesis emerged in the early 2010s as an evolution of statistical parametric methods, initially incorporating deep neural networks (DNNs) to predict acoustic features from linguistic inputs, often in hybrid systems with hidden Markov models (HMMs). These early DNN approaches improved naturalness over traditional Gaussian mixture models by better capturing non-linear mappings, achieving mean opinion scores (MOS) up to 3.5 on benchmark datasets like Blizzard Challenge entries by 2013.[76] However, they retained reliance on hand-crafted front-end processing for text analysis and phoneme alignment, limiting scalability and expressiveness.[77] A pivotal advancement occurred in 2016 with WaveNet, developed by DeepMind, which introduced autoregressive convolutional neural networks to generate raw audio waveforms directly, bypassing intermediate parametric representations like mel-cepstral coefficients. WaveNet employs dilated convolutions to model long-range dependencies in audio sequences, producing speech with MOS ratings exceeding 4.0—outperforming parametric synthesizers by capturing subtle variations in timbre and prosody that prior methods approximated poorly.[34] This waveform-level modeling revealed causal dependencies in speech production, enabling higher fidelity but at the cost of slow inference due to sequential generation.[35] Building on WaveNet's vocoding capabilities, Google’s Tacotron in 2017 pioneered end-to-end deep learning frameworks, using encoder-decoder architectures with attention mechanisms to map raw text characters directly to mel-spectrograms. Tacotron achieved an MOS of 3.82 on U.S. English evaluations, surpassing production parametric systems by automating linguistic-to-acoustic mappings and reducing errors from modular pipelines.[36] Tacotron 2, released later in 2017, integrated WaveNet as a vocoder, yielding MOS scores above 4.5 and human-like intonation through sequence-to-sequence training on paired text-audio data.[8] These models demonstrated deep learning's capacity for data-driven prosody modeling, though they required large corpora (e.g., millions of utterances) to generalize beyond training voices.[76] Subsequent innovations in the late 2010s addressed inference latency and training efficiency, with non-autoregressive models like FastSpeech (2019) using feed-forward transformers to predict spectrograms in parallel, reducing synthesis time by orders of magnitude while maintaining comparable MOS to autoregressive baselines.[76] Generative adversarial networks (GANs), as in Parallel WaveGAN (2019), accelerated vocoding by training discriminators on waveform realism, enabling real-time applications without sacrificing perceptual quality.[76] By the early 2020s, transformer-based architectures dominated, supporting multilingual synthesis and voice adaptation with fewer parameters, as evidenced by systems achieving MOS over 4.2 across low-resource languages via transfer learning.[76] These developments underscored deep learning's empirical superiority in mimicking human speech acoustics, driven by scalable architectures rather than rule-based heuristics.Emerging paradigms including diffusion and large language model integration
Diffusion models in text-to-speech (TTS) synthesis model audio generation as a reverse diffusion process, starting from Gaussian noise and iteratively denoising to produce waveforms or spectrograms conditioned on text inputs, which allows for parallel sampling and superior naturalness compared to autoregressive neural vocoders.[78] This paradigm gained traction post-2020, with early implementations like Diff-TTS (2022) demonstrating improved perceptual quality through continuous-time diffusion, while discrete-time variants address computational efficiency for longer utterances.[78] Recent advances, such as E3-TTS (2023) and NaturalSpeech 2 (2024), leverage latent diffusion in compressed representations to reduce inference latency and enhance scalability, achieving mean opinion scores (MOS) exceeding 4.0 on benchmarks like LibriTTS for both quality and similarity.[79] Despite these gains, diffusion TTS faces challenges in real-time applications due to multiple denoising steps (typically 50–1000), prompting optimizations like classifier-free guidance and accelerated samplers that cut steps to under 10 while preserving fidelity.[80] Integration of large language models (LLMs) with diffusion TTS further refines controllability and expressiveness by incorporating semantic priors from pre-trained text models to guide prosody, emotion, and style without explicit annotations. For instance, VALL-E (2023), developed by Microsoft, frames TTS as conditional language modeling over discrete audio tokens, enabling zero-shot voice cloning from 3 seconds of reference speech with MOS ratings of 3.97 for naturalness on unseen speakers.[43] Prompt-based systems like PL-TTS (2024) augment diffusion decoders with LLM-generated style descriptors, allowing fine-grained control over attributes such as speaking rate and accent via natural language inputs, outperforming baselines in subjective evaluations for style fidelity.[81] Hybrid approaches, including superposed LLM layers on diffusion backbones, boost synthesis quality by aligning textual semantics with acoustic features, as evidenced in models fine-tuned on datasets exceeding 100,000 hours, yielding up to 15% relative improvements in word error rates for downstream speech-to-text verification.[82] These paradigms converge in unified architectures like DiT-TTS variants (2024–2025), where diffusion transformers process LLM-encoded prompts directly in latent space, facilitating multilingual zero-resource synthesis and reducing hallucinations—erroneous content insertions—through reinforced alignment between text and audio tokens.[83] Empirical evaluations on datasets like LibriSpeech and VCTK indicate diffusion-LLM systems achieve state-of-the-art zero-shot performance, with similarity scores above 0.85 in cosine distance metrics, though they demand vast training corpora (often >60,000 hours) to mitigate overfitting in low-data regimes.[82] Ongoing research addresses efficiency via distillation and metric optimization, positioning these methods as frontrunners for expressive, context-aware TTS in applications like virtual assistants and audiobooks.[84]Technical Challenges and Limitations
Text preprocessing and normalization
Text preprocessing and normalization constitute the initial stage in the text-to-speech (TTS) pipeline, converting raw input text—often containing non-standard elements such as numbers, abbreviations, dates, currencies, symbols, and bracketed annotations—into a canonical spoken form suitable for subsequent linguistic analysis and waveform generation.[85] This process ensures that written representations align with how they would be verbalized in natural speech, preventing errors like pronouncing "123" as individual digits rather than "one hundred twenty-three."[86] Without effective normalization, downstream components such as grapheme-to-phoneme conversion produce incorrect phonetic outputs, leading to unnatural or unintelligible synthesis.[87] Key subprocesses include tokenization, which segments text into meaningful units like words, punctuation, and non-alphabetic tokens; abbreviation expansion, drawing from dictionaries to resolve forms such as "e.g." to "for example"; and verbalization of numerals, where algorithms apply language-specific rules to handle cardinal, ordinal, and decimal representations.[88] Punctuation is interpreted to infer prosodic cues, such as pauses for commas or sentence boundaries, while case normalization standardizes text to lowercase for consistency, excluding proper nouns.[86] Electronic addresses, URLs, and acronyms pose additional complexities, often requiring custom rules to avoid literal reading, as in expanding "http://example.com" to descriptive spoken equivalents. Bracketed stage directions or annotations (e.g., [laughs]) are typically stripped via regular expression matching to remove content within brackets, followed by whitespace normalization, preventing verbalization of non-spoken elements and ensuring clean audio output.[89] Traditional approaches rely on rule-based systems employing hand-crafted grammars and finite-state transducers, which excel in coverage for high-resource languages like English but demand extensive manual engineering and struggle with ambiguity, such as homographs ("lead" as metal or verb) resolved via part-of-speech tagging or context.[90] Statistical and neural methods, including sequence-to-sequence models trained on parallel written-spoken corpora, have gained prominence since the 2010s, achieving lower error rates—e.g., under 1% word error rate on standard benchmarks for English—by learning contextual mappings end-to-end.[86] Hybrid systems combine rules for deterministic cases with machine learning for rare or context-sensitive tokens, as implemented in frameworks like NVIDIA NeMo, which supports multilingual normalization via weighted finite-state transducers augmented with neural components.[85] Challenges persist in handling context-dependent disambiguation, where up to 20-30% of tokens in real-world text (e.g., news or web content) require inference from surrounding words, and in low-resource languages lacking parallel data, leading to reliance on transfer learning or zero-shot techniques with error rates exceeding 5%.[89] Multilingual systems must navigate code-switching and orthographic variations, such as digit grouping in European vs. Anglo-American formats, while dynamic content like financial reports amplifies the need for real-time, accurate verbalization to maintain intelligibility.[91] Evaluation typically uses metrics like normalized word error rate against gold-standard spoken transcripts, highlighting that rule-based methods scale poorly to informal text, whereas neural approaches, though data-hungry, reduce human-perceived unnaturalness in synthesized output.[90]Phoneme conversion and linguistic mapping
Phoneme conversion in speech synthesis refers to the process of transforming orthographic text into a sequence of phonemes, the basic units of sound in a language, which serves as an intermediate representation for subsequent acoustic modeling. This step, often termed grapheme-to-phoneme (G2P) conversion, addresses the non-trivial mapping between written symbols (graphemes) and their phonetic realizations, essential for generating intelligible speech from arbitrary text inputs.[92] In systems reliant on phonemic input, accurate G2P ensures that the synthesizer produces correct pronunciations, particularly for languages with irregular spelling-to-sound correspondences like English.[93] Traditional G2P methods include rule-based systems, which apply hand-crafted linguistic rules to derive phonemes from graphemes, and dictionary-based approaches that lookup pre-stored pronunciations for known words. Data-driven techniques, such as statistical models trained on pronunciation lexicons, have largely supplanted rules for handling out-of-vocabulary (OOV) words by generalizing from corpus data. More recent advancements employ neural networks, including sequence-to-sequence models and large language models (LLMs), to capture contextual dependencies and improve accuracy on ambiguous cases, outperforming baselines by integrating in-context learning from speech recordings or phonetic corpora.[94][95][96] Linguistic mapping extends phoneme conversion by incorporating higher-level language-specific knowledge, such as morphology, syntax, and prosodic features, to resolve ambiguities like homographs (e.g., "lead" as metal or verb) or stress patterns. In multilingual TTS, cross-lingual phoneme mapping aligns inventories from source and target languages using acoustic similarity metrics or learned correspondences, enabling voice transfer across under-resourced languages without native data. Techniques often combine phonetic similarity tables, human-validated alignments, and neural embeddings to bridge phonological gaps, as demonstrated in systems supporting dozens of languages via shared acoustic-phonetic spaces.[97][98][99] Challenges in phoneme conversion and mapping arise from orthographic irregularities, contextual variability, and resource scarcity in low-resource languages, where OOV rates can exceed 20% and lead to pronunciation errors. Ambiguities from polysemous graphemes require disambiguation via surrounding text or part-of-speech tagging, while dialectal variations demand adaptive mappings. Emerging solutions leverage phoneme-aligned graphemes from TTS data to refine realizations, reducing reliance on static dictionaries and enhancing scalability, though evaluation remains tied to word error rates on held-out test sets.[93][100][101]Prosody, intonation, and emotional expressiveness
Prosody in speech synthesis encompasses the suprasegmental features of speech, including rhythm, stress, and timing, which contribute to naturalness beyond individual phonemes. Intonation refers to variations in fundamental frequency (F0) that signal phrasing, emphasis, and sentence type, while emotional expressiveness involves modulating these elements to convey affect, such as joy or anger, through pitch contours, duration adjustments, and energy levels. In text-to-speech (TTS) systems, accurate prosody modeling is essential for listener comprehension and perceived humanity, as flat or mismatched prosody results in robotic output that impairs engagement.[102][103] Early rule-based and concatenative TTS methods struggled with prosody due to hand-crafted rules or limited unit selection, often producing monotonous intonation lacking contextual adaptation, such as rising F0 for questions or stress on content words. Statistical parametric synthesis introduced hidden Markov models (HMMs) for duration and F0 prediction, but these relied on simplistic Gaussian mixtures, yielding unnatural variability. Neural approaches, particularly since 2016 with WaveNet and Tacotron, advanced prosody via end-to-end learning, where acoustic models predict mel-spectrograms incorporating prosodic cues from text embeddings, though initial implementations over-smoothed contours.[102][104] Modern neural TTS employs techniques like global style tokens (GSTs) to capture latent prosodic styles, including emotional variance, by conditioning vocoders on clustered embeddings from expressive datasets. Fine-grained modeling uses predicted ToBI labels or pre-trained language models for syllable-level prosody, enhancing intonation control; for instance, cross-utterance prosody transfer via pre-trained acoustic encoders improves rhythm consistency across sentences. Diffusion-based models, integrated post-2022, refine prosody through iterative denoising of F0 trajectories, yielding more dynamic intonation than autoregressive methods. Prompt-driven systems, emerging in 2024-2025, enable explicit emotion and intensity control by injecting textual descriptors into multi-speaker architectures, addressing variability in affective synthesis.[105][106][107] Emotional expressiveness remains challenging, as prosodic markers like exaggerated pitch range for excitement or slowed tempo for sadness require disentangling from linguistic content, often leading to over- or under-modulation in zero-shot scenarios. Low-resource languages exacerbate issues, with insufficient expressive data causing generic intonation; hybrid articulatory models attempt mitigation by simulating vocal tract dynamics for nuanced emotion but demand high computational cost. Evaluation relies on mean opinion scores (MOS) for naturalness and prosodic adequacy, supplemented by objective metrics like F0 correlation or prosodic deviation indices, though subjective human judgments reveal persistent gaps in conveying subtle affects like sarcasm. Despite progress, synthesized speech in 2025 still lags human variability, particularly in real-time applications where prosody prediction must balance fidelity and latency.[108][109][110]Evaluation methodologies and metrics
Subjective evaluation remains the gold standard for assessing speech synthesis quality, as it directly captures human perceptual judgments of attributes like naturalness, intelligibility, and expressiveness.[111] In Mean Opinion Score (MOS) tests, listeners rate synthesized speech on a 1-5 scale (1: bad, 5: excellent) for overall quality or specific dimensions, following ITU-T Recommendation P.800 guidelines established in 1996 and updated periodically. MOS is widely used due to its simplicity but suffers from limitations, including poor sensitivity to subtle differences in high-fidelity modern systems and vulnerability to inter-listener variability, with studies showing correlations dropping below 0.7 for neural TTS outputs.[112] To mitigate anchoring effects and improve comparative reliability, Multiple Stimuli with Hidden Reference and Anchor (MUSHRA) tests present multiple stimuli—including a hidden natural reference and low/high anchors—rated on a 0-100 scale, enabling finer discrimination as validated in ITU-R BS.1534-3 (2015).[113] MUSHRA outperforms MOS for evaluating advanced TTS, detecting quality gaps in prosody and timbre that MOS often conflates, though it requires more participant effort and controlled conditions.[114] Objective metrics provide scalable, automated alternatives by computing distances or predictions against reference speech, though their correlation with human judgments varies (typically 0.6-0.9 for PESQ) and weakens for non-linear neural distortions.[115] Mel-Cepstral Distortion (MCD) quantifies spectral envelope differences via cepstral coefficients, with lower values (e.g., <5 dB for good quality) indicating similarity, but it ignores phase and temporal alignment.[116] Perceptual Evaluation of Speech Quality (PESQ), standardized in ITU-T P.862 (2001) and improved as POLQA (P.863, 2014), models human auditory perception to predict MOS scores, achieving high correlation (up to 0.93) for degraded speech but underperforming on clean, expressive synthesis.[117] Short-Time Objective Intelligibility (STOI) estimates word recognition rates by correlating short-time spectra, correlating at ρ=0.95 with subjective intelligibility but focusing narrowly on clarity over naturalness.[118] Emerging neural predictors like MOSNet (2018) use deep networks trained on MOS data to estimate scores from raw audio, offering faster evaluation but risking overfitting to training biases in datasets.[119]| Metric Type | Example Metrics | Primary Assessment | Strengths | Limitations |
|---|---|---|---|---|
| Subjective | MOS, MUSHRA | Naturalness, intelligibility | Aligns with human perception | Costly, subjective variance |
| Objective | MCD, PESQ, STOI | Spectral similarity, quality prediction, intelligibility | Automated, repeatable | Weaker correlation for high-quality TTS; requires references |
Scalability issues in multilingual and low-resource languages
Scalability in speech synthesis for multilingual environments is hindered by the exponential data demands of neural TTS models, which typically require thousands of hours of high-quality, paired text-audio data per language to achieve natural-sounding output. For low-resource languages—defined as those with fewer than 1 million speakers or limited digitized corpora—this scarcity leads to undertrained models exhibiting artifacts like unnatural prosody, phonetic inaccuracies, and speaker inconsistencies. In 2022, analyses showed that over 7,000 languages worldwide lack sufficient resources for robust TTS development, exacerbating digital divides as high-resource languages like English dominate datasets comprising 90% or more of training corpora.[122][123][124] Multilingual TTS systems aim to address this by pooling data across languages into shared models, but scalability falters due to cross-lingual interference, where phonetic and prosodic features from dominant languages degrade performance in target low-resource ones. For instance, models pretrained on Indo-European languages struggle with tonal systems in African or Austronesian tongues, resulting in mean opinion scores (MOS) dropping by 0.5–1.0 points for unseen low-resource variants. Data quality compounds the issue: multilingual corpora often suffer from inconsistent annotations, code-switching artifacts, and biased sampling favoring urban dialects, with error rates in phoneme alignment exceeding 20% in under-resourced pairs. This limits zero-shot generalization, where models fail to synthesize fluent speech for novel languages without fine-tuning, as evidenced in evaluations across 100+ languages using unsupervised found data. Further challenges arise in preprocessing and linguistic mapping for diverse scripts and morphologies; low-resource languages frequently lack standardized grapheme-to-phoneme converters or normalization tools, inflating out-of-vocabulary rates to 15–30% and necessitating manual interventions that are infeasible at scale. Evaluation metrics like MOS or word error rates prove unreliable across languages due to cultural variances in perceived naturalness, with inter-annotator agreement falling below 0.6 for non-English low-resource cases. These factors render deploying TTS at global scale computationally prohibitive, as adapting models for each of the estimated 40 low-resource languages targeted in recent benchmarks requires 10–100x more parameters than monolingual setups, straining inference on edge devices.[125][126][127]Implementations in Hardware and Software
Dedicated speech synthesis hardware
Dedicated speech synthesis hardware emerged in the late 1970s and 1980s as specialized integrated circuits and modules designed to generate speech from text or phoneme inputs, primarily for embedded applications where computational resources were limited. These devices typically employed techniques such as linear predictive coding (LPC), phoneme synthesis, or formant synthesis to produce intelligible speech at low cost and power. Unlike general-purpose processors running software-based synthesis, dedicated hardware prioritized real-time performance and simplicity, finding use in toys, computers, arcade games, and accessibility aids.[128] Texas Instruments pioneered LPC-based chips like the TMS5200, introduced in 1978, which used a digital filter driven by excitation signals to synthesize speech from pre-stored LPC coefficients.[128] An improved variant, the TMS5220, featured enhanced chirp tables and statistical modeling for better quality and was integrated into devices such as the Speak & Spell educational toy launched in 1978 and the TI-99/4A computer’s speech synthesizer module released in 1981.[129] These chips required external ROM for vocabulary storage and processed data at rates supporting continuous speech output via an internal D/A converter.[128] Votrax's SC-01, released around 1980, was a single-chip phoneme synthesizer capable of unlimited English vocabulary by combining 64 phonemes at 70 bits per second.[130] It generated speech through formant-like filtering of voiced/unvoiced excitations and was employed in standalone devices like the Type 'n Talk board and arcade titles including Gorf (1981) and Q*bert (1982).[131] Similarly, General Instrument's SP0256-AL2 chip from the early 1980s utilized 59 allophones for low-bitrate synthesis, enabling applications in toys and early computers by sequencing discrete speech primitives.[132] Digital Equipment Corporation's DECtalk DTC-01, introduced in 1984, represented a more advanced formant synthesizer hardware unit that converted unrestricted text to speech with high intelligibility across multiple voices.[133] Based on cascaded resonators modeling the vocal tract, it supported prosodic control and was widely adopted for accessibility, notably by physicist Stephen Hawking from 1986 until his death in 2018.[134] The system's hardware implementation allowed standalone operation via serial input, outputting audio through integrated amplification. By the 1990s, advances in general-purpose DSPs and software algorithms diminished the prevalence of dedicated TTS hardware, shifting synthesis to programmable platforms for greater flexibility and naturalness, though emulations and niche revivals persist for vintage computing.[135]Integrated systems in consumer electronics and OS
Integrated speech synthesis systems are embedded in major operating systems to support accessibility features, virtual assistants, and user interfaces, leveraging on-device processing for low latency and privacy. In Apple's iOS and macOS, the AVSpeechSynthesizer (iOS) and NSSpeechSynthesizer (macOS) frameworks enable text-to-speech conversion with adjustable parameters such as speech rate, pitch multiplier, and volume, supporting dozens of voices across multiple languages including English, Spanish, and Mandarin.[136][137] These APIs, introduced in iOS 7 in September 2013, integrate with features like VoiceOver for screen reading and Siri for responsive interactions, processing synthesis via the device's CPU and neural models for natural prosody. Android incorporates text-to-speech through the TextToSpeech class in its SDK, allowing apps to synthesize speech offline using installed engines like Google's, with support for locale-specific voices and synthesis callbacks for pausing or queuing utterances. This integration dates to Android 1.6 (Donut) in 2009, evolving to include neural voices via updates like those in Android 10 (2019), and powers TalkBack accessibility and Google Assistant responses.[138] Microsoft Windows employs the Speech API (SAPI) version 5, released with Windows 2000 in 2000 and refined in subsequent versions, to drive TTS in Narrator and other applications, supporting XML-based speech markup (SSML) for prosody control and multiple installed voices.[139] In consumer electronics, these OS-level systems extend to smartphones, where iOS and Android TTS handle real-time readout in apps, and to smart speakers like Amazon Echo devices, which integrate Amazon's Polly neural TTS engine for Alexa responses, processing text via cloud or edge computation for multilingual output.[140] Hardware in such devices typically relies on general-purpose SoCs (system-on-chips) for synthesis, with audio DSPs accelerating waveform generation, as dedicated TTS silicon remains rare outside specialized assistive hardware.[141]Commercial text-to-speech platforms and APIs
Commercial text-to-speech (TTS) platforms and APIs provide developers with cloud-based services to generate synthesized speech from text inputs, typically via RESTful APIs or software development kits (SDKs), enabling integration into applications for voiceovers, virtual assistants, and accessibility tools.[142][143] These services leverage neural network models, such as WaveNet or deep learning architectures, to produce natural-sounding audio with customizable parameters like pitch, speed, and prosody.[144] Major providers include Google Cloud, Amazon Web Services (AWS), Microsoft Azure, and IBM Watson, each offering pay-per-use pricing models based on characters processed or audio minutes generated, with support for Speech Synthesis Markup Language (SSML) for fine-tuned control.[145][146] Google Cloud Text-to-Speech, launched in March 2018, initially featured 32 voices across 12 languages using DeepMind's WaveNet technology for high-fidelity output, and by 2019 had expanded to 95 WaveNet voices in 33 languages.[147][148] As of 2025, it supports over 220 voices in more than 40 languages and variants, including custom voice options and real-time streaming synthesis via API calls that allow adjustments to speaking rate, volume, and pitch.[142] The service integrates with other Google Cloud tools for applications like content creation and integrates SSML for expressive features such as pauses and emphasis.[149] It employs a pay-per-use pricing model with free monthly quotas: the first 4 million characters for Standard (non-WaveNet) voices and 1 million characters for WaveNet voices; billing must be enabled, but charges apply only if usage exceeds these limits, and new customers receive $300 in free credits.[150] Amazon Polly, introduced as part of AWS services around 2016, uses deep learning to convert text or SSML inputs into lifelike speech, supporting over 60 voices in more than 30 languages with neural TTS for improved expressiveness.[143][151] Developers access it via API operations like SynthesizeSpeech, which outputs audio streams in formats such as MP3 or PCM, and includes lexicon support for custom pronunciations.[144] Polly emphasizes low-latency generation for real-time use cases and provides speech marks for synchronizing text with audio timestamps.[145] Microsoft Azure AI Speech service, encompassing TTS capabilities through its Speech SDK and REST APIs, supports neural voices for human-like synthesis and allows custom voice creation from audio samples.[146] Launched as part of Cognitive Services (now Azure AI), it handles real-time synthesis in multiple languages, with features like pronunciation assessment and SSML for prosody control, updated as of August 2025 to include enhanced voice gallery options.[152][153] The SDK supports cross-platform integration for applications requiring adaptive speech output.[154] IBM Watson Text to Speech, available via IBM Cloud, synthesizes text into audio using neural models, offering a range of voices and dialects across languages with API endpoints for both plain text and SSML inputs.[155][156] It supports expressive styles and customization for enterprise applications, with documentation updated as of June 2023 emphasizing natural intonation.[157] Emerging commercial providers like ElevenLabs offer specialized APIs focused on ultra-realistic, emotionally nuanced TTS, with low-latency text-to-speech endpoints supporting voice cloning from short audio samples (seconds to minutes) and multilingual output for commercial integrations, serving as accessible AI-driven tools for voice-over creation. Platforms such as Yandex SpeechKit provide TTS specialized for Russian language support with customizable voices via API.[158] Murf.ai supports content creation applications with professional tone options, and Play.ht enables synthesis across over 140 languages including Russian.[159][160] For accessible alternatives, free tools include NaturalReader with a user-friendly interface for text-to-speech conversion and Balabolka, which accommodates custom voices and various input file formats. Platforms such as Fish Audio enable rapid text-to-speech generation with similar cloning capabilities across multiple languages. For open-source alternatives, numerous pre-trained TTS models on Hugging Face facilitate local use and customization by technically skilled users.[161][162][163] Launched post-2022, ElevenLabs' API enables developers to generate audio with adaptive pacing and intonation via simple HTTP requests, priced on credit-based tiers for high-volume use.[164][165]| Provider | Approximate Launch Year | Voices/Languages Supported | Key Features |
|---|---|---|---|
| Google Cloud TTS | 2018 | 220+ voices / 40+ languages | WaveNet neural synthesis, SSML, custom pitch/speed, real-time streaming[142] |
| Amazon Polly | 2016 | 60+ voices / 30+ languages | Deep learning SSML processing, speech marks, lexicon customization[145] |
| Microsoft Azure Speech | ~2016 (Cognitive Services) | Neural/custom voices / Multiple languages | SDK/REST APIs, pronunciation tools, voice gallery[146] |
| IBM Watson TTS | Pre-2023 (IBM Cloud) | Variety of voices/dialects / Multiple languages | Neural expressiveness, SSML support, enterprise scalability[155] |
| ElevenLabs | Post-2022 | High-fidelity cloned voices / Multilingual | Emotional awareness, low-latency API, voice adaptation[161] |
Open-source and research-oriented systems
Open-source speech synthesis systems provide accessible platforms for developers and researchers to build, modify, and experiment with TTS technologies, often prioritizing reproducibility, customization, and deployment flexibility over commercial polish. These systems span traditional rule-based and concatenative approaches to modern neural architectures, fostering innovation in areas like multilingual support and low-resource languages. While commercial systems may leverage vast proprietary datasets, open-source efforts rely on community-contributed data and models, enabling rapid prototyping but sometimes resulting in variable audio quality due to training constraints.[166] Early open-source frameworks include Festival, a modular system developed at the University of Edinburgh that supports diphone-based synthesis and allows integration of custom voices and languages through Scheme scripting. Released in the mid-1990s, Festival has been used in research for building domain-specific synthesizers, though its output sounds more robotic compared to neural methods.[167] Similarly, eSpeak NG employs formant synthesis for compact, cross-platform operation, supporting over 100 languages and accents with phonetic rules rather than large corpora, making it suitable for embedded devices despite its synthetic timbre.[168] In the neural era, Coqui TTS (formerly Mozilla TTS) stands out as a comprehensive deep learning toolkit, offering pretrained models for over 1,100 languages and tools for training architectures such as Tacotron2, Glow-TTS, and VITS, with support for multi-speaker and voice cloning via fine-tuning. Active development through 2023 emphasized extensibility for research, including vocoder integration for waveform generation.[166] Piper, an optimized neural TTS engine, leverages VITS-like end-to-end models for real-time inference on consumer hardware, generating speech at speeds exceeding 100x realtime on CPUs while maintaining natural prosody through lightweight neural networks trained on public datasets.[169] Tortoise TTS prioritizes fidelity with diffusion-based autoregressive modeling, enabling zero-shot multi-voice synthesis from short audio clips, though inference requires significant GPU resources—often minutes per sentence—highlighting trade-offs in research prototypes between quality and efficiency.[170] Research-oriented systems often emerge from academic papers with open implementations, advancing core challenges like parallelism and controllability. VITS, proposed in 2021, combines conditional variational autoencoders, normalizing flows, and adversarial training for fully parallel text-to-mel-spectrogram and vocoding in a single stage, outperforming prior two-stage models in mean opinion scores (MOS) on datasets like LJ Speech, with real-time factor under 0.2 on GPUs.[171] These models facilitate experimentation in prosody modeling and zero-shot adaptation, though empirical evaluations reveal sensitivities to training data quality, underscoring the need for diverse corpora to mitigate biases in open-source benchmarks.[171]| System | Synthesis Type | Key Strengths | Limitations | Initial Release |
|---|---|---|---|---|
| Festival | Concatenative/diphone | Modular design, easy voice building | Dated sound quality | Mid-1990s |
| eSpeak NG | Formant | Multilingual, low footprint | Robotic prosody | 2008 (NG) |
| Coqui TTS | Neural end-to-end | Training toolkit, broad language support | Compute-intensive fine-tuning | 2019 |
| Piper | Neural (VITS-based) | On-device speed, natural flow | Limited voice variety out-of-box | 2022 |
| Tortoise TTS | Diffusion | High-fidelity cloning, intonation | Slow generation | 2022 |


