Recent from talks
Nothing was collected or created yet.
ESpeak
View on Wikipedia
| eSpeakNG | |
|---|---|
| Original author | Jonathan Duddington |
| Developer | Alexander Epaneshnikov et al. |
| Initial release | February 2006 |
| Stable release | 1.51[1] |
| Repository | github |
| Written in | C |
| Operating system | Linux Windows macOS FreeBSD |
| Type | Speech synthesizer |
| License | GPLv3 |
| Website | github |
eSpeak is a free and open-source, cross-platform, compact, software speech synthesizer. It uses a formant synthesis method, providing many languages in a relatively small file size. eSpeakNG (Next Generation) is a continuation of the original developer's project with more feedback from native speakers.
Because of its small size and many languages, eSpeakNG is included in NVDA[2] open source screen reader for Windows, as well as Android,[3] Ubuntu[4] and other Linux distributions. Its predecessor eSpeak was recommended by Microsoft in 2016[5] and was used by Google Translate for 27 languages in 2010;[6] 17 of these were subsequently replaced by proprietary voices.[7]
The quality of the language voices varies greatly. In eSpeakNG's predecessor eSpeak, the initial versions of some languages were based on information found on Wikipedia.[8] Some languages have had more work or feedback from native speakers than others. Most of the people who have helped to improve the various languages are blind users of text-to-speech.
History
[edit]In 1995, Jonathan Duddington released the Speak speech synthesizer for RISC OS computers supporting British English.[9] On 17 February 2006, Speak 1.05 was released under the GPLv2 license, initially for Linux, with a Windows SAPI 5 version added in January 2007.[10] Development on Speak continued until version 1.14, when it was renamed to eSpeak.
Development of eSpeak continued from 1.16 (there was not a 1.15 release)[10] with the addition of an eSpeakEdit program for editing and building the eSpeak voice data. These were only available as separate source and binary downloads up to eSpeak 1.24. The 1.24.02 version of eSpeak was the first version of eSpeak to be version controlled using subversion,[11] with separate source and binary downloads made available on SourceForge.[10] From eSpeak 1.27, eSpeak was updated to use the GPLv3 license.[11] The last official eSpeak release was 1.48.04 for Windows and Linux, 1.47.06 for RISC OS and 1.45.04 for macOS.[12] The last development release of eSpeak was 1.48.15 on 16 April 2015.[13]
eSpeak uses the Usenet scheme to represent phonemes with ASCII characters.[14]
eSpeak NG
[edit]On 25 June 2010,[15] Reece Dunn started a fork of eSpeak on GitHub using the 1.43.46 release. This started off as an effort to make it easier to build eSpeak on Linux and other POSIX platforms.
On 4 October 2015 (6 months after the 1.48.15 release of eSpeak), this fork started diverging more significantly from the original eSpeak.[16][17]
On 8 December 2015, there were discussions on the eSpeak mailing list about the lack of activity from Jonathan Duddington over the previous 8 months from the last eSpeak development release. This evolved into discussions of continuing development of eSpeak in Jonathan's absence.[18][19] The result of this was the creation of the espeak-ng (Next Generation) fork, using the GitHub version of eSpeak as the basis for future development.
On 11 December 2015, the espeak-ng fork was started.[20] The first release of espeak-ng was 1.49.0 on 10 September 2016,[21] containing significant code cleanup, bug fixes, and language updates.
Features
[edit]eSpeakNG can be used as a command-line program, or as a shared library.
It supports Speech Synthesis Markup Language (SSML).
Language voices are identified by the language's ISO 639-1 code. They can be modified by "voice variants". These are text files which can change characteristics such as pitch range, add effects such as echo, whisper and croaky voice, or make systematic adjustments to formant frequencies to change the sound of the voice. For example, "af" is the Afrikaans voice. "af+f2" is the Afrikaans voice modified with the "f2" voice variant which changes the formants and the pitch range to give a female sound.
eSpeakNG uses an ASCII representation of phoneme names which is loosely based on the Usenet system.
Phonetic representations can be included within text input by including them within double square-brackets. For example: espeak-ng -v en "Hello [[w3:ld]]" will say ⓘ in English.
Synthesis method
[edit]eSpeakNG can be used as a text-to-speech translator in different ways, depending on which text-to-speech translation step the user wants to use.
1. step – text to phoneme translation
[edit]There are many languages (notably English) which do not have straightforward one-to-one rules between writing and pronunciation; therefore, the first step in text-to-speech generation has to be text-to-phoneme translation.
- input text is translated into pronunciation phonemes (e.g. input text xerox is translated into zi@r0ks for pronunciation).
- pronunciation phonemes are synthesized into sound e.g., zi@r0ks is voiced as ⓘ
To add intonation for speech i.e. prosody data are necessary (e.g. stress of syllable, falling or rising pitch of basic frequency, pause, etc.) and other information, which allows to synthesize more human, non-monotonous speech. E.g. in eSpeakNG format stressed syllable is added using apostrophe: z'i@r0ks which provides more natural speech: ⓘ
For comparison two samples with and without prosody data:
If eSpeakNG is used for generation of prosody data only, then prosody data can be used as input for MBROLA diphone voices.
2. step – sound synthesis from prosody data
[edit]The eSpeakNG provides two different types of formant speech synthesis using its two different approaches. With its own eSpeakNG synthesizer and a Klatt synthesizer:[22]
- The eSpeakNG synthesizer creates voiced speech sounds such as vowels and sonorant consonants by additive synthesis adding together sine waves to make the total sound. Unvoiced consonants e.g. /s/ are made by playing recorded sounds,[23] because they are rich in harmonics, which makes additive synthesis less effective. Voiced consonants such as /z/ are made by mixing a synthesized voiced sound with a recorded sample of unvoiced sound.
- The Klatt synthesizer mostly uses the same formant data as the eSpeakNG synthesizer. But, it also produces sounds by subtractive synthesis by starting with generated noise, which is rich in harmonics, and then applying digital filters and enveloping to filter out necessary frequency spectrum and sound envelope for particular consonant (s, t, k) or sonorant (l, m, n) sound.
For the MBROLA voices, eSpeakNG converts the text to phonemes and associated pitch contours. It passes this to the MBROLA program using the PHO file format, capturing the audio created in output by MBROLA. That audio is then handled by eSpeakNG.
Languages
[edit]eSpeakNG performs text-to-speech synthesis for the following languages:[24]
- Afrikaans[25]
- Albanian[26]
- Amharic
- Ancient Greek
- Arabic[a]
- Aragonese[27]
- Armenian (Eastern Armenian)
- Armenian (Western Armenian)
- Assamese
- Azerbaijani
- Bashkir
- Basque
- Belarusian
- Bengali
- Bishnupriya Manipuri
- Bosnian
- Bulgarian[27]
- Burmese
- Cantonese[27]
- Catalan[27]
- Cherokee
- Chinese (Mandarin)
- Croatian[27]
- Czech
- Chuvash
- Danish[27]
- Dutch[27]
- English (American)[27]
- English (British)
- English (Caribbean)
- English (Lancastrian)
- English (New York City)[b]
- English (Received Pronunciation)
- English (Scottish)
- English (West Midlands)
- Esperanto[27]
- Estonian[27]
- Finnish[27]
- French (Belgian)[27]
- French (Canada)
- French (France)
- Georgian[27]
- German[27]
- Greek (Modern)[27]
- Greenlandic
- Guarani
- Gujarati
- Hakka Chinese[c]
- Haitian Creole
- Hawaiian
- Hebrew
- Hindi[27]
- Hungarian[27]
- Icelandic[27]
- Indonesian[27]
- Ido
- Interlingua
- Irish[27]
- Italian[27]
- Japanese[d][28]
- Kannada[27]
- Kazakh
- Klingon
- Kʼicheʼ
- Konkani[29]
- Korean
- Kurdish[27]
- Kyrgyz
- Quechua
- Latin
- Latgalian
- Latvian[27]
- Lingua Franca Nova
- Lithuanian
- Lojban[27]
- Luxembourgish
- Macedonian
- Malay[27]
- Malayalam[27]
- Maltese
- Manipuri
- Māori
- Marathi[27]
- Nahuatl (Classical)
- Nepali[27]
- Norwegian (Bokmål)[27]
- Nogai
- Oromo
- Papiamento
- Persian[27]
- Persian (Latin alphabet)
- Polish[27]
- Portuguese (Brazilian)[27]
- Portuguese (Portugal)
- Punjabi[30]
- Pyash (a constructed language)
- Quenya
- Romanian[27]
- Russian[27]
- Russian (Latvia)
- Scottish Gaelic
- Serbian[27]
- Setswana
- Shan (Tai Yai)
- Sindarin
- Sindhi
- Sinhala
- Slovak[27]
- Slovenian
- Spanish (Spain)[27]
- Spanish (Latin American)
- Swahili[25]
- Swedish[27]
- Tamil[27]
- Tatar
- Telugu
- Thai
- Turkmen
- Turkish[27]
- Uyghur
- Ukrainian
- Urarina
- Urdu
- Uzbek
- Vietnamese (Central Vietnamese)[27]
- Vietnamese (Northern Vietnamese)
- Vietnamese (Southern Vietnamese)
- Welsh
- ^ Currently, only fully diacritized Arabic is supported.
- ^ Currently unreleased; it must be built from the latest source code.
- ^ Currently, only Pha̍k-fa-sṳ is supported.
- ^ Currently, only Hiragana and Katakana are supported.
See also
[edit]References
[edit]- ^ "Release 1.51".
- ^ "Switch to eSpeak NG in NVDA distribution · Issue #5651 · nvaccess/nvda". GitHub.
- ^ "eSpeak TTS - Android Apps on Google Play". play.google.com.
- ^ "espeak-ng package : Ubuntu". Launchpad. 21 December 2023.
- ^ "Download voices for Immersive Reader, Read Mode, and Read Aloud".
- ^ Google blog, Giving a voice to more languages on Google Translate, May 2010
- ^ Google blog, Listen to us now, December 2010.
- ^ "eSpeak Speech Synthesizer". espeak.sourceforge.net.
- ^ "eSpeak: Speech Synthesizer". espeak.sourceforge.net.
- ^ a b c "ESpeak: Speech synthesis - Browse /Espeak at SourceForge.net".
- ^ a b "eSpeak: speech synthesis / Code / Browse Commits". sourceforge.net.
- ^ "Espeak: Downloads".
- ^ http://espeak.sourceforge.net/test/latest.html Archived 4 January 2017 at the Wayback Machine [bare URL]
- ^ van Leussen, Jan-Wilem; Tromp, Maarten (26 July 2007). "Latin to Speech". p. 6. CiteSeerX 10.1.1.396.7811.
- ^ "Build: Allow portaudio 18 and 19 to be switched easily. · rhdunn/Espeak@63daaec". GitHub.
- ^ "Espeakedit: Fix argument processing for unicode argv types · rhdunn/Espeak@61522a1". GitHub.
- ^ "Switch to eSpeak NG in NVDA distribution · Issue #5651 · nvaccess/Nvda". GitHub.
- ^ "[Espeak-general] Taking ownership of the espeak project and its future | eSpeak: speech synthesis". sourceforge.net.
- ^ "[Espeak-general] Vote for new main espeak developer | eSpeak: speech synthesis". sourceforge.net.
- ^ "Rebrand the espeak program to espeak-ng. · Issue #2 · espeak-ng/espeak-ng". GitHub.
- ^ "Release 1.49.0 · espeak-ng/espeak-ng". GitHub.
- ^ Klatt, Dennis H. (1979). "Software for a cascade/parallel formant synthesizer" (PDF). J. Acoustical Society of America, 67(3) March 1980.
- ^ "espeak-ng". GitHub.
- ^ "ESpeak NG Text-to-Speech". GitHub. 13 February 2022.
- ^ a b Butgereit, L., & Botha, A. (2009, May). Hadeda: The noisy way to practice spelling vocabulary using a cell phone. In The IST-Africa 2009 Conference, Kampala, Uganda.
- ^ Hamiti, M., & Kastrati, R. (2014). Adapting eSpeak for converting text into speech in Albanian. International Journal of Computer Science Issues (IJCSI), 11(4), 21.
- ^ a b c d e f g h i j k l m n o p q r s t u v w x y z aa ab ac ad ae af ag ah ai aj ak al am an ao ap Kayte, S., & Gawali, D. B. (2015). Marathi Speech Synthesis: A review. International Journal on Recent and Innovation Trends in Computing and Communication, 3(6), 3708-3711.
- ^ Pronk, R. (2013). Adding Japanese language synthesis support to the eSpeak system. University of Amsterdam.
- ^ Mohanan, S., Salkar, S., Naik, G., Dessai, N. F., & Naik, S. (2012). Text Reader for Konkani Language. Automation and Autonomous System, 4(8), 409-414.
- ^ Kaur, R., & Sharma, D. (2016). An Improved System for Converting Text into Speech for Punjabi Language using eSpeak. International Research Journal of Engineering and Technology, 3(4), 500-504.
External links
[edit]ESpeak
View on GrokipediaHistory
Initial Development
The initial development of eSpeak traces back to 1995, when Jonathan Duddington created the "speak" program for Acorn/RISC OS computers, initially supporting British English as a compact speech synthesizer.[1] This early version was designed with efficiency in mind, targeting the limited resources of RISC OS systems, and laid the foundation for formant-based synthesis techniques that would define the project.[2] In 2007, Duddington enhanced and rewrote the program, renaming it eSpeak to reflect its expanded capabilities, including relaxed memory and processing constraints that enabled broader applicability beyond RISC OS.[2] The first public release of eSpeak occurred around this time, establishing it as an open-source, formant-based text-to-speech synthesizer initially focused on English but quickly incorporating support for other languages.[3] Key enhancements under Duddington's solo development included porting to multiple platforms such as Linux and Windows, which broadened its accessibility, and introducing initial multi-language support through rule-based phoneme conversion.[1] Duddington maintained active development through numerous iterations, with version history progressing from early releases in the 2000s to more refined builds addressing prosody, voice variants, and language dictionaries. The last major release, eSpeak 1.48 (including subversions like 1.48.04), was issued in 2015, incorporating improvements in synthesis quality and platform integration before development slowed.[4] His passing marked the endpoint of this individual-led phase, after which the project briefly transitioned to community-driven open-source maintenance in the form of eSpeak NG.[5]eSpeak NG Continuation
eSpeak NG was forked from the original eSpeak project in late 2015 to improve maintainability and enable ongoing development through a more collaborative structure. Following the passing of the original developer Jonathan Duddington, the project saw increased activity from a community of volunteers.[5] The project transitioned to a dedicated GitHub repository under the espeak-ng organization, where volunteer developers, including native speakers, contribute feedback to refine language rules and pronunciations.[2] This open-source effort has focused on enhancing compatibility and quality through iterative contributions. Key releases include version 1.50 in December 2019, which introduced support for SSML <phoneme> tags, along with support for nine new languages such as Bashkir.[6] Version 1.51 in April 2022 added features like voice variants and a Chromium extension, while expanding language coverage with over 20 new additions including Belarusian, and improving platform integrations such as for Android.[7] The latest release, 1.52 in December 2024, added a CMake build system, stress marks for improved prosody, along with bug fixes and six new languages like Tigrinya.[8] Community-driven improvements have emphasized better prosody through additions like stress marks in phoneme events and integration with modern toolchains, including a new CMake build system to replace the older autoconf approach. As of 2025, eSpeak NG remains an active project, with over 500 issues resolved on GitHub and support for more than 100 languages and accents.[9][2]Overview and Features
Core Functionality
eSpeak is a free and open-source, cross-platform software speech synthesizer designed to convert written text into audible speech. It operates primarily through formant synthesis, a method that generates speech by modeling the resonances of the human vocal tract, resulting in a compact engine suitable for resource-constrained devices. The core engine, including the program and data files for multiple languages, occupies less than 2 MB, making it lightweight and efficient for embedded systems or low-power environments.[1] At its foundation, eSpeak provides a straightforward command-line interface for basic text-to-speech operations, allowing users to input text directly via commands such asespeak-ng "Hello, world" to produce spoken output from files, standard input, or strings. For more advanced programmatic use, it offers an API through a shared library (or DLL on Windows), enabling developers to integrate speech synthesis into applications for automated reading of text. This API supports embedding eSpeak within software for tasks like screen readers or voice assistants.[1][10]
eSpeak incorporates support for Speech Synthesis Markup Language (SSML), which allows fine-grained control over speech attributes including pitch, speaking rate, and volume through markup tags in the input text. Output can be directed in various formats, such as generating WAV audio files for storage, playing audio directly through the system's sound device, or piping the synthesized speech to other command-line tools for further processing. These capabilities ensure versatility in both standalone and integrated scenarios.[1][2]
Advantages and Limitations
eSpeak demonstrates significant advantages in portability, supporting a wide range of operating systems including Linux, Windows, Android (version 4.0 and later), BSD, Solaris, and macOS through command-line interfaces, shared libraries, and Windows SAPI5 compatibility.[2] Its formant synthesis method enables rapid processing, achieving practical synthesis speeds of up to 500 words per minute, which facilitates efficient real-time applications.[11] Additionally, eSpeak's compact footprint—totaling just a few megabytes—allows it to operate on resource-constrained embedded systems with minimal computational demands.[2] As an open-source project licensed under the GPL-3.0 or later, it is freely available and modifiable, promoting widespread adoption and customization.[12] Despite these strengths, eSpeak's output often sounds robotic and artificial due to its formant-based approach, lacking the naturalness of neural text-to-speech systems like WaveNet, which generate speech from human recordings or deep learning models.[1][2] It exhibits limited capabilities in emotional intonation and expressive prosody, restricting its suitability for applications requiring nuanced vocal delivery.[13] For non-English languages, synthesis accuracy depends heavily on rule-based phoneme conversion rules, which can result in approximations or errors in complex phonetics, particularly for languages with intricate orthography or tonal features.[14][15] In comparisons, eSpeak is more compact than the Festival speech synthesis system, which offers greater expressiveness through diphone or unit selection methods but at the cost of higher resource requirements.[16][17] However, it falls short in naturalness and emotional range compared to modern AI-driven synthesizers, making it particularly well-suited for accessibility tools, such as screen readers, rather than entertainment or high-fidelity audio production.[18] User feedback highlights its clear enunciation even at elevated speeds, though potential artifacts may arise in handling intricate phonetic sequences.[1] eSpeak NG's extensive multi-language support, covering over 100 languages and accents, further contributes to its versatility in diverse applications.[2]Synthesis Method
Text-to-Phoneme Conversion
eSpeak's text-to-phoneme conversion serves as the initial stage in its speech synthesis pipeline, transforming input text into a sequence of phonetic symbols that represent pronunciation, including markers for stress and timing. This process relies on linguistic preprocessing to standardize the text and rule-based grapheme-to-phoneme (G2P) translation to map orthographic representations to phonemes, ensuring compatibility across supported languages.[1][19] Preprocessing begins with text normalization, which handles elements such as abbreviations, numbers, and punctuation to convert them into a form suitable for phonemization. For instance, numbers are processed via theTranslateNumber() function, which constructs spoken forms from fragments based on language-specific options like langopts.numbers.[19] Abbreviations and punctuation are addressed through replacement rules in the language's _rules file, such as the .replace section that standardizes characters (e.g., replacing special letters like ô or ő in certain languages). Tokenization occurs implicitly by breaking the text into words and applying rules or dictionary lookups word-by-word, preparing sequences for G2P matching.[19][15]
The core of the G2P conversion uses a rule-based system combined with pronunciation dictionaries for efficiency and accuracy. Rules, defined in files like en_rules for English, employ regex-like patterns to match letter sequences with context: <pre><match><post> <phonemes>, where <pre> and <post> provide surrounding context, and <match> identifies the grapheme to replace with phonemes. These rules are scored and prioritized, with the best match selected for conversion; for example, a rule might transform "b oo k" into the phoneme [U] for the "oo" in "book". Dictionaries, stored in compiled files like en_dict, supplement rules with explicit entries for common or irregular words. The English dictionary, for instance, includes approximately 5,500 entries in its en_list file for precise pronunciations, such as "book bUk".[19][20][19] For unknown words, the system falls back to algorithmic rules after checking for standard prefixes or suffixes, ensuring broad coverage.[19]
Language-specific rulesets are implemented through dedicated phoneme files (e.g., ph_english) and rule files, inheriting a base set of phonemes while adding custom vowels, consonants, and translation logic tailored to the language's orthography. Phonemes are represented using 1- to 4-character mnemonics based on the Kirshenbaum ASCII IPA scheme, allowing compact yet precise notation.[15][21][22]
Ambiguities in pronunciation, such as stress placement and syllable boundaries, are resolved using prosodic markers embedded in the phoneme output. Stress is assigned via symbols like $1 for primary stress on the first syllable or $u for unstressed syllables, determined by dictionary entries or rule-based heuristics that analyze word structure. Syllable boundaries are implied through these markers and control phonemes from the base phoneme table, guiding the prosody for natural rhythm.[19][15]
The output of this stage is a stream of phonemes annotated with stress indicators and hints for duration and pitch, which feeds directly into subsequent synthesis processes to generate speech with appropriate intonation.[1]
Formant Synthesis and Prosody
eSpeak utilizes formant synthesis to produce speech audio from phoneme sequences, modeling the human vocal tract through time-varying sine waves that represent the primary formants—typically the first three formants (F1 for vowel openness, F2 for front-back position, and F3 for additional spectral shaping)—while incorporating noise sources to simulate fricatives and other unvoiced sounds.[1] This approach enables compact representation of multiple languages, as it relies on algorithmic generation rather than stored waveforms, resulting in clear output suitable for high-speed synthesis up to 500 words per minute.[1] The core waveform creation follows a Klatt-style synthesizer, employing a combination of cascade and parallel digital filters to shape an excitation signal—periodic pulses for voiced phonemes and random noise for unvoiced ones—into resonant formants that mimic natural speech spectra.[23] Prosody in eSpeak is generated rule-based, applying intonation "tunes" to clauses determined by punctuation, such as a rising pitch contour for questions to convey interrogative intent or falling contours for statements.[24] These contours are structured into components like the prehead (rising to the first stress), head (stressed syllables with modulated envelope), nucleus (peak stress with final pitch movement), and tail (declining unstressed endings), achieved by adjusting pitch envelopes on vowels within phoneme lists.[24] Rhythm and emphasis are handled through duration scaling, where stressed syllables receive longer lengths than unstressed ones based on linguistic rules for syllable stress, influencing overall speech timing without altering the fundamental phoneme identities.[25][26] For example, emphasis on a word increases its phoneme durations to highlight prosodic prominence.[26] A key aspect of prosodic variation involves pitch modulation for smooth intonation transitions. This model allows dynamic adjustment of fundamental frequency (F0) across utterances, with voice traits customizable via parameters such as base pitch (scaled 0-99, corresponding to approximately 100-300 Hz for typical male-to-female ranges), speaking speed (80-500 words per minute), and amplitude for volume control (0-200, default 100).[24][11][27] These settings enable users to tailor the synthetic voice for clarity or expressiveness while maintaining the synthesizer's efficiency.[1]Language Support
Coverage and Accents
eSpeak NG provides support for over 100 languages and accents.[2] This extensive coverage includes major world languages such as English (with variants including Americanen-us, British en, Caribbean en-029, and Scottish en-gb-scotland), Mandarin (cmn), Spanish (Spain es and Latin American es-419), French (France fr, Belgium fr-be, and Switzerland fr-ch), Arabic (ar), and Hindi (hi).[28]
Regional accents and voices are achieved through customized phoneme mappings and prosody adjustments tailored to specific dialects, enabling variations like Brazilian Portuguese (pt-br) alongside standard European Portuguese (pt).[28] The rule-based synthesis method facilitates this multi-language support in a compact form.[2]
Language data primarily derives from community-contributed rules developed by native speakers and contributors, which has enabled broad but uneven coverage across global linguistic families.[2] Speech quality varies across languages, depending on the maturity of the rules and dictionaries, with support for languages like Swahili and Zulu.[2][29]
eSpeak NG continues to expand through ongoing community efforts, including enhancements to underrepresented languages such as additional African dialects in recent releases.[2]
Customization and Extension
eSpeak NG allows users and developers to modify existing voices and add support for new languages or accents by editing plaintext data files, enabling customization without altering the core source code.[15] Voices can be adjusted by editing phoneme definitions in files located in thephsource/phonemes directory, such as inheriting from a base table and specifying custom sounds for vowels, consonants, and stress patterns.[30] Dictionaries, which handle text-to-phoneme (G2P) conversion, are modified in the dictsource/ directory through rule files (e.g., lang_rules) for general pronunciation patterns and exception lists (e.g., lang_list) for irregular words, allowing fine-tuning of accents like regional variations in English or French.[15]
To add a new language, contributors create a voice file in espeak-ng-data/voices/ or espeak-ng-data/lang/ defining parameters such as pitch, speed, and prosody rules, alongside new phoneme and dictionary files tailored to the language's orthography and phonology.[30] For tone languages or those with unique prosody, additional rules in the voice file adjust intonation and rhythm.[15] The espeak-ng --compile=lang command compiles these changes into usable formats like phontab for phonemes and lang_dict for dictionaries; the functionality of the older espeakedit utility has been integrated into the espeak-ng program itself for command-line access.[2] Integration with external dictionaries is possible by referencing custom rule sets during compilation.[30]
Best practices for customization emphasize starting with a rough prototype based on similar existing languages, followed by iterative refinement through native speaker feedback to ensure natural pronunciation and intonation.[15] Testing involves running synthesized audio against sample texts using tools like make check or manual playback, with new unit tests added to the tests/ directory to verify stability across updates.[30] Contributions, including modified voices or new languages, are submitted via pull requests to the eSpeak NG GitHub repository, where maintainers review and integrate them into official releases.[2]
Community efforts have extended eSpeak NG to constructed languages like Esperanto through custom phoneme tables and rules capturing its phonetic regularity.[2] Similarly, user-contributed improvements for Welsh include refined G2P rules for its mutations and vowel harmony, while ongoing work on Vietnamese has enhanced tone rendering via updated prosody parameters.[2] These extensions demonstrate how the modular file structure facilitates broad participation in expanding eSpeak NG's capabilities.[15]