Time delay neural network

Time delay neural network (TDNN)^[1] is a multilayer artificial neural network architecture whose purpose is to 1) classify patterns with shift-invariance, and 2) model context at each layer of the network. It is essentially a 1-d convolutional neural network (CNN).

Shift-invariant classification means that the classifier does not require explicit segmentation prior to classification. For the classification of a temporal pattern (such as speech), the TDNN thus avoids having to determine the beginning and end points of sounds before classifying them.

For contextual modelling in a TDNN, each neural unit at each layer receives input not only from activations/features at the layer below, but from a pattern of unit output and its context. For time signals each unit receives as input the activation patterns over time from units below. Applied to two-dimensional classification (images, time-frequency patterns), the TDNN can be trained with shift-invariance in the coordinate space and avoids precise segmentation in the coordinate space.

History

The TDNN was introduced in the late 1980s and applied to a task of phoneme classification for automatic speech recognition in speech signals where the automatic determination of precise segments or feature boundaries was difficult or impossible. Because the TDNN recognizes phonemes and their underlying acoustic/phonetic features, independent of position in time, it improved performance over static classification.^[1]^[2] It was also applied to two-dimensional signals (time-frequency patterns in speech,^[3] and coordinate space pattern in OCR^[4]).

Kunihiko Fukushima published the neocognitron in 1980.^[5] Max pooling appears in a 1982 publication on the neocognitron^[6] and was in the 1989 publication in LeNet-5.^[7]

In 1990, Yamaguchi et al. used max pooling in TDNNs in order to realize a speaker independent isolated word recognition system.^[8]

Overview

Architecture

In modern language, the design of TDNN is a 1D convolutional neural network, where the direction of convolution is across the dimension of time. In the original design, there are exactly 3 layers.

The input to the network is a continuous speech signal, preprocessed into a 2D array (a mel scale spectrogram). One dimension is time at 10 ms per frame, and the other dimension is frequency. The time dimension can be arbitrarily long, but the frequency dimension was only 16-long. In the original experiment, they only considered very short speech signals pronouncing single words like "baa", "daa", "gaa". Because of this, the speech signals could be very short, indeed, only 15 frames long (150 ms in time).

In detail, they processed a voice signal as follows:

Input speech is sampled at 12 kHz, Hamming-windowed.
Its FFT is computed every 5 ms.
The mel scale coefficients are computed from the power spectrum by taking log energies in each mel scale energy band.
Adjacent coefficients in time are soothed over, resulting in one frame every 10 ms.
For each signal, a human manually detect the onset of the vowel, and the entire speech signal is cut off except 7 frames before and 7 frames after, leaving just 15 frames in total, centered at the onset of the vowel.
The coefficients are normalized by subtracting the mean, then scaling, so that the signals fall between -1 and +1.

The first layer of the TDNN is a 1D convolutional layer. The layer contains 8 kernels of shape $3\times 16$ . It outputs a tensor of shape $8\times 13$ .

The second layer of the TDNN is a 1D convolutional layer. The layer contains 3 kernels of shape $5\times 8$ . It outputs a tensor of shape $3\times 9$ .

The third layer of the TDNN is not a convolutional layer. Instead, it is simply a fixed layer with 3 neurons. Let the output from the second layer be $x_{i,j}$ where $i\in 1:3$ and $j\in 1:9$ . The $i$ -th neuron in the third layer computes $\sigma (\sum _{j\in 1:9}x_{i,j})$ , where $\sigma$ is the sigmoid function. Essentially, it can be thought of as a convolution layer with 3 kernels of shape $1\times 9$ .

It was trained on ~800 samples for 20000--50000 backpropagation steps. Each steps was computed in a batch over the entire training dataset, i.e. not stochastic. It required the use of an Alliant supercomputer with 4 processors.

Example

In the case of a speech signal, inputs are spectral coefficients over time.

In order to learn critical acoustic-phonetic features (for example formant transitions, bursts, frication, etc.) without first requiring precise localization, the TDNN is trained time-shift-invariantly. Time-shift invariance is achieved through weight sharing across time during training: Time shifted copies of the TDNN are made over the input range (from left to right in Fig.1). Backpropagation is then performed from an overall classification target vector (see TDNN diagram, three phoneme class targets (/b/, /d/, /g/) are shown in the output layer), resulting in gradients that will generally vary for each of the time-shifted network copies. Since such time-shifted networks are only copies, however, the position dependence is removed by weight sharing. In this example, this is done by averaging the gradients from each time-shifted copy before performing the weight update. In speech, time-shift invariant training was shown to learn weight matrices that are independent of precise positioning of the input. The weight matrices could also be shown to detect important acoustic-phonetic features that are known to be important for human speech perception, such as formant transitions, bursts, etc.^[1] TDNNs could also be combined or grown by way of pre-training.^[9]

Implementation

The precise architecture of TDNNs (time-delays, number of layers) is mostly determined by the designer depending on the classification problem and the most useful context sizes. The delays or context windows are chosen specific to each application. Work has also been done to create adaptable time-delay TDNNs^[10] where this manual tuning is eliminated.

State of the art

TDNN-based phoneme recognizers compared favourably in early comparisons with HMM-based phone models.^[1]^[9] Modern deep TDNN architectures include many more hidden layers and sub-sample or pool connections over broader contexts at higher layers. They achieve up to 50% word error reduction over GMM-based acoustic models.^[11]^[12] While the different layers of TDNNs are intended to learn features of increasing context width, they do model local contexts. When longer-distance relationships and pattern sequences have to be processed, learning states and state-sequences is important and TDNNs can be combined with other modelling techniques.^[13]^[3]^[4] TDNN architectures have also been adapted to Spiking Neural Networks, leading to state-of-the-art results while lending themselves to energy-efficient hardware implementations.^[14]

Applications

Speech recognition

TDNNs used to solve problems in speech recognition that were introduced in 1989^[2] and initially focused on shift-invariant phoneme recognition. Speech lends itself nicely to TDNNs as spoken sounds are rarely of uniform length and precise segmentation is difficult or impossible. By scanning a sound over past and future, the TDNN is able to construct a model for the key elements of that sound in a time-shift invariant manner. This is particularly useful as sounds are smeared out through reverberation.^[11]^[12] Large phonetic TDNNs can be constructed modularly through pre-training and combining smaller networks.^[9]

Large vocabulary speech recognition

Large vocabulary speech recognition requires recognizing sequences of phonemes that make up words subject to the constraints of a large pronunciation vocabulary. Integration of TDNNs into large vocabulary speech recognizers is possible by introducing state transitions and search between phonemes that make up a word. The resulting Multi-State Time-Delay Neural Network (MS-TDNN) can be trained discriminative from the word level, thereby optimizing the entire arrangement toward word recognition instead of phoneme classification.^[13]^[15]^[4]

Speaker independence

Two-dimensional variants of the TDNNs were proposed for speaker independence.^[3] Here, shift-invariance is applied to the time as well as to the frequency axis in order to learn hidden features that are independent of precise location in time and in frequency (the latter being due to speaker variability).

Reverberation

One of the persistent problems in speech recognition is recognizing speech when it is corrupted by echo and reverberation (as is the case in large rooms and distant microphones). Reverberation can be viewed as corrupting speech with delayed versions of itself. In general, it is difficult, however, to de-reverberate a signal as the impulse response function (and thus the convolutional noise experienced by the signal) is not known for any arbitrary space. The TDNN was shown to be effective to recognize speech robustly despite different levels of reverberation.^[11]^[12]

Lip-reading – audio-visual speech

TDNNs were also successfully used in early demonstrations of audio-visual speech, where the sounds of speech are complemented by visually reading lip movement.^[15] Here, TDNN-based recognizers used visual and acoustic features jointly to achieve improved recognition accuracy, particularly in the presence of noise, where complementary information from an alternate modality could be fused nicely in a neural net.

Handwriting recognition

TDNNs have been used effectively in compact and high-performance handwriting recognition systems.^[16] Shift-invariance was also adapted to spatial patterns (x/y-axes) in image offline handwriting recognition.^[4]

Video analysis

Video has a temporal dimension that makes a TDNN an ideal solution to analysing motion patterns. An example of this analysis is a combination of vehicle detection and recognizing pedestrians.^[17] When examining videos, subsequent images are fed into the TDNN as input where each image is the next frame in the video. The strength of the TDNN comes from its ability to examine objects shifted in time forward and backward to define an object detectable as the time is altered. If an object can be recognized in this manner, an application can plan on that object to be found in the future and perform an optimal action.

Image recognition

Two-dimensional TDNNs were later applied to other image-recognition tasks under the name of "Convolutional Neural Networks", where shift-invariant training is applied to the x/y axes of an image.

Common libraries

TDNNs can be implemented in virtually all machine-learning frameworks using one-dimensional convolutional neural networks, due to the equivalence of the methods.
Matlab: The neural network toolbox has explicit functionality designed to produce a time delay neural network give the step size of time delays and an optional training function. The default training algorithm is a Supervised Learning back-propagation algorithm that updates filter weights based on the Levenberg-Marquardt optimizations. The function is timedelaynet(delays, hidden_layers, train_fnc) and returns a time-delay neural network architecture that a user can train and provide inputs to.^[18]
The Kaldi ASR Toolkit has an implementation of TDNNs with several optimizations for speech recognition.^[19]

References

^ ^a ^b ^c ^d Waibel, A.; Hanazawa, T.; Hinton, G.; Shikano, K.; Lang, K.J. (1989). "Phoneme recognition using time-delay neural networks" (PDF). IEEE Transactions on Acoustics, Speech, and Signal Processing. 37 (3): 328–339. doi:10.1109/29.21701.
^ ^a ^b Alexander Waibel, Phoneme Recognition Using Time-Delay Neural Networks, Procedures of the Institute of Electrical, Information and Communication Engineers (IEICE), December, 1987, Tokyo, Japan.
^ ^a ^b ^c John B. Hampshire; Alex Waibel. "Connectionist Architectures for Multi-Speaker Phoneme Recognition". Advances in Neural Information Processing Systems. 2: 203–210.
^ ^a ^b ^c ^d Jaeger, S.; Manke, S.; Reichert, J.; Waibel, A. (2001). "Online handwriting recognition: The NPen++ recognizer". International Journal on Document Analysis and Recognition. 3 (3): 169–180. doi:10.1007/PL00013559.
^ Fukushima, Kunihiko (1980). "Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position" (PDF). Biological Cybernetics. 36 (4): 193–202. doi:10.1007/BF00344251. PMID 7370364. S2CID 206775608. Archived (PDF) from the original on 3 June 2014. Retrieved 16 November 2013.
^ Fukushima, Kunihiko; Miyake, Sei (1982-01-01). "Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position". Pattern Recognition. 15 (6): 455–469. Bibcode:1982PatRe..15..455F. doi:10.1016/0031-3203(82)90024-3. ISSN 0031-3203.
^ LeCun, Yann; Boser, Bernhard; Denker, John; Henderson, Donnie; Howard, R.; Hubbard, Wayne; Jackel, Lawrence (1989). "Handwritten Digit Recognition with a Back-Propagation Network". Advances in Neural Information Processing Systems. 2. Morgan-Kaufmann.
^ Yamaguchi, Kouichi; Sakamoto, Kenji; Akabane, Toshio; Fujimoto, Yoshiji (November 1990). A Neural Network for Speaker-Independent Isolated Word Recognition. First International Conference on Spoken Language Processing (ICSLP 90). Kobe, Japan. Archived from the original on 2021-03-07. Retrieved 2019-09-04.
^ ^a ^b ^c Alexander Waibel, Hidefumi Sawai, Kiyohiro Shikano, Modularity and Scaling in Large Phonemic Neural Networks, IEEE Transactions on Acoustics, Speech, and Signal Processing, December, December 1989.
^ Wöhler, C.; Anlauf, J.K. (1999). "An adaptable time-delay neural-network algorithm for image sequence analysis". IEEE Transactions on Neural Networks. 10 (6): 1531–1536. doi:10.1109/72.809100. PMID 18252656. S2CID 16813677.
^ ^a ^b ^c Peddinti, Vijayaditya; Povey, Daniel; Khudanpur, Sanjeev (2015). "A time delay neural network architecture for efficient modeling of long temporal contexts". Interspeech 2015. pp. 3214–3218. doi:10.21437/Interspeech.2015-647. S2CID 8536162.
^ ^a ^b ^c David Snyder, Daniel Garcia-Romero, Daniel Povey, A Time-Delay Deep Neural Network-Based Universal Background Models for Speaker Recognition, Proceedings of ASRU 2015.
^ ^a ^b Haffner, Patrick; Waibel, Alex (1991). "Multi-State Time Delay Networks for Continuous Speech Recognition". proceedings.neurips.cc. 4. NIPS: 135–142.
^ D’Agostino, Simone; Moro, Filippo; Torchet, Tristan; Demirağ, Yiğit; Grenouillet, Laurent; Castellani, Niccolò; Indiveri, Giacomo; Vianello, Elisa; Payvand, Melika (2024-04-24). "DenRAM: neuromorphic dendritic architecture with RRAM for efficient temporal processing with delays". Nature Communications. 15 (1): 3446. doi:10.1038/s41467-024-47764-w. ISSN 2041-1723. PMC 11043378.
^ ^a ^b Bregler, C.; Hild, H.; Manke, S.; Waibel, A. (1993). "Improving connected letter recognition by lipreading". IEEE International Conference on Acoustics Speech and Signal Processing. Vol. 1. pp. 557–560. doi:10.1109/ICASSP.1993.319179. ISBN 0-7803-0946-4.
^ Guyon, I.; Albrecht, P.; Le Cun, Y.; Denker, J.; Hubbard, W. (1991-01-01). "Design of a neural network character recognizer for a touch terminal". Pattern Recognition. 24 (2): 105–119. doi:10.1016/0031-3203(91)90081-F. ISSN 0031-3203.
^ Wöhler, C.; Anlauf, J. K. (2001). "Real-time object recognition on image sequences with the adaptable time delay neural network algorithm — applications for autonomous vehicles". Image and Vision Computing. 19 (9–10): 593–618. doi:10.1016/S0262-8856(01)00040-3.
^ "Time Series and Dynamic Systems - MATLAB & Simulink". mathworks.com. Retrieved 21 June 2016.
^ Peddinti, Vijayaditya; Chen, Guoguo; Manohar, Vimal; Ko, Tom; Povey, Daniel; Khudanpur, Sanjeev (2015). "JHU ASpIRE system: Robust LVCSR with TDNNS, iVector adaptation and RNN-LMS" (PDF). 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). pp. 539–546. doi:10.1109/ASRU.2015.7404842. ISBN 978-1-4799-7291-3.

Haffner, Patrick; Waibel (1991) [January 1991]. Lippman, Richard; Moody, John (eds.). "Multi-State Time Delay Networks for Continuous Speech Recognition". Advances in Neural Information Processing Systems. 4. Morgan Kaufman: 135–142.
Hampshire, John; Waibel, Alex (1990) [November 30, 1989]. Touretzky, David (ed.). "Connectionist Architectures for Multi-Speaker Phoneme Recognition". Advances in Neural Information Processing Systems 2: 203-210.
Waibel, Alex; Hanazawa, Toshiyuki; Hinton, Geoffrey; Shikano, Kiyohiro; Lang, Kevin (April 1989). "Phoneme recognition using time-delay neural networks". IEEE Transactions on Acoustics, Speech, and Signal Processing. 37 (3): 328–339. doi:10.1109/29.21701.
Waibel, Alex (1987) [December]. "Phoneme Recognition Using Time-Delay Neural Networks". Conference: Meeting of the Institute of Electrical, Information and Communication Engineers (IEICE). Japan.

[phoneme_detection-1] Waibel, A.; Hanazawa, T.; Hinton, G.; Shikano, K.; Lang, K.J. (1989). "Phoneme recognition using time-delay neural networks" (PDF). IEEE Transactions on Acoustics, Speech, and Signal Processing. 37 (3): 328–339. doi:10.1109/29.21701.

[:0-2] Alexander Waibel, Phoneme Recognition Using Time-Delay Neural Networks, Procedures of the Institute of Electrical, Information and Communication Engineers (IEICE), December, 1987, Tokyo, Japan.

[:1-3] John B. Hampshire; Alex Waibel. "Connectionist Architectures for Multi-Speaker Phoneme Recognition". Advances in Neural Information Processing Systems. 2: 203–210.

[:2-4] Jaeger, S.; Manke, S.; Reichert, J.; Waibel, A. (2001). "Online handwriting recognition: The NPen++ recognizer". International Journal on Document Analysis and Recognition. 3 (3): 169–180. doi:10.1007/PL00013559.

[intro-5] Fukushima, Kunihiko (1980). "Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position" (PDF). Biological Cybernetics. 36 (4): 193–202. doi:10.1007/BF00344251. PMID 7370364. S2CID 206775608. Archived (PDF) from the original on 3 June 2014. Retrieved 16 November 2013.

[6] Fukushima, Kunihiko; Miyake, Sei (1982-01-01). "Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position". Pattern Recognition. 15 (6): 455–469. Bibcode:1982PatRe..15..455F. doi:10.1016/0031-3203(82)90024-3. ISSN 0031-3203.

[7] LeCun, Yann; Boser, Bernhard; Denker, John; Henderson, Donnie; Howard, R.; Hubbard, Wayne; Jackel, Lawrence (1989). "Handwritten Digit Recognition with a Back-Propagation Network". Advances in Neural Information Processing Systems. 2. Morgan-Kaufmann.

[Yamaguchi111990-8] Yamaguchi, Kouichi; Sakamoto, Kenji; Akabane, Toshio; Fujimoto, Yoshiji (November 1990). A Neural Network for Speaker-Independent Isolated Word Recognition. First International Conference on Spoken Language Processing (ICSLP 90). Kobe, Japan. Archived from the original on 2021-03-07. Retrieved 2019-09-04.

[:3-9] Alexander Waibel, Hidefumi Sawai, Kiyohiro Shikano, Modularity and Scaling in Large Phonemic Neural Networks, IEEE Transactions on Acoustics, Speech, and Signal Processing, December, December 1989.

[10] Wöhler, C.; Anlauf, J.K. (1999). "An adaptable time-delay neural-network algorithm for image sequence analysis". IEEE Transactions on Neural Networks. 10 (6): 1531–1536. doi:10.1109/72.809100. PMID 18252656. S2CID 16813677.

[:4-11] Peddinti, Vijayaditya; Povey, Daniel; Khudanpur, Sanjeev (2015). "A time delay neural network architecture for efficient modeling of long temporal contexts". Interspeech 2015. pp. 3214–3218. doi:10.21437/Interspeech.2015-647. S2CID 8536162.

[:5-12] David Snyder, Daniel Garcia-Romero, Daniel Povey, A Time-Delay Deep Neural Network-Based Universal Background Models for Speaker Recognition, Proceedings of ASRU 2015.

[:6-13] Haffner, Patrick; Waibel, Alex (1991). "Multi-State Time Delay Networks for Continuous Speech Recognition". proceedings.neurips.cc. 4. NIPS: 135–142.

[14] D’Agostino, Simone; Moro, Filippo; Torchet, Tristan; Demirağ, Yiğit; Grenouillet, Laurent; Castellani, Niccolò; Indiveri, Giacomo; Vianello, Elisa; Payvand, Melika (2024-04-24). "DenRAM: neuromorphic dendritic architecture with RRAM for efficient temporal processing with delays". Nature Communications. 15 (1): 3446. doi:10.1038/s41467-024-47764-w. ISSN 2041-1723. PMC 11043378.

[:7-15] Bregler, C.; Hild, H.; Manke, S.; Waibel, A. (1993). "Improving connected letter recognition by lipreading". IEEE International Conference on Acoustics Speech and Signal Processing. Vol. 1. pp. 557–560. doi:10.1109/ICASSP.1993.319179. ISBN 0-7803-0946-4.

[16] Guyon, I.; Albrecht, P.; Le Cun, Y.; Denker, J.; Hubbard, W. (1991-01-01). "Design of a neural network character recognizer for a touch terminal". Pattern Recognition. 24 (2): 105–119. doi:10.1016/0031-3203(91)90081-F. ISSN 0031-3203.

[17] Wöhler, C.; Anlauf, J. K. (2001). "Real-time object recognition on image sequences with the adaptable time delay neural network algorithm — applications for autonomous vehicles". Image and Vision Computing. 19 (9–10): 593–618. doi:10.1016/S0262-8856(01)00040-3.

[18] "Time Series and Dynamic Systems - MATLAB & Simulink". mathworks.com. Retrieved 21 June 2016.

[19] Peddinti, Vijayaditya; Chen, Guoguo; Manohar, Vimal; Ko, Tom; Povey, Daniel; Khudanpur, Sanjeev (2015). "JHU ASpIRE system: Robust LVCSR with TDNNS, iVector adaptation and RNN-LMS" (PDF). 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). pp. 539–546. doi:10.1109/ASRU.2015.7404842. ISBN 978-1-4799-7291-3.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

History

Media collections

Time delay neural network

Recent from talks

Recent from talks

Contribute something

Contribute something

Media Pages

Timelines

Articles

Notes collections

Notes

Notes

Days in Chronicle

Time delay neural network

History

Overview

Architecture

Example

Implementation

State of the art

Applications

Speech recognition

Large vocabulary speech recognition

Speaker independence

Reverberation

Lip-reading – audio-visual speech

Handwriting recognition

Video analysis

Image recognition

Common libraries

See also

References