Recent from talks
Nothing was collected or created yet.
Connectionist temporal classification
View on Wikipedia
Connectionist temporal classification (CTC) is a type of neural network output and associated scoring function, for training recurrent neural networks (RNNs) such as LSTM networks to tackle sequence problems where the timing is variable. It can be used for tasks like on-line handwriting recognition[1] or recognizing phonemes in speech audio. CTC refers to the outputs and scoring, and is independent of the underlying neural network structure. It was introduced in 2006.[2]
The input is a sequence of observations, and the outputs are a sequence of labels, which can include blank outputs. The difficulty of training comes from there being many more observations than there are labels. For example, in speech audio there can be multiple time slices which correspond to a single phoneme. Since we don't know the alignment of the observed sequence with the target labels we predict a probability distribution at each time step.[3] A CTC network has a continuous output (e.g. softmax), which is fitted through training to model the probability of a label. CTC does not attempt to learn boundaries and timings: Label sequences are considered equivalent if they differ only in alignment, ignoring blanks. Equivalent label sequences can occur in many ways – which makes scoring a non-trivial task, but there is an efficient forward–backward algorithm for that.
CTC scores can then be used with the back-propagation algorithm to update the neural network weights.
Alternative approaches to a CTC-fitted neural network include a hidden Markov model (HMM).
In 2009, a Connectionist Temporal Classification (CTC)-trained LSTM network was the first RNN to win pattern recognition contests when it won several competitions in connected handwriting recognition.[4][5]
In 2014, the Chinese company Baidu used a bidirectional RNN (not an LSTM) trained on the CTC loss function to break the 2S09 Switchboard Hub5'00 speech recognition dataset[6] benchmark without using any traditional speech processing methods.[7]
In 2015, it was used in Google voice search and dictation on Android devices.[8]
CTC are limited to monotonic alignment, which is not a problem for voice recognition, but may be a problem for language translation, as later words in a language A may correspond to earlier words in language B, since the word ordering is different for different languages.[3]
References
[edit]- ^ Liwicki, Marcus; Graves, Alex; Bunke, Horst; Schmidhuber, Jürgen (2007). "A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks". In Proceedings of the 9th International Conference on Document Analysis and Recognition, ICDAR 2007. CiteSeerX 10.1.1.139.5852.
- ^ Graves, Alex; Fernández, Santiago; Gomez, Faustino; Schmidhuber, Juergen (2006). "Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks". Proceedings of the International Conference on Machine Learning, ICML 2006: 369–376. CiteSeerX 10.1.1.75.6306.
- ^ a b Hannun, Awni (27 November 2017). "Sequence Modeling with CTC". Distill. 2 (11). arXiv:1508.01211. doi:10.23915/distill.00008. ISSN 2476-0757.
- ^ Schmidhuber, Jürgen (January 2015). "Deep Learning in Neural Networks: An Overview". Neural Networks. 61: 85–117. arXiv:1404.7828. doi:10.1016/j.neunet.2014.09.003. PMID 25462637. S2CID 11715509.
- ^ Graves, Alex; Schmidhuber, Jürgen (2009). "Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks". In Koller, D.; Schuurmans, D.; Bengio, Y.; Bottou, L. (eds.). Advances in Neural Information Processing Systems. Vol. 21. Neural Information Processing Systems (NIPS) Foundation. pp. 545–552.
- ^ "2000 HUB5 English Evaluation Speech - Linguistic Data Consortium". catalog.ldc.upenn.edu.
- ^ Hannun, Awni; Case, Carl; Casper, Jared; Catanzaro, Bryan; Diamos, Greg; Elsen, Erich; Prenger, Ryan; Satheesh, Sanjeev; Sengupta, Shubho (17 December 2014). "Deep Speech: Scaling up end-to-end speech recognition". arXiv:1412.5567 [cs.CL].
- ^ Sak, Haşim; Senior, Andrew; Rao, Kanishka; Beaufays, Françoise; Schalkwyk, Johan (September 2015). "Google voice search: faster and more accurate".
External links
[edit]- Section 16.4, "CTC" in Jurafsky and Martin's Speech and Language Processing, 3rd edition
- Hannun, Awni (27 November 2017). "Sequence Modeling with CTC". Distill. 2 (11): e8. doi:10.23915/distill.00008. ISSN 2476-0757.
Connectionist temporal classification
View on GrokipediaOverview
Definition and Purpose
Connectionist Temporal Classification (CTC) is an output layer and associated loss function for recurrent neural networks (RNNs) that enables direct training on unsegmented input-output sequence pairs, allowing the network to generate label sequences without explicit alignment between inputs and outputs.[1] This approach interprets the RNN's sequential outputs as a probability distribution over possible labelings, overcoming the constraint in standard RNN training where outputs are treated as independent classifications at each time step.[1] The core purpose of CTC lies in handling sequence-to-sequence tasks where input and output lengths differ and no pre-alignment is feasible, such as mapping variable-length audio frames to corresponding character labels.[1] It supports end-to-end learning by eliminating the need for preprocessing steps like forced alignment or segmentation, which are often required in conventional models and can introduce errors or biases.[1] At a conceptual level, CTC achieves this by marginalizing over all possible alignments of the target output sequence with the input, computing the total probability as the sum of probabilities for all valid paths that map to the desired labeling; a special "blank" token is incorporated to handle repetitions and gaps in alignments, allowing multiple input frames to correspond to a single output label.[1] This summation is computed efficiently using the forward-backward algorithm, enabling scalable training on raw, unaligned data.[1] CTC was developed to address key limitations in traditional sequence models, such as Hidden Markov Models (HMMs), which demand substantial task-specific knowledge for state definitions and impose explicit independence assumptions that may not align with real-world dependencies.[1] By training discriminatively without such priors, CTC facilitates more flexible and data-driven modeling of temporal sequences.[1]Key Advantages
One of the primary advantages of Connectionist Temporal Classification (CTC) is its elimination of the need for explicit forced alignment during training, while implicitly assuming monotonic alignments, which significantly reduces preprocessing efforts compared to traditional methods like Hidden Markov Models (HMMs) that require explicit segmentation of input sequences.[1] This approach allows recurrent neural networks (RNNs) to label unsegmented sequences directly, avoiding the labor-intensive step of aligning input features to output labels, thereby streamlining the training pipeline for sequence-to-sequence tasks such as speech recognition.[1] CTC further enables end-to-end differentiability, permitting direct optimization of deep neural networks via backpropagation without intermediate non-differentiable components like separate alignment modules.[1] By integrating sequence modeling and labeling into a single architecture, CTC facilitates the use of gradient-based learning across the entire model, enhancing the ability to capture complex temporal dependencies in data like audio or handwriting.[1] Additionally, CTC effectively handles variable-length input and output sequences without requiring dynamic programming for explicit alignment during training, as the method implicitly marginalizes over all possible alignments through its probabilistic formulation.[1] This flexibility makes it particularly suitable for real-world applications where sequence lengths vary, such as in automatic speech recognition, without the need for padding or custom alignment heuristics. Empirically, CTC has demonstrated substantial performance gains over HMM-RNN hybrids in early benchmarks. For instance, on the TIMIT phoneme recognition task, a bidirectional LSTM trained with CTC achieved a label error rate of 30.51%, representing approximately a 20% relative reduction compared to a context-independent HMM baseline of 38.85%.[1] These improvements highlight CTC's efficacy in reducing error rates by better modeling temporal variations and inter-label dependencies without task-specific assumptions.[1]Historical Development
Origins and Original Paper
Connectionist temporal classification (CTC) was developed by Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber at the Istituto Dalle Molle di Studi sull'Intelligenza Artificiale (IDSIA) in Switzerland in 2006.[1] The method was introduced in the paper titled "Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks," presented at the 23rd International Conference on Machine Learning (ICML) in Pittsburgh, Pennsylvania.[1][6] The primary motivation stemmed from the challenges in training recurrent neural networks (RNNs) on unsegmented sequence data, such as in speech or handwriting recognition, where traditional approaches required pre-segmentation of inputs and post-processing of outputs, often relying on hybrid models like hidden Markov model (HMM)-RNN combinations that incorporated task-specific assumptions.[1] CTC was proposed as a purely connectionist solution to enable direct labeling of unsegmented sequences with RNNs, eliminating the need for such alignments and hybrids while leveraging the discriminative power and noise robustness of neural networks.[1] Initial experiments demonstrated CTC's efficacy on the TIMIT speech corpus, a standard benchmark for phonetic recognition consisting of 4,620 training utterances and 1,680 test utterances with manually segmented phoneme transcripts.[1] Using a bidirectional long short-term memory (LSTM) network trained with CTC, the approach achieved a label error rate (LER) of 30.51% with prefix search decoding, outperforming a baseline context-dependent HMM (35.21% LER) and an HMM-RNN hybrid (31.57% weighted error).[1] These results highlighted CTC's potential to surpass established methods without requiring external alignment procedures.[1]Evolution and Adoption
Following its introduction, Connectionist Temporal Classification (CTC) saw early adoption in the 2010s through integration with long short-term memory (LSTM) recurrent neural networks (RNNs) for sequence-to-sequence tasks, particularly in speech and handwriting recognition. For handwriting recognition, CTC was integrated with bidirectional LSTM (BLSTM) networks as early as 2008 using multidimensional RNNs, enabling unsegmented text line transcription on datasets such as IAM.[3] In speech recognition, a seminal application was the 2014 end-to-end system by Alex Graves (Google DeepMind) and Navdeep Jaitly (University of Toronto), which used deep RNNs trained via CTC to directly map audio inputs to characters, achieving word error rates competitive with traditional hybrid models on benchmarks like the Wall Street Journal corpus.[7] The period from 2015 to 2020 marked a surge in CTC's use within end-to-end automatic speech recognition (ASR) models, driven by advancements in deep learning architectures. Baidu's Deep Speech system exemplified this trend, employing CTC with multi-layer RNNs to scale to large datasets and achieve low error rates on English and Mandarin speech, such as 4.98% word error rate on the WSJ eval93 test set.[8] This era saw CTC become a cornerstone for non-aligned sequence learning, facilitating the shift from phonetic to character-level modeling in industrial ASR pipelines. In the 2020s, CTC extended to transformer-based architectures, often in encoder-only configurations where the transformer processes input sequences and CTC handles alignment-free output, as in the Conformer model that combined convolutions with self-attention for state-of-the-art ASR performance. By 2025, CTC had become a standard component in major deep learning frameworks, with native implementations in PyTorch via the CTCLoss module for efficient training and decoding. Similarly, TensorFlow provides CTC loss functions optimized for GPU acceleration, supporting scalable deployment. Its adoption extended to real-time applications, such as mobile ASR systems for on-device transcription, and non-traditional domains like automatic music transcription, where CTC enables polyphonic note prediction from audio without explicit onset alignment.[9] Hybrid approaches combining CTC with attention mechanisms further enhanced robustness, as demonstrated in works integrating attention layers to refine CTC posteriors for improved sequence accuracy in noisy environments.[10]Mathematical Formulation
Problem Statement
Connectionist temporal classification (CTC) addresses the problem of labeling unsegmented sequential data, where the input is a sequence of observations without predefined boundaries corresponding to the target labels.[1] Formally, the input sequence is denoted as , consisting of feature vectors (e.g., acoustic frames in speech recognition), while the target label sequence is , a sequence of labels drawn from a finite alphabet , with typically due to the variable-length nature of the input.[1] The input sequence is often generated as frame-level outputs from a recurrent neural network processing raw data.[1] The core challenge in this setup is the absence of a direct one-to-one correspondence or explicit alignment between the positions in and , making it infeasible to train models using standard supervised methods that require paired timestamps.[1] Instead, CTC aims to compute the posterior probability by marginalizing over all possible alignments between the input and output sequences, without enumerating them explicitly.[1] To model this, CTC assumes that the outputs at different time steps are conditionally independent given the input sequence.[1] The alphabet is extended by introducing a special blank symbol , forming the expanded set , which allows the model to handle repetitions in the target sequence by inserting blanks between identical consecutive labels.[1]Alignment and Paths
In connectionist temporal classification (CTC), the alignment between an input sequence of length and a target label sequence of length over an alphabet is modeled through intermediate paths that account for variable-length mappings without explicit segmentation.[1] A path is a sequence of length drawn from the extended alphabet , where denotes a special blank symbol that does not correspond to any label in .[1] These paths represent potential alignments by emitting labels or blanks at each time step, allowing for repetitions and insertions to bridge the length mismatch between and .[1] Paths in CTC are required to be monotonic, ensuring that the sequence of non-blank labels in appears in non-decreasing order matching the order in , without backtracking or reordering.[1] This monotonicity preserves the temporal progression of the target labels while permitting flexible spacing via blanks and repeated emissions. The function , often denoted as the mapping , transforms a path into the target sequence by first removing all blank symbols and then merging any consecutive duplicate labels.[1] For example, the path collapses under to .[1] The set of all valid paths that align to a specific is the preimage , comprising all such that .[1] In general, this set contains very many elements, as the combinatorial possibilities for inserting blanks and repeating labels scale rapidly with and .[1]CTC Loss Function
The connectionist temporal classification (CTC) loss function is derived from the outputs of a recurrent neural network (RNN) processing an input sequence . At each time step , the RNN produces a probability distribution over the possible labels, including a blank symbol, denoted as for label in the extended label set , where is the set of target labels.[1] A path is a sequence of length (matching the input length) over , and its probability given is the product of the per-timestep probabilities: This assumes conditional independence of label emissions across time steps given the input.[1] The CTC objective marginalizes over all possible alignments of the target sequence (also called labeling ) by summing the probabilities of all paths that map to via the mapping function , which collapses repeated labels and removes blanks: where is the set of all paths mapping to . This marginal probability represents the total likelihood of observing under any valid alignment.[1] The CTC loss for a single example is the negative log-likelihood , which is minimized during training via backpropagation through time, as it is differentiable with respect to the network outputs. For a training set of weighted examples , the full objective is where weights can be uniform (e.g., 1) or adjusted for balance. In modern implementations, such as PyTorch's CTCLoss, the loss for a batch is aggregated using reduction options: 'sum' totals the per-example losses, while 'mean' (default) averages them after dividing by the target sequence lengths, enabling efficient stochastic gradient descent.[1][11]Computational Methods
Forward-Backward Algorithm
The forward-backward algorithm provides an efficient dynamic programming approach to compute the CTC loss function by summing the probabilities of all valid alignments between an input sequence of length and a target label sequence of length , without explicitly enumerating the exponential number of possible paths.[1] This method, analogous to the forward-backward algorithm in hidden Markov models, enables gradient-based training of recurrent neural networks by calculating both the total probability and the necessary derivatives.[1] The algorithm operates over an extended label sequence of length , which inserts possible blanks () before, after, and between the labels in to represent valid paths. The forward variables, denoted , represent the total probability of all paths that produce the prefix up to state at time step . These are computed recursively: First, define . Then,- If or :
- Otherwise:
- If or :
- Otherwise:
Decoding Strategies
Decoding in Connectionist Temporal Classification (CTC) involves inferring the most likely output sequence from the model's frame-level probability distributions at inference time, without requiring explicit alignment between input and output lengths.[1] The process leverages the CTC framework's inclusion of a blank symbol to handle variable-length sequences and repetitions, collapsing repeated labels and removing blanks to produce the final transcription. Common strategies balance computational efficiency with accuracy, often evaluated using metrics such as character error rate (CER) for subword-level tasks or word error rate (WER) for full transcriptions. Greedy decoding, also known as best-path decoding, is the simplest approach, selecting the most probable non-blank label at each time step and then applying the CTC collapse operation to remove duplicates and blanks.[1] This method approximates the highest-probability path by , where is a path and is the input sequence, but it ignores global context and can yield suboptimal results due to local maxima.[1] On benchmarks like TIMIT, greedy decoding achieves a label error rate (LER) of around 31.47%, highlighting its computational efficiency at the cost of accuracy.[1] Beam search improves upon greedy decoding by maintaining a fixed-width beam of the top-K most probable partial hypotheses at each time step, scoring them using CTC probabilities computed via the forward-backward algorithm and pruning low-scoring paths. This allows incorporation of linguistic constraints, such as n-gram or neural language models, to favor coherent sequences while exploring multiple alignments. For instance, on the Switchboard Eval2000 corpus, beam search with a character-based language model reduces WER to 18.6%, compared to 30.4% for greedy decoding.[12] Prefix search variants, such as those using weighted finite-state transducers (WFSTs), extend beam search by integrating lexicon and grammar constraints into a search graph that models valid output sequences. In WFST-based decoding, the CTC blank label is handled explicitly in the transducer topology, enabling efficient beam search over character or phoneme inputs with dictionary constraints. This approach achieves strong performance, such as a 17.0% WER on the Switchboard Eval2000 corpus for character models, by composing the CTC topology with language model finite-state acceptors.[12] Evaluation of decoding strategies typically employs CER, which measures insertion, substitution, and deletion errors at the character level, or WER for word-level assessment, providing standardized benchmarks across datasets like Eval2000. For example, WFST decoding yields CER components of 8.8% insertions, 13.0% substitutions, and 1.9% deletions on Eval2000, demonstrating balanced error reduction.[12] A practical tip for improving decoding efficiency and quality is blank probability thresholding, where output sequences are segmented based on frames exceeding a high blank probability (e.g., 99.99%), allowing independent decoding of sections to mitigate boundary errors in long inputs.[1] This heuristic reduces over-repetition risks without significantly impacting overall accuracy.[1]Applications
Speech Recognition
In automatic speech recognition (ASR), connectionist temporal classification (CTC) enables end-to-end training of neural networks that map variable-length audio inputs, typically represented as log-mel spectrograms or mel-frequency cepstral coefficients (MFCCs), directly to output sequences of phonemes, characters, or subword units. This formulation inherently handles discrepancies in speaking rates and durations by incorporating a special blank symbol to denote non-emitting states or repetitions, allowing the model to marginalize over all possible alignments between input frames and output labels without requiring explicit time synchronization. As a result, CTC simplifies the traditional ASR pipeline by bypassing intermediate steps like hidden Markov models (HMMs) and forced alignment, making it particularly suitable for processing unsegmented audio streams. A landmark implementation of CTC in ASR is Baidu's Deep Speech 2 system, introduced in 2015, which employed a deep bidirectional long short-term memory (LSTM) architecture trained end-to-end with CTC to transcribe English speech. On the Wall Street Journal (WSJ) corpus, a standard benchmark for read speech, Deep Speech 2 achieved a word error rate (WER) of 3.1%, surpassing many conventional hybrid DNN-HMM systems at the time and demonstrating CTC's efficacy for large-vocabulary continuous recognition. Similarly, CTC-based models have been evaluated on the Google Speech Commands dataset, a collection of short audio clips for keyword spotting, where lightweight convolutional or recurrent architectures yield accuracies above 95% on 35-command subsets, highlighting CTC's utility in resource-constrained embedded applications.[13][14] CTC offers key advantages in ASR by promoting robustness to variations such as speaker accents and environmental noise, as the alignment-free training allows the model to learn flexible mappings from raw acoustic features without predefined phonetic dictionaries or phone-level annotations. This end-to-end paradigm reduces sensitivity to misalignment errors common in traditional systems, enabling better generalization across diverse audio conditions. By 2025, hybrid CTC-attention mechanisms have further advanced ASR, as seen in adaptations of OpenAI's Whisper model (originally released in 2022), where CTC provides monotonic alignments for streaming decoding combined with attention for contextual refinement, facilitating real-time deployment on edge devices with latencies under 100 ms.[15][16] Prominent datasets for developing and assessing CTC-based ASR systems include LibriSpeech, comprising approximately 960 hours of English audiobooks with aligned transcripts for clean and noisy subsets, and Mozilla Common Voice, a crowdsourced repository exceeding 20,000 hours of multilingual speech by 2025, emphasizing inclusivity across accents and languages. These corpora support scalable training of CTC models, with LibriSpeech often yielding WERs below 5% for English baselines and Common Voice enabling low-resource adaptations.Optical Character Recognition
Connectionist temporal classification (CTC) enables optical character recognition (OCR) by processing sequential image features, such as pixel rows or convolutional feature maps from scanned documents or photos, directly into character sequences without requiring prior segmentation or alignment. This method is especially effective for handwriting and printed text, where input variability—such as differing line thicknesses or distortions—poses challenges to traditional approaches. By marginalizing over all possible alignments between input frames and output labels, CTC accommodates the irregular timing of visual features, making it suitable for both offline (static images) and scene text recognition tasks.[17] A foundational application of CTC in offline handwriting recognition was demonstrated by Graves and Schmidhuber in 2008, who developed a system using multidimensional recurrent neural networks to process raw pixel data from document images. Their model achieved a character error rate (CER) of 10.7% on the validation set of the IFN/ENIT Arabic handwriting database, outperforming competition entries and highlighting CTC's ability to handle cursive scripts without explicit feature engineering. For scene text recognition, the convolutional recurrent neural network (CRNN) model by Shi et al., introduced in 2015 and detailed in their 2017 publication, integrates CTC for end-to-end transcription of text in natural images. On the Street View Text (SVT) dataset without a lexicon, CRNN attained 80.8% accuracy, demonstrating robust performance on irregular, perspective-distorted text common in real-world scenes.[18] By 2025, CTC-based architectures continue to underpin multilingual OCR in document AI pipelines, supporting non-Latin scripts through adaptable vocabulary expansions and shared sequence modeling. These advancements have enabled seamless integration in commercial systems for extracting text from diverse documents, enhancing accessibility for low-resource languages.[19] CTC specifically addresses OCR challenges like variable stroke lengths in handwriting, where input sequences may span differing numbers of frames for the same character, by permitting many-to-one mappings in its alignment paths, thus avoiding the need for fixed-length assumptions. Ligatures and cursive connections are managed through the blank symbol, which acts as a separator to distinguish adjacent or repeated characters (e.g., distinguishing "oo" from "o" by inserting blanks), allowing the model to collapse redundant predictions while preserving sequence integrity. In practice, decoding strategies such as beam search are applied to CTC outputs in OCR to generate the most likely text hypotheses from these probabilistic alignments.[20]Implementations and Tools
Software Libraries
Several major open-source software libraries provide implementations of Connectionist Temporal Classification (CTC), enabling its integration into deep learning workflows for tasks such as sequence alignment and loss computation. These libraries typically include the CTC loss function, often built on the forward-backward algorithm for efficient probability summation over alignments, along with support for batched processing and decoding utilities.[21] In PyTorch, thetorch.nn.CTCLoss module, introduced in version 1.0 released in October 2018, computes the CTC loss between unsegmented time series inputs and target sequences. It supports batched inputs, configurable blank label indices, and reduction modes such as 'mean', 'sum', and 'none' for flexible loss aggregation across samples. This implementation handles variable-length sequences efficiently via input and target length tensors, making it suitable for training recurrent or convolutional sequence models.[22]
TensorFlow offers tf.nn.ctc_loss as a core function for CTC loss calculation, available since version 0.8 in April 2016, with enhancements in subsequent releases for better numerical stability and GPU acceleration via cuDNN. It processes logits, sparse labels, and sequence lengths, returning per-example losses that can be reduced as needed. For advanced decoding, TensorFlow includes built-in functions like tf.nn.ctc_beam_search_decoder, which performs beam search to find high-probability alignments without external dependencies.[23][21][24]
Baidu's PaddlePaddle framework provides paddle.nn.CTCLoss, which integrates the Warp-CTC library for high-performance CTC computation on both CPU and GPU. The Warp-CTC library is a foundational C++/CUDA implementation of CTC, originally developed by Baidu for efficient training and inference. This implementation supports batched variable-length inputs and is optimized for large-scale training, commonly used in PaddleOCR for optical character recognition pipelines.[25][26]
For optimized inference, Apache TVM, an open deep learning compiler, enables deployment of CTC-based models across diverse hardware by compiling frameworks like PyTorch or TensorFlow graphs into efficient runtime code, achieving speedups through operator fusion and hardware-specific tuning.
Hugging Face's Transformers library facilitates the use of pre-trained CTC models, such as Wav2Vec2, which employ CTC loss during fine-tuning for automatic speech recognition. These models can be loaded via simple APIs, with built-in support for CTC decoding and integration with PyTorch or TensorFlow backends.
ONNX format supports exporting CTC loss and decoding operations from PyTorch and TensorFlow, allowing cross-framework model portability and inference on runtimes like ONNX Runtime for edge devices. In Rust ecosystems, the Tract crate serves as an ONNX inference engine capable of executing CTC-optimized models in embedded environments with low overhead.[27]
The following pseudocode illustrates basic CTC loss computation in PyTorch, assuming pre-softmax logits and integer targets:
import torch
import torch.nn as nn
# Example inputs: batch_size=2, time_steps=5, num_classes=29 (including blank)
logits = torch.randn(2, 5, 29) # [batch, time, classes]
targets = torch.tensor([1, 3, 3, 2], dtype=torch.int32) # Example target sequence
input_lengths = torch.tensor([5, 5], dtype=torch.int32)
target_lengths = torch.tensor([4, 1], dtype=torch.int32)
ctc_loss = nn.CTCLoss(blank=0, reduction='mean')
loss = ctc_loss(logits, targets, input_lengths, target_lengths)
print(loss.item())
import torch
import torch.nn as nn
# Example inputs: batch_size=2, time_steps=5, num_classes=29 (including blank)
logits = torch.randn(2, 5, 29) # [batch, time, classes]
targets = torch.tensor([1, 3, 3, 2], dtype=torch.int32) # Example target sequence
input_lengths = torch.tensor([5, 5], dtype=torch.int32)
target_lengths = torch.tensor([4, 1], dtype=torch.int32)
ctc_loss = nn.CTCLoss(blank=0, reduction='mean')
loss = ctc_loss(logits, targets, input_lengths, target_lengths)
print(loss.item())
