Hubbry Logo
Seq2seqSeq2seqMain
Open search
Seq2seq
Community hub
Seq2seq
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Seq2seq
Seq2seq
from Wikipedia
Animation of seq2seq with RNN and attention mechanism

Seq2seq is a family of machine learning approaches used for natural language processing.[1] Applications include language translation,[2] image captioning,[3] conversational models,[4] speech recognition,[5] and text summarization.[6] Seq2seq uses sequence transformation: it turns one sequence into another sequence.

History

[edit]

One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: 'This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.

— Warren Weaver, Letter to Norbert Wiener, March 4, 1947

Shannon's diagram of a general communications system, showing the process by which a message sent becomes the message received (possibly corrupted by noise)

seq2seq is an approach to machine translation (or more generally, sequence transduction) with roots in information theory, where communication is understood as an encode-transmit-decode process, and machine translation can be studied as a special case of communication. This viewpoint was elaborated, for example, in the noisy channel model of machine translation.

In practice, seq2seq maps an input sequence into a real-numerical vector by using a neural network (the encoder), and then maps it back to an output sequence using another neural network (the decoder).

The idea of encoder-decoder sequence transduction had been developed in the early 2010s (see [2][1] for previous papers). The papers most commonly cited as the originators that produced seq2seq are two papers from 2014.[2][1]

In the seq2seq as proposed by them, both the encoder and the decoder were LSTMs. This had the "bottleneck" problem, since the encoding vector has a fixed size, so for long input sequences, information would tend to be lost, as they are difficult to fit into the fixed-length encoding vector. The attention mechanism, proposed in 2014,[7] resolved the bottleneck problem. They called their model RNNsearch, as it "emulates searching through a source sentence during decoding a translation".

A problem with seq2seq models at this point was that recurrent neural networks are difficult to parallelize. The 2017 publication of Transformers[8] resolved the problem by replacing the encoding RNN with self-attention Transformer blocks ("encoder blocks"), and the decoding RNN with cross-attention causally-masked Transformer blocks ("decoder blocks").

Priority dispute

[edit]

One of the papers cited as the originator for seq2seq is (Sutskever et al 2014),[1] published at Google Brain while they were on Google's machine translation project. The research allowed Google to overhaul Google Translate into Google Neural Machine Translation in 2016.[1][9] Tomáš Mikolov claims to have developed the idea (before joining Google Brain) of using a "neural language model on pairs of sentences... and then [generating] translation after seeing the first sentence"—which he equates with seq2seq machine translation, and to have mentioned the idea to Ilya Sutskever and Quoc Le (while at Google Brain), who failed to acknowledge him in their paper.[10] Mikolov had worked on RNNLM (using RNN for language modelling) for his PhD thesis,[11] and is more notable for developing word2vec.

Architecture

[edit]

The main reference for this section is.[12]

Encoder

[edit]
RNN encoder

The encoder is responsible for processing the input sequence and capturing its essential information, which is stored as the hidden state of the network and, in a model with attention mechanism, a context vector. The context vector is the weighted sum of the input hidden states and is generated for every time instance in the output sequences.

Decoder

[edit]
RNN decoder

The decoder takes the context vector and hidden states from the encoder and generates the final output sequence. The decoder operates in an autoregressive manner, producing one element of the output sequence at a time. At each step, it considers the previously generated elements, the context vector, and the input sequence information to make predictions for the next element in the output sequence. Specifically, in a model with attention mechanism, the context vector and the hidden state are concatenated together to form an attention hidden vector, which is used as an input for the decoder.

The seq2seq method developed in the early 2010s uses two neural networks: an encoder network converts an input sentence into numerical vectors, and a decoder network converts those vectors to sentences in the target language. The Attention mechanism was grafted onto this structure in 2014 and shown below. Later it was refined into the encoder-decoder Transformer architecture of 2017.

Training vs prediction

[edit]
Training a seq2seq model via teacher forcing
Predicting a sequence using a seq2seq model

There is a subtle difference between training and prediction. During training time, both the input and the output sequences are known. During prediction time, only the input sequence is known, and the output sequence must be decoded by the network itself.

Specifically, consider an input sequence and output sequence . The encoder would process the input step by step. After that, the decoder would take the output from the encoder, as well as the <bos> as input, and produce a prediction . Now, the question is: what should be input to the decoder in the next step?

A standard method for training is "teacher forcing". In teacher forcing, no matter what is output by the decoder, the next input to the decoder is always the reference. That is, even if , the next input to the decoder is still , and so on.

During prediction time, the "teacher" would be unavailable. Therefore, the input to the decoder must be , then , and so on.

It is found that if a model is trained purely by teacher forcing, its performance would degrade during prediction time, since generation based on the model's own output is different from generation based on the teacher's output. This is called exposure bias or a train/test distribution shift. A 2015 paper recommends that, during training, randomly switch between teacher forcing and no teacher forcing.[13]

Attention for seq2seq

[edit]

The attention mechanism is an enhancement introduced by Bahdanau et al. in 2014 to address limitations in the basic Seq2Seq architecture where a longer input sequence results in the hidden state output of the encoder becoming irrelevant for the decoder. It enables the model to selectively focus on different parts of the input sequence during the decoding process. At each decoder step, an alignment model calculates the attention score using the current decoder state and all of the attention hidden vectors as input. An alignment model is another neural network model that is trained jointly with the seq2seq model used to calculate how well an input, represented by the hidden state, matches with the previous output, represented by attention hidden state. A softmax function is then applied to the attention score to get the attention weight.

Seq2seq RNN encoder-decoder with attention mechanism, where the detailed construction of attention mechanism is exposed. See attention mechanism page for details.

In some models, the encoder states are directly fed into an activation function, removing the need for alignment model. An activation function receives one decoder state and one encoder state and returns a scalar value of their relevance.

Animation of seq2seq with RNN and attention mechanism

Consider the seq2seq language English-to-French translation task. To be concrete, let us consider the translation of "the zone of international control <end>", which should translate to "la zone de contrôle international <end>". Here, we use the special <end> token as a control character to delimit the end of input for both the encoder and the decoder.

An input sequence of text is processed by a neural network (which can be an LSTM, a Transformer encoder, or some other network) into a sequence of real-valued vectors , where stands for "hidden vector".

After the encoder has finished processing, the decoder starts operating over the hidden vectors, to produce an output sequence , autoregressively. That is, it always takes as input both the hidden vectors produced by the encoder, and what the decoder itself has produced before, to produce the next output word:

  1. (, "<start>") → "la"
  2. (, "<start> la") → "la zone"
  3. (, "<start> la zone") → "la zone de"
  4. ...
  5. (, "<start> la zone de contrôle international") → "la zone de contrôle international <end>"

Here, we use the special <start> token as a control character to delimit the start of input for the decoder. The decoding terminates as soon as "<end>" appears in the decoder output.

Attention weights

[edit]
Attention mechanism with attention weights, overview

As hand-crafting weights defeats the purpose of machine learning, the model must compute the attention weights on its own. Taking analogy from the language of database queries, we make the model construct a triple of vectors: key, query, and value. The rough idea is that we have a "database" in the form of a list of key-value pairs. The decoder sends in a query, and obtains a reply in the form of a weighted sum of the values, where the weight is proportional to how closely the query resembles each key.

The decoder first processes the "<start>" input partially, to obtain an intermediate vector , the 0th hidden vector of decoder. Then, the intermediate vector is transformed by a linear map into a query vector . Meanwhile, the hidden vectors outputted by the encoder are transformed by another linear map into key vectors . The linear maps are useful for providing the model with enough freedom to find the best way to represent the data.

Now, the query and keys are compared by taking dot products: . Ideally, the model should have learned to compute the keys and values, such that is large, is small, and the rest are very small. This can be interpreted as saying that the attention weight should be mostly applied to the 0th hidden vector of the encoder, a little to the 1st, and essentially none to the rest.

In order to make a properly weighted sum, we need to transform this list of dot products into a probability distribution over . This can be accomplished by the softmax function, thus giving us the attention weights:This is then used to compute the context vector:

where are the value vectors, linearly transformed by another matrix to provide the model with freedom to find the best way to represent values. Without the matrices , the model would be forced to use the same hidden vector for both key and value, which might not be appropriate, as these two tasks are not the same.

Computing the attention weights by dot-product. This is the "decoder cross-attention".

This is the dot-attention mechanism. The particular version described in this section is "decoder cross-attention", as the output context vector is used by the decoder, and the input keys and values come from the encoder, but the query comes from the decoder, thus "cross-attention".

More succinctly, we can write it aswhere the matrix is the matrix whose rows are . Note that the querying vector, , is not necessarily the same as the key-value vector . In fact, it is theoretically possible for query, key, and value vectors to all be different, though that is rarely done in practice.

Seq2seq RNN encoder-decoder with attention mechanism, training
Seq2seq RNN encoder-decoder with attention mechanism, training and inferring

Other applications

[edit]

In 2019, Facebook announced its use in symbolic integration and resolution of differential equations. The company claimed that it could solve complex equations more rapidly and with greater accuracy than commercial solutions such as Mathematica, MATLAB and Maple. First, the equation is parsed into a tree structure to avoid notational idiosyncrasies. An LSTM neural network then applies its standard pattern recognition facilities to process the tree.[14][15]

In 2020, Google released Meena, a 2.6 billion parameter seq2seq-based chatbot trained on a 341 GB data set. Google claimed that the chatbot has 1.7 times greater model capacity than OpenAI's GPT-2.[4]

In 2022, Amazon introduced AlexaTM 20B, a moderate-sized (20 billion parameter) seq2seq language model. It uses an encoder-decoder to accomplish few-shot learning. The encoder outputs a representation of the input that the decoder uses as input to perform a specific task, such as translating the input into another language. The model outperforms the much larger GPT-3 in language translation and summarization. Training mixes denoising (appropriately inserting missing text in strings) and causal-language-modeling (meaningfully extending an input text). It allows adding features across different languages without massive training workflows. AlexaTM 20B achieved state-of-the-art performance in few-shot-learning tasks across all Flores-101 language pairs, outperforming GPT-3 on several tasks.[16]

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Sequence-to-sequence (seq2seq) models are a class of architectures in designed to transform an input sequence into an output sequence of potentially different length, using an encoder-decoder framework that processes variable-length data end-to-end. Introduced in 2014, these models make minimal assumptions about the input structure and leverage recurrent neural networks (RNNs), particularly (LSTM) units, to handle long-range dependencies in sequences. The core consists of an encoder LSTM that reads the input timestep by timestep to produce a fixed-dimensional vector representation, followed by a decoder LSTM that generates the output conditioned on this representation. In their foundational implementation, Sutskever, Vinyals, and Le employed four-layer LSTMs with approximately 384 million parameters, trained on large parallel corpora, and incorporated a technique of reversing the source order to facilitate gradient flow during training. This approach yielded a breakthrough in , achieving a score of 34.8 on the WMT'14 English-to-French dataset, surpassing traditional phrase-based systems. A significant advancement came with the integration of an attention mechanism, proposed by Bahdanau, Cho, and Bengio in , which addresses the bottleneck of fixed vector representations by enabling the decoder to dynamically weigh and attend to relevant portions of the input sequence. This "soft-alignment" process, computed as a weighted sum of encoder hidden states, improved translation quality for longer sentences, raising scores from 17.8 (basic encoder-decoder) to 26.8 on English-to-French tasks for sentences up to 50 words in length. has since become a standard component in seq2seq models, enhancing their ability to model alignments between input and output elements. Beyond translation, seq2seq models have been widely adopted for diverse (NLP) tasks, including dialogue generation, where they produce contextually relevant responses to input utterances; abstractive text summarization, condensing long documents into concise outputs; and , converting audio sequences to text transcripts. These applications demonstrate the model's versatility in handling sequential data, paving the way for further innovations like transformer-based variants that scale to even larger contexts.

Introduction

Definition and Purpose

Sequence-to-sequence (seq2seq) models constitute a foundational in for transforming an input into an output , employing an encoder-decoder framework to handle tasks such as mapping a sentence in a source to its in a target . The encoder compresses the variable-length input into a fixed-dimensional representation that captures essential information, while the decoder autoregressively generates the output by producing elements one at a time, conditioned on the encoded representation and previously generated . The core purpose of seq2seq models is to facilitate end-to-end learning for sequence transduction problems, where the entire mapping from input to output is optimized jointly from raw data, bypassing the need for hand-engineered features or multi-stage pipelines. This addresses the constraints of earlier fixed-length models, which assume uniform input and output dimensions and fail to model sequential effectively. Seq2seq architectures emerged as a response to persistent challenges in , particularly the difficulties faced by traditional phrase-based (SMT) systems in capturing long-range dependencies and handling non-monotonic alignments between source and target sequences. By enabling direct learning of sequence mappings, seq2seq models improve fluency and accuracy in generating coherent outputs from diverse inputs.

Key Components Overview

Seq2seq models consist of three primary components: an encoder, a decoder, and optionally, an attention mechanism to enhance performance on longer sequences. The encoder processes the input sequence to produce a fixed-length context vector that captures the essential information of the input. The decoder then uses this context vector to generate the output sequence one element at a time. To handle variable-length sequences while preserving order and dependencies, seq2seq models employ layers, such as (LSTM) units, in both the encoder and decoder. These layers process the sequence timestep by timestep, allowing the model to manage inputs and outputs of differing lengths without fixed-size assumptions. Input sequences are represented through tokenization into words or subwords, followed by embedding layers that map tokens to dense vector representations, typically in high-dimensional spaces like 1000 dimensions. Outputs are generated probabilistically using a softmax layer over a predefined vocabulary, enabling the model to predict the next token based on the accumulated context. For instance, in , an input like "Hello world" is tokenized and embedded by the encoder to form a context vector, which the decoder uses to produce tokens for the target language, such as "Bonjour le monde" in French, step by step during inference. Later enhancements, like , allow the decoder to focus dynamically on relevant parts of the input, improving alignment for complex tasks.

Historical Development

Early Foundations

The foundations of sequence-to-sequence (seq2seq) models trace back to the pre-neural era of (SMT), which dominated the field from the late 1980s to the early 2010s. SMT systems modeled translation as a noisy channel problem, estimating the probability of a target sentence given a source sentence using parallel corpora to learn translation and language models. These approaches shifted away from rigid rule-based systems toward data-driven methods that could handle real-world linguistic variability. A cornerstone of early SMT was the IBM Models, developed by researchers at 's Thomas J. Watson Research Center in the early 1990s. These models, detailed in a seminal series of five progressively complex formulations, focused on word-level alignments between source and target languages to map sequences probabilistically. Model 1 introduced basic lexical translation probabilities assuming uniform alignment, while later models incorporated distortion (position-based alignments) and fertility (the number of target words generated per source word) to better capture structural differences between languages. By estimating parameters via expectation-maximization on bilingual data, the IBM Models enabled automatic alignment extraction, forming the basis for phrase-based SMT systems that improved translation quality significantly over prior methods. Despite their impact, SMT frameworks faced key challenges in handling variable-length sequences and alignments. N-gram language models, commonly used in SMT for fluency scoring, suffered from data sparsity and limited context (typically 3-5 grams), failing to capture long-range dependencies essential for coherent . Alignment processes required heuristics to resolve ambiguities, such as one-to-many mappings, and struggled with reordering in linguistically distant language pairs, leading to error propagation in decoding. These limitations highlighted the need for models that could learn continuous representations and sequential dependencies more flexibly. Neural precursors emerged in the late with recurrent neural networks (RNNs) applied to language modeling, offering a path toward addressing these issues. In 2010, Mikolov et al. introduced RNN-based models (RNNLMs) that used hidden states to maintain context across arbitrary lengths, outperforming traditional n-grams on metrics for and text prediction tasks. These models demonstrated RNNs' ability to learn distributed word representations and handle sequential data without explicit , paving the way for neural integration into translation pipelines. By 2011-2013, neural models were incorporated as features in SMT systems for rescoring hypotheses, yielding modest but consistent score improvements (e.g., 0.5-1.0 points) on benchmarks like WMT. Initial neural ideas for built on these advances, proposing encoder structures to compress source sequences into fixed representations. In 2013, Kalchbrenner and Blunsom presented recurrent continuous translation models using bidirectional RNNs to encode source sentences and generate target sequences, achieving competitive results on small-scale English-French tasks without relying on phrase tables. This work emphasized end-to-end learning of alignments through continuous embeddings, reducing the modular complexity of SMT. The transition from statistical to neural end-to-end learning accelerated around 2013-2014, driven by advances in GPU computing and larger datasets, enabling seq2seq paradigms to supplant hybrid SMT-neural systems.

Priority Dispute and Key Publications

The RNN encoder-decoder architecture, a foundational component of sequence-to-sequence (seq2seq) models, was first introduced by Kyunghyun Cho and colleagues in their June 2014 arXiv preprint, later published at EMNLP 2014. Authored by researchers from institutions including New York University and the Université de Montréal, the paper proposed using two recurrent neural networks (RNNs)—one as an encoder to compress input sequences into a fixed-length vector and another as a decoder to generate output sequences—for tasks such as statistical machine translation, with additional applications noted for automatic summarization, question answering, and dialogue systems. The model emphasized learning continuous phrase representations to improve integration with traditional phrase-based translation systems. Shortly thereafter, Ilya Sutskever and colleagues from Google Brain published their September 2014 arXiv preprint, accepted at NeurIPS 2014, which formalized "sequence to sequence learning with neural networks" and popularized the "seq2seq" terminology. This work built upon the encoder-decoder framework by employing long short-term memory (LSTM) units to handle longer sequences more effectively, focusing primarily on end-to-end machine translation without relying on intermediate phrase alignments. Sutskever et al. explicitly cited Cho et al. as related prior work, acknowledging the encoder-decoder structure while advancing it through deeper LSTMs and empirical demonstrations on translation benchmarks. The Sutskever paper had profound impact by demonstrating state-of-the-art performance on the WMT'14 English-to-French translation task, achieving a score of 34.81 with an ensemble of deep LSTMs—surpassing the previous phrase-based SMT baseline of 33.3 and establishing seq2seq as a viable alternative to traditional methods. This result, obtained via direct sequence generation without , highlighted the model's ability to capture long-range dependencies in . The work's emphasis on reversing input sequences during training further improved convergence and performance. Subsequent influences included rapid integration of seq2seq models into major frameworks, such as , where example implementations and tutorials for encoder-decoder architectures emerged by 2017 to facilitate experimentation and beyond. These developments accelerated adoption across research and industry, cementing seq2seq as a cornerstone of neural sequence modeling.

Core Architecture

Encoder Mechanism

The encoder in a sequence-to-sequence (seq2seq) model serves to process an input x1,,xTx_1, \dots, x_T, where each xtx_t is typically an of a token such as a word or subword, through a series of recurrent layers to generate a sequence of hidden states h1,,hTh_1, \dots, h_T. These hidden states capture the contextual information from the input up to each time step tt. In the basic formulation, the final hidden state hTh_T is often used as a fixed-dimensional context vector that summarizes the entire input for downstream processing. Recurrent neural networks (RNNs) form the backbone of the encoder, with the hidden state at each step computed as ht=tanh(Wxhxt+Whhht1+bh)h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} + b_h), where WxhW_{xh} and WhhW_{hh} are weight matrices, and bhb_h is a bias term; this update allows the model to maintain a memory of prior inputs while incorporating the current one. To enhance context capture, especially for tasks requiring understanding of future tokens, bidirectional RNNs or LSTMs are commonly employed, processing the sequence in both forward and backward directions to produce concatenated hidden states. Long short-term memory (LSTM) units, which replace the simple tanh activation with a cell state and gating mechanisms—including forget, input, and output gates—address limitations in standard RNNs by selectively retaining or discarding information over extended sequences. A key challenge in encoder design is the , where gradients diminish exponentially during through time, hindering learning of long-range dependencies in sequences. LSTMs mitigate this by using multiplicative gates to maintain stable gradient flow, enabling effective encoding of lengthy inputs. For instance, in , the encoder processes a source sentence like "The cat sat on the mat" to produce hidden states that encode syntactic and semantic relationships, summarized in the context vector for decoding the target language output.

Decoder Mechanism

The decoder in a sequence-to-sequence (seq2seq) model is responsible for generating the output sequence y1,,yTy_1, \dots, y_{T'} autoregressively, conditioned on the input sequence xx via a fixed-dimensional context vector cc produced by the encoder. This process begins with the decoder's initial hidden state initialized from cc, and each subsequent hidden state sts_t is computed as st=f(yt1,st1)s_t = f(y_{t-1}, s_{t-1}), where ff denotes the recurrent function (typically an LSTM or GRU), yt1y_{t-1} is the previous output symbol (as embedding), and st1s_{t-1} is the prior hidden state; the context vector cc influences the decoding through this initialization, enabling the model to produce variable-length outputs without explicit alignment to the input. During training, the decoder employs , where the ground-truth previous outputs y<ty_{<t} are fed as inputs to compute the next symbol, rather than the model's own predictions. This facilitates efficient optimization by maximizing the conditional likelihood P(yty<t,x)=\softmax(Wost)P(y_t \mid y_{<t}, x) = \softmax(W_o s_t), where WoW_o is a learned output mapping the hidden state sts_t to the vocabulary size, and \softmax\softmax yields the over possible output tokens. The overall training objective decomposes into the product of these conditional probabilities across the output sequence, allowing the model to learn coherent generations step by step. At inference time, the decoder operates fully autoregressively, starting from a special start token and using its own predictions to generate subsequent symbols, which introduces a train-test discrepancy. To improve output quality over greedy decoding (which selects the highest-probability token at each step), is commonly used; it maintains a fixed number of partial hypotheses (beam width BB), expanding and pruning them based on cumulative log-probability to explore more likely sequences. Experiments on tasks demonstrate that with B=12B = 12 can yield substantial gains; for an ensemble of five models, it improves the score from 33.0 (greedy decoding) to 34.81 on the WMT'14 English-to-French task. A key challenge in this setup is exposure bias, arising from the mismatch between (where errors do not propagate due to ) and inference (where compounding errors from model predictions degrade performance over long sequences). This discrepancy causes the model to be exposed only to during , leading to suboptimal robustness when generating from its own outputs.

Training and Inference Differences

Seq2seq models are trained using , where the objective is to maximize the of the target given the input , typically formulated as minimizing the loss L=tlogP(yty<t,x)\mathcal{L} = -\sum_t \log P(y_t | y_{<t}, x), with the sum taken over the target tokens yty_t. This loss is computed autoregressively, but during , the decoder receives the ground-truth previous tokens as input—a technique known as —to accelerate convergence and avoid error propagation from early predictions. In contrast, inference in seq2seq models involves autoregressive decoding without access to ground-truth targets, where the decoder generates each token conditioned only on the model's previous outputs, leading to potential accumulation over long sequences—a phenomenon called exposure bias that can degrade performance compared to . To mitigate suboptimal greedy selection of the most probable token at each step, strategies like are employed, maintaining a fixed-width beam of kk hypotheses and exploring the top-kk likely extensions at each step to find a higher-probability output sequence overall. This process is computationally more intensive than , as it requires evaluating multiple partial sequences in parallel, often resulting in longer times, especially for larger beam widths. A key distinction arises from the absence of ground truth during inference, necessitating extrinsic evaluation metrics such as , which measures n-gram overlap between generated and reference sequences to approximate quality without human judgment. Additionally, to stabilize training of the underlying recurrent components, techniques like gradient clipping are applied, capping the gradient norm (e.g., at 5) to prevent exploding gradients that could destabilize optimization.

Attention Integration

Role of Attention in Seq2seq

In vanilla sequence-to-sequence (seq2seq) models, the encoder compresses the entire input into a single fixed-length context vector, which serves as the sole source of information for the decoder during generation. This approach creates a significant bottleneck, particularly for long sequences, as the encoder must encode all relevant details into a limited representation, leading to information loss and degraded performance as input length increases. To address this limitation, attention mechanisms were introduced to enable the decoder to dynamically focus on different parts of the input sequence at each generation step, rather than relying on a static context vector. In their seminal work, Bahdanau et al. (2014) proposed an additive mechanism specifically for , where the decoder computes alignment scores between the current hidden state and all encoder annotations to weigh their contributions selectively. This allows the model to emphasize relevant input elements, such as specific words or phrases, improving alignment between source and target sequences. The integration of attention involves the decoder attending to the full set of encoder hidden states—representations produced by the encoder for each input position—at every decoding timestep. By forming a context vector as a weighted sum of these states, the model handles long sequences more effectively, avoiding the need to propagate through deep recurrent layers solely via the initial . This enhancement yields several benefits: it mitigates error propagation from the encoder by allowing on-demand retrieval of input , and it produces interpretable soft-alignment weights that reveal plausible linguistic correspondences between input and output, aiding model and .

Computing Attention Weights

In sequence-to-sequence models, attention weights are computed to determine the relevance of each encoder hidden state to the current decoder state. The process begins with calculating alignment scores, which quantify the compatibility between the decoder's hidden state sts_t at time step tt and each encoder hidden state hih_i for input positions i=1i = 1 to TT. In the additive mechanism introduced by Bahdanau et al., the score etie_{ti} is computed as eti=vaTtanh(Wa[st;hi])e_{ti} = v_a^T \tanh(W_a [s_t; h_i]), where WaW_a is a learnable weight matrix, vav_a is a learnable vector, and [st;hi][s_t; h_i] denotes . These raw scores are then normalized to form attention weights using the : αti=exp(eti)j=1Texp(etj)\alpha_{ti} = \frac{\exp(e_{ti})}{\sum_{j=1}^T \exp(e_{tj})}. The resulting weights αti\alpha_{ti} sum to 1 over ii, enabling the computation of a context vector ct=i=1Tαtihic_t = \sum_{i=1}^T \alpha_{ti} h_i, which aggregates a weighted sum of the encoder states to inform the decoder's output at step tt. Alternative formulations exist for computing alignment scores. For instance, the dot-product attention variant, proposed by Luong et al., simplifies the score to eti=stThie_{ti} = s_t^T h_i (or a scaled version thereof), which is computationally efficient and effective when encoder and decoder states are in the same space. Multi-head attention extends these mechanisms by projecting queries, keys, and values into multiple subspaces and computing attention in parallel, allowing the model to jointly attend to information from different representation subspaces, as later developed in architectures. Attention weights are often visualized as heatmaps, where rows correspond to decoder time steps and columns to encoder positions, with color intensity representing αti\alpha_{ti} values; such visualizations reveal alignments between source and target sequences, such as focusing on specific input words during .

Applications and Extensions

Machine Translation

Sequence-to-sequence (seq2seq) models were initially developed for as an end-to-end approach, directly mapping source language sentences to target language outputs without relying on intermediate phrase alignments or rules. This , introduced in 2014, outperformed traditional phrase-based (SMT) systems, which dominated prior benchmarks. For instance, on the WMT'14 English-to-French task, an ensemble of deep LSTM-based seq2seq models achieved a score of 34.8, surpassing the phrase-based SMT baseline of 33.3. The integration of mechanisms further enhanced performance by allowing the decoder to focus on relevant parts of the input sequence, yielding a score of 28.45 (excluding unknown words) on the same dataset with an extended training regime. In 2016, adopted based on seq2seq architectures with attention in its service, marking a widespread commercial deployment. The (GNMT) system, utilizing deep LSTM layers, reduced translation errors by 60% relative to phrase-based systems on English-to-French, English-to-Spanish, and English-to-Chinese pairs. This transition improved fluency and accuracy, establishing seq2seq as the foundation for production-scale translation systems. Seq2seq models for are commonly evaluated on Workshop on Machine Translation (WMT) datasets, which provide standardized corpora for language pairs like English-French and English-German, using the metric to measure n-gram overlap with human references. Early attention-augmented seq2seq models achieved scores exceeding 28 on English-to-French tasks, setting new standards that phrase-based methods struggled to match. Later implementations, such as GNMT, pushed scores to 38.95 on WMT'14 English-to-French, demonstrating the scalability of seq2seq to larger datasets and deeper networks. The evolution of seq2seq in progressed from basic RNN and LSTM architectures to optimized toolkits like fairseq, developed by AI Research, which facilitates efficient training of seq2seq models including LSTMs and convolutions for tasks. A key advancement addressed the challenge of rare words, which comprise up to 20% of vocabulary in corpora and often lead to unknown token issues in fixed-vocabulary models. Byte Pair Encoding (BPE) subword units resolve this by decomposing rare and out-of-vocabulary words into frequent subword segments, enabling open-vocabulary ; for example, on WMT 2015 English-to-German, BPE improved by 0.5 points and rare word accuracy (unigram F1) from 36.8% to 41.8%. A notable case study in seq2seq translation involves visualizing attention-derived alignments, which reveal how the model associates source and target words. In the attention model, soft-alignment weights are depicted as grayscale matrices, illustrating monotonic alignments such as mapping "" in English to "zone économique européenne" in French, providing interpretability into the translation process. This visualization underscores attention's role in producing coherent, context-aware translations.

Speech Recognition and Other NLP Tasks

Seq2seq models have been adapted for by treating the problem as mapping acoustic input sequences, such as mel-frequency cepstral coefficients, to output sequences of characters or words. A seminal approach is the Listen, Attend and Spell (LAS) model, which employs an encoder-decoder architecture with to directly transcribe speech utterances without explicit alignment, achieving end-to-end learning from audio to text. This contrasts with (CTC), an earlier alignment-free method that uses a to compute probabilities over output labels at each time step, often combined with a for decoding, but lacking the for focusing on relevant audio segments. Attention-based decoders in seq2seq, as in LAS, generally outperform CTC in handling variable-length inputs and improving transcription accuracy on large-vocabulary tasks by dynamically weighting acoustic features. Beyond speech, seq2seq architectures power other natural language processing tasks, such as abstractive text summarization, where an encoder processes the input document to produce a fixed representation, and the decoder generates a concise abstract by attending to key parts of the source text. A foundational model for this is the neural attention-based abstractive summarizer, which frames summarization as sequence transduction and demonstrates superior fluency and informativeness over extractive methods on datasets like the Gigaword corpus. In dialogue systems, seq2seq models enable response generation by encoding conversational context as input sequences and decoding coherent replies, as exemplified by the neural conversational model trained on multi-turn dialogues, which captures utterance dependencies to produce contextually relevant outputs. Seq2seq extends to multimodal and non-text domains, including image captioning, where a (CNN) encoder extracts visual features from images, feeding them into an RNN decoder to generate descriptive captions. The model pioneered this encoder-decoder paradigm for vision-to-language tasks, attaining state-of-the-art performance on benchmarks like MSCOCO by leveraging to align image regions with words. For time-series forecasting, seq2seq models encode historical data sequences to predict future values, providing a flexible framework for multivariate predictions that outperforms traditional autoregressive methods in capturing long-range dependencies. Modern advancements integrate seq2seq with pre-trained language models, such as T5, which unifies diverse NLP tasks under a text-to-text format using an encoder-decoder , fine-tuned on large corpora to enhance performance in summarization, , and beyond while preserving the core seq2seq structure.
Add your contribution
Related Hubs
User Avatar
No comments yet.