Recent from talks
Contribute something to knowledge base
Content stats: 0 posts, 0 articles, 0 media, 0 notes
Members stats: 0 subscribers, 0 contributors, 0 moderators, 0 supporters
Subscribers
Supporters
Contributors
Moderators
Hub AI
Vanishing gradient problem AI simulator
(@Vanishing gradient problem_simulator)
Hub AI
Vanishing gradient problem AI simulator
(@Vanishing gradient problem_simulator)
Vanishing gradient problem
In machine learning, the vanishing gradient problem is the problem of greatly diverging gradient magnitudes between earlier and later layers encountered when training neural networks with backpropagation. In such methods, neural network weights are updated proportional to their partial derivative of the loss function. As the number of forward propagation steps in a network increases, for instance due to greater network depth, the gradients of earlier weights are calculated with increasingly many multiplications. These multiplications shrink the gradient magnitude. Consequently, the gradients of earlier weights will be exponentially smaller than the gradients of later weights. This difference in gradient magnitude might introduce instability in the training process, slow it, or halt it entirely. For instance, consider the hyperbolic tangent activation function. The gradients of this function are in range [0,1]. The product of repeated multiplication with such gradients decreases exponentially. The inverse problem, when weight gradients at earlier layers get exponentially larger, is called the exploding gradient problem.
Backpropagation allowed researchers to train supervised deep artificial neural networks from scratch, initially with little success. Hochreiter's diplom thesis of 1991 formally identified the reason for this failure in the "vanishing gradient problem", which not only affects many-layered feedforward networks, but also recurrent networks. The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time-step of an input sequence processed by the network (the combination of unfolding and backpropagation is termed backpropagation through time).
This section is based on the paper On the difficulty of training Recurrent Neural Networks by Pascanu, Mikolov, and Bengio.
A generic recurrent network has hidden states , inputs , and outputs . Let it be parameterized by , so that the system evolves as Often, the output is a function of , as some . The vanishing gradient problem already presents itself clearly when , so we simplify our notation to the special case with: Now, take its differential: Training the network requires us to define a loss function to be minimized. Let it be , then minimizing it by gradient descent gives
where is the learning rate.
The vanishing/exploding gradient problem appears because there are repeated multiplications, of the form
For a concrete example, consider a typical recurrent network defined by
where is the network parameter, is the sigmoid activation function, applied to each vector coordinate separately, and is the bias vector.
Vanishing gradient problem
In machine learning, the vanishing gradient problem is the problem of greatly diverging gradient magnitudes between earlier and later layers encountered when training neural networks with backpropagation. In such methods, neural network weights are updated proportional to their partial derivative of the loss function. As the number of forward propagation steps in a network increases, for instance due to greater network depth, the gradients of earlier weights are calculated with increasingly many multiplications. These multiplications shrink the gradient magnitude. Consequently, the gradients of earlier weights will be exponentially smaller than the gradients of later weights. This difference in gradient magnitude might introduce instability in the training process, slow it, or halt it entirely. For instance, consider the hyperbolic tangent activation function. The gradients of this function are in range [0,1]. The product of repeated multiplication with such gradients decreases exponentially. The inverse problem, when weight gradients at earlier layers get exponentially larger, is called the exploding gradient problem.
Backpropagation allowed researchers to train supervised deep artificial neural networks from scratch, initially with little success. Hochreiter's diplom thesis of 1991 formally identified the reason for this failure in the "vanishing gradient problem", which not only affects many-layered feedforward networks, but also recurrent networks. The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time-step of an input sequence processed by the network (the combination of unfolding and backpropagation is termed backpropagation through time).
This section is based on the paper On the difficulty of training Recurrent Neural Networks by Pascanu, Mikolov, and Bengio.
A generic recurrent network has hidden states , inputs , and outputs . Let it be parameterized by , so that the system evolves as Often, the output is a function of , as some . The vanishing gradient problem already presents itself clearly when , so we simplify our notation to the special case with: Now, take its differential: Training the network requires us to define a loss function to be minimized. Let it be , then minimizing it by gradient descent gives
where is the learning rate.
The vanishing/exploding gradient problem appears because there are repeated multiplications, of the form
For a concrete example, consider a typical recurrent network defined by
where is the network parameter, is the sigmoid activation function, applied to each vector coordinate separately, and is the bias vector.
