Vanishing gradient problem

Main page

What are your thoughts?

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Vanishing gradient problem

Community hub0 subscribers

Talks overview Knowledge Base overview

About hubStatsRules

Wikipedia

Grokipedia

Vanishing gradient problem

In machine learning, the vanishing gradient problem is the problem of greatly diverging gradient magnitudes between earlier and later layers encountered when training neural networks with backpropagation. In such methods, neural network weights are updated proportional to their partial derivative of the loss function. As the number of forward propagation steps in a network increases, for instance due to greater network depth, the gradients of earlier weights are calculated with increasingly many multiplications. These multiplications shrink the gradient magnitude. Consequently, the gradients of earlier weights will be exponentially smaller than the gradients of later weights. This difference in gradient magnitude might introduce instability in the training process, slow it, or halt it entirely. For instance, consider the hyperbolic tangent activation function. The gradients of this function are in range $[0,1]$ . The product of repeated multiplication with such gradients decreases exponentially. The inverse problem, when weight gradients at earlier layers get exponentially larger, is called the exploding gradient problem.

Backpropagation allowed researchers to train supervised deep artificial neural networks from scratch, initially with little success. Hochreiter's diplom thesis of 1991 formally identified the reason for this failure in the "vanishing gradient problem", which not only affects many-layered feedforward networks, but also recurrent networks. The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time-step of an input sequence processed by the network (the combination of unfolding and backpropagation is termed backpropagation through time).

This section is based on the paper On the difficulty of training Recurrent Neural Networks by Pascanu, Mikolov, and Bengio.

A generic recurrent network has hidden states $h_{1},h_{2},\dots$ , inputs $u_{1},u_{2},\dots$ , and outputs $x_{1},x_{2},\dots$ . Let it be parameterized by $\theta$ , so that the system evolves as $(h_{t},x_{t})=F(h_{t-1},u_{t},\theta )$ Often, the output $x_{t}$ is a function of $h_{t}$ , as some $x_{t}=G(h_{t})$ . The vanishing gradient problem already presents itself clearly when $x_{t}=h_{t}$ , so we simplify our notation to the special case with: $x_{t}=F(x_{t-1},u_{t},\theta )$ Now, take its differential: ${\begin{aligned}dx_{t}&=\nabla _{\theta }F(x_{t-1},u_{t},\theta )d\theta +\nabla _{x}F(x_{t-1},u_{t},\theta )dx_{t-1}\\&=\nabla _{\theta }F(x_{t-1},u_{t},\theta )d\theta +\nabla _{x}F(x_{t-1},u_{t},\theta )\left[\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )d\theta +\nabla _{x}F(x_{t-2},u_{t-1},\theta )dx_{t-2}\right]\\&\;\;\vdots \\&=\left[\nabla _{\theta }F(x_{t-1},u_{t},\theta )+\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )+\cdots \right]d\theta \end{aligned}}$ Training the network requires us to define a loss function to be minimized. Let it be $L(x_{T},u_{1},\dots ,u_{T})$ , then minimizing it by gradient descent gives

$\Delta \theta =-\eta \cdot \left[\nabla _{x}L(x_{T})\left(\nabla _{\theta }F(x_{t-1},u_{t},\theta )+\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )+\cdots \right)\right]^{T}$ where $\eta$ is the learning rate.

The vanishing/exploding gradient problem appears because there are repeated multiplications, of the form $\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{x}F(x_{t-2},u_{t-1},\theta )\nabla _{x}F(x_{t-3},u_{t-2},\theta )\cdots$

For a concrete example, consider a typical recurrent network defined by

$x_{t}=F(x_{t-1},u_{t},\theta )=W_{\text{rec}}\sigma (x_{t-1})+W_{\text{in}}u_{t}+b$ where $\theta =(W_{\text{rec}},W_{\text{in}})$ is the network parameter, $\sigma$ is the sigmoid activation function, applied to each vector coordinate separately, and $b$ is the bias vector.

See all

Hub AI

Vanishing gradient problem AI simulator

(@Vanishing gradient problem_simulator)

Wikipedia

Grokipedia

Hub AI

Vanishing gradient problem

This section is based on the paper On the difficulty of training Recurrent Neural Networks by Pascanu, Mikolov, and Bengio.

For a concrete example, consider a typical recurrent network defined by

See all

Talk Channels

Knowledge Base

Special Pages

Talk Channels

Knowledge Base

Special Pages

Vanishing gradient problem

Vanishing gradient problem

Recent from talks

Recent from talks

Knowledge base stats:

Talk channels stats:

Members stats:

Vanishing gradient problem

Hub AI

Vanishing gradient problem

Contribute something to knowledge base

History

History

Vanishing gradient problem

Vanishing gradient problem

Recent from talks

Recent from talks

Knowledge base stats:

Talk channels stats:

Members stats:

Vanishing gradient problem

Hub AI

Vanishing gradient problem

Contribute something to knowledge base