Hubbry Logo
search
logo
1407305

Activation function

logo
Community Hub0 Subscribers
Read side by side
from Wikipedia

Logistic activation function

In artificial neural networks, the activation function of a node is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation function is nonlinear.[1]

Modern activation functions include the logistic (sigmoid) function used in the 2012 speech recognition model developed by Hinton et al;[2] the ReLU used in the 2012 AlexNet computer vision model[3][4] and in the 2015 ResNet model; and the smooth version of the ReLU, the GELU, which was used in the 2018 BERT model.[5]

Comparison of activation functions

[edit]

Aside from their empirical performance, activation functions also have different mathematical properties:

Nonlinear
When the activation function is non-linear, then a two-layer neural network can be proven to be a universal function approximator.[6] This is known as the Universal Approximation Theorem. The identity activation function does not satisfy this property. When multiple layers use the identity activation function, the entire network is equivalent to a single-layer model.
Range
When the range of the activation function is finite, gradient-based training methods tend to be more stable, because pattern presentations significantly affect only limited weights. When the range is infinite, training is generally more efficient because pattern presentations significantly affect most of the weights. In the latter case, smaller learning rates are typically necessary.[citation needed]
Continuously differentiable
This property is desirable for enabling gradient-based optimization methods (ReLU is not continuously differentiable and has some issues with gradient-based optimization, but it is still possible). The binary step activation function is not differentiable at 0, and it differentiates to 0 for all other values, so gradient-based methods can make no progress with it.[7]

These properties do not decisively influence performance, nor are they the only mathematical properties that may be useful. For instance, the strictly positive range of the softplus makes it suitable for predicting variances in variational autoencoders.

Mathematical details

[edit]

The most common activation functions can be divided into three categories: ridge functions, radial functions and fold functions.

An activation function is saturating if . It is nonsaturating if . Non-saturating activation functions, such as ReLU, may be better than saturating activation functions, because they are less likely to suffer from the vanishing gradient problem.[8]

Ridge activation functions

[edit]

Ridge functions are multivariate functions acting on a linear combination of the input variables. Often used examples include:[clarification needed]

  • Linear activation: ,
  • ReLU activation: ,
  • Heaviside activation: ,
  • Logistic activation: .

In biologically inspired neural networks, the activation function is usually an abstraction representing the rate of action potential firing in the cell.[9] In its simplest form, this function is binary—that is, either the neuron is firing or not. Neurons also cannot fire faster than a certain rate, motivating sigmoid activation functions whose range is a finite interval.

The function looks like , where is the Heaviside step function.

If a line has a positive slope, on the other hand, it may reflect the increase in firing rate that occurs as input current increases. Such a function would be of the form .

Rectified linear unit and Gaussian error linear unit activation functions

Radial activation functions

[edit]

A special class of activation functions known as radial basis functions (RBFs) are used in RBF networks. These activation functions can take many forms, but they are usually found as one of the following functions:

  • Gaussian:
  • Multiquadratics:
  • Inverse multiquadratics:
  • Polyharmonic splines

where is the vector representing the function center and and are parameters affecting the spread of the radius.

Other examples

[edit]

Periodic functions can serve as activation functions. Usually the sinusoid is used, as any periodic function is decomposable into sinusoids by the Fourier transform.[10]

Quadratic activation maps .[11][12]

Folding activation functions

[edit]

Folding activation functions are extensively used in the pooling layers in convolutional neural networks, and in output layers of multiclass classification networks. These activations perform aggregation over the inputs, such as taking the mean, minimum or maximum. In multiclass classification the softmax activation is often used.

Table of activation functions

[edit]

The following table compares the properties of several activation functions that are functions of one fold x from the previous layer or layers:

Name Plot Function, Derivative of , Range Order of continuity
Identity
Binary step
Logistic, sigmoid, or soft step
Hyperbolic tangent (tanh)
Soboleva modified hyperbolic tangent (smht)
Softsign
Rectified linear unit (ReLU)[13]
Gaussian Error Linear Unit (GELU)[5] Visualization of the Gaussian Error Linear Unit (GELU) where is the gaussian error function. where is the probability density function of standard gaussian distribution.
Softplus[14]
Exponential linear unit (ELU)[15]
with parameter
Scaled exponential linear unit (SELU)[16]
with parameters and
Leaky rectified linear unit (Leaky ReLU)[17]
Parametric rectified linear unit (PReLU)[18]
with parameter
Rectified Parametric Sigmoid Units (flexible, 5 parameters)

where [19]

Sigmoid linear unit (SiLU,[5] Sigmoid shrinkage,[20] SiL,[21] or Swish-‍1[22]) Swish Activation Function
Exponential Linear Sigmoid SquasHing (ELiSH)[23]
Gaussian
Sinusoid

The following table lists activation functions that are not functions of a single fold x from the previous layer or layers:

Name Equation, Derivatives, Range Order of continuity
Softmax    for i = 1, …, J [1][2]
Maxout[24]
^ Here, is the Kronecker delta.
^ For instance, could be iterating through the number of kernels of the previous neural network layer while iterates through the number of kernels of the current layer.

Quantum activation functions

[edit]

In quantum neural networks programmed on gate-model quantum computers, based on quantum perceptrons instead of variational quantum circuits, the non-linearity of the activation function can be implemented with no need of measuring the output of each perceptron at each layer. The quantum properties loaded within the circuit such as superposition can be preserved by creating the Taylor series of the argument computed by the perceptron itself, with suitable quantum circuits computing the powers up to a wanted approximation degree. Because of the flexibility of such quantum circuits, they can be designed in order to approximate any arbitrary classical activation function.[25]

See also

[edit]

References

[edit]

Further reading

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
In artificial neural networks, an activation function is a mathematical operation applied to the weighted sum of inputs at a neuron, transforming it into an output that introduces non-linearity, thereby enabling the network to model complex, non-linear relationships in data.[1] These functions are essential components of neural architectures, as without them, multi-layer networks would reduce to simple linear models incapable of capturing intricate patterns.[2] The concept of activation functions traces its origins to early models of biological neurons, notably the 1943 McCulloch-Pitts neuron, which employed a binary step function as its activation to simulate logical operations like AND and OR gates.[3] This threshold-based approach laid the foundation for computational neuroscience and inspired the 1958 perceptron by Frank Rosenblatt, which also utilized a step function but faced limitations in handling nonlinearly separable problems, as highlighted in Minsky and Papert's 1969 critique.[4] The resurgence of neural networks in the 1980s, driven by the backpropagation algorithm introduced by Rumelhart, Hinton, and Williams in 1986, popularized smooth, differentiable activation functions such as the sigmoid, which maps inputs to a range from 0 to 1 and facilitates gradient-based learning.[5] Common activation functions include the sigmoid function, defined as σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}, valued for its probabilistic interpretation in binary classification but prone to vanishing gradients during training; the hyperbolic tangent (tanh), tanh(x)=exexex+ex\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}, which centers outputs around zero for better convergence compared to sigmoid; and the rectified linear unit (ReLU), f(x)=max(0,x)f(x) = \max(0, x), introduced prominently in 2010 by Nair and Hinton, which accelerates training by avoiding vanishing gradients and promoting sparsity, though it can suffer from "dying ReLU" issues where neurons output zero indefinitely.[6] Variants like Leaky ReLU and parametric ReLU address these limitations by allowing small negative slopes.[1] In modern deep learning, activation functions are pivotal for performance, with choices influencing training stability, generalization, and computational efficiency; for instance, ReLU and its derivatives dominate convolutional neural networks due to their simplicity and empirical success in large-scale image recognition tasks.[7] Ongoing research continues to explore novel functions, such as swish (f(x)=xσ(βx)f(x) = x \cdot \sigma(\beta x)) and mish, to further mitigate issues like gradient saturation and enhance expressivity in architectures like transformers.[1]

Fundamentals

Definition and Purpose

In neural networks, an activation function is defined as a non-linear mathematical mapping applied element-wise to the output of a linear transformation within each layer, transforming input values into output values that introduce non-linearity into the model.[8] This mapping, commonly denoted in its general form as $ f(\mathbf{x}) $, where $ \mathbf{x} $ represents the input vector or scalar, produces a corresponding output that can be scalar or vector-valued, enabling the network to process and propagate information non-linearly.[9] During forward propagation, the activation function follows the computation of a linear combination in each neuron, where the input is first transformed via a weighted sum plus bias—typically $ z = \mathbf{w}^T \mathbf{x} + b $—and then passed through the activation to yield the neuron's final output $ a = f(z) $.[10] This sequential application across layers allows the network to build hierarchical representations from raw inputs. The primary purpose of activation functions is to enable neural networks to approximate arbitrary non-linear functions and model complex relationships in data that exceed linear separability, as without non-linearity, multi-layer networks would collapse to a single linear transformation.[11] For instance, they permit the solution of problems like the XOR gate, which a single-layer perceptron cannot handle due to its inherent linearity.[12] By introducing these non-linearities, activation functions underpin the universal approximation capabilities of neural networks, allowing them to capture intricate patterns in diverse applications.[11]

Historical Development

The origins of activation functions trace back to the foundational work on computational models of neurons in the early 1940s. In 1943, Warren McCulloch and Walter Pitts introduced a simplified model of a neuron that employed a threshold-based step function to mimic binary firing behavior, enabling the representation of logical operations through networks of such units.[13] This model laid the groundwork for artificial neural networks by demonstrating how non-linear thresholds could simulate complex propositional logic in nervous activity, though it lacked learning mechanisms. The mid-20th century saw further advancements with the development of learning-capable systems that incorporated threshold-based activation functions. Frank Rosenblatt's perceptron, described in 1958, utilized a step function with a fixed threshold for binary classification tasks, allowing the network to adapt weights based on input-output patterns in pattern recognition.[14] Building on this, Bernard Widrow and Ted Hoff introduced the ADALINE in the early 1960s, which employed a linear activation followed by a threshold but emphasized adaptive linear combinations for error minimization in adaptive filtering systems. These innovations marked a shift toward trainable models, yet they were limited to single-layer architectures and struggled with non-linearly separable problems, contributing to early enthusiasm followed by setbacks. The 1970s and 1980s brought periods known as AI winters, during which reduced funding and skepticism—exacerbated by critiques like Marvin Minsky and Seymour Papert's 1969 analysis of perceptron limitations—stifled neural network research, including explorations of activation functions. A revival occurred in 1986 with David Rumelhart, Geoffrey Hinton, and Ronald Williams' popularization of backpropagation, which required differentiable activation functions such as the logistic sigmoid to propagate errors through multi-layer networks, enabling the training of deeper architectures.[15] This breakthrough addressed prior limitations but highlighted issues like vanishing gradients in deep setups. The deep learning boom accelerated after 2006, driven by Hinton's introduction of deep belief networks, which revived interest in scalable training and prompted innovations in activation functions to mitigate gradient problems. A pivotal milestone came in 2010 when Vinod Nair and Geoffrey Hinton proposed the rectified linear unit (ReLU), a simple piecewise linear function that accelerated convergence and alleviated vanishing gradients in deep networks by allowing sparse activation and better gradient flow.[6] This shift, amid surging computational power and data availability, transformed activation functions from niche tools into core components of modern neural architectures.

Common Activation Functions

Binary Step Function

The binary step function, also known as the threshold or Heaviside step function, is the most basic activation function in artificial neural networks, producing a binary output of 0 for inputs below a specified threshold—typically 0—and 1 for inputs at or above it. This design directly emulates the all-or-none response of biological neurons, where a neuron either fires (outputs 1) or remains inactive (outputs 0) based on whether the summed excitatory and inhibitory inputs exceed a firing threshold. Mathematically, the function is defined as:
f(x)={1if x00otherwise f(x) = \begin{cases} 1 & \text{if } x \geq 0 \\ 0 & \text{otherwise} \end{cases}
In the McCulloch-Pitts neuron model, introduced in 1943, this binary activation enabled networks of such units to simulate logical operations like AND, OR, and NOT gates, laying the groundwork for computational models of the brain by treating neural activity as propositional logic. The function was central to Frank Rosenblatt's perceptron in 1958, a single-layer network for binary classification that adjusted weights via a perceptron learning rule to classify inputs into two categories, such as separating linearly separable patterns in two dimensions.[14] Its primary advantages lie in computational simplicity, requiring only a single threshold comparison with no complex arithmetic, which made it feasible for early hardware implementations, and in its interpretability as a clear decision boundary for binary decisions.[16] However, the function's discontinuity renders it non-differentiable everywhere except at the threshold, preventing the use of gradient descent for training in multilayer networks and limiting its applicability to simple, linearly separable problems.[16] Later developments, such as the sigmoid function, addressed this by offering a continuous, differentiable approximation to the step response.[17]

Sigmoid Function

The sigmoid function, also known as the logistic sigmoid, is a smooth, S-shaped activation function that maps any real-valued input to the open interval (0,1), making it suitable for representing probabilities or normalized outputs in neural networks.[8] It is mathematically defined by the equation
σ(x)=11+ex, \sigma(x) = \frac{1}{1 + e^{-x}},
where ee is the base of the natural logarithm, ensuring the output approaches 1 as xx becomes large and positive, and 0 as xx becomes large and negative.[17] This function derives from the logistic function originally developed in statistics to model growth processes and binary outcomes, such as in logistic regression where it serves as the inverse of the logit transformation to bound predictions between 0 and 1. In the context of neural networks, it was adapted as an activation for artificial neurons to introduce non-linearity while remaining differentiable, facilitating gradient-based learning algorithms like backpropagation, as introduced in seminal work on multi-layer networks. Key properties of the sigmoid include its symmetry around the point (0, 0.5), where σ(0)=0.5\sigma(0) = 0.5, and saturation at the extremes: the derivative σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1 - \sigma(x)) peaks at 0.25 when x=0x = 0 but approaches zero for large |x|, leading to regions where gradients vanish during training.[8] These characteristics make it continuous and infinitely differentiable everywhere, though the saturation can hinder learning in deep networks by causing vanishing gradients.[17] Historically, the sigmoid was widely used in the output layers of shallow neural networks for binary classification tasks, where its probabilistic output directly corresponds to class probabilities without needing additional transformations. Prior to the widespread adoption of rectified linear units in the 2010s, it also served as a common activation in hidden layers of early multi-layer perceptrons, enabling the modeling of complex decision boundaries through composition of non-linear transformations.[8]

Hyperbolic Tangent

The hyperbolic tangent activation function, commonly denoted as tanh\tanh, serves as a smooth, S-shaped nonlinearity in neural networks, transforming input values xx into outputs bounded within the open interval (-1, 1). This bounded range ensures that neuron activations remain controlled, preventing explosive growth during forward propagation. Unlike unbounded functions, tanh\tanh introduces non-linearity while maintaining differentiability everywhere, making it suitable for gradient-based optimization. The function is defined mathematically as
tanh(x)=exexex+ex \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
or, equivalently, using hyperbolic functions,
tanh(x)=sinh(x)cosh(x), \tanh(x) = \frac{\sinh(x)}{\cosh(x)},
where sinh(x)=exex2\sinh(x) = \frac{e^x - e^{-x}}{2} and cosh(x)=ex+ex2\cosh(x) = \frac{e^x + e^{-x}}{2}. Its derivative is tanh(x)=1tanh(x)2\tanh'(x) = 1 - \tanh(x)^2, which facilitates efficient backpropagation. tanh\tanh resembles a scaled and shifted version of the logistic sigmoid function σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}, related by the identity
tanh(x)=2σ(2x)1. \tanh(x) = 2\sigma(2x) - 1.
This connection highlights how tanh\tanh can be derived from sigmoid, inheriting similar saturation behavior near the asymptotes but with symmetric output around zero. A key advantage of tanh\tanh over sigmoid lies in its zero-centered output, which has an expected value near zero for symmetric inputs, thereby reducing the bias shift in weights of downstream layers and promoting more stable gradient flow during training. This zero-centering often leads to fewer training epochs compared to sigmoid's positive bias, enhancing convergence in multi-layer networks. However, like sigmoid, tanh\tanh suffers from vanishing gradients for large |x|, where the derivative approaches zero, potentially slowing learning in deep architectures. In historical context, tanh\tanh gained prominence in recurrent neural networks for handling sequential data with bounded states. It was notably adopted in the Long Short-Term Memory (LSTM) units proposed by Hochreiter and Schmidhuber in 1997, where tanh\tanh activates the candidate cell state to squash values into (-1, 1), aiding in the preservation of long-term dependencies without unbounded growth. This choice complemented sigmoid gates in LSTMs, enabling effective training on tasks requiring memory over extended time lags. LSTMs with tanh\tanh have since become a cornerstone in sequence modeling, influencing architectures like gated recurrent units.

Rectified Linear Unit

The rectified linear unit (ReLU) is a piecewise linear activation function defined as $ f(x) = \max(0, x) $, which outputs the input directly if it is positive and zero otherwise, thereby introducing sparsity in neural network activations by nullifying negative values.[6] This function was introduced in 2010 by Vinod Nair and Geoffrey Hinton to improve the training of restricted Boltzmann machines, where it demonstrated faster convergence compared to sigmoid activations by preserving relative intensities across layers.[6] ReLU gained widespread adoption following its use in the AlexNet architecture, which achieved breakthrough performance on the ImageNet Large Scale Visual Recognition Challenge in 2012, marking a pivotal advancement in deep convolutional neural networks. One key advantage of ReLU is its ability to mitigate the vanishing gradient problem, as the gradient is either 1 or 0 for positive inputs, enabling effective backpropagation through deep networks without the saturation issues common in sigmoid or hyperbolic tangent functions.[6] Additionally, ReLU is computationally efficient, requiring only a simple thresholding operation without expensive exponentials or divisions, which accelerates training in large-scale models.[18] The sparsity induced by zeroing negative inputs further reduces parameter redundancy and can enhance generalization in sparse representations.[6] To address the minor drawback of "dying ReLU," where neurons can become inactive for all inputs during training, variants have been developed to allow small gradients for negative values. Leaky ReLU modifies the function to $ f(x) = \max(\alpha x, x) $, where $ \alpha $ is a small positive constant (typically 0.01), permitting a leaky flow for negative inputs to prevent neuron death while retaining ReLU's efficiency; it was proposed in 2013 for improving acoustic models in deep neural networks.[18] Another variant, the exponential linear unit (ELU), is defined as $ f(x) = x $ if $ x > 0 $ and $ f(x) = \alpha (e^x - 1) $ otherwise, with $ \alpha = 1 $, which centers the mean activation near zero for faster learning and reduced bias shift; ELU was introduced in 2015 to accelerate convergence in deep networks.[19]

Properties and Characteristics

Differentiability and Continuity

Activation functions in neural networks must generally be differentiable to facilitate training via gradient-based optimization methods such as gradient descent, which relies on computing gradients to update parameters.[16] This differentiability allows the application of the chain rule during backpropagation, enabling efficient propagation of error signals through the network layers.[20] For functions like the rectified linear unit (ReLU), which is non-differentiable at the origin (where $ f(x) = \max(0, x) $), subgradients are employed; the subgradient at $ x = 0 $ is conventionally set to 0 or any value in [0, 1] to handle this point during optimization.[21] Most common activation functions are continuous everywhere, ensuring smooth mappings from inputs to outputs, with the notable exception of the binary step function, which introduces a discontinuity at the threshold (typically 0).[8] Continuity is a prerequisite for differentiability, as non-continuous functions cannot have derivatives at points of discontinuity, limiting their utility in gradient-based learning.[8] Certain activations, such as the sigmoid and hyperbolic tangent, can lead to vanishing gradients due to saturation regions where the absolute value of the input exceeds 1, causing derivatives to approach zero and impeding learning in deep networks.[22] For the sigmoid function $ \sigma(x) = \frac{1}{1 + e^{-x}} $, the derivative is $ \sigma'(x) = \sigma(x)(1 - \sigma(x)) $, which has a maximum value of 0.25 and diminishes rapidly for large |x|.[9] Similarly, the hyperbolic tangent $ \tanh(x) $ has a derivative $ \sech^2(x) $, which is less than 1 for |x| > 0 and approaches 0 as |x| increases, exacerbating gradient flow issues in deeper architectures.[9] For ReLU, the derivative is piecewise defined as $ f'(x) = 1 $ if $ x > 0 $ and 0 otherwise, avoiding saturation but introducing the non-differentiability at zero.[9] These properties directly influence optimization dynamics: smooth, differentiable activations support stable gradient propagation via the chain rule, while issues like vanishing gradients necessitate alternatives like ReLU to maintain effective training in deep models.[20]

Non-linearity Requirements

Activation functions in neural networks must introduce non-linearity to prevent the collapse of multi-layer architectures into equivalent single-layer linear models. If the activation function ff is linear, such as f(z)=kz+bf(z) = kz + b, then composing multiple layers results in a single linear transformation: for inputs xx, the output of two layers becomes f(W2f(W1x))=W2(kW1x+b)+b=kW2W1x+bf(W_2 f(W_1 x)) = W_2 (k W_1 x + b) + b' = k' W_2 W_1 x + b'', where kk' and constants absorb the biases, rendering deeper networks no more expressive than a shallow linear regressor.[23] This limitation confines the network to modeling only linear relationships, severely restricting its ability to capture complex data patterns.[23] Non-linear activation functions overcome this by enabling the network to approximate arbitrary continuous functions on compact subsets of Rn\mathbb{R}^n, as established by the universal approximation theorem. Originally proven for sigmoidal activations, this theorem demonstrates that a single hidden layer with sufficiently many neurons can approximate any continuous function on a compact subset of Rn\mathbb{R}^n to arbitrary accuracy, with extensions applying to other non-linear functions like ReLU under certain conditions.[24] Without non-linearity, networks fail to solve problems requiring non-linear decision boundaries, such as the XOR gate, which single-layer perceptrons cannot classify due to its non-linear separability, as shown in early analyses of perceptron limitations. The key criterion for non-linearity is that the activation must not preserve linearity when composed with affine transformations; specifically, f(Wz+b)f(Wz + b) should not reduce to an affine function for all W,b,zW, b, z. Piecewise linear or smoothly curved forms, like those in ReLU or sigmoid, satisfy this by introducing bends or saturations that allow layered compositions to generate non-linear manifolds.[23] Biologically, this mirrors the threshold-based firing of neurons, where inputs are integrated until exceeding a firing potential, as modeled in the foundational McCulloch-Pitts neuron, which uses a step function to simulate all-or-nothing spikes only above a threshold.

Specialized Variants

Radial Basis Functions

Radial basis functions (RBFs) serve as activation functions in neural networks, characterized by their dependence solely on the radial distance from a specified center point. Formally, an RBF is defined as $ f(\mathbf{x}) = \phi(|\mathbf{x} - \mathbf{c}|) $, where x\mathbf{x} is the input vector, c\mathbf{c} is the center vector, \|\cdot\| denotes the Euclidean norm, and ϕ\phi is a univariate function that operates on the distance $ r = |\mathbf{x} - \mathbf{c}| $. This structure ensures radial symmetry, making the activation invariant to rotations around the center.[25] The Gaussian function is the most prevalent form of RBF, expressed as
ϕ(r)=exp(r22σ2), \phi(r) = \exp\left( -\frac{r^2}{2\sigma^2} \right),
where σ>0\sigma > 0 is a scale parameter that determines the function's width and thus the extent of its localized response. This exponential decay produces a smooth, bell-shaped curve peaking at $ r = 0 $ with value 1 and approaching 0 as $ r $ increases. RBFs exhibit infinite support, being non-zero for all finite $ r $, yet their rapid decay beyond a few multiples of σ\sigma results in effectively localized peaks, ideal for capturing regional features in data. Additionally, their form ensures translation invariance: shifting c\mathbf{c} merely relocates the peak without altering its shape or height.[25][26] In radial basis function networks, these activations form the hidden layer, where the output is a weighted sum of multiple RBFs centered at selected points, enabling universal approximation of continuous functions on compact sets. This architecture, introduced for multivariable interpolation, excels in tasks requiring precise fitting to scattered data points, such as function approximation in high dimensions. The localized nature of RBFs facilitates efficient learning via methods like orthogonal least squares, avoiding the vanishing gradient issues that can saturate sigmoidal activations during backpropagation. Furthermore, the Gaussian RBF extends to kernel methods, notably as the radial basis kernel in support vector machines, where it implicitly maps inputs to a high-dimensional feature space for non-linear separation.[25]

Swish and Parametric Functions

Swish is a self-gated activation function defined as $ f(x) = x \cdot \sigma(\beta x) $, where $ \sigma $ is the sigmoid function and $ \beta $ is a learnable parameter that allows the function to adapt during training.[27] Introduced by Ramachandran et al. in 2017, Swish generalizes the ReLU by incorporating a smooth gating mechanism, enabling non-monotonic behavior that can enhance performance in deep neural networks.[27] Other parametric activation functions include the Parametric Rectified Linear Unit (PReLU), proposed by He et al. in 2015, which extends ReLU with a learnable slope parameter for negative inputs, formulated as $ f(x) = \max(0, x) + a \min(0, x) $ where $ a $ is trainable.[28] Similarly, the Gaussian Error Linear Unit (GELU), developed by Hendrycks and Gimpel in 2016, is defined as $ f(x) = x \Phi(x) $, where $ \Phi(x) $ is the cumulative distribution function of the standard Gaussian, providing a probabilistic interpretation that smooths transitions near zero.[29] These learnable activations offer advantages over fixed functions by avoiding abrupt zeros in the negative regime, which can mitigate dying neuron issues, and by permitting the network to optimize the activation's shape for specific tasks.[27][28][29] They have found widespread use in advanced architectures, such as Swish and PReLU in convolutional neural networks for image recognition, and GELU in transformer models like BERT for natural language processing.[27][28][30]

Comparison and Applications

Performance Evaluation

Performance evaluation of activation functions relies on key metrics that quantify their influence on training efficiency, stability, and model performance. Convergence speed is a primary metric, often assessed by the number of epochs needed to achieve a specified accuracy threshold on benchmark datasets; faster convergence indicates more effective learning dynamics. Gradient variance measures the fluctuation in backpropagated gradients across layers, where excessive variance can lead to unstable optimization, and normalized gradient variance provides a more reliable indicator of convergence behavior than raw variance. Sparsity evaluates the percentage of zero-valued activations, promoting computational efficiency and potentially enhancing generalization by inducing feature sparsity in the network. Empirical benchmarks highlight these metrics in practice. On the MNIST dataset, rectified linear unit (ReLU) activations generally enable faster convergence and higher accuracy compared to sigmoid, often reaching over 98% accuracy more quickly due to reduced gradient saturation issues.[16] Similar trends appear on CIFAR-10, where ReLU-based convolutional neural networks (CNNs) demonstrate superior training speed and accuracy compared to sigmoid-based models, in image classification tasks.[16] For more advanced functions, Swish has shown marginal improvements over ReLU on the ImageNet dataset, boosting top-1 classification accuracy by 0.9% in Mobile NASNet-A architectures while maintaining comparable convergence rates.[27] Computational factors further inform evaluation. ReLU incurs minimal computational overhead with simple thresholding, whereas sigmoid and tanh demand more operations involving exponentials, leading to higher overall training and inference costs. Memory usage is also lower for sparse activations like those from ReLU, as zero values reduce storage needs during forward passes. Frameworks such as TensorFlow and PyTorch facilitate ablation studies by allowing seamless substitution of activation functions within identical architectures, enabling direct measurement of metrics like epochs-to-accuracy and gradient statistics. Current trends underscore ReLU variants' dominance in CNNs for vision tasks owing to their speed and sparsity benefits, while hyperbolic tangent (tanh) remains prevalent in recurrent neural networks (RNNs) for sequential modeling, where its zero-centered output aids gradient propagation over time steps.

Selection Criteria

The selection of an activation function in neural networks depends primarily on the nature of the task, as different functions are suited to specific output requirements. For binary classification tasks, the sigmoid function is commonly applied in the output layer to produce probabilities between 0 and 1, enabling direct interpretation as class likelihoods.[31] In multi-class classification, softmax (a generalization of the sigmoid function) is preferred for output layers to generate normalized probabilities across classes.[32] For hidden layers in feedforward networks, the rectified linear unit (ReLU) is a standard choice due to its ability to introduce non-linearity without saturating gradients during backpropagation, facilitating efficient training in deep architectures.[33] Architectural considerations further guide the choice, particularly in recurrent neural networks (RNNs) where bounded activations like tanh are favored to mitigate exploding gradients by constraining signal propagation over time steps. In contrast, unbounded functions such as ReLU are well-suited for convolutional neural networks (CNNs), supporting deeper layers without vanishing signals and promoting sparsity in feature representations.[33] Practical constraints, including computational resources and training stability, also influence decisions. ReLU's simple thresholding operation—outputting the input if positive and zero otherwise—ensures low computational overhead, making it ideal for deployment on resource-limited edge devices where efficiency is paramount.[16] To enhance stability, saturating functions like sigmoid should be avoided in hidden layers, as they can lead to vanishing gradients that hinder learning in deep networks.[16] Heuristics provide practical starting points for practitioners: begin with ReLU for most hidden layers due to its robustness and speed, then experiment with alternatives like Swish if overfitting occurs, as its smooth, non-monotonic shape can improve generalization in complex models.[27] Additionally, consider the data distribution; zero-centered activations such as tanh are beneficial when inputs are symmetrically distributed around zero, as they prevent bias shifts in subsequent layers and accelerate convergence.[32] Emerging trends point toward automated methods for selection, with AutoML techniques enabling the search for task-specific activation functions through reinforcement learning or evolutionary algorithms, potentially yielding optimized variants beyond manual choices.[27]

Advanced Topics

Quantum Activation Functions

Quantum activation functions refer to non-linear mappings implemented within quantum circuits to enable expressive power in quantum neural networks (QNNs), typically through measurement-based protocols or variational quantum circuits that introduce non-linearity without violating quantum linearity constraints. Unlike classical activations, these functions operate on quantum states, leveraging superposition and entanglement to process information in a Hilbert space, where the output is often obtained via partial measurements or post-selection to approximate non-linear behaviors.[34] This approach addresses the inherent linearity of unitary quantum operations by incorporating probabilistic elements, such as projective measurements, to mimic classical non-linearities while preserving quantum coherence where possible. Prominent examples include the quantum sigmoid, realized through amplitude encoding of classical inputs into quantum states followed by variational circuits that approximate the sigmoid curve via trainable parameters, and extensions of Mercer's theorem to quantum kernels, which allow non-linear feature mappings in quantum support vector machines by embedding data into high-dimensional quantum Hilbert spaces. Another set of examples comprises QReLU and m-QReLU, quantum analogs of the rectified linear unit designed for binary classification tasks; QReLU applies a quantum rotation gate conditioned on the input amplitude to enforce rectification, while m-QReLU incorporates measurement outcomes to adaptively threshold activations in multi-qubit settings. Additionally, Quantum Splines (QSplines) and Generalized Hybrid Quantum Splines (GHQSplines) use variational quantum circuits to piecewise approximate arbitrary non-linear functions, enabling trainable quantum gates to serve as activation layers in QNNs.[34][35] A primary challenge in implementing these functions stems from the no-cloning theorem, which prohibits duplicating unknown quantum states for parallel classical-like non-linear processing, necessitating partial measurements that introduce noise and decoherence risks. To mitigate this, techniques like ancillary qubits and controlled measurements are employed, though they can limit scalability on noisy intermediate-scale quantum (NISQ) devices.[34] In applications, quantum activation functions enhance quantum machine learning (QML) models for tasks such as optimization and pattern recognition, potentially offering exponential speedups in high-dimensional data processing compared to classical counterparts. For example, quantum-inspired activations like QReLU have been applied in classical convolutional neural networks for medical diagnostics, such as detecting COVID-19 from lung ultrasound images and Parkinson disease from spiral drawings, where they improved accuracy, precision, recall, and F1-scores compared to traditional ReLU variants.[35] Recent research since 2018 has focused on hybrid quantum-classical frameworks, with seminal works exploring trainable quantum gates for end-to-end QNN training and kernel-based methods that leverage quantum activations for provable advantages in specific problems.[35][34] More recent developments as of 2025 include quantum variational activation functions (QVAFs), which leverage data re-uploading in variational circuits for improved approximation in quantum neural architectures, and optimized quantum circuits for activation functions targeting fault-tolerant quantum devices.[36][37]

Periodic Activation Functions

Periodic activation functions are a class of activation mechanisms in neural networks designed to process periodic or cyclical data by applying trigonometric or periodic mappings that generate repeating outputs, enabling the network to capture inherent periodicities without artificial discontinuities. A representative example is $ f(x) = \sqrt{2} \sin(x) $, which generates a smooth, oscillating output, facilitating the representation of repeating patterns in data. This approach contrasts with traditional activations like ReLU by inherently embedding periodicity, which aids in modeling domains where inputs wrap around, such as angular measurements or seasonal cycles.[38] Key properties of periodic activation functions include their continuity across periodic boundaries and ability to preserve smoothness in cyclical representations, making them suitable for data exhibiting seasonality, such as calendar-based timestamps or directional angles. Unlike linear or piecewise activations, they avoid abrupt jumps at cycle edges—e.g., treating 0° and 360° as equivalent—thus reducing gradient issues in optimization for circular data. These functions also promote better generalization in tasks with inherent repetition, as the periodic operation supports translation-invariant behavior and higher uncertainty for out-of-distribution data in Bayesian neural networks, enhancing the network's inductive bias toward periodicity.[38] In practice, periodic activations find application in recurrent neural networks (RNNs) for temporal data analysis, where they help model time-series with seasonal components, such as daily or yearly cycles in financial or environmental datasets, by avoiding discontinuities that plague standard activations in circular domains. For instance, periodic activations have been integrated into models for time-series classification, where variants like periodic ReLU—featuring periodicity in their form—improve handling of oscillating signals while maintaining computational efficiency. Triangular wave functions, another periodic example, approximate linear rises and falls within each period, offering differentiable alternatives for tasks requiring precise periodicity capture in neural architectures. The development of periodic activation functions gained prominence in the 2020s, driven by the need for specialized machine learning tasks involving implicit representations of signals and shapes with periodic structures. Seminal work in 2021 introduced periodic mechanisms to enable neural networks to learn high-frequency details in cyclical data and induce global stationarity in Bayesian neural networks, outperforming conventional activations in fitting periodic targets and improving robustness. Subsequent advancements extended this to time-series domains, emphasizing their role in stabilizing training for models processing seasonal or angular inputs.[38]

References

User Avatar
No comments yet.