Hubbry Logo
search
logo
27339

Swish function

logo
Community Hub0 Subscribers
Read side by side
from Wikipedia

The swish function is a family of mathematical function defined as follows:

The swish function
[1]

where can be constant (usually set to 1) or trainable and "sigmoid" refers to the logistic function.

The swish family was designed to smoothly interpolate between a linear function and the ReLU function.

When considering positive values, Swish is a particular case of doubly parameterized sigmoid shrinkage function defined in [2]: Eq 3 . Variants of the swish function include Mish.[3]

Special values

[edit]

For β = 0, the function is linear: f(x) = x/2.

For β = 1, the function is the Sigmoid Linear Unit (SiLU).

With β → ∞, the function converges to ReLU.

Thus, the swish family smoothly interpolates between a linear function and the ReLU function.[1]

Since , all instances of swish have the same shape as the default , zoomed by . One usually sets . When is trainable, this constraint can be enforced by , where is trainable.

Derivatives

[edit]

Because , it suffices to calculate its derivatives for the default case.so is odd.so is even.

History

[edit]

SiLU was first proposed alongside the GELU in 2016,[4] then again proposed in 2017 as the Sigmoid-weighted Linear Unit (SiL) in reinforcement learning.[5][1] The SiLU/SiL was then again proposed as the SWISH over a year after its initial discovery, originally proposed without the learnable parameter β, so that β implicitly equaled 1. The swish paper was then updated to propose the activation with the learnable parameter β.

In 2017, after performing analysis on ImageNet data, researchers from Google indicated that using this function as an activation function in artificial neural networks improves the performance, compared to ReLU and sigmoid functions.[1] It is believed that one reason for the improvement is that the swish function helps alleviate the vanishing gradient problem during backpropagation.[6]

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The Swish function, also known as the Swish activation, is a smooth, non-monotonic activation function employed in deep neural networks, mathematically defined as $ f(x) = x \cdot \sigma(\beta x) $, where $ \sigma(z) = \frac{1}{1 + e^{-z}} $ is the sigmoid function and $ \beta $ is a hyperparameter often fixed at 1 (known as the Sigmoid Linear Unit or SiLU) or made learnable during training.[1] Introduced in 2017 by researchers Prajit Ramachandran, Barret Zoph, and Quoc V. Le at Google Brain, Swish emerged from an automated search process that combined exhaustive enumeration of simple functions with reinforcement learning-based neural architecture search to explore promising activation forms.[1] This self-gated mechanism, where the input $ x $ gates itself via the sigmoid term, distinguishes Swish from piecewise linear activations like ReLU, enabling it to exhibit small negative values for negative inputs while transitioning smoothly to positive growth, which facilitates improved gradient propagation in deep architectures.[1] Empirical evaluations in the original study revealed Swish's superiority over ReLU across diverse tasks, including image classification on CIFAR-10/100 and ImageNet datasets—yielding a 0.9% top-1 accuracy gain on Mobile NASNet-A and 0.6% on Inception-ResNet-v2 for ImageNet—as well as enhancements in recurrent neural networks for character-level language modeling and machine translation on the WMT 2014 English-to-German dataset.[1] The study showed Swish's efficacy in deeper models, often accelerating convergence and boosting performance in convolutional and recurrent networks, though it incurs modestly higher computational costs due to the sigmoid computation compared to ReLU.[1] Its straightforward implementation—a single operation in frameworks like TensorFlow and PyTorch—has contributed to widespread adoption in modern deep learning pipelines, including variants like hard-Swish for mobile efficiency.[2]

Definition

Mathematical Formulation

The Swish function, denoted as $ f(x) $, is defined mathematically as
f(x)=xσsigmoid(βx)=x1+eβx, f(x) = x \cdot \sigma_{\text{sigmoid}}(\beta x) = \frac{x}{1 + e^{-\beta x}},
where $ \sigma_{\text{sigmoid}}(z) = \frac{1}{1 + e^{-z}} $ is the standard sigmoid function and $ \beta $ is a hyperparameter that can be fixed or learned during training, with the conventional value $ \beta = 1 $ for the base form.[3] This formulation arises as the simple product of the identity function $ x $ and a scaled sigmoid $ \sigma_{\text{sigmoid}}(\beta x) $, introducing a self-gating mechanism where the sigmoid acts as an adaptive multiplier that modulates the linear input based on its own value.[3] The resulting curve is smooth and non-monotonic, exhibiting positive values for $ x > 0 $, negative values for $ x < 0 $, and a characteristic "bump" in the negative region before approaching zero asymptotically.[3]

Parameter Interpretation

The β parameter in the Swish activation function serves as a scaling factor that modulates the steepness of the sigmoid component within the self-gated structure, thereby influencing the overall curvature and behavior of the function.[3] Specifically, it controls the degree to which the gating mechanism—sigmoid(βx)—transitions between a nearly constant value and a sharp step-like response, allowing Swish to interpolate between smoother, linear-like responses and piecewise linear approximations.[3] When β = 0, the sigmoid term evaluates to 0.5 regardless of x, reducing Swish to a scaled linear function f(x) = (1/2)x, which passes input values proportionally without nonlinearity.[3] As β approaches infinity, the sigmoid(βx) term approximates a Heaviside step function (0 for x < 0 and 1 for x > 0), making Swish behave asymptotically like the ReLU activation for large |x|, though with smoother transitions near the origin.[3] When β=1, Swish is equivalent to the Sigmoid Linear Unit (SiLU) proposed by Elfwing et al. (2017).[3] In standard implementations, β is fixed at 1, yielding the canonical Swish variant, which balances non-monotonicity in the negative domain with positive unbounded growth, often leading to improved gradient flow compared to ReLU.[3] This choice of β = 1 was empirically determined to perform well across various architectures, such as convolutional networks on ImageNet, without requiring hyperparameter tuning.[3] However, β can also be treated as a learnable parameter during training, initialized typically around 1 (e.g., using He initialization adapted for the function's scale), which allows the model to adapt the activation's shape to specific tasks or dataset characteristics.[3] Learned values of β often converge to the range of 0 to 1.5, with peaks near 1 in optimized networks like Mobile NASNet-A, enhancing performance by 0.6% to 0.9% on ImageNet top-1 accuracy when jointly trained with other parameters.[3] The value of β directly affects the function's shape: higher β values (>1) sharpen the sigmoid, promoting ReLU-like sparsity by suppressing negative inputs more aggressively while preserving strong positive signals, which can aid in inducing sparsity for efficiency in deep networks.[3] Conversely, lower β values (<1) result in a smoother, more linear profile with reduced sparsity, allowing greater signal propagation through negative regions and potentially mitigating vanishing gradients in shallow layers or specific depths. During optimization, training with a learnable β requires considerations such as using a slightly lower learning rate than for ReLU to stabilize convergence, ensuring the scale parameter is enabled in batch normalization layers, as Swish is not piecewise linear like ReLU.[3] These strategies ensure that β evolves effectively without destabilizing the network's dynamics.[3]

Properties

Special Values

The Swish function, defined as σ(x)=xσ(βx)\sigma(x) = x \cdot \sigma(\beta x) where σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}} is the sigmoid function, evaluates to zero at the origin regardless of the parameter β>0\beta > 0. Specifically, σ(0)=0σ(0)=0\sigma(0) = 0 \cdot \sigma(0) = 0, as the sigmoid at zero is 0.50.5 but multiplied by zero yields zero. This fixed point at the origin underscores its role as a self-gating mechanism that passes the input unchanged in sign at this boundary.[4] For the standard case with β=1\beta = 1, known as the Sigmoid Linear Unit (SiLU), the function takes approximate values of σ(1)0.269\sigma(-1) \approx -0.269 and σ(1)0.731\sigma(1) \approx 0.731. These values reflect the non-monotonic behavior near the origin, where the function lies above the line y=xy = x for negative inputs (outputting values less negative than the input) before rising smoothly toward y=xy = x for positive ones.[4] The Swish function crosses zero only at x=0x = 0, with σ(x)<0\sigma(x) < 0 for all x<0x < 0 and σ(x)>0\sigma(x) > 0 for all x>0x > 0, due to the always-positive sigmoid multiplier preserving the sign of xx. No other zeros exist, as solving σ(x)=0\sigma(x) = 0 reduces solely to x=0x = 0 given the non-zero sigmoid. This single sign change at the origin distinguishes Swish from purely positive activations like ReLU.[4] Varying β\beta alters the steepness and scaling of these evaluations; for instance, at x=1x = 1, σ(1)=σ(β)\sigma(1) = \sigma(\beta), yielding 0.731\approx 0.731 for β=1\beta = 1 but 0.881\approx 0.881 for β=2\beta = 2, where the higher β\beta produces a steeper sigmoid and thus a value closer to 1. This parameter tunes the function's gating strength, with lower β\beta values approaching linearity and higher ones mimicking ReLU-like clipping at positive points.[4]

Derivatives

The first derivative of the Swish function f(x)=xσ(βx)f(x) = x \cdot \sigma(\beta x), where σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}} is the sigmoid function, is given by
f(x)=σ(βx)+βxσ(βx)(1σ(βx)). f'(x) = \sigma(\beta x) + \beta x \cdot \sigma(\beta x) \cdot (1 - \sigma(\beta x)).
This can be rewritten in factored form as
f(x)=σ(βx)[1+βx(1σ(βx))]. f'(x) = \sigma(\beta x) \left[1 + \beta x (1 - \sigma(\beta x))\right].
Evaluating at x=0x = 0 yields f(0)=12f'(0) = \frac{1}{2}, independent of β>0\beta > 0. For β>0\beta > 0, f(x)>0f'(x) > 0 when x>0x > 0, but f(x)<0f'(x) < 0 for x<1.3x < \approx -1.3 (the approximate location of the local minimum for β=1\beta = 1), rendering the Swish function non-monotonic overall. This non-monotonicity introduces a local minimum for negative inputs, distinguishing Swish from strictly increasing activations like ReLU.[4] The second derivative, obtained by differentiating f(x)f'(x), is
f(x)=βσ(βx)(1σ(βx))[2+βx(12σ(βx))]. f''(x) = \beta \sigma(\beta x) (1 - \sigma(\beta x)) \left[2 + \beta x (1 - 2 \sigma(\beta x))\right].
This expression changes sign depending on xx and β\beta, indicating that Swish is convex in some regions (where f(x)>0f''(x) > 0) and concave in others (where f(x)<0f''(x) < 0). Such curvature properties can influence optimization dynamics in neural networks by allowing adaptive gradient flows without uniform convexity assumptions. The smoothness of these derivatives—being infinitely differentiable and expressed via elementary sigmoid operations—facilitates efficient computation during backpropagation, avoiding discontinuities or vanishing gradients common in piecewise activations.

Asymptotic Behavior

As $ x \to \infty $, the Swish function $ \sigma(x) = x \cdot \sigmoid(\beta x) $ asymptotically behaves as $ \sigma(x) \sim x $, exhibiting linear growth analogous to the ReLU activation, while its derivative $ \sigma'(x) \to 1 $.[4] This linear regime ensures unbounded positive outputs without saturation, facilitating effective signal propagation in positive domains.[4] In contrast, as $ x \to -\infty $, $ \sigma(x) \sim x e^{\beta x} $, demonstrating exponential decay toward 0 from below, with $ \sigma'(x) \to 0 $.[4] This behavior arises from the sigmoid component approaching 0 faster than the linear growth of $ |x| $, resulting in near-zero outputs for large negative inputs. Additionally, in the negative domain, Swish exhibits non-monotonicity, featuring a small dip below zero before asymptotically approaching 0, due to the derivative's characteristic "bump" for $ x < 0 $.[4] These asymptotic properties contribute to desirable characteristics in deep neural networks: the approach to 0 for negative inputs promotes sparsity by effectively suppressing inactive neurons, similar to ReLU, while the non-zero derivative in the positive regime and the non-monotonic bump in the negative domain enhance gradient flow, mitigating vanishing gradients and improving training stability in deeper architectures.[4]

Development

Origins

The Swish activation function was introduced in 2017 by Prajit Ramachandran, Barret Zoph, and Quoc V. Le, researchers at Google Brain.[4] This development stemmed from efforts to identify activation functions that could outperform the widely used ReLU in deep neural networks, addressing limitations such as ReLU's tendency to produce dying neurons and suboptimal performance in deeper architectures.[4] The origins of Swish trace back to a systematic search for novel activation functions, combining exhaustive enumeration of simple mathematical forms with reinforcement learning (RL)-based exploration.[4] An RNN controller was employed in the RL approach to generate and evaluate candidate functions, focusing on scalar, non-monotonic activations that could enhance gradient flow and representational power.[4] Swish emerged as a leading candidate from this process, defined initially with a fixed parameter β ≈ 1, though later recognized as potentially learnable.[4] Early theoretical insights highlighted Swish's self-gating property, where the function multiplies the input by a sigmoid-based gate derived from itself, akin to gating mechanisms in LSTMs or attention models.[4] This structure enables a smooth, nonlinear interpolation between linear behavior for positive inputs and ReLU-like suppression for negatives, potentially improving information propagation in deep networks without the abruptness of traditional thresholds.[4] The discovery was detailed in the seminal paper "Searching for Activation Functions," marking Swish as a promising alternative discovered through innovative search techniques rather than manual design.[4]

Naming and Variants

The Swish activation function was named "Swish" by the authors.[4] It is also referred to as the Sigmoid Linear Unit (SiLU) in certain contexts, particularly for the case where the parameter β is fixed at 1.[5] Notable variants include Swish-β, which incorporates a learnable parameter β to allow adaptation during training, enhancing flexibility over the fixed-β version.[4] For efficiency in resource-constrained environments like mobile devices, approximations such as Hard Swish have been developed; Hard Swish replaces the sigmoid with a piecewise linear function, defined as $ x \cdot \frac{\operatorname{ReLU6}(x + 3)}{6} $, where ReLU6 clamps values between 0 and 6.[6] Following its introduction, Swish saw rapid adoption in major deep learning frameworks, with native implementations added to TensorFlow in 2018 and PyTorch in 2021.

Applications

Use in Neural Networks

The Swish activation function serves as a drop-in replacement for the ReLU activation in both convolutional neural network (CNN) layers and fully connected layers, requiring minimal modifications to existing architectures due to its similar output behavior for positive inputs while introducing subtle improvements for negative ones.[4] This seamless integration allows practitioners to substitute Swish directly into pre-trained models or during training to potentially enhance performance without altering the overall network structure.[4] Implementation of Swish is readily available in major deep learning frameworks, facilitating its adoption. In TensorFlow, it is provided via tf.nn.swish or tf.nn.silu, which computes $ x \cdot \sigma(x) $ where $ \sigma $ is the sigmoid function.[7] In PyTorch, the equivalent is torch.nn.SiLU, also known as Swish with a fixed $ \beta = 1 $, applying the same element-wise operation. In deep neural networks, Swish offers advantages through improved gradient propagation, as its derivative remains non-zero even for negative inputs, enabling small gradients to flow backward and mitigating issues like vanishing gradients in deeper layers.[4] This property supports smoother optimization and faster convergence compared to activations that fully suppress negative signals. Specific applications include its use in vision models such as EfficientNet, where Swish replaces ReLU across convolutional blocks to boost accuracy and efficiency in image classification tasks.[8] Additionally, Swish has been incorporated into transformer architectures, particularly in feed-forward network layers, to enhance training dynamics and model performance in sequence modeling.

Empirical Performance

In the original study introducing Swish, experiments on CIFAR-10 and CIFAR-100 datasets using ResNet architectures demonstrated accuracy improvements of approximately 0.9% over ReLU, with Swish achieving 95.5% on CIFAR-10 (Wide ResNet-28-10) and 83.9% on CIFAR-100 (DenseNet-100-12).[4] On ImageNet, Swish yielded gains of 0.9% to 1.4% top-1 accuracy in Mobile NASNet-A models and 0.6% in Inception-ResNet-v2 compared to ReLU.[4] Subsequent research has confirmed Swish's advantages in vision tasks, particularly in efficient architectures like MobileNet variants, where its adoption (or approximations like h-Swish in MobileNetV3) led to top-1 accuracy improvements of 0.5% to 1.0% on ImageNet while maintaining low latency on mobile devices. In natural language processing, evaluations across tasks such as sentiment analysis and question classification showed Swish outperforming ReLU by 1-3 percentage points in F1 scores on datasets like MR and TREC when integrated into recurrent and convolutional models, though it was sometimes surpassed in stability by other functions like penalized tanh.[9] Despite these benefits, Swish incurs a modestly higher computational cost than ReLU—approximately 10-20% more FLOPs per layer owing to the sigmoid computation—but this is often mitigated by faster convergence, reducing overall training time by 5-15% in deep networks. Post-2020 studies on large-scale Vision Transformers (ViTs) have further validated its efficacy in hybrid CNN-Transformer models for tasks like medical imaging and land-use classification as of 2025, enhancing performance without significant overhead.
User Avatar
No comments yet.