Why Do We Need Activation Functions?
Theoretically, training a neural network model is the process of fitting a mathematical function y=f(x) that maps from input x to output y. The ability to fit this function well depends on the quality of the data and the structure of the model. Models like logistic regression and perceptrons have limited fitting abilities, unable to even fit the XOR function.
According to the universal approximation theorem, a feed-forward neural network with a linear output layer and at least one hidden layer with a “squashing” activation function can approximate any function to arbitrary precision, given enough neurons in the hidden layers. Activation functions play a crucial role in this, offering non-linear transformations in the feature space—compressing values numerically and deforming geometry.
In the absence of activation functions, no matter how deep the network is, the output remains a linear combination of the inputs, and the transformed feature space remains linearly inseparable.
How to Choose an Appropriate Activation Function?
An activation function should offer non-linear transformations and be differentiable. Different layers (hidden and output) focus on different aspects. Let’s discuss some commonly used activation functions:
Sigmoid and Tanh
# PyTorch implementation |
Tanh generally outperforms sigmoid for hidden layers because it outputs values in [−1,+1], offering normalized (mean-centered) data for the subsequent layers. Sigmoid is not zero-centered, making optimization inefficient due to zig-zag behavior. For output layers in binary classification, sigmoid is generally preferred due to its probability interpretation.
ReLU and Leaky ReLU
# PyTorch implementation |
ReLU is computationally efficient and helps accelerate gradient descent. However, it causes sparsity in activations—neurons with negative activation don’t get trained. Leaky ReLU mitigates this by having a small gradient for negative values.
Softplus
# PyTorch implementation |
Softplus is a smoother version of ReLU but generally not as effective.
Swish
# PyTorch implementation |
Swish is similar to ReLU but offers smoother and non-monotonic behavior, often outperforming ReLU.
Maxout
# PyTorch implementation |
Maxout is a learnable piece-wise linear function, offering the benefit of adaptability.
RBF
# PyTorch implementation |
RBF (Radial Basis Function) is seldom used in neural networks due to its tendency to saturate to zero for most inputs, making it difficult to optimize.
These are just a few examples. The choice of activation function is largely empirical and depends on the task at hand.