activation_functions / activation_tutorial.md
AmberLJC's picture
Upload activation_tutorial.md with huggingface_hub
3283ee8 verified

Comprehensive Tutorial: Activation Functions in Deep Learning

Table of Contents

  1. Introduction
  2. Theoretical Background
  3. Experiment 1: Gradient Flow
  4. Experiment 2: Sparsity and Dead Neurons
  5. Experiment 3: Training Stability
  6. Experiment 4: Representational Capacity
  7. Experiment 5: Temporal Gradient Analysis (NEW)
  8. Summary and Recommendations

Introduction

Activation functions are a critical component of neural networks that introduce non-linearity, enabling networks to learn complex patterns. This tutorial provides both theoretical explanations and empirical experiments to understand how different activation functions affect:

  1. Gradient Flow: Do gradients vanish or explode during backpropagation?
  2. Sparsity & Dead Neurons: How easily do units turn on/off?
  3. Stability: How robust is training under stress (large learning rates, deep networks)?
  4. Representational Capacity: How well can the network approximate different functions?

Activation Functions Studied

Function Formula Range Key Property
Linear f(x) = x (-∞, ∞) No non-linearity
Sigmoid f(x) = 1/(1+e⁻ˣ) (0, 1) Bounded, saturates
Tanh f(x) = (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ) (-1, 1) Zero-centered, saturates
ReLU f(x) = max(0, x) [0, ∞) Sparse, can die
Leaky ReLU f(x) = max(αx, x) (-∞, ∞) Prevents dead neurons
ELU f(x) = x if x>0, α(eˣ-1) otherwise (-α, ∞) Smooth negative region
GELU f(x) = xΒ·Ξ¦(x) β‰ˆ(-0.17, ∞) Smooth, probabilistic
Swish f(x) = xΒ·Οƒ(x) β‰ˆ(-0.28, ∞) Self-gated

Theoretical Background

Why Non-linearity Matters

Without activation functions, a neural network of any depth is equivalent to a single linear transformation:

f(x) = Wβ‚™ Γ— Wₙ₋₁ Γ— ... Γ— W₁ Γ— x = W_combined Γ— x

Non-linear activations allow networks to approximate any continuous function (Universal Approximation Theorem).

The Gradient Flow Problem

During backpropagation, gradients flow through the chain rule:

βˆ‚L/βˆ‚Wα΅’ = βˆ‚L/βˆ‚aβ‚™ Γ— βˆ‚aβ‚™/βˆ‚aₙ₋₁ Γ— ... Γ— βˆ‚aα΅’β‚Šβ‚/βˆ‚aα΅’ Γ— βˆ‚aα΅’/βˆ‚Wα΅’

Each layer contributes a factor of Οƒ'(z) Γ— W, where Οƒ' is the activation derivative.

Vanishing Gradients: When |Οƒ'(z)| < 1 repeatedly

  • Sigmoid: Οƒ'(z) ∈ (0, 0.25], maximum at z=0
  • For n layers: gradient β‰ˆ (0.25)ⁿ β†’ 0 as n β†’ ∞

Exploding Gradients: When |Οƒ'(z) Γ— W| > 1 repeatedly

  • More common with unbounded activations
  • Mitigated by gradient clipping, proper initialization

Experiment 1: Gradient Flow

Question

How do gradients propagate through deep networks with different activations?

Method

  • Built networks with depths [5, 10, 20, 50]
  • Measured gradient magnitude at each layer during backpropagation
  • Used Xavier initialization for fair comparison

Results

Gradient Flow

Gradient Ratio (Layer 10 / Layer 1) at Depth=20

Activation Gradient Ratio Interpretation
Linear 1.43e+00 Stable gradient flow
Sigmoid inf Severe vanishing gradients
Tanh 5.07e-01 Stable gradient flow
ReLU 1.08e+00 Stable gradient flow
LeakyReLU 1.73e+00 Stable gradient flow
ELU 8.78e-01 Stable gradient flow
GELU 3.34e-01 Stable gradient flow
Swish 1.14e+00 Stable gradient flow

Theoretical Explanation

Sigmoid shows the most severe gradient decay because:

  • Maximum derivative is only 0.25 (at z=0)
  • In deep networks: 0.25²⁰ β‰ˆ 10⁻¹² (effectively zero!)

ReLU maintains gradients better because:

  • Derivative is exactly 1 for positive inputs
  • But can be exactly 0 for negative inputs (dead neurons)

GELU/Swish provide smooth gradient flow:

  • Derivatives are bounded but not as severely as Sigmoid
  • Smooth transitions prevent sudden gradient changes

Experiment 2: Sparsity and Dead Neurons

Question

How do activations affect the sparsity of representations and the "death" of neurons?

Method

  • Trained 10-layer networks with high learning rate (0.1) to stress-test
  • Measured activation sparsity (% of near-zero activations)
  • Measured dead neuron rate (neurons that never activate)

Results

Sparsity and Dead Neurons

Activation Sparsity (%) Dead Neurons (%)
Linear 0.0% 100.0%
Sigmoid 8.2% 8.2%
Tanh 0.0% 0.0%
ReLU 48.8% 6.6%
LeakyReLU 0.1% 0.0%
ELU 0.0% 0.0%
GELU 0.0% 0.0%
Swish 0.0% 0.0%

Theoretical Explanation

ReLU creates sparse representations:

  • Any negative input β†’ output is exactly 0
  • ~50% sparsity is typical with zero-mean inputs
  • Sparsity can be beneficial (efficiency, regularization)

Dead Neuron Problem:

  • If a ReLU neuron's input is always negative, it outputs 0 forever
  • Gradient is 0, so weights never update
  • Caused by: bad initialization, large learning rates, unlucky gradients

Solutions:

  • Leaky ReLU: Small gradient (0.01) for negative inputs
  • ELU: Smooth negative region with non-zero gradient
  • Proper initialization: Keep activations in a good range

Experiment 3: Training Stability

Question

How stable is training under stress conditions (large learning rates, deep networks)?

Method

  • Tested learning rates: [0.001, 0.01, 0.1, 0.5, 1.0]
  • Tested depths: [5, 10, 20, 50, 100]
  • Measured whether training diverged (loss β†’ ∞)

Results

Stability

Key Observations

Learning Rate Stability:

  • Sigmoid/Tanh: Most stable (bounded outputs prevent explosion)
  • ReLU: Can diverge at high learning rates
  • GELU/Swish: Good balance of stability and performance

Depth Stability:

  • All activations struggle with depth > 50 without special techniques
  • Sigmoid fails earliest due to vanishing gradients
  • ReLU/LeakyReLU maintain trainability longer

Theoretical Explanation

Why bounded activations are more stable:

  • Sigmoid outputs ∈ (0, 1), so activations can't explode
  • But gradients can vanish, making learning very slow

Why ReLU can be unstable:

  • Unbounded outputs: large inputs β†’ large outputs β†’ larger gradients
  • Positive feedback loop can cause explosion

Modern solutions:

  • Batch Normalization: Keeps activations in good range
  • Residual Connections: Allow gradients to bypass layers
  • Gradient Clipping: Prevents explosion

Experiment 4: Representational Capacity

Question

How well can networks with different activations approximate various functions?

Method

  • Target functions: sin(x), |x|, step, sin(10x), xΒ³
  • 5-layer networks, 500 epochs training
  • Measured test MSE

Results

Representational Capacity

Predictions

Test MSE by Activation Γ— Target Function

| Activation | sin(x) | |x| | step | sin(10x) | xΒ³ | |------------|------|------|------|------|------| | Linear | 0.0262 | 0.3347 | 0.0406 | 0.4906 | 1.4807 | | Sigmoid | 0.0015 | 0.0025 | 0.0007 | 0.4910 | 0.0184 | | Tanh | 0.0006 | 0.0022 | 0.0000 | 0.4903 | 0.0008 | | ReLU | 0.0000 | 0.0000 | 0.0000 | 0.0006 | 0.0002 | | LeakyReLU | 0.0000 | 0.0000 | 0.0000 | 0.0008 | 0.0004 | | ELU | 0.0007 | 0.0005 | 0.0012 | 0.2388 | 0.0003 | | GELU | 0.0000 | 0.0006 | 0.0001 | 0.0009 | 0.0033 | | Swish | 0.0000 | 0.0017 | 0.0004 | 0.4601 | 0.0016 |

Theoretical Explanation

Universal Approximation Theorem:

  • Any continuous function can be approximated with enough neurons
  • But different activations have different "inductive biases"

ReLU excels at piecewise functions (like |x|):

  • ReLU networks compute piecewise linear functions
  • Perfect match for |x| which is piecewise linear

Smooth activations for smooth functions:

  • GELU, Swish produce smoother decision boundaries
  • Better for smooth targets like sin(x)

High-frequency functions are hard:

  • sin(10x) has 10 oscillations in [-2, 2]
  • Requires many neurons to capture all oscillations
  • All activations struggle without sufficient width

Experiment 5: Temporal Gradient Analysis

Question

How do gradients evolve during training? Does the vanishing gradient problem persist or improve?

Method

  • Measured gradient magnitudes at epochs 1, 100, and 200
  • Tracked gradient ratio (Layer 10 / Layer 1) over time
  • Analyzed whether training helps or hurts gradient flow

Results

Gradient Flow Over Epochs

Gradient Evolution

Gradient Magnitudes at Key Training Epochs

Activation Epoch Layer 1 Layer 5 Layer 10 Ratio (L10/L1)
Linear 1 4.01e-04 3.29e-04 7.44e-04 1.86
Linear 100 3.10e-05 2.78e-05 3.57e-05 1.15
Linear 200 1.12e-07 9.99e-08 1.21e-07 1.08
Sigmoid 1 1.66e-10 2.40e-07 3.68e-03 2.22e+07
Sigmoid 100 1.04e-10 3.24e-10 4.77e-06 4.59e+04
Sigmoid 200 1.32e-10 1.24e-10 3.23e-08 2.45e+02
ReLU 1 1.20e-05 6.12e-06 3.23e-05 2.69
ReLU 100 2.04e-03 1.28e-03 4.84e-04 0.24
ReLU 200 1.27e-04 7.49e-05 1.91e-05 0.15
Leaky ReLU 1 2.78e-06 5.04e-06 3.17e-04 114
Leaky ReLU 100 1.30e-03 4.29e-04 3.37e-04 0.26
Leaky ReLU 200 8.98e-04 8.29e-04 1.79e-04 0.20
GELU 1 4.10e-07 7.02e-07 1.50e-04 365
GELU 100 2.66e-04 1.54e-04 2.57e-04 0.97
GELU 200 4.87e-04 2.21e-04 1.63e-04 0.34

Key Insights

1. Sigmoid's Catastrophic Vanishing Gradients

  • At epoch 1: Gradient ratio is 22 million to 1 (Layer 10 vs Layer 1)
  • This means Layer 1 receives 22 million times less gradient signal than Layer 10
  • The early layers essentially cannot learn!
  • Even after 200 epochs, the ratio is still 245:1

2. Modern Activations Self-Correct

  • ReLU, Leaky ReLU, GELU: Start with some gradient imbalance
  • By epoch 100-200, ratios approach 0.2-1.0 (healthy range)
  • The network learns to balance gradient flow through weight adaptation

3. Training Dynamics Visualization

Training Dynamics Summary

This comprehensive figure shows:

  • Panel A: Loss curves showing convergence speed
  • Panel B: Gradient ratio evolution over training
  • Panel C: Final learned functions
  • Panels D1-D3: Gradient flow at epochs 1, 100, 200
  • Panels E1-E3: Function approximation at epochs 50, 200, 499

Theoretical Explanation

Why Sigmoid gradients don't improve:

  • Sigmoid saturates to 0 or 1 for large inputs
  • Derivative Οƒ'(z) = Οƒ(z)(1-Οƒ(z)) β†’ 0 when Οƒ(z) β†’ 0 or 1
  • Deep layers push activations toward saturation
  • Early layers are "locked" and cannot adapt

Why ReLU/GELU gradients stabilize:

  • Adam optimizer adapts learning rates per-parameter
  • Weights adjust to keep activations in "active" region
  • Network finds a gradient-friendly configuration

Practical Implications

  1. Sigmoid is fundamentally broken for deep hidden layers

    • Not just slow to train, but mathematically unable to learn
    • Early layers receive ~10⁻¹⁰ gradient magnitude
  2. Modern activations are self-healing

    • Initial gradient imbalance corrects during training
    • Adam optimizer helps by adapting per-parameter learning rates
  3. Monitor gradient ratios during training

    • Ratio > 100 indicates vanishing gradients
    • Ratio < 0.01 indicates exploding gradients
    • Healthy range: 0.1 to 10

Summary and Recommendations

Comparison Table

Property Best Activations Worst Activations
Gradient Flow LeakyReLU, GELU Sigmoid, Tanh
Avoids Dead Neurons LeakyReLU, ELU, GELU ReLU
Training Stability Sigmoid, Tanh, GELU ReLU (high lr)
Smooth Functions GELU, Swish, Tanh ReLU
Sharp Functions ReLU, LeakyReLU Sigmoid
Computational Speed ReLU, LeakyReLU GELU, Swish

Practical Recommendations

  1. Default Choice: ReLU or LeakyReLU

    • Simple, fast, effective for most tasks
    • Use LeakyReLU if dead neurons are a concern
  2. For Transformers/Attention: GELU

    • Standard in BERT, GPT, modern transformers
    • Smooth gradients help with optimization
  3. For Very Deep Networks: LeakyReLU or ELU

    • Or use residual connections + batch normalization
    • Avoid Sigmoid/Tanh in hidden layers
  4. For Regression with Bounded Outputs: Sigmoid (output layer only)

    • Use for probabilities or [0, 1] outputs
    • Never in hidden layers of deep networks
  5. For RNNs/LSTMs: Tanh (traditional choice)

    • Zero-centered helps with recurrent dynamics
    • Modern alternative: use Transformers instead

The Big Picture

                    ACTIVATION FUNCTION SELECTION GUIDE
                    
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                     Is it a hidden layer?                    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β–Ό                               β–Ό
           YES                               NO (output layer)
              β”‚                               β”‚
              β–Ό                               β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Is it a         β”‚             β”‚ What's the task?    β”‚
    β”‚ Transformer?    β”‚             β”‚                     β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚ Binary class β†’ Sigmoid
              β”‚                     β”‚ Multi-class β†’ Softmax
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”             β”‚ Regression β†’ Linear β”‚
      β–Ό               β–Ό             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    YES              NO
      β”‚               β”‚
      β–Ό               β–Ό
    GELU      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚ Worried about   β”‚
              β”‚ dead neurons?   β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”
              β–Ό               β–Ό
            YES              NO
              β”‚               β”‚
              β–Ό               β–Ό
         LeakyReLU          ReLU
           or ELU

Files Generated

File Description
learned_functions.png Final learned functions vs ground truth
loss_curves.png Training loss curves over 500 epochs
gradient_flow.png Gradient magnitude across layers (epoch 1)
gradient_flow_epochs.png NEW Gradient flow at epochs 1, 100, 200
gradient_evolution.png NEW Gradient ratio evolution over training
hidden_activations.png Activation distributions in trained network
training_dynamics_functions.png NEW Function learning over time
activation_evolution.png NEW Activation distribution evolution
training_dynamics_summary.png NEW Comprehensive training dynamics
exp1_gradient_flow.png Gradient magnitude across layers
exp2_sparsity_dead_neurons.png Sparsity and dead neuron rates
exp2_activation_distributions.png Activation value distributions
exp3_stability.png Stability vs learning rate and depth
exp4_representational_heatmap.png MSE heatmap for different targets
exp4_predictions.png Actual predictions vs ground truth
summary_figure.png Comprehensive summary visualization

References

  1. Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks.
  2. He, K., et al. (2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification.
  3. Hendrycks, D., & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs).
  4. Ramachandran, P., et al. (2017). Searching for Activation Functions.
  5. Nwankpa, C., et al. (2018). Activation Functions: Comparison of trends in Practice and Research for Deep Learning.

Tutorial generated by Orchestra Research Assistant All experiments are reproducible with the provided code