AmberLJC
/

activation_functions

Model card Files Files and versions

xet

Community

AmberLJC commited on 19 days ago

Commit

3283ee8

verified ·

1 Parent(s): 417be58

Upload activation_tutorial.md with huggingface_hub

Browse files

Files changed (1) hide show

activation_tutorial.md +450 -0

activation_tutorial.md ADDED Viewed

	@@ -0,0 +1,450 @@

+# Comprehensive Tutorial: Activation Functions in Deep Learning
+## Table of Contents
+1. [Introduction](#introduction)
+2. [Theoretical Background](#theoretical-background)
+3. [Experiment 1: Gradient Flow](#experiment-1-gradient-flow)
+4. [Experiment 2: Sparsity and Dead Neurons](#experiment-2-sparsity-and-dead-neurons)
+5. [Experiment 3: Training Stability](#experiment-3-training-stability)
+6. [Experiment 4: Representational Capacity](#experiment-4-representational-capacity)
+7. [**Experiment 5: Temporal Gradient Analysis**](#experiment-5-temporal-gradient-analysis) *(NEW)*
+8. [Summary and Recommendations](#summary-and-recommendations)
+---
+## Introduction
+Activation functions are a critical component of neural networks that introduce non-linearity, enabling networks to learn complex patterns. This tutorial provides both **theoretical explanations** and **empirical experiments** to understand how different activation functions affect:
+1. **Gradient Flow**: Do gradients vanish or explode during backpropagation?
+2. **Sparsity & Dead Neurons**: How easily do units turn on/off?
+3. **Stability**: How robust is training under stress (large learning rates, deep networks)?
+4. **Representational Capacity**: How well can the network approximate different functions?
+### Activation Functions Studied
+| Function | Formula | Range | Key Property |
+|----------|---------|-------|--------------|
+| Linear | f(x) = x | (-∞, ∞) | No non-linearity |
+| Sigmoid | f(x) = 1/(1+e⁻ˣ) | (0, 1) | Bounded, saturates |
+| Tanh | f(x) = (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ) | (-1, 1) | Zero-centered, saturates |
+| ReLU | f(x) = max(0, x) | [0, ∞) | Sparse, can die |
+| Leaky ReLU | f(x) = max(αx, x) | (-∞, ∞) | Prevents dead neurons |
+| ELU | f(x) = x if x>0, α(eˣ-1) otherwise | (-α, ∞) | Smooth negative region |
+| GELU | f(x) = x·Φ(x) | ≈(-0.17, ∞) | Smooth, probabilistic |
+| Swish | f(x) = x·σ(x) | ≈(-0.28, ∞) | Self-gated |
+---
+## Theoretical Background
+### Why Non-linearity Matters
+Without activation functions, a neural network of any depth is equivalent to a single linear transformation:
+```
+f(x) = Wₙ × Wₙ₋₁ × ... × W₁ × x = W_combined × x
+```
+Non-linear activations allow networks to approximate **any continuous function** (Universal Approximation Theorem).
+### The Gradient Flow Problem
+During backpropagation, gradients flow through the chain rule:
+```
+∂L/∂Wᵢ = ∂L/∂aₙ × ∂aₙ/∂aₙ₋₁ × ... × ∂aᵢ₊₁/∂aᵢ × ∂aᵢ/∂Wᵢ
+```
+Each layer contributes a factor of **σ'(z) × W**, where σ' is the activation derivative.
+**Vanishing Gradients**: When |σ'(z)| < 1 repeatedly
+- Sigmoid: σ'(z) ∈ (0, 0.25], maximum at z=0
+- For n layers: gradient ≈ (0.25)ⁿ → 0 as n → ∞
+**Exploding Gradients**: When |σ'(z) × W| > 1 repeatedly
+- More common with unbounded activations
+- Mitigated by gradient clipping, proper initialization
+---
+## Experiment 1: Gradient Flow
+### Question
+How do gradients propagate through deep networks with different activations?
+### Method
+- Built networks with depths [5, 10, 20, 50]
+- Measured gradient magnitude at each layer during backpropagation
+- Used Xavier initialization for fair comparison
+### Results
+![Gradient Flow](exp1_gradient_flow.png)
+#### Gradient Ratio (Layer 10 / Layer 1) at Depth=20
+| Activation | Gradient Ratio | Interpretation |
+|------------|----------------|----------------|
+| Linear | 1.43e+00 | Stable gradient flow |
+| Sigmoid | inf | Severe vanishing gradients |
+| Tanh | 5.07e-01 | Stable gradient flow |
+| ReLU | 1.08e+00 | Stable gradient flow |
+| LeakyReLU | 1.73e+00 | Stable gradient flow |
+| ELU | 8.78e-01 | Stable gradient flow |
+| GELU | 3.34e-01 | Stable gradient flow |
+| Swish | 1.14e+00 | Stable gradient flow |
+### Theoretical Explanation
+**Sigmoid** shows the most severe gradient decay because:
+- Maximum derivative is only 0.25 (at z=0)
+- In deep networks: 0.25²⁰ ≈ 10⁻¹² (effectively zero!)
+**ReLU** maintains gradients better because:
+- Derivative is exactly 1 for positive inputs
+- But can be exactly 0 for negative inputs (dead neurons)
+**GELU/Swish** provide smooth gradient flow:
+- Derivatives are bounded but not as severely as Sigmoid
+- Smooth transitions prevent sudden gradient changes
+---
+## Experiment 2: Sparsity and Dead Neurons
+### Question
+How do activations affect the sparsity of representations and the "death" of neurons?
+### Method
+- Trained 10-layer networks with high learning rate (0.1) to stress-test
+- Measured activation sparsity (% of near-zero activations)
+- Measured dead neuron rate (neurons that never activate)
+### Results
+![Sparsity and Dead Neurons](exp2_sparsity_dead_neurons.png)
+| Activation | Sparsity (%) | Dead Neurons (%) |
+|------------|--------------|------------------|
+| Linear | 0.0% | 100.0% |
+| Sigmoid | 8.2% | 8.2% |
+| Tanh | 0.0% | 0.0% |
+| ReLU | 48.8% | 6.6% |
+| LeakyReLU | 0.1% | 0.0% |
+| ELU | 0.0% | 0.0% |
+| GELU | 0.0% | 0.0% |
+| Swish | 0.0% | 0.0% |
+### Theoretical Explanation
+**ReLU creates sparse representations**:
+- Any negative input → output is exactly 0
+- ~50% sparsity is typical with zero-mean inputs
+- Sparsity can be beneficial (efficiency, regularization)
+**Dead Neuron Problem**:
+- If a ReLU neuron's input is always negative, it outputs 0 forever
+- Gradient is 0, so weights never update
+- Caused by: bad initialization, large learning rates, unlucky gradients
+**Solutions**:
+- **Leaky ReLU**: Small gradient (0.01) for negative inputs
+- **ELU**: Smooth negative region with non-zero gradient
+- **Proper initialization**: Keep activations in a good range
+---
+## Experiment 3: Training Stability
+### Question
+How stable is training under stress conditions (large learning rates, deep networks)?
+### Method
+- Tested learning rates: [0.001, 0.01, 0.1, 0.5, 1.0]
+- Tested depths: [5, 10, 20, 50, 100]
+- Measured whether training diverged (loss → ∞)
+### Results
+![Stability](exp3_stability.png)
+### Key Observations
+**Learning Rate Stability**:
+- Sigmoid/Tanh: Most stable (bounded outputs prevent explosion)
+- ReLU: Can diverge at high learning rates
+- GELU/Swish: Good balance of stability and performance
+**Depth Stability**:
+- All activations struggle with depth > 50 without special techniques
+- Sigmoid fails earliest due to vanishing gradients
+- ReLU/LeakyReLU maintain trainability longer
+### Theoretical Explanation
+**Why bounded activations are more stable**:
+- Sigmoid outputs ∈ (0, 1), so activations can't explode
+- But gradients can vanish, making learning very slow
+**Why ReLU can be unstable**:
+- Unbounded outputs: large inputs → large outputs → larger gradients
+- Positive feedback loop can cause explosion
+**Modern solutions**:
+- Batch Normalization: Keeps activations in good range
+- Residual Connections: Allow gradients to bypass layers
+- Gradient Clipping: Prevents explosion
+---
+## Experiment 4: Representational Capacity
+### Question
+How well can networks with different activations approximate various functions?
+### Method
+- Target functions: sin(x), |x|, step, sin(10x), x³
+- 5-layer networks, 500 epochs training
+- Measured test MSE
+### Results
+![Representational Capacity](exp4_representational_heatmap.png)
+![Predictions](exp4_predictions.png)
+#### Test MSE by Activation × Target Function
+| Activation | sin(x) | |x| | step | sin(10x) | x³ |
+|------------|------|------|------|------|------|
+| Linear | 0.0262 | 0.3347 | 0.0406 | 0.4906 | 1.4807 |
+| Sigmoid | 0.0015 | 0.0025 | 0.0007 | 0.4910 | 0.0184 |
+| Tanh | 0.0006 | 0.0022 | 0.0000 | 0.4903 | 0.0008 |
+| ReLU | 0.0000 | 0.0000 | 0.0000 | 0.0006 | 0.0002 |
+| LeakyReLU | 0.0000 | 0.0000 | 0.0000 | 0.0008 | 0.0004 |
+| ELU | 0.0007 | 0.0005 | 0.0012 | 0.2388 | 0.0003 |
+| GELU | 0.0000 | 0.0006 | 0.0001 | 0.0009 | 0.0033 |
+| Swish | 0.0000 | 0.0017 | 0.0004 | 0.4601 | 0.0016 |
+### Theoretical Explanation
+**Universal Approximation Theorem**:
+- Any continuous function can be approximated with enough neurons
+- But different activations have different "inductive biases"
+**ReLU excels at piecewise functions** (like |x|):
+- ReLU networks compute piecewise linear functions
+- Perfect match for |x| which is piecewise linear
+**Smooth activations for smooth functions**:
+- GELU, Swish produce smoother decision boundaries
+- Better for smooth targets like sin(x)
+**High-frequency functions are hard**:
+- sin(10x) has 10 oscillations in [-2, 2]
+- Requires many neurons to capture all oscillations
+- All activations struggle without sufficient width
+---
+## Experiment 5: Temporal Gradient Analysis
+### Question
+How do gradients evolve during training? Does the vanishing gradient problem persist or improve?
+### Method
+- Measured gradient magnitudes at epochs 1, 100, and 200
+- Tracked gradient ratio (Layer 10 / Layer 1) over time
+- Analyzed whether training helps or hurts gradient flow
+### Results
+![Gradient Flow Over Epochs](gradient_flow_epochs.png)
+![Gradient Evolution](gradient_evolution.png)
+#### Gradient Magnitudes at Key Training Epochs
+| Activation | Epoch | Layer 1 | Layer 5 | Layer 10 | Ratio (L10/L1) |
+|------------|-------|---------|---------|----------|----------------|
+| Linear | 1 | 4.01e-04 | 3.29e-04 | 7.44e-04 | 1.86 |
+| Linear | 100 | 3.10e-05 | 2.78e-05 | 3.57e-05 | 1.15 |
+| Linear | 200 | 1.12e-07 | 9.99e-08 | 1.21e-07 | 1.08 |
+| **Sigmoid** | **1** | **1.66e-10** | **2.40e-07** | **3.68e-03** | **2.22e+07** |
+| **Sigmoid** | **100** | **1.04e-10** | **3.24e-10** | **4.77e-06** | **4.59e+04** |
+| **Sigmoid** | **200** | **1.32e-10** | **1.24e-10** | **3.23e-08** | **2.45e+02** |
+| ReLU | 1 | 1.20e-05 | 6.12e-06 | 3.23e-05 | 2.69 |
+| ReLU | 100 | 2.04e-03 | 1.28e-03 | 4.84e-04 | 0.24 |
+| ReLU | 200 | 1.27e-04 | 7.49e-05 | 1.91e-05 | 0.15 |
+| Leaky ReLU | 1 | 2.78e-06 | 5.04e-06 | 3.17e-04 | 114 |
+| Leaky ReLU | 100 | 1.30e-03 | 4.29e-04 | 3.37e-04 | 0.26 |
+| Leaky ReLU | 200 | 8.98e-04 | 8.29e-04 | 1.79e-04 | 0.20 |
+| GELU | 1 | 4.10e-07 | 7.02e-07 | 1.50e-04 | 365 |
+| GELU | 100 | 2.66e-04 | 1.54e-04 | 2.57e-04 | 0.97 |
+| GELU | 200 | 4.87e-04 | 2.21e-04 | 1.63e-04 | 0.34 |
+### Key Insights
+#### 1. Sigmoid's Catastrophic Vanishing Gradients
+- **At epoch 1**: Gradient ratio is **22 million to 1** (Layer 10 vs Layer 1)
+- This means Layer 1 receives 22 million times less gradient signal than Layer 10
+- The early layers essentially cannot learn!
+- Even after 200 epochs, the ratio is still 245:1
+#### 2. Modern Activations Self-Correct
+- **ReLU, Leaky ReLU, GELU**: Start with some gradient imbalance
+- By epoch 100-200, ratios approach 0.2-1.0 (healthy range)
+- The network learns to balance gradient flow through weight adaptation
+#### 3. Training Dynamics Visualization
+![Training Dynamics Summary](training_dynamics_summary.png)
+This comprehensive figure shows:
+- **Panel A**: Loss curves showing convergence speed
+- **Panel B**: Gradient ratio evolution over training
+- **Panel C**: Final learned functions
+- **Panels D1-D3**: Gradient flow at epochs 1, 100, 200
+- **Panels E1-E3**: Function approximation at epochs 50, 200, 499
+### Theoretical Explanation
+**Why Sigmoid gradients don't improve**:
+- Sigmoid saturates to 0 or 1 for large inputs
+- Derivative σ'(z) = σ(z)(1-σ(z)) → 0 when σ(z) → 0 or 1
+- Deep layers push activations toward saturation
+- Early layers are "locked" and cannot adapt
+**Why ReLU/GELU gradients stabilize**:
+- Adam optimizer adapts learning rates per-parameter
+- Weights adjust to keep activations in "active" region
+- Network finds a gradient-friendly configuration
+### Practical Implications
+1. **Sigmoid is fundamentally broken for deep hidden layers**
+   - Not just slow to train, but mathematically unable to learn
+   - Early layers receive ~10⁻¹⁰ gradient magnitude
+2. **Modern activations are self-healing**
+   - Initial gradient imbalance corrects during training
+   - Adam optimizer helps by adapting per-parameter learning rates
+3. **Monitor gradient ratios during training**
+   - Ratio > 100 indicates vanishing gradients
+   - Ratio < 0.01 indicates exploding gradients
+   - Healthy range: 0.1 to 10
+---
+## Summary and Recommendations
+### Comparison Table
+| Property | Best Activations | Worst Activations |
+|----------|------------------|-------------------|
+| Gradient Flow | LeakyReLU, GELU | Sigmoid, Tanh |
+| Avoids Dead Neurons | LeakyReLU, ELU, GELU | ReLU |
+| Training Stability | Sigmoid, Tanh, GELU | ReLU (high lr) |
+| Smooth Functions | GELU, Swish, Tanh | ReLU |
+| Sharp Functions | ReLU, LeakyReLU | Sigmoid |
+| Computational Speed | ReLU, LeakyReLU | GELU, Swish |
+### Practical Recommendations
+1. **Default Choice**: **ReLU** or **LeakyReLU**
+   - Simple, fast, effective for most tasks
+   - Use LeakyReLU if dead neurons are a concern
+2. **For Transformers/Attention**: **GELU**
+   - Standard in BERT, GPT, modern transformers
+   - Smooth gradients help with optimization
+3. **For Very Deep Networks**: **LeakyReLU** or **ELU**
+   - Or use residual connections + batch normalization
+   - Avoid Sigmoid/Tanh in hidden layers
+4. **For Regression with Bounded Outputs**: **Sigmoid** (output layer only)
+   - Use for probabilities or [0, 1] outputs
+   - Never in hidden layers of deep networks
+5. **For RNNs/LSTMs**: **Tanh** (traditional choice)
+   - Zero-centered helps with recurrent dynamics
+   - Modern alternative: use Transformers instead
+### The Big Picture
+```
+                    ACTIVATION FUNCTION SELECTION GUIDE
+    ┌─────────────────────────────────────────────────────────────┐
+    │                     Is it a hidden layer?                    │
+    └─────────────────────────────────────────────────────────────┘
+                              │
+              ┌───────────────┴───────────────┐
+              ▼                               ▼
+           YES                               NO (output layer)
+              │                               │
+              ▼                               ▼
+    ┌─────────────────┐             ┌─────────────────────┐
+    │ Is it a         │             │ What's the task?    │
+    │ Transformer?    │             │                     │
+    └─────────────────┘             │ Binary class → Sigmoid
+              │                     │ Multi-class → Softmax
+      ┌───────┴───────┐             │ Regression → Linear │
+      ▼               ▼             └───────���─────────────┘
+    YES              NO
+      │               │
+      ▼               ▼
+    GELU      ┌─────────────────┐
+              │ Worried about   │
+              │ dead neurons?   │
+              └─────────────────┘
+                      │
+              ┌───────┴───────┐
+              ▼               ▼
+            YES              NO
+              │               │
+              ▼               ▼
+         LeakyReLU          ReLU
+           or ELU
+```
+---
+## Files Generated
+| File | Description |
+|------|-------------|
+| learned_functions.png | Final learned functions vs ground truth |
+| loss_curves.png | Training loss curves over 500 epochs |
+| gradient_flow.png | Gradient magnitude across layers (epoch 1) |
+| gradient_flow_epochs.png | **NEW** Gradient flow at epochs 1, 100, 200 |
+| gradient_evolution.png | **NEW** Gradient ratio evolution over training |
+| hidden_activations.png | Activation distributions in trained network |
+| training_dynamics_functions.png | **NEW** Function learning over time |
+| activation_evolution.png | **NEW** Activation distribution evolution |
+| training_dynamics_summary.png | **NEW** Comprehensive training dynamics |
+| exp1_gradient_flow.png | Gradient magnitude across layers |
+| exp2_sparsity_dead_neurons.png | Sparsity and dead neuron rates |
+| exp2_activation_distributions.png | Activation value distributions |
+| exp3_stability.png | Stability vs learning rate and depth |
+| exp4_representational_heatmap.png | MSE heatmap for different targets |
+| exp4_predictions.png | Actual predictions vs ground truth |
+| summary_figure.png | Comprehensive summary visualization |
+---
+## References
+1. Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks.
+2. He, K., et al. (2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification.
+3. Hendrycks, D., & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs).
+4. Ramachandran, P., et al. (2017). Searching for Activation Functions.
+5. Nwankpa, C., et al. (2018). Activation Functions: Comparison of trends in Practice and Research for Deep Learning.
+---
+*Tutorial generated by Orchestra Research Assistant*
+*All experiments are reproducible with the provided code*