AmberLJC commited on
Commit
3283ee8
Β·
verified Β·
1 Parent(s): 417be58

Upload activation_tutorial.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. activation_tutorial.md +450 -0
activation_tutorial.md ADDED
@@ -0,0 +1,450 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Comprehensive Tutorial: Activation Functions in Deep Learning
2
+
3
+ ## Table of Contents
4
+ 1. [Introduction](#introduction)
5
+ 2. [Theoretical Background](#theoretical-background)
6
+ 3. [Experiment 1: Gradient Flow](#experiment-1-gradient-flow)
7
+ 4. [Experiment 2: Sparsity and Dead Neurons](#experiment-2-sparsity-and-dead-neurons)
8
+ 5. [Experiment 3: Training Stability](#experiment-3-training-stability)
9
+ 6. [Experiment 4: Representational Capacity](#experiment-4-representational-capacity)
10
+ 7. [**Experiment 5: Temporal Gradient Analysis**](#experiment-5-temporal-gradient-analysis) *(NEW)*
11
+ 8. [Summary and Recommendations](#summary-and-recommendations)
12
+
13
+ ---
14
+
15
+ ## Introduction
16
+
17
+ Activation functions are a critical component of neural networks that introduce non-linearity, enabling networks to learn complex patterns. This tutorial provides both **theoretical explanations** and **empirical experiments** to understand how different activation functions affect:
18
+
19
+ 1. **Gradient Flow**: Do gradients vanish or explode during backpropagation?
20
+ 2. **Sparsity & Dead Neurons**: How easily do units turn on/off?
21
+ 3. **Stability**: How robust is training under stress (large learning rates, deep networks)?
22
+ 4. **Representational Capacity**: How well can the network approximate different functions?
23
+
24
+ ### Activation Functions Studied
25
+
26
+ | Function | Formula | Range | Key Property |
27
+ |----------|---------|-------|--------------|
28
+ | Linear | f(x) = x | (-∞, ∞) | No non-linearity |
29
+ | Sigmoid | f(x) = 1/(1+e⁻ˣ) | (0, 1) | Bounded, saturates |
30
+ | Tanh | f(x) = (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ) | (-1, 1) | Zero-centered, saturates |
31
+ | ReLU | f(x) = max(0, x) | [0, ∞) | Sparse, can die |
32
+ | Leaky ReLU | f(x) = max(αx, x) | (-∞, ∞) | Prevents dead neurons |
33
+ | ELU | f(x) = x if x>0, α(eˣ-1) otherwise | (-α, ∞) | Smooth negative region |
34
+ | GELU | f(x) = xΒ·Ξ¦(x) | β‰ˆ(-0.17, ∞) | Smooth, probabilistic |
35
+ | Swish | f(x) = xΒ·Οƒ(x) | β‰ˆ(-0.28, ∞) | Self-gated |
36
+
37
+ ---
38
+
39
+ ## Theoretical Background
40
+
41
+ ### Why Non-linearity Matters
42
+
43
+ Without activation functions, a neural network of any depth is equivalent to a single linear transformation:
44
+
45
+ ```
46
+ f(x) = Wβ‚™ Γ— Wₙ₋₁ Γ— ... Γ— W₁ Γ— x = W_combined Γ— x
47
+ ```
48
+
49
+ Non-linear activations allow networks to approximate **any continuous function** (Universal Approximation Theorem).
50
+
51
+ ### The Gradient Flow Problem
52
+
53
+ During backpropagation, gradients flow through the chain rule:
54
+
55
+ ```
56
+ βˆ‚L/βˆ‚Wα΅’ = βˆ‚L/βˆ‚aβ‚™ Γ— βˆ‚aβ‚™/βˆ‚aₙ₋₁ Γ— ... Γ— βˆ‚aα΅’β‚Šβ‚/βˆ‚aα΅’ Γ— βˆ‚aα΅’/βˆ‚Wα΅’
57
+ ```
58
+
59
+ Each layer contributes a factor of **Οƒ'(z) Γ— W**, where Οƒ' is the activation derivative.
60
+
61
+ **Vanishing Gradients**: When |Οƒ'(z)| < 1 repeatedly
62
+ - Sigmoid: Οƒ'(z) ∈ (0, 0.25], maximum at z=0
63
+ - For n layers: gradient β‰ˆ (0.25)ⁿ β†’ 0 as n β†’ ∞
64
+
65
+ **Exploding Gradients**: When |Οƒ'(z) Γ— W| > 1 repeatedly
66
+ - More common with unbounded activations
67
+ - Mitigated by gradient clipping, proper initialization
68
+
69
+ ---
70
+
71
+ ## Experiment 1: Gradient Flow
72
+
73
+ ### Question
74
+ How do gradients propagate through deep networks with different activations?
75
+
76
+ ### Method
77
+ - Built networks with depths [5, 10, 20, 50]
78
+ - Measured gradient magnitude at each layer during backpropagation
79
+ - Used Xavier initialization for fair comparison
80
+
81
+ ### Results
82
+
83
+ ![Gradient Flow](exp1_gradient_flow.png)
84
+
85
+ #### Gradient Ratio (Layer 10 / Layer 1) at Depth=20
86
+
87
+ | Activation | Gradient Ratio | Interpretation |
88
+ |------------|----------------|----------------|
89
+ | Linear | 1.43e+00 | Stable gradient flow |
90
+ | Sigmoid | inf | Severe vanishing gradients |
91
+ | Tanh | 5.07e-01 | Stable gradient flow |
92
+ | ReLU | 1.08e+00 | Stable gradient flow |
93
+ | LeakyReLU | 1.73e+00 | Stable gradient flow |
94
+ | ELU | 8.78e-01 | Stable gradient flow |
95
+ | GELU | 3.34e-01 | Stable gradient flow |
96
+ | Swish | 1.14e+00 | Stable gradient flow |
97
+
98
+ ### Theoretical Explanation
99
+
100
+ **Sigmoid** shows the most severe gradient decay because:
101
+ - Maximum derivative is only 0.25 (at z=0)
102
+ - In deep networks: 0.25²⁰ β‰ˆ 10⁻¹² (effectively zero!)
103
+
104
+ **ReLU** maintains gradients better because:
105
+ - Derivative is exactly 1 for positive inputs
106
+ - But can be exactly 0 for negative inputs (dead neurons)
107
+
108
+ **GELU/Swish** provide smooth gradient flow:
109
+ - Derivatives are bounded but not as severely as Sigmoid
110
+ - Smooth transitions prevent sudden gradient changes
111
+
112
+ ---
113
+
114
+ ## Experiment 2: Sparsity and Dead Neurons
115
+
116
+ ### Question
117
+ How do activations affect the sparsity of representations and the "death" of neurons?
118
+
119
+ ### Method
120
+ - Trained 10-layer networks with high learning rate (0.1) to stress-test
121
+ - Measured activation sparsity (% of near-zero activations)
122
+ - Measured dead neuron rate (neurons that never activate)
123
+
124
+ ### Results
125
+
126
+ ![Sparsity and Dead Neurons](exp2_sparsity_dead_neurons.png)
127
+
128
+ | Activation | Sparsity (%) | Dead Neurons (%) |
129
+ |------------|--------------|------------------|
130
+ | Linear | 0.0% | 100.0% |
131
+ | Sigmoid | 8.2% | 8.2% |
132
+ | Tanh | 0.0% | 0.0% |
133
+ | ReLU | 48.8% | 6.6% |
134
+ | LeakyReLU | 0.1% | 0.0% |
135
+ | ELU | 0.0% | 0.0% |
136
+ | GELU | 0.0% | 0.0% |
137
+ | Swish | 0.0% | 0.0% |
138
+
139
+ ### Theoretical Explanation
140
+
141
+ **ReLU creates sparse representations**:
142
+ - Any negative input β†’ output is exactly 0
143
+ - ~50% sparsity is typical with zero-mean inputs
144
+ - Sparsity can be beneficial (efficiency, regularization)
145
+
146
+ **Dead Neuron Problem**:
147
+ - If a ReLU neuron's input is always negative, it outputs 0 forever
148
+ - Gradient is 0, so weights never update
149
+ - Caused by: bad initialization, large learning rates, unlucky gradients
150
+
151
+ **Solutions**:
152
+ - **Leaky ReLU**: Small gradient (0.01) for negative inputs
153
+ - **ELU**: Smooth negative region with non-zero gradient
154
+ - **Proper initialization**: Keep activations in a good range
155
+
156
+ ---
157
+
158
+ ## Experiment 3: Training Stability
159
+
160
+ ### Question
161
+ How stable is training under stress conditions (large learning rates, deep networks)?
162
+
163
+ ### Method
164
+ - Tested learning rates: [0.001, 0.01, 0.1, 0.5, 1.0]
165
+ - Tested depths: [5, 10, 20, 50, 100]
166
+ - Measured whether training diverged (loss β†’ ∞)
167
+
168
+ ### Results
169
+
170
+ ![Stability](exp3_stability.png)
171
+
172
+ ### Key Observations
173
+
174
+ **Learning Rate Stability**:
175
+ - Sigmoid/Tanh: Most stable (bounded outputs prevent explosion)
176
+ - ReLU: Can diverge at high learning rates
177
+ - GELU/Swish: Good balance of stability and performance
178
+
179
+ **Depth Stability**:
180
+ - All activations struggle with depth > 50 without special techniques
181
+ - Sigmoid fails earliest due to vanishing gradients
182
+ - ReLU/LeakyReLU maintain trainability longer
183
+
184
+ ### Theoretical Explanation
185
+
186
+ **Why bounded activations are more stable**:
187
+ - Sigmoid outputs ∈ (0, 1), so activations can't explode
188
+ - But gradients can vanish, making learning very slow
189
+
190
+ **Why ReLU can be unstable**:
191
+ - Unbounded outputs: large inputs β†’ large outputs β†’ larger gradients
192
+ - Positive feedback loop can cause explosion
193
+
194
+ **Modern solutions**:
195
+ - Batch Normalization: Keeps activations in good range
196
+ - Residual Connections: Allow gradients to bypass layers
197
+ - Gradient Clipping: Prevents explosion
198
+
199
+ ---
200
+
201
+ ## Experiment 4: Representational Capacity
202
+
203
+ ### Question
204
+ How well can networks with different activations approximate various functions?
205
+
206
+ ### Method
207
+ - Target functions: sin(x), |x|, step, sin(10x), xΒ³
208
+ - 5-layer networks, 500 epochs training
209
+ - Measured test MSE
210
+
211
+ ### Results
212
+
213
+ ![Representational Capacity](exp4_representational_heatmap.png)
214
+
215
+ ![Predictions](exp4_predictions.png)
216
+
217
+ #### Test MSE by Activation Γ— Target Function
218
+
219
+ | Activation | sin(x) | |x| | step | sin(10x) | xΒ³ |
220
+ |------------|------|------|------|------|------|
221
+ | Linear | 0.0262 | 0.3347 | 0.0406 | 0.4906 | 1.4807 |
222
+ | Sigmoid | 0.0015 | 0.0025 | 0.0007 | 0.4910 | 0.0184 |
223
+ | Tanh | 0.0006 | 0.0022 | 0.0000 | 0.4903 | 0.0008 |
224
+ | ReLU | 0.0000 | 0.0000 | 0.0000 | 0.0006 | 0.0002 |
225
+ | LeakyReLU | 0.0000 | 0.0000 | 0.0000 | 0.0008 | 0.0004 |
226
+ | ELU | 0.0007 | 0.0005 | 0.0012 | 0.2388 | 0.0003 |
227
+ | GELU | 0.0000 | 0.0006 | 0.0001 | 0.0009 | 0.0033 |
228
+ | Swish | 0.0000 | 0.0017 | 0.0004 | 0.4601 | 0.0016 |
229
+
230
+ ### Theoretical Explanation
231
+
232
+ **Universal Approximation Theorem**:
233
+ - Any continuous function can be approximated with enough neurons
234
+ - But different activations have different "inductive biases"
235
+
236
+ **ReLU excels at piecewise functions** (like |x|):
237
+ - ReLU networks compute piecewise linear functions
238
+ - Perfect match for |x| which is piecewise linear
239
+
240
+ **Smooth activations for smooth functions**:
241
+ - GELU, Swish produce smoother decision boundaries
242
+ - Better for smooth targets like sin(x)
243
+
244
+ **High-frequency functions are hard**:
245
+ - sin(10x) has 10 oscillations in [-2, 2]
246
+ - Requires many neurons to capture all oscillations
247
+ - All activations struggle without sufficient width
248
+
249
+ ---
250
+
251
+ ## Experiment 5: Temporal Gradient Analysis
252
+
253
+ ### Question
254
+ How do gradients evolve during training? Does the vanishing gradient problem persist or improve?
255
+
256
+ ### Method
257
+ - Measured gradient magnitudes at epochs 1, 100, and 200
258
+ - Tracked gradient ratio (Layer 10 / Layer 1) over time
259
+ - Analyzed whether training helps or hurts gradient flow
260
+
261
+ ### Results
262
+
263
+ ![Gradient Flow Over Epochs](gradient_flow_epochs.png)
264
+
265
+ ![Gradient Evolution](gradient_evolution.png)
266
+
267
+ #### Gradient Magnitudes at Key Training Epochs
268
+
269
+ | Activation | Epoch | Layer 1 | Layer 5 | Layer 10 | Ratio (L10/L1) |
270
+ |------------|-------|---------|---------|----------|----------------|
271
+ | Linear | 1 | 4.01e-04 | 3.29e-04 | 7.44e-04 | 1.86 |
272
+ | Linear | 100 | 3.10e-05 | 2.78e-05 | 3.57e-05 | 1.15 |
273
+ | Linear | 200 | 1.12e-07 | 9.99e-08 | 1.21e-07 | 1.08 |
274
+ | **Sigmoid** | **1** | **1.66e-10** | **2.40e-07** | **3.68e-03** | **2.22e+07** |
275
+ | **Sigmoid** | **100** | **1.04e-10** | **3.24e-10** | **4.77e-06** | **4.59e+04** |
276
+ | **Sigmoid** | **200** | **1.32e-10** | **1.24e-10** | **3.23e-08** | **2.45e+02** |
277
+ | ReLU | 1 | 1.20e-05 | 6.12e-06 | 3.23e-05 | 2.69 |
278
+ | ReLU | 100 | 2.04e-03 | 1.28e-03 | 4.84e-04 | 0.24 |
279
+ | ReLU | 200 | 1.27e-04 | 7.49e-05 | 1.91e-05 | 0.15 |
280
+ | Leaky ReLU | 1 | 2.78e-06 | 5.04e-06 | 3.17e-04 | 114 |
281
+ | Leaky ReLU | 100 | 1.30e-03 | 4.29e-04 | 3.37e-04 | 0.26 |
282
+ | Leaky ReLU | 200 | 8.98e-04 | 8.29e-04 | 1.79e-04 | 0.20 |
283
+ | GELU | 1 | 4.10e-07 | 7.02e-07 | 1.50e-04 | 365 |
284
+ | GELU | 100 | 2.66e-04 | 1.54e-04 | 2.57e-04 | 0.97 |
285
+ | GELU | 200 | 4.87e-04 | 2.21e-04 | 1.63e-04 | 0.34 |
286
+
287
+ ### Key Insights
288
+
289
+ #### 1. Sigmoid's Catastrophic Vanishing Gradients
290
+ - **At epoch 1**: Gradient ratio is **22 million to 1** (Layer 10 vs Layer 1)
291
+ - This means Layer 1 receives 22 million times less gradient signal than Layer 10
292
+ - The early layers essentially cannot learn!
293
+ - Even after 200 epochs, the ratio is still 245:1
294
+
295
+ #### 2. Modern Activations Self-Correct
296
+ - **ReLU, Leaky ReLU, GELU**: Start with some gradient imbalance
297
+ - By epoch 100-200, ratios approach 0.2-1.0 (healthy range)
298
+ - The network learns to balance gradient flow through weight adaptation
299
+
300
+ #### 3. Training Dynamics Visualization
301
+
302
+ ![Training Dynamics Summary](training_dynamics_summary.png)
303
+
304
+ This comprehensive figure shows:
305
+ - **Panel A**: Loss curves showing convergence speed
306
+ - **Panel B**: Gradient ratio evolution over training
307
+ - **Panel C**: Final learned functions
308
+ - **Panels D1-D3**: Gradient flow at epochs 1, 100, 200
309
+ - **Panels E1-E3**: Function approximation at epochs 50, 200, 499
310
+
311
+ ### Theoretical Explanation
312
+
313
+ **Why Sigmoid gradients don't improve**:
314
+ - Sigmoid saturates to 0 or 1 for large inputs
315
+ - Derivative Οƒ'(z) = Οƒ(z)(1-Οƒ(z)) β†’ 0 when Οƒ(z) β†’ 0 or 1
316
+ - Deep layers push activations toward saturation
317
+ - Early layers are "locked" and cannot adapt
318
+
319
+ **Why ReLU/GELU gradients stabilize**:
320
+ - Adam optimizer adapts learning rates per-parameter
321
+ - Weights adjust to keep activations in "active" region
322
+ - Network finds a gradient-friendly configuration
323
+
324
+ ### Practical Implications
325
+
326
+ 1. **Sigmoid is fundamentally broken for deep hidden layers**
327
+ - Not just slow to train, but mathematically unable to learn
328
+ - Early layers receive ~10⁻¹⁰ gradient magnitude
329
+
330
+ 2. **Modern activations are self-healing**
331
+ - Initial gradient imbalance corrects during training
332
+ - Adam optimizer helps by adapting per-parameter learning rates
333
+
334
+ 3. **Monitor gradient ratios during training**
335
+ - Ratio > 100 indicates vanishing gradients
336
+ - Ratio < 0.01 indicates exploding gradients
337
+ - Healthy range: 0.1 to 10
338
+
339
+ ---
340
+
341
+ ## Summary and Recommendations
342
+
343
+ ### Comparison Table
344
+
345
+ | Property | Best Activations | Worst Activations |
346
+ |----------|------------------|-------------------|
347
+ | Gradient Flow | LeakyReLU, GELU | Sigmoid, Tanh |
348
+ | Avoids Dead Neurons | LeakyReLU, ELU, GELU | ReLU |
349
+ | Training Stability | Sigmoid, Tanh, GELU | ReLU (high lr) |
350
+ | Smooth Functions | GELU, Swish, Tanh | ReLU |
351
+ | Sharp Functions | ReLU, LeakyReLU | Sigmoid |
352
+ | Computational Speed | ReLU, LeakyReLU | GELU, Swish |
353
+
354
+ ### Practical Recommendations
355
+
356
+ 1. **Default Choice**: **ReLU** or **LeakyReLU**
357
+ - Simple, fast, effective for most tasks
358
+ - Use LeakyReLU if dead neurons are a concern
359
+
360
+ 2. **For Transformers/Attention**: **GELU**
361
+ - Standard in BERT, GPT, modern transformers
362
+ - Smooth gradients help with optimization
363
+
364
+ 3. **For Very Deep Networks**: **LeakyReLU** or **ELU**
365
+ - Or use residual connections + batch normalization
366
+ - Avoid Sigmoid/Tanh in hidden layers
367
+
368
+ 4. **For Regression with Bounded Outputs**: **Sigmoid** (output layer only)
369
+ - Use for probabilities or [0, 1] outputs
370
+ - Never in hidden layers of deep networks
371
+
372
+ 5. **For RNNs/LSTMs**: **Tanh** (traditional choice)
373
+ - Zero-centered helps with recurrent dynamics
374
+ - Modern alternative: use Transformers instead
375
+
376
+ ### The Big Picture
377
+
378
+ ```
379
+ ACTIVATION FUNCTION SELECTION GUIDE
380
+
381
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
382
+ β”‚ Is it a hidden layer? β”‚
383
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
384
+ β”‚
385
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
386
+ β–Ό β–Ό
387
+ YES NO (output layer)
388
+ β”‚ β”‚
389
+ β–Ό β–Ό
390
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
391
+ β”‚ Is it a β”‚ β”‚ What's the task? β”‚
392
+ β”‚ Transformer? β”‚ β”‚ β”‚
393
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ Binary class β†’ Sigmoid
394
+ β”‚ β”‚ Multi-class β†’ Softmax
395
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β” β”‚ Regression β†’ Linear β”‚
396
+ β–Ό β–Ό β””β”€β”€β”€β”€β”€β”€β”€οΏ½οΏ½οΏ½β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
397
+ YES NO
398
+ β”‚ β”‚
399
+ β–Ό β–Ό
400
+ GELU β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
401
+ β”‚ Worried about β”‚
402
+ β”‚ dead neurons? β”‚
403
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
404
+ β”‚
405
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”
406
+ β–Ό β–Ό
407
+ YES NO
408
+ β”‚ β”‚
409
+ β–Ό β–Ό
410
+ LeakyReLU ReLU
411
+ or ELU
412
+ ```
413
+
414
+ ---
415
+
416
+ ## Files Generated
417
+
418
+ | File | Description |
419
+ |------|-------------|
420
+ | learned_functions.png | Final learned functions vs ground truth |
421
+ | loss_curves.png | Training loss curves over 500 epochs |
422
+ | gradient_flow.png | Gradient magnitude across layers (epoch 1) |
423
+ | gradient_flow_epochs.png | **NEW** Gradient flow at epochs 1, 100, 200 |
424
+ | gradient_evolution.png | **NEW** Gradient ratio evolution over training |
425
+ | hidden_activations.png | Activation distributions in trained network |
426
+ | training_dynamics_functions.png | **NEW** Function learning over time |
427
+ | activation_evolution.png | **NEW** Activation distribution evolution |
428
+ | training_dynamics_summary.png | **NEW** Comprehensive training dynamics |
429
+ | exp1_gradient_flow.png | Gradient magnitude across layers |
430
+ | exp2_sparsity_dead_neurons.png | Sparsity and dead neuron rates |
431
+ | exp2_activation_distributions.png | Activation value distributions |
432
+ | exp3_stability.png | Stability vs learning rate and depth |
433
+ | exp4_representational_heatmap.png | MSE heatmap for different targets |
434
+ | exp4_predictions.png | Actual predictions vs ground truth |
435
+ | summary_figure.png | Comprehensive summary visualization |
436
+
437
+ ---
438
+
439
+ ## References
440
+
441
+ 1. Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks.
442
+ 2. He, K., et al. (2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification.
443
+ 3. Hendrycks, D., & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs).
444
+ 4. Ramachandran, P., et al. (2017). Searching for Activation Functions.
445
+ 5. Nwankpa, C., et al. (2018). Activation Functions: Comparison of trends in Practice and Research for Deep Learning.
446
+
447
+ ---
448
+
449
+ *Tutorial generated by Orchestra Research Assistant*
450
+ *All experiments are reproducible with the provided code*