Key Concepts
- Know the benefits of learning rate decay and apply it to your optimization
- Remember different optimization methods such as (Stochastic) Gradient Descent, Momentum, RMSProp and Adam
- Use random minibatches to accelerate the convergence and improve the optimization
[mathjax]
Optimization algorithms
Mini-batch gradient descent
Batch vs. mini-batch gradient descent
Vectorization allows you to efficiently compute on m examples.
\[
\underbrace{X}_{(n_x,m)}=[\underbrace{x^{(1)} \dots x^{(1000)}}_{X^{\{1\}}(n_x,1000)} | \underbrace{ x^{(1001)} \dots x^{(2000)} }_{X^{\{2\}} (n_x,1000) } | \dots | \underbrace{ \dots x^{(m)} }_{X^{\{5,000\}} (n_x,1000) } ]\\
\underbrace{Y}_{(1,m)}=[\underbrace{y^{(1)} \dots y^{(1000)}}_{Y^{\{1\}}(1,1000)} | \underbrace{ y^{(1001)} \dots y^{(2000)} }_{Y^{\{2\}} (1,1000) } | \dots | \underbrace{ \dots y^{(m)} }_{Y^{\{5,000\}} (1,1000) } ]\\
\text{What if } m=5,000,000?\\
\text{5000 mini-batches of 1000 each}\\
\text{mini-batch: t: }X^{\{t\}},Y^{\{t\}}
\]
Mini-batch gradient descent
- for t=1,…,5000
- Forward prop on \(X^{\{t\}}\)
- \( Z^{[1]}=w^{[1]}X^{\{t\}}+b^{[1]}\)
- \( A^{[1]}=g^{[1]}(Z^{[1]})\)
- …
- \( A^{[L]}=g^{[L]}(Z^{[L]})\)
- Compute Cost \( J^{\{t\}}=\frac1{1000}\sum_{i=1}^l L(\underbrace{\hat y^{(i)} , y^{(i)}}_{from X^{\{t\}}, Y^{\{t\}} })+\frac{\lambda}{2 \times 1000}\sum_l{\|w^{[l]}\|_F^2}\)
- Backprop to compute gradients cost \( J^{\{t\}}(using (X^{\{t\}}, X^{\{t\}} ))\)
- \( w^{[l]}:=w^{[l]}-\alpha d w^{[l]}, b^{[l]}:=b^{[l]}-\alpha db^{[l]}\)
- Forward prop on \(X^{\{t\}}\)
Understanding mini-batch gradient descent
Training with mini batch gradient descent
- Batch gradient descent:
- plot J with cost/#iterations
- Mini-batch gradient descent:
- plot J^{t} with cost/mini batch #(t)
Choosing your mini-batch size
- If mini-batch size=m: Batch gradient descent\[
(X^{\{t\}},Y^{\{t\}})=(X,Y)
\]- Batch gradient descent(mini-batch size=m)
- Too long per iteration
- Low noise, big step
- Batch gradient descent(mini-batch size=m)
- If mini-batch size=1: Stochastic gradient descent\[
(X^{\{t\}},Y^{\{t\}})=(X^{(1)},Y^{(1)}),\dots, (X^{(2)},Y^{(2)}) \text{mini-batch}
\]- Stochastic確率的 gradient descent
- Lose speedy from vectorization
- noisy, not converge
- Stochastic確率的 gradient descent
- In practice: Some in between 1 and m
- In-bother(mini-batch size not too big(small))
- Fastest learning
- Vectorization(~1000)
- make progress without processing entire training set
- In-bother(mini-batch size not too big(small))
- If small training set: use batch gradient descent
- m<=2000
- Typical mini-batch size:
- 64,128,256,512,1024
- Make some mini-batch fit in CPU/GPU memory
Exponentially weighted averages
Exponentially weighted moving average
\[
v_t = \beta v_{t=1}+(1-\beta)\theta_t\\
\beta=0.9 : \approx 10 days temp\\
\beta=0.98 : \approx 50 days temp\\
\beta=0.5 : \approx 2 days temp\\
\]
\[
v_t \approx \frac1{1-\beta} days temp\\
\]
- Beta is large: adapt more slowly.
- Beta is small: noisy, oscillation
Understanding exponentially weighted averages
\[
v_t = \beta v_{t=1}+(1-\beta)\theta_t\\
v_{100}=0.9v_{99}+0.1\theta_{100}\\
v_{99}=0.9v_{98}+0.1\theta_{99}\\
v_{98}=0.9v_{97}+0.1\theta_{98}\\
v_{100}=0.1\theta_{100}
+0.1\times0.9\theta_{99}
+0.1\times(0.9)^2\theta_{98}
+0.1\times(0.9)^3\theta_{97}
+0.1\times(0.9)^4\theta_{96}
+\dots\\
0.9^{10}\approx 0.35 \approx \frac1e (\dots 10daystemp)\\
\frac{(1-\epsilon)^\frac1\epsilon}{0.9}=\frac1e
\]
Implementing exponentialy weighted averages
\[
v_0=0\\
v_1=\beta v_0+(1-\beta)\theta_1\\
v_2=\beta v_1+(1-\beta)\theta_2\\
v_3=\beta v_2+(1-\beta)\theta_3\\
\dots
\]
- \(v_\theta = 0\)
- Repeat{
- Get next \(\theta_t\)
- \(v_\theta := \beta v_\theta+(1-\beta)\theta_t\)
- }
Bias correction in exponentially weighted averages
Bias correction
\[
\frac{v_t}{1-\beta^t}\\
\]
- 初期段階でのバイアスに懸念がある場合
- 早い段階で良い推定値を取得できる
Gradient descent with momentum
Gradient descent example
- Gradient descent
- take a lot of step
- slowly oscillate
- cant use big \(\alpha\)
- Momentum
- low oscillate step
- speedy horizontal move
Implementation details
- On iteration t:
- compute \(dW,db\) on current mini-batch
- \( v_{dW}=\beta v_{dW} + \underbrace{(1-\beta)}_{省略のケースあり} dW\)
- \( v_{db}=\beta v_{db} + \underbrace{ (1-\beta) }_{省略のケースあり} db\)
- \( W:=W-\alpha v_{dW}, b:=b-\alpha v_{db} \)
- like a bowl
- \(dW,db\): acceleration
- \(v\): velocity
- \(\beta\): friction 摩擦
- Hyperparameters: \( \alpha, \beta \)
- \( \beta=0.9 \): 最後の10回の反復の勾配の平均化
- \( 1-\beta \)を省略すると、\(dW,db\)のスケーリングに影響し、\(\alpha\)も再調整
RMSprop (二乗平均平方根)
- On iteration t:compute \(dW,db\) on current mini-batch
- \[
\begin{align*}
S_{dW}&=\beta S_{dW} + (1-\beta) dW^{2(elementwise)} : small\\
S_{db}&=\beta S_{db} + (1-\beta) db^2: large\\
W&:=W-\alpha \frac{dW}{\sqrt{S_{dw}}}\\
b&:=b-\alpha \frac{db}{\sqrt{S_{db}}}
\end{align*}
\]
- \[
Adam optimization algorithm
Adam=適応モーメント
\[
V_{dw}=0, S_{dw}=0, V_{db}=0, S_{db}=0\\
momentum: \beta_1, RMSprop: \beta_2
\]
on iteration t:
Compute \(dW,db\) using current mini-batch
\[
\begin{align*}
V_{dW}&=\beta_1 V_{dW} + (1-\beta_1) dW\\
V_{db}&=\beta_1 V_{db} + (1-\beta_1) db\\ S_{dW}&=\beta_2 S_{dW} + (1-\beta_2) dW^{2(elementwise)}\\
S_{db}&=\beta_2 S_{db} + (1-\beta_2) db^2\\
V_{dW}^{corrected}&=\frac{V_{dW}}{1-\beta_1^t}\\
V_{db}^{corrected}&=\frac{V_{db}}{1-\beta_1^t}\\
S_{dW}^{corrected}&=\frac{S_{dW}}{1-\beta_2^t}\\
S_{db}^{corrected}&=\frac{S_{db}}{1-\beta_2^t}\\
W&:=W-\alpha \frac{V_{dW}^{corrected}}{\sqrt{S_{dW}^{corrected}}+\epsilon}\\
b&:=b-\alpha \frac{V_{db}^{corrected}}{\sqrt{S_{db}^{corrected}}+\epsilon}
\end{align*}
\]
Hyperparameters choice:\[
\begin{align*}
\alpha&: \text{ needs to be tune}\\
\beta_1&: 0.9 \rightarrow dW\\
\beta_2&: 0.999 \rightarrow dW^2\\
\epsilon&: 10^{-8} \rightarrow \text{by default, no need to be tune}
\end{align*}
\]
Learning rate decay
Learning rate decay
1 epoch = 1 pass through all data
\(\alpha=\frac1{1+(\text{deccayRate}) \times (\text{epochNumber})}\alpha_0\)
Epoch | \(\alpha\) |
1 | 0.1 |
2 | 0.067 |
3 | 0.05 |
4 | 0.04 |
Other learning rate decay methods
\[
\alpha=0.95^{epoch_num}\alpha_0 \dots\text{ exponentialy decay}\\
\alpha=\frac{k}{\sqrt{\text{epoch_num}}}\alpha_0 \text{ or } \frac{k}{\sqrt{t}}\alpha_0\\
\]
plot \(\alpha/t\): discrete staircase(階段状)
- Manual decay
- 主導で\(\alpha\)をへらす手法。時間ごと、日ごとなど。
The problem of local optima
Local optima in neural networks
Problem of plateaus (プラトー)
- Unlikely to get stuck in a bad local optima
- 比較的高次の空間であれば、はまりこまない
- Plateaus can make learning slow
- Momentum, RMSprop,Adamが助けになる
Programming assignment
Optimization
Gradient descent: What you should remember:
- The difference between gradient descent, mini-batch gradient descent and stochastic gradient descent is the number of examples you use to perform one update step.
- You have to tune a learning rate hyperparameter \(\alpha\).
- With a well-turned mini-batch size, usually it outperforms either gradient descent or stochastic gradient descent (particularly when the training set is large).
Mini-Batch Gradient descent: What you should remember:
- Shuffling and Partitioning are the two steps required to build mini-batches
- Powers of two are often chosen to be the mini-batch size, e.g., 16, 32, 64, 128.
Momentum: What you should remember:
- Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent.
- You have to tune a momentum hyperparameter \(\beta\) and a learning rate \(\alpha\).