DL [Course 2/5] Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization [Week 2/3] Optimization algorithms

Key Concepts

Know the benefits of learning rate decay and apply it to your optimization
Remember different optimization methods such as (Stochastic) Gradient Descent, Momentum, RMSProp and Adam
Use random minibatches to accelerate the convergence and improve the optimization

[mathjax]

Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization (deeplearning.ai) の受講メモ

Optimization algorithms

Mini-batch gradient descent

Batch vs. mini-batch gradient descent

Vectorization allows you to efficiently compute on m examples.

\[
\underbrace{X}_{(n_x,m)}=[\underbrace{x^{(1)} \dots x^{(1000)}}_{X^{\{1\}}(n_x,1000)} | \underbrace{ x^{(1001)} \dots x^{(2000)} }_{X^{\{2\}} (n_x,1000) } | \dots | \underbrace{ \dots x^{(m)} }_{X^{\{5,000\}} (n_x,1000) } ]\\
\underbrace{Y}_{(1,m)}=[\underbrace{y^{(1)} \dots y^{(1000)}}_{Y^{\{1\}}(1,1000)} | \underbrace{ y^{(1001)} \dots y^{(2000)} }_{Y^{\{2\}} (1,1000) } | \dots | \underbrace{ \dots y^{(m)} }_{Y^{\{5,000\}} (1,1000) } ]\\
\text{What if } m=5,000,000?\\
\text{5000 mini-batches of 1000 each}\\
\text{mini-batch: t: }X^{\{t\}},Y^{\{t\}}
\]

Mini-batch gradient descent

for t=1,…,5000
- Forward prop on \(X^{\{t\}}\)
  - \( Z^{[1]}=w^{[1]}X^{\{t\}}+b^{[1]}\)
  - \( A^{[1]}=g^{[1]}(Z^{[1]})\)
  - …
  - \( A^{[L]}=g^{[L]}(Z^{[L]})\)
- Compute Cost \( J^{\{t\}}=\frac1{1000}\sum_{i=1}^l L(\underbrace{\hat y^{(i)} , y^{(i)}}_{from X^{\{t\}}, Y^{\{t\}} })+\frac{\lambda}{2 \times 1000}\sum_l{\|w^{[l]}\|_F^2}\)
- Backprop to compute gradients cost \( J^{\{t\}}(using (X^{\{t\}}, X^{\{t\}} ))\)
- \( w^{[l]}:=w^{[l]}-\alpha d w^{[l]}, b^{[l]}:=b^{[l]}-\alpha db^{[l]}\)

Understanding mini-batch gradient descent

Training with mini batch gradient descent

Batch gradient descent:
- plot J with cost/#iterations
Mini-batch gradient descent:
- plot J^{t} with cost/mini batch #(t)

Choosing your mini-batch size

If mini-batch size=m: Batch gradient descent\[
(X^{\{t\}},Y^{\{t\}})=(X,Y)
\]
- Batch gradient descent(mini-batch size=m)
  - Too long per iteration
  - Low noise, big step
If mini-batch size=1: Stochastic gradient descent\[
(X^{\{t\}},Y^{\{t\}})=(X^{(1)},Y^{(1)}),\dots, (X^{(2)},Y^{(2)}) \text{mini-batch}
\]
- Stochastic確率的 gradient descent
  - Lose speedy from vectorization
  - noisy, not converge
In practice: Some in between 1 and m
- In-bother(mini-batch size not too big(small))
  - Fastest learning
  - Vectorization(~1000)
  - make progress without processing entire training set
If small training set: use batch gradient descent
- m<=2000
Typical mini-batch size:
- 64,128,256,512,1024
Make some mini-batch fit in CPU/GPU memory

Exponentially weighted averages

Exponentially weighted moving average

\[
v_t = \beta v_{t=1}+(1-\beta)\theta_t\\
\beta=0.9 : \approx 10 days temp\\
\beta=0.98 : \approx 50 days temp\\
\beta=0.5 : \approx 2 days temp\\
\]

\[
v_t \approx \frac1{1-\beta} days temp\\
\]

Beta is large: adapt more slowly.
Beta is small: noisy, oscillation

Understanding exponentially weighted averages

\[
v_t = \beta v_{t=1}+(1-\beta)\theta_t\\
v_{100}=0.9v_{99}+0.1\theta_{100}\\
v_{99}=0.9v_{98}+0.1\theta_{99}\\
v_{98}=0.9v_{97}+0.1\theta_{98}\\
v_{100}=0.1\theta_{100}
+0.1\times0.9\theta_{99}
+0.1\times(0.9)^2\theta_{98}
+0.1\times(0.9)^3\theta_{97}
+0.1\times(0.9)^4\theta_{96}
+\dots\\
0.9^{10}\approx 0.35 \approx \frac1e (\dots 10daystemp)\\
\frac{(1-\epsilon)^\frac1\epsilon}{0.9}=\frac1e
\]

Implementing exponentialy weighted averages

\[
v_0=0\\
v_1=\beta v_0+(1-\beta)\theta_1\\
v_2=\beta v_1+(1-\beta)\theta_2\\
v_3=\beta v_2+(1-\beta)\theta_3\\
\dots
\]

\(v_\theta = 0\)
Repeat{
- Get next \(\theta_t\)
- \(v_\theta := \beta v_\theta+(1-\beta)\theta_t\)
}

Bias correction in exponentially weighted averages

Bias correction

\[
\frac{v_t}{1-\beta^t}\\
\]

初期段階でのバイアスに懸念がある場合
早い段階で良い推定値を取得できる

Gradient descent with momentum

Gradient descent example

Gradient descent
- take a lot of step
- slowly oscillate
- cant use big \(\alpha\)
Momentum
- low oscillate step
- speedy horizontal move

Implementation details

On iteration t:
- compute \(dW,db\) on current mini-batch
- \( v_{dW}=\beta v_{dW} + \underbrace{(1-\beta)}_{省略のケースあり} dW\)
- \( v_{db}=\beta v_{db} + \underbrace{ (1-\beta) }_{省略のケースあり} db\)
- \( W:=W-\alpha v_{dW}, b:=b-\alpha v_{db} \)
like a bowl
- \(dW,db\): acceleration
- \(v\): velocity
- \(\beta\): friction 摩擦
Hyperparameters: \( \alpha, \beta \)
- \( \beta=0.9 \): 最後の10回の反復の勾配の平均化
- \( 1-\beta \)を省略すると、\(dW,db\)のスケーリングに影響し、\(\alpha\)も再調整

RMSprop (二乗平均平方根)

On iteration t:compute \(dW,db\) on current mini-batch
- \[
  \begin{align*}
  S_{dW}&=\beta S_{dW} + (1-\beta) dW^{2(elementwise)} : small\\
  S_{db}&=\beta S_{db} + (1-\beta) db^2: large\\
  W&:=W-\alpha \frac{dW}{\sqrt{S_{dw}}}\\
  b&:=b-\alpha \frac{db}{\sqrt{S_{db}}}
  \end{align*}
  \]

Adam optimization algorithm

Adam=適応モーメント

\[
V_{dw}=0, S_{dw}=0, V_{db}=0, S_{db}=0\\
momentum: \beta_1, RMSprop: \beta_2
\]
on iteration t:
Compute \(dW,db\) using current mini-batch
\[
\begin{align*}
V_{dW}&=\beta_1 V_{dW} + (1-\beta_1) dW\\
V_{db}&=\beta_1 V_{db} + (1-\beta_1) db\\ S_{dW}&=\beta_2 S_{dW} + (1-\beta_2) dW^{2(elementwise)}\\
S_{db}&=\beta_2 S_{db} + (1-\beta_2) db^2\\
V_{dW}^{corrected}&=\frac{V_{dW}}{1-\beta_1^t}\\
V_{db}^{corrected}&=\frac{V_{db}}{1-\beta_1^t}\\
S_{dW}^{corrected}&=\frac{S_{dW}}{1-\beta_2^t}\\
S_{db}^{corrected}&=\frac{S_{db}}{1-\beta_2^t}\\
W&:=W-\alpha \frac{V_{dW}^{corrected}}{\sqrt{S_{dW}^{corrected}}+\epsilon}\\
b&:=b-\alpha \frac{V_{db}^{corrected}}{\sqrt{S_{db}^{corrected}}+\epsilon}
\end{align*}
\]
Hyperparameters choice:\[
\begin{align*}
\alpha&: \text{ needs to be tune}\\
\beta_1&: 0.9 \rightarrow dW\\
\beta_2&: 0.999 \rightarrow dW^2\\
\epsilon&: 10^{-8} \rightarrow \text{by default, no need to be tune}
\end{align*}
\]

Learning rate decay

1 epoch = 1 pass through all data

\(\alpha=\frac1{1+(\text{deccayRate}) \times (\text{epochNumber})}\alpha_0\)

Epoch	\(\alpha\)
1	0.1
2	0.067
3	0.05
4	0.04

Other learning rate decay methods

\[
\alpha=0.95^{epoch_num}\alpha_0 \dots\text{ exponentialy decay}\\
\alpha=\frac{k}{\sqrt{\text{epoch_num}}}\alpha_0 \text{ or } \frac{k}{\sqrt{t}}\alpha_0\\

\]

plot \(\alpha/t\): discrete staircase(階段状)

Manual decay
- 主導で\(\alpha\)をへらす手法。時間ごと、日ごとなど。

The problem of local optima

Local optima in neural networks

Problem of plateaus (プラトー)

Unlikely to get stuck in a bad local optima
- 比較的高次の空間であれば、はまりこまない
Plateaus can make learning slow
- Momentum, RMSprop,Adamが助けになる

Programming assignment

Optimization

Gradient descent: What you should remember:

The difference between gradient descent, mini-batch gradient descent and stochastic gradient descent is the number of examples you use to perform one update step.
You have to tune a learning rate hyperparameter \(\alpha\).
With a well-turned mini-batch size, usually it outperforms either gradient descent or stochastic gradient descent (particularly when the training set is large).

Mini-Batch Gradient descent: What you should remember:

Shuffling and Partitioning are the two steps required to build mini-batches
Powers of two are often chosen to be the mini-batch size, e.g., 16, 32, 64, 128.

Momentum: What you should remember:

Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent.
You have to tune a momentum hyperparameter \(\beta\) and a learning rate \(\alpha\).

Optimization algorithms

Mini-batch gradient descent

Batch vs. mini-batch gradient descent

Mini-batch gradient descent

Understanding mini-batch gradient descent

Training with mini batch gradient descent

Choosing your mini-batch size

Exponentially weighted averages

Understanding exponentially weighted averages

Implementing exponentialy weighted averages

Bias correction in exponentially weighted averages

Bias correction

Gradient descent with momentum

Gradient descent example

Implementation details

RMSprop (二乗平均平方根)

Adam optimization algorithm

Learning rate decay

Learning rate decay

Other learning rate decay methods

The problem of local optima

Local optima in neural networks

Problem of plateaus (プラトー)

Programming assignment

Optimization

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル