カテゴリー
深層学習

DL [Course 2/5] Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization [Week 2/3] Optimization algorithms

Key Concepts

  • Know the benefits of learning rate decay and apply it to your optimization
  • Remember different optimization methods such as (Stochastic) Gradient Descent, Momentum, RMSProp and Adam
  • Use random minibatches to accelerate the convergence and improve the optimization

[mathjax]

Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization (deeplearning.ai) の受講メモ

Optimization algorithms

Mini-batch gradient descent

Batch vs. mini-batch gradient descent

Vectorization allows you to efficiently compute on m examples.

\[
\underbrace{X}_{(n_x,m)}=[\underbrace{x^{(1)} \dots x^{(1000)}}_{X^{\{1\}}(n_x,1000)} | \underbrace{ x^{(1001)} \dots x^{(2000)} }_{X^{\{2\}} (n_x,1000) } | \dots | \underbrace{ \dots x^{(m)} }_{X^{\{5,000\}} (n_x,1000) } ]\\
\underbrace{Y}_{(1,m)}=[\underbrace{y^{(1)} \dots y^{(1000)}}_{Y^{\{1\}}(1,1000)} | \underbrace{ y^{(1001)} \dots y^{(2000)} }_{Y^{\{2\}} (1,1000) } | \dots | \underbrace{ \dots y^{(m)} }_{Y^{\{5,000\}} (1,1000) } ]\\
\text{What if } m=5,000,000?\\
\text{5000 mini-batches of 1000 each}\\
\text{mini-batch: t: }X^{\{t\}},Y^{\{t\}}
\]

Mini-batch gradient descent

  • for t=1,…,5000
    • Forward prop on \(X^{\{t\}}\)
      • \( Z^{[1]}=w^{[1]}X^{\{t\}}+b^{[1]}\)
      • \( A^{[1]}=g^{[1]}(Z^{[1]})\)
      • \( A^{[L]}=g^{[L]}(Z^{[L]})\)
    • Compute Cost \( J^{\{t\}}=\frac1{1000}\sum_{i=1}^l L(\underbrace{\hat y^{(i)} , y^{(i)}}_{from X^{\{t\}}, Y^{\{t\}} })+\frac{\lambda}{2 \times 1000}\sum_l{\|w^{[l]}\|_F^2}\)
    • Backprop to compute gradients cost \( J^{\{t\}}(using (X^{\{t\}}, X^{\{t\}} ))\)
    • \( w^{[l]}:=w^{[l]}-\alpha d w^{[l]}, b^{[l]}:=b^{[l]}-\alpha db^{[l]}\)

Understanding mini-batch gradient descent

Training with mini batch gradient descent

  • Batch gradient descent:
    • plot J with cost/#iterations
  • Mini-batch gradient descent:
    • plot J^{t} with cost/mini batch #(t)

Choosing your mini-batch size

  • If mini-batch size=m: Batch gradient descent\[
    (X^{\{t\}},Y^{\{t\}})=(X,Y)
    \]
    • Batch gradient descent(mini-batch size=m)
      • Too long per iteration
      • Low noise, big step
  • If mini-batch size=1: Stochastic gradient descent\[
    (X^{\{t\}},Y^{\{t\}})=(X^{(1)},Y^{(1)}),\dots, (X^{(2)},Y^{(2)}) \text{mini-batch}
    \]
    • Stochastic確率的 gradient descent
      • Lose speedy from vectorization
      • noisy, not converge
  • In practice: Some in between 1 and m
    • In-bother(mini-batch size not too big(small))
      • Fastest learning
      • Vectorization(~1000)
      • make progress without processing entire training set
  • If small training set: use batch gradient descent
    • m<=2000
  • Typical mini-batch size:
    • 64,128,256,512,1024
  • Make some mini-batch fit in CPU/GPU memory

Exponentially weighted averages

Exponentially weighted moving average

\[
v_t = \beta v_{t=1}+(1-\beta)\theta_t\\
\beta=0.9 : \approx 10 days temp\\
\beta=0.98 : \approx 50 days temp\\
\beta=0.5 : \approx 2 days temp\\
\]

\[
v_t \approx \frac1{1-\beta} days temp\\
\]

  • Beta is large: adapt more slowly.
  • Beta is small: noisy, oscillation

Understanding exponentially weighted averages

\[
v_t = \beta v_{t=1}+(1-\beta)\theta_t\\
v_{100}=0.9v_{99}+0.1\theta_{100}\\
v_{99}=0.9v_{98}+0.1\theta_{99}\\
v_{98}=0.9v_{97}+0.1\theta_{98}\\
v_{100}=0.1\theta_{100}
+0.1\times0.9\theta_{99}
+0.1\times(0.9)^2\theta_{98}
+0.1\times(0.9)^3\theta_{97}
+0.1\times(0.9)^4\theta_{96}
+\dots\\
0.9^{10}\approx 0.35 \approx \frac1e (\dots 10daystemp)\\
\frac{(1-\epsilon)^\frac1\epsilon}{0.9}=\frac1e
\]

Implementing exponentialy weighted averages

\[
v_0=0\\
v_1=\beta v_0+(1-\beta)\theta_1\\
v_2=\beta v_1+(1-\beta)\theta_2\\
v_3=\beta v_2+(1-\beta)\theta_3\\
\dots
\]

  • \(v_\theta = 0\)
  • Repeat{
    • Get next \(\theta_t\)
    • \(v_\theta := \beta v_\theta+(1-\beta)\theta_t\)
  • }

Bias correction in exponentially weighted averages

Bias correction

\[
\frac{v_t}{1-\beta^t}\\
\]

  • 初期段階でのバイアスに懸念がある場合
  • 早い段階で良い推定値を取得できる

Gradient descent with momentum

Gradient descent example

  • Gradient descent
    • take a lot of step
    • slowly oscillate
    • cant use big \(\alpha\)
  • Momentum
    • low oscillate step
    • speedy horizontal move

Implementation details

  • On iteration t:
    • compute \(dW,db\) on current mini-batch
    • \( v_{dW}=\beta v_{dW} + \underbrace{(1-\beta)}_{省略のケースあり} dW\)
    • \( v_{db}=\beta v_{db} + \underbrace{ (1-\beta) }_{省略のケースあり} db\)
    • \( W:=W-\alpha v_{dW}, b:=b-\alpha v_{db} \)
  • like a bowl
    • \(dW,db\): acceleration
    • \(v\): velocity
    • \(\beta\): friction 摩擦
  • Hyperparameters: \( \alpha, \beta \)
    • \( \beta=0.9 \): 最後の10回の反復の勾配の平均化
    • \( 1-\beta \)を省略すると、\(dW,db\)のスケーリングに影響し、\(\alpha\)も再調整

RMSprop (二乗平均平方根)

  • On iteration t:compute \(dW,db\) on current mini-batch
    • \[
      \begin{align*}
      S_{dW}&=\beta S_{dW} + (1-\beta) dW^{2(elementwise)} : small\\
      S_{db}&=\beta S_{db} + (1-\beta) db^2: large\\
      W&:=W-\alpha \frac{dW}{\sqrt{S_{dw}}}\\
      b&:=b-\alpha \frac{db}{\sqrt{S_{db}}}
      \end{align*}
      \]

Adam optimization algorithm

Adam=適応モーメント

\[
V_{dw}=0, S_{dw}=0, V_{db}=0, S_{db}=0\\
momentum: \beta_1, RMSprop: \beta_2
\]
on iteration t:
Compute \(dW,db\) using current mini-batch
\[
\begin{align*}
V_{dW}&=\beta_1 V_{dW} + (1-\beta_1) dW\\
V_{db}&=\beta_1 V_{db} + (1-\beta_1) db\\ S_{dW}&=\beta_2 S_{dW} + (1-\beta_2) dW^{2(elementwise)}\\
S_{db}&=\beta_2 S_{db} + (1-\beta_2) db^2\\
V_{dW}^{corrected}&=\frac{V_{dW}}{1-\beta_1^t}\\
V_{db}^{corrected}&=\frac{V_{db}}{1-\beta_1^t}\\
S_{dW}^{corrected}&=\frac{S_{dW}}{1-\beta_2^t}\\
S_{db}^{corrected}&=\frac{S_{db}}{1-\beta_2^t}\\
W&:=W-\alpha \frac{V_{dW}^{corrected}}{\sqrt{S_{dW}^{corrected}}+\epsilon}\\
b&:=b-\alpha \frac{V_{db}^{corrected}}{\sqrt{S_{db}^{corrected}}+\epsilon}
\end{align*}
\]
Hyperparameters choice:\[
\begin{align*}
\alpha&: \text{ needs to be tune}\\
\beta_1&: 0.9 \rightarrow dW\\
\beta_2&: 0.999 \rightarrow dW^2\\
\epsilon&: 10^{-8} \rightarrow \text{by default, no need to be tune}
\end{align*}
\]

Learning rate decay

Learning rate decay

1 epoch = 1 pass through all data

\(\alpha=\frac1{1+(\text{deccayRate}) \times (\text{epochNumber})}\alpha_0\)

Epoch\(\alpha\)
10.1
20.067
30.05
40.04

Other learning rate decay methods

\[
\alpha=0.95^{epoch_num}\alpha_0 \dots\text{ exponentialy decay}\\
\alpha=\frac{k}{\sqrt{\text{epoch_num}}}\alpha_0 \text{ or } \frac{k}{\sqrt{t}}\alpha_0\\

\]

plot \(\alpha/t\): discrete staircase(階段状)

  • Manual decay
    • 主導で\(\alpha\)をへらす手法。時間ごと、日ごとなど。

The problem of local optima

Local optima in neural networks

Problem of plateaus (プラトー)

  • Unlikely to get stuck in a bad local optima
    • 比較的高次の空間であれば、はまりこまない
  • Plateaus can make learning slow
    • Momentum, RMSprop,Adamが助けになる

Programming assignment

Optimization

Gradient descent: What you should remember:

  • The difference between gradient descent, mini-batch gradient descent and stochastic gradient descent is the number of examples you use to perform one update step.
  • You have to tune a learning rate hyperparameter \(\alpha\).
  • With a well-turned mini-batch size, usually it outperforms either gradient descent or stochastic gradient descent (particularly when the training set is large).

Mini-Batch Gradient descent: What you should remember:

  • Shuffling and Partitioning are the two steps required to build mini-batches
  • Powers of two are often chosen to be the mini-batch size, e.g., 16, 32, 64, 128.

Momentum: What you should remember:

  • Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent.
  • You have to tune a momentum hyperparameter \(\beta\) and a learning rate \(\alpha\).

コメントを残す

メールアドレスが公開されることはありません。 が付いている欄は必須項目です