カテゴリー

# DL [Course 2/5] Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization [Week 2/3] Optimization algorithms

Key Concepts

• Know the benefits of learning rate decay and apply it to your optimization
• Remember different optimization methods such as (Stochastic) Gradient Descent, Momentum, RMSProp and Adam
• Use random minibatches to accelerate the convergence and improve the optimization

## Optimization algorithms

#### Batch vs. mini-batch gradient descent

Vectorization allows you to efficiently compute on m examples.

$\underbrace{X}_{(n_x,m)}=[\underbrace{x^{(1)} \dots x^{(1000)}}_{X^{\{1\}}(n_x,1000)} | \underbrace{ x^{(1001)} \dots x^{(2000)} }_{X^{\{2\}} (n_x,1000) } | \dots | \underbrace{ \dots x^{(m)} }_{X^{\{5,000\}} (n_x,1000) } ]\\ \underbrace{Y}_{(1,m)}=[\underbrace{y^{(1)} \dots y^{(1000)}}_{Y^{\{1\}}(1,1000)} | \underbrace{ y^{(1001)} \dots y^{(2000)} }_{Y^{\{2\}} (1,1000) } | \dots | \underbrace{ \dots y^{(m)} }_{Y^{\{5,000\}} (1,1000) } ]\\ \text{What if } m=5,000,000?\\ \text{5000 mini-batches of 1000 each}\\ \text{mini-batch: t: }X^{\{t\}},Y^{\{t\}}$

• for t=1,…,5000
• Forward prop on $$X^{\{t\}}$$
• $$Z^{[1]}=w^{[1]}X^{\{t\}}+b^{[1]}$$
• $$A^{[1]}=g^{[1]}(Z^{[1]})$$
• $$A^{[L]}=g^{[L]}(Z^{[L]})$$
• Compute Cost $$J^{\{t\}}=\frac1{1000}\sum_{i=1}^l L(\underbrace{\hat y^{(i)} , y^{(i)}}_{from X^{\{t\}}, Y^{\{t\}} })+\frac{\lambda}{2 \times 1000}\sum_l{\|w^{[l]}\|_F^2}$$
• Backprop to compute gradients cost $$J^{\{t\}}(using (X^{\{t\}}, X^{\{t\}} ))$$
• $$w^{[l]}:=w^{[l]}-\alpha d w^{[l]}, b^{[l]}:=b^{[l]}-\alpha db^{[l]}$$

#### Training with mini batch gradient descent

• plot J with cost/#iterations
• plot J^{t} with cost/mini batch #(t)

• If mini-batch size=m: Batch gradient descent$(X^{\{t\}},Y^{\{t\}})=(X,Y)$
• Too long per iteration
• Low noise, big step
• If mini-batch size=1: Stochastic gradient descent$(X^{\{t\}},Y^{\{t\}})=(X^{(1)},Y^{(1)}),\dots, (X^{(2)},Y^{(2)}) \text{mini-batch}$
• Lose speedy from vectorization
• noisy, not converge
• In practice: Some in between 1 and m
• In-bother(mini-batch size not too big(small))
• Fastest learning
• Vectorization(~1000)
• make progress without processing entire training set
• If small training set: use batch gradient descent
• m<=2000
• Typical mini-batch size:
• 64,128,256,512,1024
• Make some mini-batch fit in CPU/GPU memory

### Exponentially weighted averages

Exponentially weighted moving average

$v_t = \beta v_{t=1}+(1-\beta)\theta_t\\ \beta=0.9 : \approx 10 days temp\\ \beta=0.98 : \approx 50 days temp\\ \beta=0.5 : \approx 2 days temp\\$

$v_t \approx \frac1{1-\beta} days temp\\$

• Beta is large: adapt more slowly.
• Beta is small: noisy, oscillation

### Understanding exponentially weighted averages

$v_t = \beta v_{t=1}+(1-\beta)\theta_t\\ v_{100}=0.9v_{99}+0.1\theta_{100}\\ v_{99}=0.9v_{98}+0.1\theta_{99}\\ v_{98}=0.9v_{97}+0.1\theta_{98}\\ v_{100}=0.1\theta_{100} +0.1\times0.9\theta_{99} +0.1\times(0.9)^2\theta_{98} +0.1\times(0.9)^3\theta_{97} +0.1\times(0.9)^4\theta_{96} +\dots\\ 0.9^{10}\approx 0.35 \approx \frac1e (\dots 10daystemp)\\ \frac{(1-\epsilon)^\frac1\epsilon}{0.9}=\frac1e$

#### Implementing exponentialy weighted averages

$v_0=0\\ v_1=\beta v_0+(1-\beta)\theta_1\\ v_2=\beta v_1+(1-\beta)\theta_2\\ v_3=\beta v_2+(1-\beta)\theta_3\\ \dots$

• $$v_\theta = 0$$
• Repeat{
• Get next $$\theta_t$$
• $$v_\theta := \beta v_\theta+(1-\beta)\theta_t$$
• }

### Bias correction in exponentially weighted averages

#### Bias correction

$\frac{v_t}{1-\beta^t}\\$

• 初期段階でのバイアスに懸念がある場合
• 早い段階で良い推定値を取得できる

• take a lot of step
• slowly oscillate
• cant use big $$\alpha$$
• Momentum
• low oscillate step
• speedy horizontal move

#### Implementation details

• On iteration t:
• compute $$dW,db$$ on current mini-batch
• $$v_{dW}=\beta v_{dW} + \underbrace{(1-\beta)}_{省略のケースあり} dW$$
• $$v_{db}=\beta v_{db} + \underbrace{ (1-\beta) }_{省略のケースあり} db$$
• $$W:=W-\alpha v_{dW}, b:=b-\alpha v_{db}$$
• like a bowl
• $$dW,db$$: acceleration
• $$v$$: velocity
• $$\beta$$: friction 摩擦
• Hyperparameters: $$\alpha, \beta$$
• $$\beta=0.9$$: 最後の10回の反復の勾配の平均化
• $$1-\beta$$を省略すると、$$dW,db$$のスケーリングに影響し、$$\alpha$$も再調整

### RMSprop (二乗平均平方根)

• On iteration t:compute $$dW,db$$ on current mini-batch
• \begin{align*} S_{dW}&=\beta S_{dW} + (1-\beta) dW^{2(elementwise)} : small\\ S_{db}&=\beta S_{db} + (1-\beta) db^2: large\\ W&:=W-\alpha \frac{dW}{\sqrt{S_{dw}}}\\ b&:=b-\alpha \frac{db}{\sqrt{S_{db}}} \end{align*}

$V_{dw}=0, S_{dw}=0, V_{db}=0, S_{db}=0\\ momentum: \beta_1, RMSprop: \beta_2$
on iteration t:
Compute $$dW,db$$ using current mini-batch
\begin{align*} V_{dW}&=\beta_1 V_{dW} + (1-\beta_1) dW\\ V_{db}&=\beta_1 V_{db} + (1-\beta_1) db\\ S_{dW}&=\beta_2 S_{dW} + (1-\beta_2) dW^{2(elementwise)}\\ S_{db}&=\beta_2 S_{db} + (1-\beta_2) db^2\\ V_{dW}^{corrected}&=\frac{V_{dW}}{1-\beta_1^t}\\ V_{db}^{corrected}&=\frac{V_{db}}{1-\beta_1^t}\\ S_{dW}^{corrected}&=\frac{S_{dW}}{1-\beta_2^t}\\ S_{db}^{corrected}&=\frac{S_{db}}{1-\beta_2^t}\\ W&:=W-\alpha \frac{V_{dW}^{corrected}}{\sqrt{S_{dW}^{corrected}}+\epsilon}\\ b&:=b-\alpha \frac{V_{db}^{corrected}}{\sqrt{S_{db}^{corrected}}+\epsilon} \end{align*}
Hyperparameters choice:\begin{align*} \alpha&: \text{ needs to be tune}\\ \beta_1&: 0.9 \rightarrow dW\\ \beta_2&: 0.999 \rightarrow dW^2\\ \epsilon&: 10^{-8} \rightarrow \text{by default, no need to be tune} \end{align*}

### Learning rate decay

#### Learning rate decay

1 epoch = 1 pass through all data

$$\alpha=\frac1{1+(\text{deccayRate}) \times (\text{epochNumber})}\alpha_0$$

#### Other learning rate decay methods

$\alpha=0.95^{epoch_num}\alpha_0 \dots\text{ exponentialy decay}\\ \alpha=\frac{k}{\sqrt{\text{epoch_num}}}\alpha_0 \text{ or } \frac{k}{\sqrt{t}}\alpha_0\\$

plot $$\alpha/t$$: discrete staircase(階段状)

• Manual decay
• 主導で$$\alpha$$をへらす手法。時間ごと、日ごとなど。

### The problem of local optima

#### Problem of plateaus (プラトー)

• Unlikely to get stuck in a bad local optima
• 比較的高次の空間であれば、はまりこまない
• Plateaus can make learning slow

## Programming assignment

### Optimization

Gradient descent: What you should remember:

• The difference between gradient descent, mini-batch gradient descent and stochastic gradient descent is the number of examples you use to perform one update step.
• You have to tune a learning rate hyperparameter $$\alpha$$.
• With a well-turned mini-batch size, usually it outperforms either gradient descent or stochastic gradient descent (particularly when the training set is large).

Mini-Batch Gradient descent: What you should remember:

• Shuffling and Partitioning are the two steps required to build mini-batches
• Powers of two are often chosen to be the mini-batch size, e.g., 16, 32, 64, 128.

Momentum: What you should remember:

• Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent.
• You have to tune a momentum hyperparameter $$\beta$$ and a learning rate $$\alpha$$.