DL [Course 2/5] Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization [Week 3/3] Hyperparameter tuning, Batch Normalization and Programming Frameworks

Key Concepts

Master the process of hyperparameter tuning

[mathjax]\(\require{cancel}\)

Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization (deeplearning.ai) の受講メモ

Hyperparameter tuning

Tuning process

Hyperparameters importance
- 1st
  - \(\alpha\)
- 2nd
  - \(\beta\): ~0.9
  - #hidden units
  - mini-batch size
- 3rd
  - #layers
  - learning rate decay
- adam default
  - \(\beta_1\): 0.9
  - \(\beta_2\): 0.999
  - \(\epsilon\): \(10^-8 \)
Try random values: Don’t use a grid
- 格子状に選ぶより、ランダムサンプリングのほうが潤沢な値をテストできる
Coarse to fine 粗密探索
- 広い領域でランダムサンプリングし、良い値の領域にズームし、さらにランダムサンプリング

Using an appropriate scale to pick hyperparameters

Picking hyperparameters at random
- \[
  n^{[l]}=50,\dots,100\\
  \text{#layers} L: 2,\dots,4
  \]
Appropriate scale for hyperparameters
- \[
  \alpha=0.0001,\dots,1\\
  a=\log_{10} 0.0001=-4\\
  b=\log_{10} 1=0\\
  r=-4*np.random.rand() \leftarrow r\in [-4,0]\\
  \alpha=10^r \leftarrow 10^{-4}\dots 10^0
  \]
Hyperparameters for exponentially weighted averages
- \[
  \beta=0.9\dots0.999\\
  1-\beta=0.1\dots0.001\\
  r\in [-3,-1]\\
  1-\beta=10^r\\
  \beta=1-10^r
  \]

Hyperparameters tuning in practice: Pandas vs. Caviar

Re-test hyperparameters occasionally
- NLP,Vision,Speech,Ads,logistics,…
- Intuitions do get stale. Re-evaluate occasionally.
Babysitting one model(panda approach)
- CPU/GPU資源がない場合
- オンライン広告やVisionアプリ（データ大、モデル大）
- わりと主流
- 一つのモデルのパラメタを毎日いじる
- Pandaのように少数の子育てを全力でする
Training many models in parallel(caviar approach)
- CPU/GPU資源がある
- 複数のHyperparametersで複数モデルを同時に訓練
- Caviarのように卵の確率的繁殖を目指す

Batch Normalization

Normalizing activations in a network

Normalizing inputs to speed up learning
- Normalize in Gradient descent: x
  - expand x to activations(hidden layers)
- \[
  \mu=\frac1m\sum_i x^{(i)}\\
  x=x-\mu\\
  \sigma^2=\frac1m\sum _i x^{(i)2(elementwise)}\\
  x=\frac{x}{\sigma^2}
  \]
- Can be normalize \(a^{[2]}\) so as to train \(w^{[3]},b^{[3]}\) faster.
- Normalize not \(a^{[2]}\) but \(z^{[2]}\)
  - 実際にはaではなくzをNormalizeすることを勧める
Implementing Batch Norm
- Give some intermedium value in NN z(i)…z(m)
- \[
  \mu=\frac1m\sum_i z^{(i)}\\
  \sigma^2=\frac1m\sum_i (z_i-\mu)^2\\
  z_{norm}^{(i)}=\frac{z^{(i)}-\boxed{\mu}}{\boxed{\sqrt{\sigma^2+\epsilon}}}\\
  \tilde z^{(i)}=\gamma z_{norm}^{(i)}+\beta
  \]
- \(\gamma,\beta\): learnable parameters of model
  - If\[
    \gamma=\boxed{\sqrt{\sigma^2+\epsilon}}\\
    \beta=\boxed{\mu}
    \]
  - then\[
    \tilde z^{(i)}=z^{(i)}
    \]
  - use \(\tilde z^{(i)}\) insted of \(z^{(i)} \)

Fitting Batch Norm into a neural network

Adding Batch Norm to a network
- \[
  X
  \overset{w^{[1]},b^{[1]}}{\rightarrow} Z^{[1]}
  \overset{\beta^{[1]},\gamma^{[1]}}{\underset{Batch-norm}{\rightarrow}}
  \tilde Z^{[1]}
  \rightarrow
  a^{[1]}=g^{[1]}(\tilde Z^{[1]})
  \overset{w^{[2]},b^{[2]}}{\rightarrow}
  Z^{[2]}
  \overset{\beta^{[2]},\gamma^{[2]}}{\underset{Batch-norm}{\rightarrow}}
  \tilde Z^{[2]}
  \rightarrow
  a^{[2]}\dots
  \]
- Parameters:\[
  d\beta^{[l]}
  \begin{cases}
  w^{[1]}, b^{[1]}, w^{[2]}, b^{[2]}, \dots w^{[L]}, b^{[L]}\\
  \beta^{[1]}, \gamma^{[1]}, \beta^{[2]}, \gamma^{[2]}, \dots \beta^{[L]}, \gamma^{[L]}\\
  \end{cases}
  \]
- \[
  \beta^{[l]}:=\beta^{[l]}-\alpha d\beta^{[l]}
  \]
- ここでの\(\beta\)はHyperparametersでの \(\beta\) ではない。Batch Norm論文で使われてるものと合わせてるだけ。
- フレームワークでは1行なので、通常はbatch normを実装する必要はない

tf.nn.batch_normalization

Working with mini-batches
- \[
  \begin{array}{l}
  X^{\{1\}}
  \overset{w^{[1]},b^{[1]}}{\rightarrow} Z^{[1]}
  \overset{\beta^{[1]},\gamma^{[1]}}{\underset{B.N.}{\rightarrow}}
  \tilde Z^{[1]}
  \rightarrow
  a^{[1]}=g^{[1]}(\tilde Z^{[1]})
  \overset{w^{[2]},b^{[2]}}{\rightarrow} Z^{[2]}\dots\\
  X^{\{2\}}
  \overset{w^{[1]},b^{[1]}}{\rightarrow} Z^{[1]}
  \overset{\beta^{[1]},\gamma^{[1]}}{\underset{B.N.}{\rightarrow}}
  \tilde Z^{[1]}
  \rightarrow\dots\\
  X^{\{3\}}
  \overset{w^{[1]},b^{[1]}}{\rightarrow} Z^{[1]}\dots\\
  \vdots
  \end{array}
  \]
- Parameters:\[
  w^{[l]},
  \underbrace{b^{[l]}}_{(n^{[l]},1)},
  \underbrace{\beta^{[l]}}_{(n^{[l]},1)} ,
  \underbrace{\gamma^{[l]} }_{(n^{[l]},1)}
  \]
- \[
  Z^{[l]}=w^{[l]}a^{[l-1]}+b{[l]}\\
  (cancel: b^{[l]})\\
  Z^{[l]}=w^{[l]}a^{[l-1]}\\
  Z_{norm}^{[l]}\\
  \tilde Z^{[l]}=\gamma^{[l]}Z_{norm}^{[l]}+\beta^{[l]}
  \]
Implementing gradient descent
- for t=1 … num Mini-batches
  - Compute forward prop on \(X^{\{t\}}\)
    - In each hidden layer, use BN to replace \(Z^{[l]}\) with \(\tilde Z^{[l]}\)
  - Use backprop to compute \(dw^{[l]}, db^{[l]}, d\beta^{[l]} d\gamma^{[l]}\)
  - Update parameters: \(w^{[l]}, \beta^{[l]} \gamma^{[l]}\)
- Works with momentum, RMSprop, Adam

Why does Batch Norm work?

Learning on shifting input distribution
- covariate shift (共変量シフト)
- training と predictで違う分布
Why this is a problem with neural networks?
- normalize: 平均0,分散1
- batch-normはどう変化してもmean/variance平均と分散は変わらない
- 入力値が変化する問題が軽減され、安定する
- 前の層が学習を続けていても、あとの層の影響は減る
- 各レイヤそれ自体が学習できるようになり、すこし独立したかんじ。これは学習を高速化する
Batch Norm as regularization
- Each mini-batch is scaled by the mean/variance computed on just that mini-batch.
- This adds some noise to the values \(z^[l]\) within that minibatch. So similar to dropout, it adds some noise to each hidden layer’s activations.
- This has a slight regularization effect.
- regularization目的でbatch-normをつかわないこと。意図しない僅かな副作用にすぎない。

Batch Norm at test time

batch-norm in training:\[
\mu=\frac1m\sum_i z^{(i)}\\
\sigma^2=\frac1m\sum_i (z^{(i)}-\mu)^2\\
z_{norm}^{(i)}=\frac{z^{(i)}-\mu}{\sqrt{\sigma^2+\epsilon}}\\
\tilde z^{(i)}=\gamma z_{norm}^{(i)}+\beta
\]
in test time: single example at one time
- \(\mu,\sigma^2\): estimate using exponentially weighted average (across mini-batch)
- \(X^{\{t\}}, \mu^{\{t\}[l]}, \theta_l, \sigma^{2\{t\}[l]} \rightarrow \mu,\sigma^2\)
- \[
  z_{norm}=\frac{z-\mu}{\sqrt{\sigma^2+\epsilon}}\\
  \tilde z=\gamma z_{norm}+\beta
  \]

Multi-class classification

Softmax Regression

Generalization of Logistic regression
- multiple classification
Recognizing cats, dogs,and baby chicks
- \[
  C=\text{#classes}=4 (0,\dots,3)\\
  n^{[L]}=4=C\\
  4: p(other|x),p(cat|x),p(dog|x),p(bc|x)\\
  \]
- \[
  Z^{[L]}=w^{[L]}a^{[L-1]}+b^{[L]} \dots(4,1)
  \]
- Activation function:\[
  t=e^{z^{[L]}}\\
  a^{[L]}=\frac{e^{z^{[L]}}}{\sum_{i=i}^4 t_i}\\
  a_i^{[l]}=\frac{t_i}{\sum_{j=1}^4 t_i}\\
  \]
- \[
  Z^{[L]}=\begin{bmatrix}
  5\\2\\-1\\3\end{bmatrix}\\
  t=\begin{bmatrix}
  e^5\\e^2\\e^{-1}\\e^3\end{bmatrix}
  =\begin{bmatrix}148.4\\7.4\\0.4\\20.1\end{bmatrix}\sum_{j=1}^4 t_j=176.3\\
  a^{[L]}=\frac{t}{176.3}
  \]

Training a softmax classifier

Understanding softmax
\[
z^{[L]}=\begin{bmatrix}
5\\2\\-1\\3\end{bmatrix}\\
t=\begin{bmatrix}
e^5\\e^2\\e^{-1}\\e^3\end{bmatrix}\\
a^{[L]}=g^{[L]}(z^{[L]})=\begin{bmatrix}
e^5 / ( e^5+e^2+e^{-1}+e^3 )\\
e^2 / ( e^5+e^2+e^{-1}+e^3 )\\
e^{-1} / ( e^5+e^2+e^{-1}+e^3 )\\
e^3 / ( e^5+e^2+e^{-1}+e^3 )
\end{bmatrix}
= \begin{bmatrix}
0.842\\0.042\\0.002\\0.114\end{bmatrix}
\]
- Softmax regression generalizes logistic regression to C classes.
Loss function
\[
y=\begin{bmatrix}0\\1\\0\\0\end{bmatrix}\dots cat\\
a^{[L]}=\hat y=\begin{bmatrix}0.3\\0.2\\0.1\\0.4\end{bmatrix}\\
L(\hat y, y)=-\sum_{j=1}^4 y_j \log \hat y_j\\
-y_2 \log \hat y_2=-\log \hat y_2 \dots\text{make }\hat y\text{ big}
\]
Gradient descent with softmax
backprop: \[
dz^{[L]}=\hat y-y\\
\frac{\partial J}{\partial z^{[L]}}
\]

Introduction to programming frameworks

Deep learning frameworks

frameworks
- Caffe/Caffe2
- CNTK
- DL4J
- Keras
- Lasagne
- mxnet
- PaddlePaddle
- TensorFlow
- Theano
- Torch
Choosing deep learning frameworks
- Ease of programming(development and deployment)
- Running speed
- Truly open (open source with good governance)

TensorFlow

Motivating problem
- minimize cost function: \[
  J(w,b)
  \]
- find: \[
  \begin{align*}
  J(w)&=w^2-10w+25\\
  &=(w-5)^2\\
  w&=5
  \end{align*}
  \]

w = tf.Variable(0,dtype=tf.float32)
cost = tf.add(tf.add(w**2,tf.multiply(-10.,w)),25)
train= tf.train.GradientDescentOptimizer(0.01).minimize(cost)

init = tf.global_variables_initializer()
session = tf.Session()
session.run(init)
print(session.run(w)) # 0.0

session.run(train)
print(session.run(w)) # 0.1

for i in range(1000):
  session.run(train)
print(session.run(w)  # 4.9999

w = tf.Variable(0,dtype=tf.float32)
#cost = tf.add(tf.add(w**2,tf.multiply(-10.,w)),25)
cost = w**2 - 10*w + 25 # overload by tf
train= tf.train.GradientDescentOptimizer(0.01).minimize(cost)
...

import numpy as np
import tensorflow as tf

coefficients = np.array([[1.],[-10.].[25.]])

w = tf.Variable(0,dtype=tf.float32)
x = tf.placeholder(tf.float32,[3,1])
#cost = tf.add(tf.add(w**2,tf.multiply(-10.,w)),25)
#cost = w**2 - 10*w + 25
cost = x[0][0]*w**2 + x[1][0]*w + x[2][0] # (w-5)**2
train= tf.train.GradientDescentOptimizer(0.01).minimize(cost)

init = tf.global_variables_initializer()

#session = tf.Session()
#session.run(init)
#print(session.run(w)) # 0.0

# with
with tf.Session() as session:
  session.run(init)
  print(session.run(w))

session.run(train, feed_dict={x:coefficients})
print(session.run(w)) # 0.1

for i in range(1000):
  session.run(train, feed_dict={x:coefficients})
print(session.run(w)  # 4.9999

Programming assignment

Exploring the Tensorflow Library

To summarize, remember to initialize your variables, create a session and run the operations inside the session.

To summarize, you how know how to:

Create placeholders
Specify the computation graph corresponding to operations you want to compute
Create the session
Run the session, using a feed dictionary if necessary to specify placeholder variables’ values.

What you should remember:

Tensorflow is a programming framework used in deep learning
The two main object classes in tensorflow are Tensors and Operators.
When you code in tensorflow you have to take the following steps:
- Create a graph containing Tensors (Variables, Placeholders …) and Operations (tf.matmul, tf.add, …)
- Create a session
- Initialize the session
- Run the session to execute the graph
You can execute the graph multiple times as you’ve seen in model()
The backpropagation and optimization is automatically done when running the session on the “optimizer” object.

Hyperparameter tuning

Tuning process

Using an appropriate scale to pick hyperparameters

Hyperparameters tuning in practice: Pandas vs. Caviar

Batch Normalization

Normalizing activations in a network

Fitting Batch Norm into a neural network

Why does Batch Norm work?

Batch Norm at test time

Multi-class classification

Softmax Regression

Training a softmax classifier

Introduction to programming frameworks

Deep learning frameworks

TensorFlow

Programming assignment

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル