カテゴリー
深層学習

DL [Course 2/5] Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization [Week 3/3] Hyperparameter tuning, Batch Normalization and Programming Frameworks

Key Concepts

  • Master the process of hyperparameter tuning

[mathjax]\(\require{cancel}\)

Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization (deeplearning.ai) の受講メモ

Hyperparameter tuning

Tuning process

  • Hyperparameters importance
    • 1st
      • \(\alpha\)
    • 2nd
      • \(\beta\): ~0.9
      • #hidden units
      • mini-batch size
    • 3rd
      • #layers
      • learning rate decay
    • adam default
      • \(\beta_1\): 0.9
      • \(\beta_2\): 0.999
      • \(\epsilon\): \(10^-8 \)
  • Try random values: Don’t use a grid
    • 格子状に選ぶより、ランダムサンプリングのほうが潤沢な値をテストできる
  • Coarse to fine 粗密探索
    • 広い領域でランダムサンプリングし、良い値の領域にズームし、さらにランダムサンプリング

Using an appropriate scale to pick hyperparameters

  • Picking hyperparameters at random
    • \[
      n^{[l]}=50,\dots,100\\
      \text{#layers} L: 2,\dots,4
      \]
  • Appropriate scale for hyperparameters
    • \[
      \alpha=0.0001,\dots,1\\
      a=\log_{10} 0.0001=-4\\
      b=\log_{10} 1=0\\
      r=-4*np.random.rand() \leftarrow r\in [-4,0]\\
      \alpha=10^r \leftarrow 10^{-4}\dots 10^0
      \]
  • Hyperparameters for exponentially weighted averages
    • \[
      \beta=0.9\dots0.999\\
      1-\beta=0.1\dots0.001\\
      r\in [-3,-1]\\
      1-\beta=10^r\\
      \beta=1-10^r
      \]

Hyperparameters tuning in practice: Pandas vs. Caviar

  • Re-test hyperparameters occasionally
    • NLP,Vision,Speech,Ads,logistics,…
    • Intuitions do get stale. Re-evaluate occasionally.
  • Babysitting one model(panda approach)
    • CPU/GPU資源がない場合
    • オンライン広告やVisionアプリ(データ大、モデル大)
    • わりと主流
    • 一つのモデルのパラメタを毎日いじる
    • Pandaのように少数の子育てを全力でする
  • Training many models in parallel(caviar approach)
    • CPU/GPU資源がある
    • 複数のHyperparametersで複数モデルを同時に訓練
    • Caviarのように卵の確率的繁殖を目指す

Batch Normalization

Normalizing activations in a network

  • Normalizing inputs to speed up learning
    • Normalize in Gradient descent: x
      • expand x to activations(hidden layers)
    • \[
      \mu=\frac1m\sum_i x^{(i)}\\
      x=x-\mu\\
      \sigma^2=\frac1m\sum _i x^{(i)2(elementwise)}\\
      x=\frac{x}{\sigma^2}
      \]
    • Can be normalize \(a^{[2]}\) so as to train \(w^{[3]},b^{[3]}\) faster.
    • Normalize not \(a^{[2]}\) but \(z^{[2]}\)
      • 実際にはaではなくzをNormalizeすることを勧める
  • Implementing Batch Norm
    • Give some intermedium value in NN z(i)…z(m)
    • \[
      \mu=\frac1m\sum_i z^{(i)}\\
      \sigma^2=\frac1m\sum_i (z_i-\mu)^2\\
      z_{norm}^{(i)}=\frac{z^{(i)}-\boxed{\mu}}{\boxed{\sqrt{\sigma^2+\epsilon}}}\\
      \tilde z^{(i)}=\gamma z_{norm}^{(i)}+\beta
      \]
    • \(\gamma,\beta\): learnable parameters of model
      • If\[
        \gamma=\boxed{\sqrt{\sigma^2+\epsilon}}\\
        \beta=\boxed{\mu}
        \]
      • then\[
        \tilde z^{(i)}=z^{(i)}
        \]
      • use \(\tilde z^{(i)}\) insted of \(z^{(i)} \)

Fitting Batch Norm into a neural network

  • Adding Batch Norm to a network
    • \[
      X
      \overset{w^{[1]},b^{[1]}}{\rightarrow} Z^{[1]}
      \overset{\beta^{[1]},\gamma^{[1]}}{\underset{Batch-norm}{\rightarrow}}
      \tilde Z^{[1]}
      \rightarrow
      a^{[1]}=g^{[1]}(\tilde Z^{[1]})
      \overset{w^{[2]},b^{[2]}}{\rightarrow}
      Z^{[2]}
      \overset{\beta^{[2]},\gamma^{[2]}}{\underset{Batch-norm}{\rightarrow}}
      \tilde Z^{[2]}
      \rightarrow
      a^{[2]}\dots
      \]
    • Parameters:\[
      d\beta^{[l]}
      \begin{cases}
      w^{[1]}, b^{[1]}, w^{[2]}, b^{[2]}, \dots w^{[L]}, b^{[L]}\\
      \beta^{[1]}, \gamma^{[1]}, \beta^{[2]}, \gamma^{[2]}, \dots \beta^{[L]}, \gamma^{[L]}\\
      \end{cases}
      \]
    • \[
      \beta^{[l]}:=\beta^{[l]}-\alpha d\beta^{[l]}
      \]
    • ここでの\(\beta\)はHyperparametersでの \(\beta\) ではない。Batch Norm論文で使われてるものと合わせてるだけ。
    • フレームワークでは1行なので、通常はbatch normを実装する必要はない
tf.nn.batch_normalization
  • Working with mini-batches
    • \[
      \begin{array}{l}
      X^{\{1\}}
      \overset{w^{[1]},b^{[1]}}{\rightarrow} Z^{[1]}
      \overset{\beta^{[1]},\gamma^{[1]}}{\underset{B.N.}{\rightarrow}}
      \tilde Z^{[1]}
      \rightarrow
      a^{[1]}=g^{[1]}(\tilde Z^{[1]})
      \overset{w^{[2]},b^{[2]}}{\rightarrow} Z^{[2]}\dots\\
      X^{\{2\}}
      \overset{w^{[1]},b^{[1]}}{\rightarrow} Z^{[1]}
      \overset{\beta^{[1]},\gamma^{[1]}}{\underset{B.N.}{\rightarrow}}
      \tilde Z^{[1]}
      \rightarrow\dots\\
      X^{\{3\}}
      \overset{w^{[1]},b^{[1]}}{\rightarrow} Z^{[1]}\dots\\
      \vdots
      \end{array}
      \]
    • Parameters:\[
      w^{[l]},
      \underbrace{b^{[l]}}_{(n^{[l]},1)},
      \underbrace{\beta^{[l]}}_{(n^{[l]},1)} ,
      \underbrace{\gamma^{[l]} }_{(n^{[l]},1)}
      \]
    • \[
      Z^{[l]}=w^{[l]}a^{[l-1]}+b{[l]}\\
      (cancel: b^{[l]})\\
      Z^{[l]}=w^{[l]}a^{[l-1]}\\
      Z_{norm}^{[l]}\\
      \tilde Z^{[l]}=\gamma^{[l]}Z_{norm}^{[l]}+\beta^{[l]}
      \]
  • Implementing gradient descent
    • for t=1 … num Mini-batches
      • Compute forward prop on \(X^{\{t\}}\)
        • In each hidden layer, use BN to replace \(Z^{[l]}\) with \(\tilde Z^{[l]}\)
      • Use backprop to compute \(dw^{[l]}, db^{[l]}, d\beta^{[l]} d\gamma^{[l]}\)
      • Update parameters: \(w^{[l]}, \beta^{[l]} \gamma^{[l]}\)
    • Works with momentum, RMSprop, Adam

Why does Batch Norm work?

  • Learning on shifting input distribution
    • covariate shift (共変量シフト)
    • training と predictで違う分布
  • Why this is a problem with neural networks?
    • normalize: 平均0,分散1
    • batch-normはどう変化してもmean/variance平均と分散は変わらない
    • 入力値が変化する問題が軽減され、安定する
    • 前の層が学習を続けていても、あとの層の影響は減る
    • 各レイヤそれ自体が学習できるようになり、すこし独立したかんじ。これは学習を高速化する
  • Batch Norm as regularization
    • Each mini-batch is scaled by the mean/variance computed on just that mini-batch.
    • This adds some noise to the values \(z^[l]\) within that minibatch. So similar to dropout, it adds some noise to each hidden layer’s activations.
    • This has a slight regularization effect.
    • regularization目的でbatch-normをつかわないこと。意図しない僅かな副作用にすぎない。

Batch Norm at test time

  • batch-norm in training:\[
    \mu=\frac1m\sum_i z^{(i)}\\
    \sigma^2=\frac1m\sum_i (z^{(i)}-\mu)^2\\
    z_{norm}^{(i)}=\frac{z^{(i)}-\mu}{\sqrt{\sigma^2+\epsilon}}\\
    \tilde z^{(i)}=\gamma z_{norm}^{(i)}+\beta
    \]
  • in test time: single example at one time
    • \(\mu,\sigma^2\): estimate using exponentially weighted average (across mini-batch)
    • \(X^{\{t\}}, \mu^{\{t\}[l]}, \theta_l, \sigma^{2\{t\}[l]} \rightarrow \mu,\sigma^2\)
    • \[
      z_{norm}=\frac{z-\mu}{\sqrt{\sigma^2+\epsilon}}\\
      \tilde z=\gamma z_{norm}+\beta
      \]

Multi-class classification

Softmax Regression

  • Generalization of Logistic regression
    • multiple classification
  • Recognizing cats, dogs,and baby chicks
    • \[
      C=\text{#classes}=4 (0,\dots,3)\\
      n^{[L]}=4=C\\
      4: p(other|x),p(cat|x),p(dog|x),p(bc|x)\\
      \]
    • \[
      Z^{[L]}=w^{[L]}a^{[L-1]}+b^{[L]} \dots(4,1)
      \]
    • Activation function:\[
      t=e^{z^{[L]}}\\
      a^{[L]}=\frac{e^{z^{[L]}}}{\sum_{i=i}^4 t_i}\\
      a_i^{[l]}=\frac{t_i}{\sum_{j=1}^4 t_i}\\
      \]
    • \[
      Z^{[L]}=\begin{bmatrix}
      5\\2\\-1\\3\end{bmatrix}\\
      t=\begin{bmatrix}
      e^5\\e^2\\e^{-1}\\e^3\end{bmatrix}
      =\begin{bmatrix}148.4\\7.4\\0.4\\20.1\end{bmatrix}\sum_{j=1}^4 t_j=176.3\\
      a^{[L]}=\frac{t}{176.3}
      \]

Training a softmax classifier

  • Understanding softmax
  • \[
    z^{[L]}=\begin{bmatrix}
    5\\2\\-1\\3\end{bmatrix}\\
    t=\begin{bmatrix}
    e^5\\e^2\\e^{-1}\\e^3\end{bmatrix}\\
    a^{[L]}=g^{[L]}(z^{[L]})=\begin{bmatrix}
    e^5 / ( e^5+e^2+e^{-1}+e^3 )\\
    e^2 / ( e^5+e^2+e^{-1}+e^3 )\\
    e^{-1} / ( e^5+e^2+e^{-1}+e^3 )\\
    e^3 / ( e^5+e^2+e^{-1}+e^3 )
    \end{bmatrix}
    = \begin{bmatrix}
    0.842\\0.042\\0.002\\0.114\end{bmatrix}
    \]
    • Softmax regression generalizes logistic regression to C classes.
  • Loss function
  • \[
    y=\begin{bmatrix}0\\1\\0\\0\end{bmatrix}\dots cat\\
    a^{[L]}=\hat y=\begin{bmatrix}0.3\\0.2\\0.1\\0.4\end{bmatrix}\\
    L(\hat y, y)=-\sum_{j=1}^4 y_j \log \hat y_j\\
    -y_2 \log \hat y_2=-\log \hat y_2 \dots\text{make }\hat y\text{ big}
    \]
  • Gradient descent with softmax
  • backprop: \[
    dz^{[L]}=\hat y-y\\
    \frac{\partial J}{\partial z^{[L]}}
    \]

Introduction to programming frameworks

Deep learning frameworks

  • frameworks
    • Caffe/Caffe2
    • CNTK
    • DL4J
    • Keras
    • Lasagne
    • mxnet
    • PaddlePaddle
    • TensorFlow
    • Theano
    • Torch
  • Choosing deep learning frameworks
    • Ease of programming(development and deployment)
    • Running speed
    • Truly open (open source with good governance)

TensorFlow

  • Motivating problem
    • minimize cost function: \[
      J(w,b)
      \]
    • find: \[
      \begin{align*}
      J(w)&=w^2-10w+25\\
      &=(w-5)^2\\
      w&=5
      \end{align*}
      \]
w = tf.Variable(0,dtype=tf.float32)
cost = tf.add(tf.add(w**2,tf.multiply(-10.,w)),25)
train= tf.train.GradientDescentOptimizer(0.01).minimize(cost)

init = tf.global_variables_initializer()
session = tf.Session()
session.run(init)
print(session.run(w)) # 0.0

session.run(train)
print(session.run(w)) # 0.1

for i in range(1000):
  session.run(train)
print(session.run(w)  # 4.9999
w = tf.Variable(0,dtype=tf.float32)
#cost = tf.add(tf.add(w**2,tf.multiply(-10.,w)),25)
cost = w**2 - 10*w + 25 # overload by tf
train= tf.train.GradientDescentOptimizer(0.01).minimize(cost)
...
import numpy as np
import tensorflow as tf

coefficients = np.array([[1.],[-10.].[25.]])

w = tf.Variable(0,dtype=tf.float32)
x = tf.placeholder(tf.float32,[3,1])
#cost = tf.add(tf.add(w**2,tf.multiply(-10.,w)),25)
#cost = w**2 - 10*w + 25
cost = x[0][0]*w**2 + x[1][0]*w + x[2][0] # (w-5)**2
train= tf.train.GradientDescentOptimizer(0.01).minimize(cost)

init = tf.global_variables_initializer()

#session = tf.Session()
#session.run(init)
#print(session.run(w)) # 0.0

# with
with tf.Session() as session:
  session.run(init)
  print(session.run(w))

session.run(train, feed_dict={x:coefficients})
print(session.run(w)) # 0.1

for i in range(1000):
  session.run(train, feed_dict={x:coefficients})
print(session.run(w)  # 4.9999

Programming assignment

Exploring the Tensorflow Library

  • To summarize, remember to initialize your variables, create a session and run the operations inside the session.

To summarize, you how know how to:

  1. Create placeholders
  2. Specify the computation graph corresponding to operations you want to compute
  3. Create the session
  4. Run the session, using a feed dictionary if necessary to specify placeholder variables’ values.

What you should remember:

  • Tensorflow is a programming framework used in deep learning
  • The two main object classes in tensorflow are Tensors and Operators.
  • When you code in tensorflow you have to take the following steps:
    • Create a graph containing Tensors (Variables, Placeholders …) and Operations (tf.matmul, tf.add, …)
    • Create a session
    • Initialize the session
    • Run the session to execute the graph
  • You can execute the graph multiple times as you’ve seen in model()
  • The backpropagation and optimization is automatically done when running the session on the “optimizer” object.

コメントを残す

メールアドレスが公開されることはありません。 が付いている欄は必須項目です