Key Concepts
- Master the process of hyperparameter tuning
[mathjax]\(\require{cancel}\)
Hyperparameter tuning
Tuning process
- Hyperparameters importance
- 1st
- \(\alpha\)
- 2nd
- \(\beta\): ~0.9
- #hidden units
- mini-batch size
- 3rd
- #layers
- learning rate decay
- adam default
- \(\beta_1\): 0.9
- \(\beta_2\): 0.999
- \(\epsilon\): \(10^-8 \)
- 1st
- Try random values: Don’t use a grid
- 格子状に選ぶより、ランダムサンプリングのほうが潤沢な値をテストできる
- Coarse to fine 粗密探索
- 広い領域でランダムサンプリングし、良い値の領域にズームし、さらにランダムサンプリング
Using an appropriate scale to pick hyperparameters
- Picking hyperparameters at random
- \[
n^{[l]}=50,\dots,100\\
\text{#layers} L: 2,\dots,4
\]
- \[
- Appropriate scale for hyperparameters
- \[
\alpha=0.0001,\dots,1\\
a=\log_{10} 0.0001=-4\\
b=\log_{10} 1=0\\
r=-4*np.random.rand() \leftarrow r\in [-4,0]\\
\alpha=10^r \leftarrow 10^{-4}\dots 10^0
\]
- \[
- Hyperparameters for exponentially weighted averages
- \[
\beta=0.9\dots0.999\\
1-\beta=0.1\dots0.001\\
r\in [-3,-1]\\
1-\beta=10^r\\
\beta=1-10^r
\]
- \[
Hyperparameters tuning in practice: Pandas vs. Caviar
- Re-test hyperparameters occasionally
- NLP,Vision,Speech,Ads,logistics,…
- Intuitions do get stale. Re-evaluate occasionally.
- Babysitting one model(panda approach)
- CPU/GPU資源がない場合
- オンライン広告やVisionアプリ(データ大、モデル大)
- わりと主流
- 一つのモデルのパラメタを毎日いじる
- Pandaのように少数の子育てを全力でする
- Training many models in parallel(caviar approach)
- CPU/GPU資源がある
- 複数のHyperparametersで複数モデルを同時に訓練
- Caviarのように卵の確率的繁殖を目指す
Batch Normalization
Normalizing activations in a network
- Normalizing inputs to speed up learning
- Normalize in Gradient descent: x
- expand x to activations(hidden layers)
- \[
\mu=\frac1m\sum_i x^{(i)}\\
x=x-\mu\\
\sigma^2=\frac1m\sum _i x^{(i)2(elementwise)}\\
x=\frac{x}{\sigma^2}
\] - Can be normalize \(a^{[2]}\) so as to train \(w^{[3]},b^{[3]}\) faster.
- Normalize not \(a^{[2]}\) but \(z^{[2]}\)
- 実際にはaではなくzをNormalizeすることを勧める
- Normalize in Gradient descent: x
- Implementing Batch Norm
- Give some intermedium value in NN z(i)…z(m)
- \[
\mu=\frac1m\sum_i z^{(i)}\\
\sigma^2=\frac1m\sum_i (z_i-\mu)^2\\
z_{norm}^{(i)}=\frac{z^{(i)}-\boxed{\mu}}{\boxed{\sqrt{\sigma^2+\epsilon}}}\\
\tilde z^{(i)}=\gamma z_{norm}^{(i)}+\beta
\] - \(\gamma,\beta\): learnable parameters of model
- If\[
\gamma=\boxed{\sqrt{\sigma^2+\epsilon}}\\
\beta=\boxed{\mu}
\] - then\[
\tilde z^{(i)}=z^{(i)}
\] - use \(\tilde z^{(i)}\) insted of \(z^{(i)} \)
- If\[
Fitting Batch Norm into a neural network
- Adding Batch Norm to a network
- \[
X
\overset{w^{[1]},b^{[1]}}{\rightarrow} Z^{[1]}
\overset{\beta^{[1]},\gamma^{[1]}}{\underset{Batch-norm}{\rightarrow}}
\tilde Z^{[1]}
\rightarrow
a^{[1]}=g^{[1]}(\tilde Z^{[1]})
\overset{w^{[2]},b^{[2]}}{\rightarrow}
Z^{[2]}
\overset{\beta^{[2]},\gamma^{[2]}}{\underset{Batch-norm}{\rightarrow}}
\tilde Z^{[2]}
\rightarrow
a^{[2]}\dots
\] - Parameters:\[
d\beta^{[l]}
\begin{cases}
w^{[1]}, b^{[1]}, w^{[2]}, b^{[2]}, \dots w^{[L]}, b^{[L]}\\
\beta^{[1]}, \gamma^{[1]}, \beta^{[2]}, \gamma^{[2]}, \dots \beta^{[L]}, \gamma^{[L]}\\
\end{cases}
\] - \[
\beta^{[l]}:=\beta^{[l]}-\alpha d\beta^{[l]}
\] - ここでの\(\beta\)はHyperparametersでの \(\beta\) ではない。Batch Norm論文で使われてるものと合わせてるだけ。
- フレームワークでは1行なので、通常はbatch normを実装する必要はない
- \[
tf.nn.batch_normalization
- Working with mini-batches
- \[
\begin{array}{l}
X^{\{1\}}
\overset{w^{[1]},b^{[1]}}{\rightarrow} Z^{[1]}
\overset{\beta^{[1]},\gamma^{[1]}}{\underset{B.N.}{\rightarrow}}
\tilde Z^{[1]}
\rightarrow
a^{[1]}=g^{[1]}(\tilde Z^{[1]})
\overset{w^{[2]},b^{[2]}}{\rightarrow} Z^{[2]}\dots\\
X^{\{2\}}
\overset{w^{[1]},b^{[1]}}{\rightarrow} Z^{[1]}
\overset{\beta^{[1]},\gamma^{[1]}}{\underset{B.N.}{\rightarrow}}
\tilde Z^{[1]}
\rightarrow\dots\\
X^{\{3\}}
\overset{w^{[1]},b^{[1]}}{\rightarrow} Z^{[1]}\dots\\
\vdots
\end{array}
\] - Parameters:\[
w^{[l]},
\underbrace{b^{[l]}}_{(n^{[l]},1)},
\underbrace{\beta^{[l]}}_{(n^{[l]},1)} ,
\underbrace{\gamma^{[l]} }_{(n^{[l]},1)}
\] - \[
Z^{[l]}=w^{[l]}a^{[l-1]}+b{[l]}\\
(cancel: b^{[l]})\\
Z^{[l]}=w^{[l]}a^{[l-1]}\\
Z_{norm}^{[l]}\\
\tilde Z^{[l]}=\gamma^{[l]}Z_{norm}^{[l]}+\beta^{[l]}
\]
- \[
- Implementing gradient descent
- for t=1 … num Mini-batches
- Compute forward prop on \(X^{\{t\}}\)
- In each hidden layer, use BN to replace \(Z^{[l]}\) with \(\tilde Z^{[l]}\)
- Use backprop to compute \(dw^{[l]}, db^{[l]}, d\beta^{[l]} d\gamma^{[l]}\)
- Update parameters: \(w^{[l]}, \beta^{[l]} \gamma^{[l]}\)
- Compute forward prop on \(X^{\{t\}}\)
- Works with momentum, RMSprop, Adam
- for t=1 … num Mini-batches
Why does Batch Norm work?
- Learning on shifting input distribution
- covariate shift (共変量シフト)
- training と predictで違う分布
- Why this is a problem with neural networks?
- normalize: 平均0,分散1
- batch-normはどう変化してもmean/variance平均と分散は変わらない
- 入力値が変化する問題が軽減され、安定する
- 前の層が学習を続けていても、あとの層の影響は減る
- 各レイヤそれ自体が学習できるようになり、すこし独立したかんじ。これは学習を高速化する
- Batch Norm as regularization
- Each mini-batch is scaled by the mean/variance computed on just that mini-batch.
- This adds some noise to the values \(z^[l]\) within that minibatch. So similar to dropout, it adds some noise to each hidden layer’s activations.
- This has a slight regularization effect.
- regularization目的でbatch-normをつかわないこと。意図しない僅かな副作用にすぎない。
Batch Norm at test time
- batch-norm in training:\[
\mu=\frac1m\sum_i z^{(i)}\\
\sigma^2=\frac1m\sum_i (z^{(i)}-\mu)^2\\
z_{norm}^{(i)}=\frac{z^{(i)}-\mu}{\sqrt{\sigma^2+\epsilon}}\\
\tilde z^{(i)}=\gamma z_{norm}^{(i)}+\beta
\] - in test time: single example at one time
- \(\mu,\sigma^2\): estimate using exponentially weighted average (across mini-batch)
- \(X^{\{t\}}, \mu^{\{t\}[l]}, \theta_l, \sigma^{2\{t\}[l]} \rightarrow \mu,\sigma^2\)
- \[
z_{norm}=\frac{z-\mu}{\sqrt{\sigma^2+\epsilon}}\\
\tilde z=\gamma z_{norm}+\beta
\]
Multi-class classification
Softmax Regression
- Generalization of Logistic regression
- multiple classification
- Recognizing cats, dogs,and baby chicks
- \[
C=\text{#classes}=4 (0,\dots,3)\\
n^{[L]}=4=C\\
4: p(other|x),p(cat|x),p(dog|x),p(bc|x)\\
\] - \[
Z^{[L]}=w^{[L]}a^{[L-1]}+b^{[L]} \dots(4,1)
\] - Activation function:\[
t=e^{z^{[L]}}\\
a^{[L]}=\frac{e^{z^{[L]}}}{\sum_{i=i}^4 t_i}\\
a_i^{[l]}=\frac{t_i}{\sum_{j=1}^4 t_i}\\
\] - \[
Z^{[L]}=\begin{bmatrix}
5\\2\\-1\\3\end{bmatrix}\\
t=\begin{bmatrix}
e^5\\e^2\\e^{-1}\\e^3\end{bmatrix}
=\begin{bmatrix}148.4\\7.4\\0.4\\20.1\end{bmatrix}\sum_{j=1}^4 t_j=176.3\\
a^{[L]}=\frac{t}{176.3}
\]
- \[
Training a softmax classifier
- Understanding softmax
- \[
z^{[L]}=\begin{bmatrix}
5\\2\\-1\\3\end{bmatrix}\\
t=\begin{bmatrix}
e^5\\e^2\\e^{-1}\\e^3\end{bmatrix}\\
a^{[L]}=g^{[L]}(z^{[L]})=\begin{bmatrix}
e^5 / ( e^5+e^2+e^{-1}+e^3 )\\
e^2 / ( e^5+e^2+e^{-1}+e^3 )\\
e^{-1} / ( e^5+e^2+e^{-1}+e^3 )\\
e^3 / ( e^5+e^2+e^{-1}+e^3 )
\end{bmatrix}
= \begin{bmatrix}
0.842\\0.042\\0.002\\0.114\end{bmatrix}
\]- Softmax regression generalizes logistic regression to C classes.
- Loss function
- \[
y=\begin{bmatrix}0\\1\\0\\0\end{bmatrix}\dots cat\\
a^{[L]}=\hat y=\begin{bmatrix}0.3\\0.2\\0.1\\0.4\end{bmatrix}\\
L(\hat y, y)=-\sum_{j=1}^4 y_j \log \hat y_j\\
-y_2 \log \hat y_2=-\log \hat y_2 \dots\text{make }\hat y\text{ big}
\] - Gradient descent with softmax
- backprop: \[
dz^{[L]}=\hat y-y\\
\frac{\partial J}{\partial z^{[L]}}
\]
Introduction to programming frameworks
Deep learning frameworks
- frameworks
- Caffe/Caffe2
- CNTK
- DL4J
- Keras
- Lasagne
- mxnet
- PaddlePaddle
- TensorFlow
- Theano
- Torch
- Choosing deep learning frameworks
- Ease of programming(development and deployment)
- Running speed
- Truly open (open source with good governance)
TensorFlow
- Motivating problem
- minimize cost function: \[
J(w,b)
\] - find: \[
\begin{align*}
J(w)&=w^2-10w+25\\
&=(w-5)^2\\
w&=5
\end{align*}
\]
- minimize cost function: \[
w = tf.Variable(0,dtype=tf.float32)
cost = tf.add(tf.add(w**2,tf.multiply(-10.,w)),25)
train= tf.train.GradientDescentOptimizer(0.01).minimize(cost)
init = tf.global_variables_initializer()
session = tf.Session()
session.run(init)
print(session.run(w)) # 0.0
session.run(train)
print(session.run(w)) # 0.1
for i in range(1000):
session.run(train)
print(session.run(w) # 4.9999
w = tf.Variable(0,dtype=tf.float32)
#cost = tf.add(tf.add(w**2,tf.multiply(-10.,w)),25)
cost = w**2 - 10*w + 25 # overload by tf
train= tf.train.GradientDescentOptimizer(0.01).minimize(cost)
...
import numpy as np
import tensorflow as tf
coefficients = np.array([[1.],[-10.].[25.]])
w = tf.Variable(0,dtype=tf.float32)
x = tf.placeholder(tf.float32,[3,1])
#cost = tf.add(tf.add(w**2,tf.multiply(-10.,w)),25)
#cost = w**2 - 10*w + 25
cost = x[0][0]*w**2 + x[1][0]*w + x[2][0] # (w-5)**2
train= tf.train.GradientDescentOptimizer(0.01).minimize(cost)
init = tf.global_variables_initializer()
#session = tf.Session()
#session.run(init)
#print(session.run(w)) # 0.0
# with
with tf.Session() as session:
session.run(init)
print(session.run(w))
session.run(train, feed_dict={x:coefficients})
print(session.run(w)) # 0.1
for i in range(1000):
session.run(train, feed_dict={x:coefficients})
print(session.run(w) # 4.9999
Programming assignment
Exploring the Tensorflow Library
- To summarize, remember to initialize your variables, create a session and run the operations inside the session.
To summarize, you how know how to:
- Create placeholders
- Specify the computation graph corresponding to operations you want to compute
- Create the session
- Run the session, using a feed dictionary if necessary to specify placeholder variables’ values.
What you should remember:
- Tensorflow is a programming framework used in deep learning
- The two main object classes in tensorflow are Tensors and Operators.
- When you code in tensorflow you have to take the following steps:
- Create a graph containing Tensors (Variables, Placeholders …) and Operations (tf.matmul, tf.add, …)
- Create a session
- Initialize the session
- Run the session to execute the graph
- You can execute the graph multiple times as you’ve seen in model()
- The backpropagation and optimization is automatically done when running the session on the “optimizer” object.