Key Concepts
- Use gradient checking to verify the correctness of your backpropagation implementation
- Understand experimental issues in deep learning such as Vanishing or Exploding gradients and learn how to deal with them
- Recognize the importance of initialization in complex neural networks.
- Learn when and how to use regularization methods such as dropout or L2 regularization.
- Recognize the difference between train/dev/test sets
- Recall that different types of initializations lead to different results
- Diagnose the bias and variance issues in your model
[mathjax]
Setting up your Machine Learning Application
Train / Dev / Test
- Train / Dev / Test sets
- Applied ML is a highly iterative process
- #layers
- #hidden units
- learning rates
- activation functions
- 別ドメインへの応用はあまり利かない
- NLP、会話認識、広告、セキュリティ、物流
- あるドメインや応用分野の直感は、他の応用先で使えない
- CPUとGPUの構成なども
- DLに非常に経験豊富な人でさえ、最初に最適なパラメータ設定することはほぼ不可能
- 結果、DLの適用は、非常に反復的なプロセス
- 反復プロセスを効率的に回すことが重要
- Applied ML is a highly iterative process
- Train / dev / test sets
- Data
- Training set
- Hold-out cross validation / development set / dev set
- test
- Workflow
- learn with training set
- model test with dev set
- test with test set
- Prev DL era (100-10000 data)
- 70/30 % train/test
- 60/20/20 % train/dev/test
- Bigdata era (1000000- data)
- dev set はModelの比較判断できる量でいい
- test set は確信を持つためのもの
- 98/1/1 とかでいい
- Data
- Mismatched train/test distribution
- dev/testが同じ由来であることは重要
- 携帯カメラのアプリのデータ dev/test
- 高解像度カメラのWeb上のデータ train
- 異なる分布になる可能性がある
- アルゴリズムの進化は早くなる
- train/test
- 習慣的に、train/devな実態を指す
- Not having a test set might be okay(only dev)
- testは公平な評価のためのものなので
- dev/testが同じ由来であることは重要
Bias / Variance
- Bias and Variance
- high bias
- underfitting
- just right
- high variance
- overfitting
- 2 features 次元なら図でイメージできるが、それ以上は?
- high bias
Train set error | 1% | 15% | 15% | 0.5% |
Dev set error | 11% | 16% | 30% | 1% |
high variance | high bias(if human: 0%) low variance(if optimal (bayes): 15%) | high variance high bias | low variance low bias |
- High bias and high variance
- トレードオフではなく、両方成立しうる
- ほぼLinearだけど、一部High varianceなとこがあるみたいな
Basic Recipe for Machine Learning
- High bias?(train performance)
- Bigger network
- Train longer
- (NN architecture search)
- High variance?(dev performance)
- More data
- Regularization
- (NN architecture search)
- High bias?N and High variance?N
- done
- Pre-DL era
- bias variance trade off
- we didn’t have many tools without hurting the other one.
Regularizing your neural network
参考 Machine Learning [Week 3/11] Logistic Regression
Regularization
- Logistic regression
- \(\min_{w,b}(w,b)\)
- \[
\require{cancel}
J(w,b)=\frac1m \sum_{i=1}^m L(\hat y^{(i)}, y^{(i)})=\frac{\lambda}{2m} \|w\|_2^2\xcancel{+ \frac{\lambda}{2m}b^2}
\]- bは大量のパラメタのうちの1つでしか無いので、省略してもよい
- \(L_2\) regularization:\[
\|w\|_2^2=\sum_{j=1}^{n_x}w_j^2=w^Tw
\]- 一般的なRegularization
- \(L_1\) regularization:\[
\frac{\lambda}{2m}\sum_{i=1}^{n_x}|w_j|=\frac{\lambda}{2m}\|w\|_1
\]- wはスパースになる=成分に0が大量にある
- メモリ節約、の割に、Regularizationとしては弱いのであまり使われない
- lambdaはpython予約語なのでlambdとする
- Neural network
- \[
J(w^{[1]},b^{[1]} ,…,w^{[L]} ,b^{[L]})=\frac1m\sum_{i=1}^{m}L(\hat y^{(i)},y^{(i)})+\frac{\lambda}{2m}\sum_{l-1}^{L}\|w^{[l]}\|_F^2\\
\|w^{[l]}\|_F^2=\sum_{i=1}^{n^{[l]}}\sum_{j=1}^{n^{[l-1]}}(w_{ij}^{[l]})^2
\]- F: Frobenius norm=L2 norm matrix
- \[
dw^{[l]}=(backprop)+\frac\lambda m w^{[l]}\\
w^{[l]}:=w^{[l]}+\alpha dw^{[l]}
\] - Weight decay: \[
\begin{align*}
w^{[l]}&:=w^{[l]}-\alpha [(backprop)+\frac \alpha m w^{[l]}]\\
&=w^{[l]}-\frac{\alpha \lambda}{m}w^{[l]}-\alpha(backprop)
\end{align*}
\] - 1より少し小さい値0.99とかをwにかける(重み減衰):\[
(1-\frac{\alpha\lambda}{m})w^{[l]}
\]
- \[
Why regularization reduces overfitting?
- \[
J(w^{[l]},b^{[l]})=\frac1m\sum_{i=1}^{m}L(\hat y^{(i)},y^{(i)})+\frac \lambda {2m} \| w^{[l]} \|_F^2
\]- Intuition1: \(\lambda\)を大きくすると、\(w^{[i]}\approx0\) となる
- 小さなNeural Networkになる
- ほぼLogistic Regression
- 実際に完全にunitがゼロになるのではなく、影響力が小さくなる
- Intuition2: NN全体としてlinearに近くなる\[
z^{[l]}=w^{[l]}a^{[l-1]}+b^{[l]}\\ g(z)=\tanh(z)\\
\lambda\uparrow, w^{[l]}\downarrow\\
\text{every layer} \approx \text{linear}
\]
- Intuition1: \(\lambda\)を大きくすると、\(w^{[i]}\approx0\) となる
- Implementation tips
- Gradient decent debug
- plot cost J function
- new definition J
Dropout Regularization
Implementing dropout(Inverted dropout)
ノードをコイントスでドロップアウトさせて学習
Illustrate with layer l=3
keep_prob=0.8
d3=np.random.rand(a3.shape[0],a3.shape[1]) < keep_prob
a3=np.multiply(a3,d3) # a3 *= d3 (30% shut off)
a3 /= keep_prob #invert dropout technique
Making predictions at test time
Do not use dropout at test time. 単なるノイズ。似た結果ではあるが、効率が落ちるだけ。
Why does drop-out work?
Intuition: Can’t rely on any one feature, so have to spread out weights.
レイヤーのユニット数をみて、keep_probを変える。
if you’re more worried about some layers overfitting than others, you can set a lower key.
downside, even more hyperparameters to search for using cross-validation.
あるいは、dropoutするlayerとしないlayerを設定する
Implementation Tips
computer visionでは入力サイズが大きいので、この手の研究者はほとんどデフォルトで利用する傾向があるが、彼らの直感が一般化できるというわけではない。過剰適合でないかぎり、Regularizationはしない。
Dropoutの弱点は、cost function Jが明確に定義できないこと。イテレーションごとにJが減少していくことをプロットで確かめる。
Other regularization methods
- Data augmentation
- 画像を水平反転
- ランダムに切り取って回転させる
- overfittingを防ぐRegularizationの効果がある
- Early stopping: plot #iteration\times error
- Training error or J
- Dev set error
- 丁度いいところを探す
- Orthogonalization: 2つのタスクを明確に分ける
- optimize cost function J with gradient descent..
- Not overfit, Regularization…
Setting up your optimization problem
Normalizing inputs
- Subtruct mean: \[
\mu=\frac1m\sum_{i=1}^m x^{(i)}\\
x:=x-\mu
\] - Normalize variance: \[
\sigma^2=\frac1m\sum_{i-1}^m x^{(i)} **2 \dots(element-wide)\\
x \text{/=} \sigma
\] - \(\frac{x-\mu}{\sigma}\) to normalize test set.
Vanishing / exploding gradients
- If \(L\) is large in very Deep NN with weight matrix >1, exploding \(\hat y\).
- If \(L\) is large in very Deep NN with weight matrix <1, vanishing \(\hat y\).
Weight Initialization for Deep Networks
- Single neuron example sigmoid\[
z=w_1 x_1+w_2 x_2+ \dots +w_n x_n(+b)\\
n\uparrow \rightarrow w\downarrow \\
Var(w_i)=\frac1n\\
w^{[l]}=np.random.randn(shape)*\underbrace{np.sqrt(\frac1{n^{[l-1]}})}
\] - Otehr Variance
- ReLU (He initialization):\[
\sqrt{\frac2{n^{[l-1]}}}\\
\]
- \tanh (xavier initialization): \[
\sqrt{\frac1{n^{[l-1]}}}
\] - other:\[
\sqrt{\frac2{n^{[l]}+n^{[l-1]}}}\\
\]
- ReLU (He initialization):\[
Numerical approximation of gradients
- checking your derivative computation
- \[
f(\theta)=\theta^3\\
\frac{f(\theta+\epsilon)+ f(\theta-\epsilon)}{2\epsilon}\approx g(\theta)\\
\frac{1.01^3-0.99^3}{2(0.01)}=3.0001\approx 3\\
g(\theta)=3\theta^2=3\\
\text{approx error:} 0.0001
\] - \[
f'(\theta)= \lim_{\epsilon \to 0} \underbrace{\frac{f(\theta+\epsilon)- f(\theta-\epsilon)}{2\epsilon}}_{O(\epsilon^2):0.0001}\\
\] - \[
f'(\theta)= \lim_{\epsilon \to 0} \underbrace{\frac{f(\theta+\epsilon)- f(\theta)}{\epsilon}}_{O(\epsilon):0.01}\\
\] - 両側差分のほうが精度が良い
Gradient checking
- Gradient check for a neural network
- Take \(W^{[1]},b^{[1]},\dots,W^{[L]},b^{[L]} \) and reshape into a big vector \(\theta\)
- concatenate: \(J(\theta)\)
- Take \(dW^{[1]},db^{[1]},\dots,dW^{[L]},db^{[L]} \) and reshape into a big vector \(d\theta\)
- concatenate: \(d\theta\) the gradient of \(J(\theta)\)
- Take \(W^{[1]},b^{[1]},\dots,W^{[L]},b^{[L]} \) and reshape into a big vector \(\theta\)
- Gradient checking (Grad check)
- for each i: \[
\begin{align*}
d\theta_{approx^{[i]}}&=\frac{J(\theta_1,\theta_2,\dots,\theta_i+\epsilon,\dots)- J(\theta_1,\theta_2,\dots,\theta_i-\epsilon,\dots) }{2\epsilon}\\
&\approx d\theta[i]=\frac{dJ}{d\theta_i}\\
d\theta_{approx}&\approx d\theta (\leftarrow \text{ check it})
\end{align*}
\] - check euclidean distance:\[
\frac{\| d\theta_{approx}-d\theta\|_2}{\|d\theta_{approx}\|_2+\|d\theta\|_2}\approx \begin{cases}
\underbrace{10^{-7}}_{\epsilon}(\leftarrow great!)\\
\underbrace{10^{-5}}_{\epsilon}(\leftarrow \text{double check vector compornent})\\
\underbrace{10^{-3}}_{\epsilon}(\leftarrow \text{may be wrong})
\end{cases}
\]
- for each i: \[
Gradient Checking Implementation Notes
- Don’t use in training – only to debug
- \(d\theta_{approx}[i]\), d\theta 計算が遅いから
- If algorithm fails grad check, look at components to try to identify bug.
- Remember regularization.
- Doesn’t work with dropout.
- コスト関数JがないのでGrad checkできない
- keep_prob=1にするなど
- Run at random initialization; perhaps again after some training.
Programming assignments
Initialization
- Zero initialization: What you should remember:
- The weights \(W^{[l]}\) should be initialized randomly to break symmetry.
- It is however okay to initialize the biases \(b^{[l]}\) to zeros. Symmetry is still broken so long as \(W^{[l]}\) is initialized randomly.
- Random initialization: In summary:
- Initializing weights to very large random values does not work well.
- Hopefully intializing with small random values does better. The important question is: how small should be these random values be? Lets find out in the next part!
- What you should remember from this notebook:
- Different initializations lead to different results
- Random initialization is used to break symmetry and make sure different hidden units can learn different things
- Don’t intialize to values that are too large
- He initialization works well for networks with ReLU activations.
Regularization
- What you should remember — the implications of L2-regularization on:
- The cost computation:
- A regularization term is added to the cost
- The backpropagation function:
- There are extra terms in the gradients with respect to weight matrices
- Weights end up smaller (“weight decay”):
- Weights are pushed to smaller values.
- The cost computation:
- What you should remember about dropout:
- Dropout is a regularization technique.
- You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.
- Apply dropout both during forward and backward propagation.
- During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.
- What we want you to remember from this notebook:
- Regularization will help you reduce overfitting.
- Regularization will drive your weights to lower values.
- L2 regularization and Dropout are two very effective regularization techniques.
Gradient Checking
- What you should remember from this notebook:
- Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation).
- Gradient checking is slow, so we don’t run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process.