Key Concepts
- Be able to apply a variety of activation functions in a neural network.
- Become fluent with Deep Learning notations and Neural Network Representations
- Build and train a neural network with one hidden layer.
- Build your first forward and backward propagation with a hidden layer
- Understand hidden units and hidden layers
- Apply random initialization to your neural network
[mathjax]
Neural Networks and Deep Learning (deeplearning.ai) の受講メモ
Week3: Shallow neural networks
Neural Network Overview
\[
\matrix{x\\w\\b}\rbrace\leftrightarrow
\overbrace{
\underbrace{
\boxed{z=w^{\mathrm{T}}x+b}
}_{ \color{red}{ dz}}
\leftrightarrow
\underbrace{
\boxed{a=\sigma(z)}
}_{ \color{red}{ da}}}^{\text{Node}}
\leftrightarrow
\boxed{L(a,y)}
\]
\[
\matrix{
\matrix{x^{[1]}\\W^{[1]}\\b^{[1]}}\rbrace\leftrightarrow\\
\\
}
\overbrace{
\matrix{
\underbrace{
\boxed{z^{[1]}=W^{[1]}x+b^{[1]}}
}_{\color{red}{dz^{[1]}} }
\leftrightarrow
\underbrace{
\boxed{a^{[1]} =\sigma(z^{[1]})}
}_{ \color{red}{ da^{[1]}} }\\
\underbrace{
W^{[2]}}_{ \color{red}{ dW^{[2]}}} \\
\underbrace{
b^{[2]}}_{ \color{red}{ db^{[2]}}} } \rbrace }^{\text{Node}^{[1]}}
\leftrightarrow
\overbrace{
\underbrace{
\boxed{z^{[2]}=W^{[2]}a^{[1]}+b^{[2]}}
}_{ \color{red}{ dz^{[2]}} }
\leftrightarrow
\underbrace{
\boxed{a^{[2]} =\sigma(z^{[2]})}
}_{ \color{red}{ da^{[2]}} }}^{\text{Node}^{[2]}}
\leftrightarrow
\boxed{L(a^{[2]},y)}
\]
Neural Network Representation
- Input layer
- \(x_1, x_2, x_3 \)などのfeature value layer
- \(a^{[0]}=x\) activation次の層に値を渡す意味
- Hidden layer
- Training setから見えないのでHidden
- $$a^{[1]}=\begin{bmatrix}
a_1^{[1]}\\
a_2^{[1]}\\
\vdots\\
a_4^{[1]}
\end{bmatrix}$$ - $$\underbrace{w^{[1]}}_{(4,3)}, \underbrace{b^{[1]}}_{(4,1)}$$ 4はhidden unit数、3はfeature value数
- Output layer
- \(\hat y=a^{[2]}\)を出力
- logistic regressionでは出力は1つなので上付き文字を使ってないが、Neural Networkではどの層から来たかを示したい
- \(\underbrace{w^{[2]}}_{(1,4)}, \underbrace{b^{[2]}}_{(1,1)}\) 4はhidden unit数、1はoutput unit数
- 2 layer NN
- Input layerはカウントしないので、2層
- Input layerは0層と呼ぶ
Computing a Neural Network’s Output
- first node calculate step
- \(z_1^{[1]}=w_1^{[1]T}x+b_1^{[1]}\)
- \(a_i^{[l]}=\sigma(z_1^{[1]})\)
- \(l: \text{layer}\)
- \(i: \text{neuron in layer}\)
Vectorization
\[
\displaylines{
z_1^{[1]}=w_1^{[1]T}x+b_1^{[1]},
a_1^{[1]}=\sigma(z_1^{[1]})\\
z_2^{[1]}=w_2^{[1]T}x+b_2^{[1]},
a_2^{[1]}=\sigma(z_2^{[1]})\\
z_3^{[1]}=w_3^{[1]T}x+b_3^{[1]},
a_3^{[1]}=\sigma(z_3^{[1]})\\
z_4^{[1]}=w_4^{[1]T}x+b_4^{[1]},
a_4^{[1]}=\sigma(z_4^{[1]})\\
}
\]
\[
\displaylines{
z^{[1]}
=\overbrace{\begin{bmatrix}
-w_1^{[1]T}-\\
-w_2^{[1]T}-\\
-w_3^{[1]T}-\\
-w_4^{[1]T}-
\end{bmatrix}}^{W^{[1]}}
\begin{bmatrix}
x_1\\
x_2\\
x_3
\end{bmatrix}
+\begin{bmatrix}
b_1^{[1]}\\
b_2^{[1]}\\
b_3^{[1]}\\
b_4^{[1]}\\
\end{bmatrix}
=\begin{bmatrix}
w_1^{[1]T}x+b_1^{[1]}\\
w_2^{[1]T}x+b_2^{[1]}\\
w_3^{[1]T}x+b_3^{[1]}\\
w_4^{[1]T}x+b_4^{[1]}
\end{bmatrix}
=\begin{bmatrix}
z_1^{[1]}\\
z_2^{[1]}\\
z_3^{[1]}\\
z_4^{[1]}
\end{bmatrix}\\
Z^{[1]}=W^{[1]}x+b^{[1]}
}
\]
\[
\displaylines{
a^{[1]}
=\begin{bmatrix}
a_1^{[1]}\\
a_2^{[1]}\\
a_3^{[1]}\\
a_4^{[1]}
\end{bmatrix}
=\sigma(Z^{[1]})
}
\]
Neural Network Representation learning
\[
\eqalign{
z^{[1]}&=W^{[1]}x+b^{[1]}\\
a^{[1]}&=\sigma({z^{[1]}})\\
z^{[2]}&=W^{[2]}a^{[1]}+b^{[2]}\\
a^{[2]}&=\sigma({z^{[2]}})
}
\]
Vectorizing across multiple examples
\[
\eqalign{
x\rightarrow a^{[i]}&=\hat y\\
x^{(1)}\rightarrow a^{[2](1)}&=\hat y^{(1)}\\
x^{(2)}\rightarrow a^{[2](2)}&=\hat y^{(2)}\\
\vdots\\
x^{(m)}\rightarrow a^{[2](m)}&=\hat y^{(m)}}
\]
\[
[2]: \text{layer}\\
(i): \text{training example i}
\]
- forward propagation
- for i=1 to m
- \[
\eqalign{
z^{[1](i)}&=W^{[1]}x^{(i)}+b^{[1]}\\
a^{[1](i)}&=\sigma({z^{[1](i)}})\\
z^{[2](i)}&=W^{[2]}a^{[1](i)}+b^{[2]}\\
a^{[2](i)}&=\sigma({z^{[2](i)}})
}
\]
- \[
\[
X=
\underbrace{
\begin{bmatrix}
|&|&|&|\\
x^{(1)}&x^{(2)}&\dots&x^{(m)}\\
|&|&|&|
\end{bmatrix}}_{(n_x,m)}
\]
\[
\eqalign{
Z^{[1]}&=W^{[1]}X+b^{[1]}\\
A^{[1]}&=\sigma(Z^{[1]})\\
Z^{[2]}&=W^{[2]}A^{[1]}+b^{[2]}\\
A^{[2]}&=\sigma(Z^{[2]})
}
\]
\[
Z=
\underbrace{
\begin{bmatrix}
|&|&|&|\\
z^{[1](1)}&z^{[1](2)}&\dots&z^{[1](m)}\\
|&|&|&|
\end{bmatrix}}_{()}
\]
\[
A=
\underbrace{
\begin{bmatrix}
|&|&|&|\\
a^{[1](1)}&a^{[1](2)}&\dots&a^{[1](m)}\\
|&|&|&|
\end{bmatrix}}_{\text{training examples}}\rbrace \text{hidden units}
\]
Activation functions
- g(z)
- sigmoid function: $$a=\sigma(z)=\frac{1}{1+e^{-z}}$$
- tanh function:$$a=\tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}}$$
- ReLU: $$a=max(0,z)$$
- Leaking ReLU: $$a=max(0.01z,z)$$
Why do you need non-linear activation functions?
- hidden layer
- ReLU, Tanh, Leaky ReLU, etc, not Linear
- output layer
- can use Linear function when hat y is real
- hidden: linear, output: sigmoid
- means Logistic Regression without hidden layer
- linear hidden layer nothing calculate
- 結局、入力xに依存し、Outputの重みはxの分布に従い、定数ベクトルではない場合は互いに異なる
Derivatives of activation functions
- sigmoid activation function
- $$g(z)=\frac{1}{1+e^{-z}}$$
- $$\eqalign{
g'(z)&=\frac{d}{dz}g(z)\\
&=\text{slope of g(x) at x}\\
&=\frac{1}{1+e^{-z}}(1-\frac{1}{1+e^{-z}})\\
&=g(z)(1-g(z))=a(1-a)
}$$
- tanh activation function
- $$g(z)=\tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}}$$
- $$\eqalign{
g'(z)&=\frac{d}{dz}g(z)\\
&=\text{slope of g(z) at z}\\
&=1-(\tanh(z))^2\\
&=1-a^2
}$$
- ReLU
- $$g(z)=max(0,z)$$
- $$g'(z)=\begin{cases}
0 (z\lt 0)\\
1 (z\geq 0)
\end{cases}$$
- Leaky ReLU
- $$g(z)=max(0,z)$$
- $$g'(z)=\begin{cases}
0.01 (z\lt 0)\\
1 (z\geq 0)
\end{cases}$$
Gradient descent for Neural Networks
- Formulas for computing derivatives
- forward propagation順伝搬(4等式)
- \[\eqalign{
Z^{[1]}&=W^{[1]}x+b^{[1]}\\
A^{[1]}&=g^{[1]}(Z^{[1]})\\
Z^{[2]}&=W^{[2]}A^{[1]}+b^{[2]}\\
A^{[2]}&=g^{[2]}(Z^{[2]})=\sigma(Z^{[2]})
}\]
- 二項分類なら関数gはsigmoidになる
- \[\eqalign{
- back propagation誤差逆伝播(6等式)
- \[\eqalign{
dZ^{[2]}&=A^{[2]}-Y\\
dW^{[2]}&=\frac{1}{m}dZ^{[2]}A^{[1]T}\\
db^{[2]}&=\frac{1}{m} np.sum(dZ^{[2]}, axis=1, keepdims=true)\\
dZ^{[1]}&=W^{[2]T} dZ^{[2]} * g^{[1]’}(Z^{[1]})\\
dW^{[1]}&=\frac{1}{m} dZ^{[1]}X^T\\
db^{[1]}&=\frac{1}{m} np.sum(dZ^{[1]}, axis=1, keepdims=true)
}\]
- \[\eqalign{
Backpropagation intuition
Computing gradients
Logistic regression
\[
\matrix{x\\w\\b}\rbrace\leftrightarrow
\overbrace{
\underbrace{
\boxed{z=w^{\mathrm{T}}x+b}
}_{ \color{red}{ dz}}
\leftrightarrow
\underbrace{
\boxed{a=\sigma(z)}
}_{ \color{red}{ da}}}^{\text{Node}}
\leftrightarrow
\boxed{L(a,y)}
\]
\[
\begin{align}
da&= \frac{d}{da} L(a,y)=-y\log a-(1-y)\log(1-a)\\
&=-\frac{y}{a}+\frac{1-y}{1-a}\\
dz&=a-y\\
dz&=da \cdot g'(z)\\
g(z)&=\sigma(z)\\
\frac{\partial L}{\partial z}&=\frac{\partial L}{\partial a} \cdot \frac{d a}{d z}\\
&=\frac{\partial L}{\partial a} \cdot \frac{d}{d z}g(z)\\
\end{align}
\]
\[
\matrix{
\matrix{x^{[1]}\\W^{[1]}\\b^{[1]}}\rbrace\leftrightarrow\\
\\
}
\overbrace{
\matrix{
\underbrace{
\boxed{z^{[1]}=W^{[1]}x+b^{[1]}}
}_{\color{red}{dz^{[1]}} }
\leftrightarrow
\underbrace{
\boxed{a^{[1]} =\sigma(z^{[1]})}
}_{ \color{red}{ da^{[1]}} }\\
\underbrace{
W^{[2]}}_{ \color{red}{ dW^{[2]}}} \\
\underbrace{
b^{[2]}}_{ \color{red}{ db^{[2]}}} } \rbrace }^{\text{Node}^{[1]}}
\leftrightarrow
\overbrace{
\underbrace{
\boxed{z^{[2]}=W^{[2]}a^{[1]}+b^{[2]}}
}_{ \color{red}{ dz^{[2]}} }
\leftrightarrow
\underbrace{
\boxed{a^{[2]} =\sigma(z^{[2]})}
}_{ \color{red}{ da^{[2]}} }}^{\text{Node}^{[2]}}
\leftrightarrow
\boxed{L(a^{[2]},y)}
\]
- \[\eqalign{
dz^{[2]}&=a^{[2]}-y\\
dW^{[2]}&=dz^{[2]}a^{[1]T}\\
db^{[2]}&=dz^{[2]}\\
dz^{[1]}&=W^{[2]T} dz^{[2]} * g^{[1]’}(z^{[1]})\\
dW^{[1]}&=dz^{[1]}X^T\\
db^{[1]}&=dz^{[1]}
}\]
- \[\eqalign{
dZ^{[2]}&=A^{[2]}-Y\\
dW^{[2]}&=\frac{1}{m}dZ^{[2]}A^{[1]T}\\
db^{[2]}&=\frac{1}{m} np.sum(dZ^{[2]}, axis=1, keepdims=true)\\
dZ^{[1]}&=W^{[2]T} dZ^{[2]} * g^{[1]’}(Z^{[1]})\\
dW^{[1]}&=\frac{1}{m} dZ^{[1]}X^T\\
db^{[1]}&=\frac{1}{m} np.sum(dZ^{[1]}, axis=1, keepdims=true)
}\]
Random Initialization
- NN訓練では、重みをランダム初期化することが重要
- 0で初期化すると対称、全く同じ関数にしかならない
\[\eqalign{
W^{[1]}&=np.random.randn((n_h,n_x))\times 0.01\\
b^{[1]}&=np.zeros((n_h,1))\\
W^{[2]}&=np.random.randn((n_y,n_h))\times 0.01\\
b^{[2]}&=np.zeros((n_y,1))
}\]
- 定数の理由
- 0.01はsigmoidやtanhの傾きを考えて選択
- 100とかだと平らなとこにきて、学習がおそい
- 深いNNだと0.01以外も選びたくなる
Programming Assignment
- You’ve learnt to:
- Build a complete neural network with a hidden layer
- Make a good use of a non-linear unit
- Implemented forward propagation and backpropagation, and trained a neural network
- See the impact of varying the hidden layer size, including overfitting.
- np.multiply vs np.dot
- np.multiply: Multiply arguments element-wise.
- np.dot: Dot product of two arrays.
- Arithmetic, matrix multiplication, and comparison operations
- A2 > 0.5
- ndarray.__gt__(self,0.5)