DL [Course 1/5] Neural Networks and Deep Learning [Week 3/4] Shallow neural networks

Key Concepts

Be able to apply a variety of activation functions in a neural network.
Become fluent with Deep Learning notations and Neural Network Representations
Build and train a neural network with one hidden layer.
Build your first forward and backward propagation with a hidden layer
Understand hidden units and hidden layers
Apply random initialization to your neural network

[mathjax]

Neural Networks and Deep Learning (deeplearning.ai) の受講メモ

Week3: Shallow neural networks

\[
\matrix{
\matrix{x^{[1]}\\W^{[1]}\\b^{[1]}}\rbrace\leftrightarrow\\
\\

}

\overbrace{
\matrix{
\underbrace{
\boxed{z^{[1]}=W^{[1]}x+b^{[1]}}
}_{\color{red}{dz^{[1]}} }
\leftrightarrow
\underbrace{
\boxed{a^{[1]} =\sigma(z^{[1]})}
}_{ \color{red}{ da^{[1]}} }\\
\underbrace{
W^{[2]}}_{ \color{red}{ dW^{[2]}}} \\
\underbrace{
b^{[2]}}_{ \color{red}{ db^{[2]}}} } \rbrace }^{\text{Node}^{[1]}}

\leftrightarrow

\overbrace{
\underbrace{
\boxed{z^{[2]}=W^{[2]}a^{[1]}+b^{[2]}}
}_{ \color{red}{ dz^{[2]}} }
\leftrightarrow
\underbrace{
\boxed{a^{[2]} =\sigma(z^{[2]})}
}_{ \color{red}{ da^{[2]}} }}^{\text{Node}^{[2]}}
\leftrightarrow
\boxed{L(a^{[2]},y)}
\]

Neural Network Representation

Input layer
- $x_1, x_2, x_3 $などのfeature value layer
- $a^{[0]}=x$ activation次の層に値を渡す意味
Hidden layer
- Training setから見えないのでHidden
- $$a^{[1]}=\begin{bmatrix}
  a_1^{[1]}\\
  a_2^{[1]}\\
  \vdots\\
  a_4^{[1]}
  \end{bmatrix}$$
- $$\underbrace{w^{[1]}}_{(4,3)}, \underbrace{b^{[1]}}_{(4,1)}$$ 4はhidden unit数、3はfeature value数
Output layer
- $\hat y=a^{[2]}$を出力
- logistic regressionでは出力は1つなので上付き文字を使ってないが、Neural Networkではどの層から来たかを示したい
- $\underbrace{w^{[2]}}_{(1,4)}, \underbrace{b^{[2]}}_{(1,1)}$ 4はhidden unit数、1はoutput unit数
2 layer NN
- Input layerはカウントしないので、2層
- Input layerは0層と呼ぶ

Computing a Neural Network’s Output

first node calculate step
- $z_1^{[1]}=w_1^{[1]T}x+b_1^{[1]}$
- $a_i^{[l]}=\sigma(z_1^{[1]})$
- $l: \text{layer}$
- $i: \text{neuron in layer}$

Vectorization

\[
\displaylines{
z_1^{[1]}=w_1^{[1]T}x+b_1^{[1]},
a_1^{[1]}=\sigma(z_1^{[1]})\\
z_2^{[1]}=w_2^{[1]T}x+b_2^{[1]},
a_2^{[1]}=\sigma(z_2^{[1]})\\
z_3^{[1]}=w_3^{[1]T}x+b_3^{[1]},
a_3^{[1]}=\sigma(z_3^{[1]})\\
z_4^{[1]}=w_4^{[1]T}x+b_4^{[1]},
a_4^{[1]}=\sigma(z_4^{[1]})\\
}
\]
\[
\displaylines{
z^{[1]}
=\overbrace{\begin{bmatrix}
-w_1^{[1]T}-\\
-w_2^{[1]T}-\\
-w_3^{[1]T}-\\
-w_4^{[1]T}-
\end{bmatrix}}^{W^{[1]}}
\begin{bmatrix}
x_1\\
x_2\\
x_3
\end{bmatrix}
+\begin{bmatrix}
b_1^{[1]}\\
b_2^{[1]}\\
b_3^{[1]}\\
b_4^{[1]}\\
\end{bmatrix}
=\begin{bmatrix}
w_1^{[1]T}x+b_1^{[1]}\\
w_2^{[1]T}x+b_2^{[1]}\\
w_3^{[1]T}x+b_3^{[1]}\\
w_4^{[1]T}x+b_4^{[1]}
\end{bmatrix}
=\begin{bmatrix}
z_1^{[1]}\\
z_2^{[1]}\\
z_3^{[1]}\\
z_4^{[1]}
\end{bmatrix}\\
Z^{[1]}=W^{[1]}x+b^{[1]}
}
\]
\[
\displaylines{
a^{[1]}
=\begin{bmatrix}
a_1^{[1]}\\
a_2^{[1]}\\
a_3^{[1]}\\
a_4^{[1]}
\end{bmatrix}
=\sigma(Z^{[1]})
}
\]

Vectorizing across multiple examples

\[
\eqalign{
x\rightarrow a^{[i]}&=\hat y\\
x^{(1)}\rightarrow a^{[2](1)}&=\hat y^{(1)}\\
x^{(2)}\rightarrow a^{[2](2)}&=\hat y^{(2)}\\
\vdots\\
x^{(m)}\rightarrow a^{[2](m)}&=\hat y^{(m)}}
\]
\[
[2]: \text{layer}\\
(i): \text{training example i}
\]

forward propagation
for i=1 to m
- \[
  \eqalign{
  z^{[1](i)}&=W^{[1]}x^{(i)}+b^{[1]}\\
  a^{[1](i)}&=\sigma({z^{[1](i)}})\\
  z^{[2](i)}&=W^{[2]}a^{[1](i)}+b^{[2]}\\
  a^{[2](i)}&=\sigma({z^{[2](i)}})
  }
  \]

\[
X=
\underbrace{
\begin{bmatrix}
|&|&|&|\\
x^{(1)}&x^{(2)}&\dots&x^{(m)}\\
|&|&|&|
\end{bmatrix}}_{(n_x,m)}
\]
\[
\eqalign{
Z^{[1]}&=W^{[1]}X+b^{[1]}\\
A^{[1]}&=\sigma(Z^{[1]})\\
Z^{[2]}&=W^{[2]}A^{[1]}+b^{[2]}\\
A^{[2]}&=\sigma(Z^{[2]})
}
\]
\[
Z=
\underbrace{
\begin{bmatrix}
|&|&|&|\\
z^{[1](1)}&z^{[1](2)}&\dots&z^{[1](m)}\\
|&|&|&|
\end{bmatrix}}_{()}
\]
\[
A=
\underbrace{
\begin{bmatrix}
|&|&|&|\\
a^{[1](1)}&a^{[1](2)}&\dots&a^{[1](m)}\\
|&|&|&|
\end{bmatrix}}_{\text{training examples}}\rbrace \text{hidden units}
\]

Activation functions

g(z)
- sigmoid function: $$a=\sigma(z)=\frac{1}{1+e^{-z}}$$
- tanh function:$$a=\tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}}$$
- ReLU: $$a=max(0,z)$$
- Leaking ReLU: $$a=max(0.01z,z)$$

Why do you need non-linear activation functions?

hidden layer
- ReLU, Tanh, Leaky ReLU, etc, not Linear
output layer
- can use Linear function when hat y is real
hidden: linear, output: sigmoid
- means Logistic Regression without hidden layer
- linear hidden layer nothing calculate
- 結局、入力xに依存し、Outputの重みはxの分布に従い、定数ベクトルではない場合は互いに異なる

Derivatives of activation functions

sigmoid activation function
- $$g(z)=\frac{1}{1+e^{-z}}$$
- $$\eqalign{
  g'(z)&=\frac{d}{dz}g(z)\\
  &=\text{slope of g(x) at x}\\
  &=\frac{1}{1+e^{-z}}(1-\frac{1}{1+e^{-z}})\\
  &=g(z)(1-g(z))=a(1-a)
  }$$
tanh activation function
- $$g(z)=\tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}}$$
- $$\eqalign{
  g'(z)&=\frac{d}{dz}g(z)\\
  &=\text{slope of g(z) at z}\\
  &=1-(\tanh(z))^2\\
  &=1-a^2
  }$$
ReLU
- $$g(z)=max(0,z)$$
- $$g'(z)=\begin{cases}
  0 (z\lt 0)\\
  1 (z\geq 0)
  \end{cases}$$
Leaky ReLU
- $$g(z)=max(0,z)$$
- $$g'(z)=\begin{cases}
  0.01 (z\lt 0)\\
  1 (z\geq 0)
  \end{cases}$$

Gradient descent for Neural Networks

Formulas for computing derivatives
forward propagation順伝搬(4等式)
- \[\eqalign{
  Z^{[1]}&=W^{[1]}x+b^{[1]}\\
  A^{[1]}&=g^{[1]}(Z^{[1]})\\
  Z^{[2]}&=W^{[2]}A^{[1]}+b^{[2]}\\
  A^{[2]}&=g^{[2]}(Z^{[2]})=\sigma(Z^{[2]})
  }\]
- 二項分類なら関数gはsigmoidになる
back propagation誤差逆伝播(6等式)
- \[\eqalign{
  dZ^{[2]}&=A^{[2]}-Y\\
  dW^{[2]}&=\frac{1}{m}dZ^{[2]}A^{[1]T}\\
  db^{[2]}&=\frac{1}{m} np.sum(dZ^{[2]}, axis=1, keepdims=true)\\
  dZ^{[1]}&=W^{[2]T} dZ^{[2]} * g^{[1]’}(Z^{[1]})\\
  dW^{[1]}&=\frac{1}{m} dZ^{[1]}X^T\\
  db^{[1]}&=\frac{1}{m} np.sum(dZ^{[1]}, axis=1, keepdims=true)
  }\]

Backpropagation intuition

Computing gradients

Logistic regression

\[
\begin{align}
da&= \frac{d}{da} L(a,y)=-y\log a-(1-y)\log(1-a)\\
&=-\frac{y}{a}+\frac{1-y}{1-a}\\
dz&=a-y\\
dz&=da \cdot g'(z)\\
g(z)&=\sigma(z)\\
\frac{\partial L}{\partial z}&=\frac{\partial L}{\partial a} \cdot \frac{d a}{d z}\\
&=\frac{\partial L}{\partial a} \cdot \frac{d}{d z}g(z)\\
\end{align}
\]

\[\eqalign{
dz^{[2]}&=a^{[2]}-y\\
dW^{[2]}&=dz^{[2]}a^{[1]T}\\
db^{[2]}&=dz^{[2]}\\
dz^{[1]}&=W^{[2]T} dz^{[2]} * g^{[1]’}(z^{[1]})\\
dW^{[1]}&=dz^{[1]}X^T\\
db^{[1]}&=dz^{[1]}
}\]

\[\eqalign{
dZ^{[2]}&=A^{[2]}-Y\\
dW^{[2]}&=\frac{1}{m}dZ^{[2]}A^{[1]T}\\
db^{[2]}&=\frac{1}{m} np.sum(dZ^{[2]}, axis=1, keepdims=true)\\
dZ^{[1]}&=W^{[2]T} dZ^{[2]} * g^{[1]’}(Z^{[1]})\\
dW^{[1]}&=\frac{1}{m} dZ^{[1]}X^T\\
db^{[1]}&=\frac{1}{m} np.sum(dZ^{[1]}, axis=1, keepdims=true)
}\]

Random Initialization

NN訓練では、重みをランダム初期化することが重要
0で初期化すると対称、全く同じ関数にしかならない

\[\eqalign{
W^{[1]}&=np.random.randn((n_h,n_x))\times 0.01\\
b^{[1]}&=np.zeros((n_h,1))\\
W^{[2]}&=np.random.randn((n_y,n_h))\times 0.01\\
b^{[2]}&=np.zeros((n_y,1))
}\]

定数の理由
- 0.01はsigmoidやtanhの傾きを考えて選択
- 100とかだと平らなとこにきて、学習がおそい
- 深いNNだと0.01以外も選びたくなる

Programming Assignment

You’ve learnt to:
- Build a complete neural network with a hidden layer
- Make a good use of a non-linear unit
- Implemented forward propagation and backpropagation, and trained a neural network
- See the impact of varying the hidden layer size, including overfitting.

np.multiply vs np.dot
- np.multiply: Multiply arguments element-wise.
- np.dot: Dot product of two arrays.
Arithmetic, matrix multiplication, and comparison operations
- A2 > 0.5
- ndarray.__gt__(self,0.5)