カテゴリー
深層学習

DL [Course 1/5] Neural Networks and Deep Learning [Week 3/4] Shallow neural networks

Key Concepts

  • Be able to apply a variety of activation functions in a neural network.
  • Become fluent with Deep Learning notations and Neural Network Representations
  • Build and train a neural network with one hidden layer.
  • Build your first forward and backward propagation with a hidden layer
  • Understand hidden units and hidden layers
  • Apply random initialization to your neural network

[mathjax]

Neural Networks and Deep Learning (deeplearning.ai) の受講メモ

Week3: Shallow neural networks

Neural Network Overview

\[
\matrix{x\\w\\b}\rbrace\leftrightarrow
\overbrace{
\underbrace{
\boxed{z=w^{\mathrm{T}}x+b}
}_{ \color{red}{ dz}}
\leftrightarrow
\underbrace{
\boxed{a=\sigma(z)}
}_{ \color{red}{ da}}}^{\text{Node}}
\leftrightarrow
\boxed{L(a,y)}
\]

\[
\matrix{
\matrix{x^{[1]}\\W^{[1]}\\b^{[1]}}\rbrace\leftrightarrow\\
\\

}

\overbrace{
\matrix{
\underbrace{
\boxed{z^{[1]}=W^{[1]}x+b^{[1]}}
}_{\color{red}{dz^{[1]}} }
\leftrightarrow
\underbrace{
\boxed{a^{[1]} =\sigma(z^{[1]})}
}_{ \color{red}{ da^{[1]}} }\\
\underbrace{
W^{[2]}}_{ \color{red}{ dW^{[2]}}} \\
\underbrace{
b^{[2]}}_{ \color{red}{ db^{[2]}}} } \rbrace }^{\text{Node}^{[1]}}

\leftrightarrow

\overbrace{
\underbrace{
\boxed{z^{[2]}=W^{[2]}a^{[1]}+b^{[2]}}
}_{ \color{red}{ dz^{[2]}} }
\leftrightarrow
\underbrace{
\boxed{a^{[2]} =\sigma(z^{[2]})}
}_{ \color{red}{ da^{[2]}} }}^{\text{Node}^{[2]}}
\leftrightarrow
\boxed{L(a^{[2]},y)}
\]

Neural Network Representation

  • Input layer
    • \(x_1, x_2, x_3 \)などのfeature value layer
    • \(a^{[0]}=x\) activation次の層に値を渡す意味
  • Hidden layer
    • Training setから見えないのでHidden
    • $$a^{[1]}=\begin{bmatrix}
      a_1^{[1]}\\
      a_2^{[1]}\\
      \vdots\\
      a_4^{[1]}
      \end{bmatrix}$$
    • $$\underbrace{w^{[1]}}_{(4,3)}, \underbrace{b^{[1]}}_{(4,1)}$$ 4はhidden unit数、3はfeature value数
  • Output layer
    • \(\hat y=a^{[2]}\)を出力
    • logistic regressionでは出力は1つなので上付き文字を使ってないが、Neural Networkではどの層から来たかを示したい
    • \(\underbrace{w^{[2]}}_{(1,4)}, \underbrace{b^{[2]}}_{(1,1)}\) 4はhidden unit数、1はoutput unit数
  • 2 layer NN
    • Input layerはカウントしないので、2層
    • Input layerは0層と呼ぶ

Computing a Neural Network’s Output

  • first node calculate step
    • \(z_1^{[1]}=w_1^{[1]T}x+b_1^{[1]}\)
    • \(a_i^{[l]}=\sigma(z_1^{[1]})\)
    • \(l: \text{layer}\)
    • \(i: \text{neuron in layer}\)

Vectorization

\[
\displaylines{
z_1^{[1]}=w_1^{[1]T}x+b_1^{[1]},
a_1^{[1]}=\sigma(z_1^{[1]})\\
z_2^{[1]}=w_2^{[1]T}x+b_2^{[1]},
a_2^{[1]}=\sigma(z_2^{[1]})\\
z_3^{[1]}=w_3^{[1]T}x+b_3^{[1]},
a_3^{[1]}=\sigma(z_3^{[1]})\\
z_4^{[1]}=w_4^{[1]T}x+b_4^{[1]},
a_4^{[1]}=\sigma(z_4^{[1]})\\
}
\]
\[
\displaylines{
z^{[1]}
=\overbrace{\begin{bmatrix}
-w_1^{[1]T}-\\
-w_2^{[1]T}-\\
-w_3^{[1]T}-\\
-w_4^{[1]T}-
\end{bmatrix}}^{W^{[1]}}
\begin{bmatrix}
x_1\\
x_2\\
x_3
\end{bmatrix}
+\begin{bmatrix}
b_1^{[1]}\\
b_2^{[1]}\\
b_3^{[1]}\\
b_4^{[1]}\\
\end{bmatrix}
=\begin{bmatrix}
w_1^{[1]T}x+b_1^{[1]}\\
w_2^{[1]T}x+b_2^{[1]}\\
w_3^{[1]T}x+b_3^{[1]}\\
w_4^{[1]T}x+b_4^{[1]}
\end{bmatrix}
=\begin{bmatrix}
z_1^{[1]}\\
z_2^{[1]}\\
z_3^{[1]}\\
z_4^{[1]}
\end{bmatrix}\\
Z^{[1]}=W^{[1]}x+b^{[1]}
}
\]
\[
\displaylines{
a^{[1]}
=\begin{bmatrix}
a_1^{[1]}\\
a_2^{[1]}\\
a_3^{[1]}\\
a_4^{[1]}
\end{bmatrix}
=\sigma(Z^{[1]})
}
\]

Neural Network Representation learning

\[
\eqalign{
z^{[1]}&=W^{[1]}x+b^{[1]}\\
a^{[1]}&=\sigma({z^{[1]}})\\
z^{[2]}&=W^{[2]}a^{[1]}+b^{[2]}\\
a^{[2]}&=\sigma({z^{[2]}})
}
\]

Vectorizing across multiple examples

\[
\eqalign{
x\rightarrow a^{[i]}&=\hat y\\
x^{(1)}\rightarrow a^{[2](1)}&=\hat y^{(1)}\\
x^{(2)}\rightarrow a^{[2](2)}&=\hat y^{(2)}\\
\vdots\\
x^{(m)}\rightarrow a^{[2](m)}&=\hat y^{(m)}}
\]
\[
[2]: \text{layer}\\
(i): \text{training example i}
\]

  • forward propagation
  • for i=1 to m
    • \[
      \eqalign{
      z^{[1](i)}&=W^{[1]}x^{(i)}+b^{[1]}\\
      a^{[1](i)}&=\sigma({z^{[1](i)}})\\
      z^{[2](i)}&=W^{[2]}a^{[1](i)}+b^{[2]}\\
      a^{[2](i)}&=\sigma({z^{[2](i)}})
      }
      \]

\[
X=
\underbrace{
\begin{bmatrix}
|&|&|&|\\
x^{(1)}&x^{(2)}&\dots&x^{(m)}\\
|&|&|&|
\end{bmatrix}}_{(n_x,m)}
\]
\[
\eqalign{
Z^{[1]}&=W^{[1]}X+b^{[1]}\\
A^{[1]}&=\sigma(Z^{[1]})\\
Z^{[2]}&=W^{[2]}A^{[1]}+b^{[2]}\\
A^{[2]}&=\sigma(Z^{[2]})
}
\]
\[
Z=
\underbrace{
\begin{bmatrix}
|&|&|&|\\
z^{[1](1)}&z^{[1](2)}&\dots&z^{[1](m)}\\
|&|&|&|
\end{bmatrix}}_{()}
\]
\[
A=
\underbrace{
\begin{bmatrix}
|&|&|&|\\
a^{[1](1)}&a^{[1](2)}&\dots&a^{[1](m)}\\
|&|&|&|
\end{bmatrix}}_{\text{training examples}}\rbrace \text{hidden units}
\]

Activation functions

  • g(z)
    • sigmoid function: $$a=\sigma(z)=\frac{1}{1+e^{-z}}$$
    • tanh function:$$a=\tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}}$$
    • ReLU: $$a=max(0,z)$$
    • Leaking ReLU: $$a=max(0.01z,z)$$

Why do you need non-linear activation functions?

  • hidden layer
    • ReLU, Tanh, Leaky ReLU, etc, not Linear
  • output layer
    • can use Linear function when hat y is real
  • hidden: linear, output: sigmoid
    • means Logistic Regression without hidden layer
    • linear hidden layer nothing calculate
    • 結局、入力xに依存し、Outputの重みはxの分布に従い、定数ベクトルではない場合は互いに異なる

Derivatives of activation functions

  • sigmoid activation function
    • $$g(z)=\frac{1}{1+e^{-z}}$$
    • $$\eqalign{
      g'(z)&=\frac{d}{dz}g(z)\\
      &=\text{slope of g(x) at x}\\
      &=\frac{1}{1+e^{-z}}(1-\frac{1}{1+e^{-z}})\\
      &=g(z)(1-g(z))=a(1-a)
      }$$
  • tanh activation function
    • $$g(z)=\tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}}$$
    • $$\eqalign{
      g'(z)&=\frac{d}{dz}g(z)\\
      &=\text{slope of g(z) at z}\\
      &=1-(\tanh(z))^2\\
      &=1-a^2
      }$$
  • ReLU
    • $$g(z)=max(0,z)$$
    • $$g'(z)=\begin{cases}
      0 (z\lt 0)\\
      1 (z\geq 0)
      \end{cases}$$
  • Leaky ReLU
    • $$g(z)=max(0,z)$$
    • $$g'(z)=\begin{cases}
      0.01 (z\lt 0)\\
      1 (z\geq 0)
      \end{cases}$$

Gradient descent for Neural Networks

  • Formulas for computing derivatives
  • forward propagation順伝搬(4等式)
    • \[\eqalign{
      Z^{[1]}&=W^{[1]}x+b^{[1]}\\
      A^{[1]}&=g^{[1]}(Z^{[1]})\\
      Z^{[2]}&=W^{[2]}A^{[1]}+b^{[2]}\\
      A^{[2]}&=g^{[2]}(Z^{[2]})=\sigma(Z^{[2]})
      }\]
    • 二項分類なら関数gはsigmoidになる
  • back propagation誤差逆伝播(6等式)
    • \[\eqalign{
      dZ^{[2]}&=A^{[2]}-Y\\
      dW^{[2]}&=\frac{1}{m}dZ^{[2]}A^{[1]T}\\
      db^{[2]}&=\frac{1}{m} np.sum(dZ^{[2]}, axis=1, keepdims=true)\\
      dZ^{[1]}&=W^{[2]T} dZ^{[2]} * g^{[1]’}(Z^{[1]})\\
      dW^{[1]}&=\frac{1}{m} dZ^{[1]}X^T\\
      db^{[1]}&=\frac{1}{m} np.sum(dZ^{[1]}, axis=1, keepdims=true)
      }\]

Backpropagation intuition

Computing gradients

Logistic regression

\[
\matrix{x\\w\\b}\rbrace\leftrightarrow
\overbrace{
\underbrace{
\boxed{z=w^{\mathrm{T}}x+b}
}_{ \color{red}{ dz}}
\leftrightarrow
\underbrace{
\boxed{a=\sigma(z)}
}_{ \color{red}{ da}}}^{\text{Node}}
\leftrightarrow
\boxed{L(a,y)}
\]

\[
\begin{align}
da&= \frac{d}{da} L(a,y)=-y\log a-(1-y)\log(1-a)\\
&=-\frac{y}{a}+\frac{1-y}{1-a}\\
dz&=a-y\\
dz&=da \cdot g'(z)\\
g(z)&=\sigma(z)\\
\frac{\partial L}{\partial z}&=\frac{\partial L}{\partial a} \cdot \frac{d a}{d z}\\
&=\frac{\partial L}{\partial a} \cdot \frac{d}{d z}g(z)\\
\end{align}
\]

\[
\matrix{
\matrix{x^{[1]}\\W^{[1]}\\b^{[1]}}\rbrace\leftrightarrow\\
\\

}

\overbrace{
\matrix{
\underbrace{
\boxed{z^{[1]}=W^{[1]}x+b^{[1]}}
}_{\color{red}{dz^{[1]}} }
\leftrightarrow
\underbrace{
\boxed{a^{[1]} =\sigma(z^{[1]})}
}_{ \color{red}{ da^{[1]}} }\\
\underbrace{
W^{[2]}}_{ \color{red}{ dW^{[2]}}} \\
\underbrace{
b^{[2]}}_{ \color{red}{ db^{[2]}}} } \rbrace }^{\text{Node}^{[1]}}

\leftrightarrow

\overbrace{
\underbrace{
\boxed{z^{[2]}=W^{[2]}a^{[1]}+b^{[2]}}
}_{ \color{red}{ dz^{[2]}} }
\leftrightarrow
\underbrace{
\boxed{a^{[2]} =\sigma(z^{[2]})}
}_{ \color{red}{ da^{[2]}} }}^{\text{Node}^{[2]}}
\leftrightarrow
\boxed{L(a^{[2]},y)}
\]

  • \[\eqalign{
    dz^{[2]}&=a^{[2]}-y\\
    dW^{[2]}&=dz^{[2]}a^{[1]T}\\
    db^{[2]}&=dz^{[2]}\\
    dz^{[1]}&=W^{[2]T} dz^{[2]} * g^{[1]’}(z^{[1]})\\
    dW^{[1]}&=dz^{[1]}X^T\\
    db^{[1]}&=dz^{[1]}
    }\]
  • \[\eqalign{
    dZ^{[2]}&=A^{[2]}-Y\\
    dW^{[2]}&=\frac{1}{m}dZ^{[2]}A^{[1]T}\\
    db^{[2]}&=\frac{1}{m} np.sum(dZ^{[2]}, axis=1, keepdims=true)\\
    dZ^{[1]}&=W^{[2]T} dZ^{[2]} * g^{[1]’}(Z^{[1]})\\
    dW^{[1]}&=\frac{1}{m} dZ^{[1]}X^T\\
    db^{[1]}&=\frac{1}{m} np.sum(dZ^{[1]}, axis=1, keepdims=true)
    }\]

Random Initialization

  • NN訓練では、重みをランダム初期化することが重要
  • 0で初期化すると対称、全く同じ関数にしかならない

\[\eqalign{
W^{[1]}&=np.random.randn((n_h,n_x))\times 0.01\\
b^{[1]}&=np.zeros((n_h,1))\\
W^{[2]}&=np.random.randn((n_y,n_h))\times 0.01\\
b^{[2]}&=np.zeros((n_y,1))
}\]

  • 定数の理由
    • 0.01はsigmoidやtanhの傾きを考えて選択
    • 100とかだと平らなとこにきて、学習がおそい
    • 深いNNだと0.01以外も選びたくなる

Programming Assignment

  • You’ve learnt to:
    • Build a complete neural network with a hidden layer
    • Make a good use of a non-linear unit
    • Implemented forward propagation and backpropagation, and trained a neural network
    • See the impact of varying the hidden layer size, including overfitting.

コメントを残す

メールアドレスが公開されることはありません。 が付いている欄は必須項目です