DL [Course 4/5] Convolutional Neural Networks [Week 1/4] Foundations of Convolutional Neural Networks

Key Concepts

Understand the convolution operation
Understand the pooling operation
Remember the vocabulary used in convolutional neural network (padding, stride, filter, …)
Build a convolutional neural network for image multi-class classification

[mathjax]

Convolutional Neural Networks

Computer Vision

Problem
- Image Classification
- Object detection
- Neural Style Transfer
Deep Learning on large images
- 64x64x3=12,288
- 1000x1000x3=3,000,000
  - \(w^{(1)} (100,3M)\)
  - 30 billion parameters!!

Edge Detection Example

Computer Vision Problem
- how to detect edges
Vertical edge detection(* means “convolution” )
- 6×6 * 3×3 = 4×4\[
  \begin{bmatrix}
  3&0&1&2&7&4\\
  1&5&8&9&3&1\\
  2&7&2&5&1&3\\
  0&1&3&1&7&8\\
  4&2&1&6&2&8\\
  2&4&5&2&3&9
  \end{bmatrix} *
  \begin{bmatrix}
  1&0&-1\\
  1&0&-1\\
  1&0&-1\\
  \end{bmatrix}=
  \begin{bmatrix}
  -5&-4&0&8\\
  -10&-2&2&3\\
  0&-2&-4&-7\\
  -3&-2&-3&-16
  \end{bmatrix}
  \]
- 3×1+1×1+2×1+0x0+5×0+7×0+1x-1+8x-1+2x-1=-5 …
- Why works \[
  \begin{bmatrix}
  10&10&10&0&0&0\\
  10&10&10&0&0&0\\
  10&10&10&0&0&0\\
  10&10&10&0&0&0\\
  10&10&10&0&0&0\\
  10&10&10&0&0&0
  \end{bmatrix} *
  \begin{bmatrix}
  1&0&-1\\
  1&0&-1\\
  1&0&-1\\
  \end{bmatrix}=
  \begin{bmatrix}
  0&30&30&0\\
  0&30&30&0\\
  0&30&30&0\\
  0&30&30&0
  \end{bmatrix}
  \]

More Edge Detection

Vertical and Horizontal Edge Detection
Vertical\[
\begin{bmatrix}
1&0&-1\\
1&0&-1\\
1&0&-1\\
\end{bmatrix}
\]
Horizontal\[
\begin{bmatrix}
1&1&1\\
0&0&0\\
-1&-1&-1\\
\end{bmatrix}
\]
Learning to detect edges
Sobel filter\[
\begin{bmatrix}
1&0&-1\\
2&0&-2\\
1&0&-1
\end{bmatrix}
\]
Sharr filter\[
\begin{bmatrix}
3&0&-3\\
10&0&-10\\
3&0&-3
\end{bmatrix}
\]
filter as weight\[
\begin{bmatrix}
w_1&w_2&w_3\\
w_4&w_5&w_6\\
w_7&w_8&w_9
\end{bmatrix}
\]

Padding

Valid and Same convolutions

Padding Problem
- shrinking output
- throwing away info from edge
Practical
- p=1: padding all around with an extra boarder of 1 pixel. (so 6×6 p=1 8×8)
- “Zero-Padding” padded field value = 0
“Valid” convolutions (no padding)
- (6×6)*(3×3)=(4×4)
- (nxn)*(fxf)=(n-f+1,n-f+1)
“Same” convolutions
- (6×6)->padding->(8×8)*(3×3)=(6×6)
- (nxn)*(fxf)=(n+2p-f+1,n+2p-f+1)
- p=(f-1)/2
- Pad so that output size is the same as the input size.
f is usually odd
- symmetry padding
- center pixel as filter position

Strided Convolutions

stride=2: 1つジャンプしてフィルタを掛ける
Summary of convolutions: \[
\text{image: }n\times n \text{ filter: }f\times f\\
\text{padding: }p, \text{ stride: }s\\
\text{output size: }
|\frac{n+2p-f}{s}+1| \times |\frac{n+2p-f}{s}+1|
\]

Technical note on cross-correlation vs. convolution

Convolution in math textbook
- narrowing filter both on the vertical and horizontal axes
- (A*B)C=A*(B*C)
Convolution in Deep Learning
- cross-correlation called “Convolution”
- DLではfilterのnarrowingには関わらない

Convolutions Over Volume

Convolution on RGB images
- hight,width,#channel
- (6x6x3)*(3x3x3)=(4,4)
Multiple filters
- \[
  (6,6,3) \begin{cases}
  *\overbrace{(3,3,3)}^{vertical edge}=(4,4)\\
  *\overbrace{(3,3,3)}^{horizontal edge}=(4,4)
  \end{cases}\rightarrow (4,4,2)
  \]
- Summary:\[
  (n \times n \times n_c) * (f \times f \times n_c) \longrightarrow (n-f+1 \times n-f+1 \times \underbrace{n_c’}_{\text{#filters}})\\
  (6 \times 6 \times 3) * (3 \times 3 \times 3) \longrightarrow (4 \times 4 \times 2)
  \]
Summary
- RGB, vertical, horizontal, 10,128 several hundred features can detect.
- Output #channels is detecting #filters
- “channel” notation people often call this the depth of this 3D value. And both notations channels or depth are commonly used in the literature.But they find depth more confusing because you usually talk about the depth of the neural network as well.

One Layer of a Convolutional Network

Example of layer\[
\overbrace{(6,6,3)}^{a^{[0]}} \begin{cases}
*\overbrace{(3,3,3)}^{w^{[1]}}\rightarrow \overbrace{ReLU(\overbrace{(4,4)}^{ w^{[1]}a^{[0]} } +b_1)}^{z^{[1]}}\\
*(3,3,3)\rightarrow ReLU((4,4)+b_2)
\end{cases}
\rightarrow \overbrace{(4,4,2)}^{a^{[1]}}\\
z^{[1]}=w^{[1]}a^{[0]}+b^{[1]}\\
a^{[1]}=g(z^{[1]})
\]
If you have 10 filters that are 3x3x3 in one layer of a NN, how many parameters does that layer have?
- 3x3x3+bias=28parameters
- 28x10filters=280parameters
- Nevertheless large input image, parameters fixed in 280.

Summary of notation

Layer l is ConvolutionLayer
filter size: \(f^{[l]}\)
padding: \(p^{[l]}\)
stride: \(s^{[l]}\)
number of filters: \(n_c^{[l]}\)
Input: \(n_H^{[l-1]} \times n_W^{[l-1]} \times n_c^{[l-1]}\)
Output: \(n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]}\)
Output volume size(height): \(n_H^{[l]}=|\frac{n_H^{[l-1]}+2p^{[l]}-f^{[l]}}{s^{[l]}}+1| \)
Output volume size(width): \(n_W^{[l]}=|\frac{n_W^{[l-1]}+2p^{[l]}-f^{[l]}}{s^{[l]}}+1| \)
Each filters: \( f^{[l]} \times f^{[l]} \times n_c^{[l-1]}\)
Activations: \( a^{[l]} \rightarrow n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]} \)
A for gradient descent etc.: \( A^{[l]} \rightarrow m \times n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]} \)
Weights: \( f^{[l]} \times f^{[l]} \times n_c^{[l-1]} \times n_c^{[l]}\)
bias: \( n_c^{[l]}-(1,1,1,n_c)\)

Simple Convolutional Network Example

39x39x3 \(x\) image is cat? \[
n_H^{[0]}=n_W^{[0]}=39\\
n_c^{[0]}=3\\
\]
↓\[
f^{[1]}=3\\
s^{[1]}=1\\
p^{[1]}=0\dots \text{valid convolutions}\\
10 filters
\]
37x37x10 \(a^{[1]}\)\[
n_H^{[1]}=n_W^{[1]}=37\\
n_c^{[1]}=10
\]
↓\[
f^{[2]}=5\\
s^{[2]}=2\dots 速く縮む\\
p^{[2]}=0\\
20 filters
\]
17x17x20 \(a^{[2]}\)\[
n_H^{[2]}=n_W^{[2]}=17\\
n_c^{[2]}=20
\]
↓\[
f^{[3]}=5\\
s^{[3]}=2\\
p^{[3]}=0\\
40 filters
\]
7x7x40
↓\[
1960 units\\
\text{logistic regression/softmax}
\]
\(\hat y\)

Types of layer in a convolutional network

Convolution (CONV)
Pooling (POOL)
Fully connected (FC)

Pooling Layers

Max pooling
- Input: 4×4、4つの2×2リージョンに分割し、それぞれのMAXをとって、2×2を作る
  - Hyperparameter
    - f=2
    - s=2
- ある特徴が(Max pooling)フィルタ内に存在すれば、MAXは高い数値となり、存在しなければMAXは小さくなる
- Input 5×5
  - f=3, s=1
  - Output 3×3
Average pooling
- MaxではなくAvarageをとる

Summary of pooling

Hyperparameters:
- f: filter size
- s: stride
- Max or average pooling
- p: =0 usually
\[
n_H \times n_W \times n_c\\
\downarrow\\
|\frac{n_H-f}{s}+1|\times| \frac{n_W-f}{s}+1 |\times n_c
\]
pooling layer
- fixed function
- no parameter in back propagation
- there is actually nothing to learn
- hyperparameters set once by hand or cross-validation

CNN Example

Neural Network example(LeNet-5)
- x: 32x32x3
- f=5,s=1
- conv1: 28x28x6
- maxpool f=2,s=2
- pool1: 14x14x6
  - conv1+pool1=layer1(in ConvNet context)
- f=5,s=1
- conv2: 10x10x16
- maxpool f=2,s=2
- pool2 : 5x5x16
  - conv2+pool2=layer2
- 400units
- FC3: 120units\[
  w^{[3]}: (120,400)\\
  b^{[3]}: (120)
  \]
- FC4: 84units
- softmax
- 10outputs
guideline
- 自分独自のHyperparametersを設定せず、文献を見て上手くいってるものを選ぶ
- NNが深くなると、nH,nWは減っていき、Ch数は増えていく。

	Activation shape	Activation size	#parameters
Input:	(32,32,3)	3072	0
CONV1(f=5,s=1)	(28,28,8)	6272	608=(553+1)*8
POOL1	(14,14,8)	1568	0
CONV2(f=5,s=1)	(10,10,16)	1600	3,216=(558+1)*16
POOL2	(5,5,16)	400	0
FC3	(120,1)	120	48,120=400*120+120
FC4	(84,1)	84	10,164=120*84+84
Softmax	(10,1)	10	850=84*10+10

Max pooling layers don’t have any parameters.
Conv layers tend to have relatively few parameters.
A lot of the parameters tend to be in the Fully Collected layers of the NN.
Activation size tends to go down gradually as you go deeper in the NN. If drops too quickly, that’s usually not great for performance as well.

Why Convolutions?

Parameter sharing パラメタ共有
- A feature detector (such as a vertical edge detector) that’s useful in one part of the image is probably useful in another part of the image.
Sparsity of connections スパース(疎)結合
- In each layer, each output value depends only on a small number of inputs.
- prevent from overfitting
Putting it together
- Training set \((x^{(1)},y^{(1)}),\dots (x^{(m)},y^{(m)} )\)
- \[
  X\rightarrow (Conv*\rightarrow Pool)\rightarrow \underbrace{FC*}_{w,b} \rightarrow softmax \rightarrow \hat y
  \]
- Cost\[
  J=\frac1m \sum_{i=1}^m L(\hat y^{(i)},y^{(i)})
  \]

Programming assignments

Convolutional Neural Networks

Implements the forward propagation for the model:
CONV2D -> RELU -> MAXPOOL -> CONV2D -> RELU -> MAXPOOL -> FLATTEN -> FULLYCONNECTED

Window, kernel, filter

The words “window”, “kernel”, and “filter” are used to refer to the same thing. This is why the parameter ksize refers to “kernel size”, and we use (f,f) to refer to the filter size. Both “kernel” and “filter” refer to the “window.”

Convolutional Neural Networks

Computer Vision

Edge Detection Example

More Edge Detection

Padding

Valid and Same convolutions

Strided Convolutions

Technical note on cross-correlation vs. convolution

Convolutions Over Volume

One Layer of a Convolutional Network

Summary of notation

Simple Convolutional Network Example

Types of layer in a convolutional network

Pooling Layers

Summary of pooling

CNN Example

Why Convolutions?

Programming assignments

Convolutional Neural Networks

Create placeholders

Initialize parameters

Forward propagation

Window, kernel, filter

Compute cost

Model

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル