Key Concepts
- Understand the convolution operation
- Understand the pooling operation
- Remember the vocabulary used in convolutional neural network (padding, stride, filter, …)
- Build a convolutional neural network for image multi-class classification
[mathjax]
Convolutional Neural Networks
Computer Vision
- Problem
- Image Classification
- Object detection
- Neural Style Transfer
- Deep Learning on large images
- 64x64x3=12,288
- 1000x1000x3=3,000,000
- \(w^{(1)} (100,3M)\)
- 30 billion parameters!!
Edge Detection Example
- Computer Vision Problem
- how to detect edges
- Vertical edge detection(* means “convolution” )
- 6×6 * 3×3 = 4×4\[
\begin{bmatrix}
3&0&1&2&7&4\\
1&5&8&9&3&1\\
2&7&2&5&1&3\\
0&1&3&1&7&8\\
4&2&1&6&2&8\\
2&4&5&2&3&9
\end{bmatrix} *
\begin{bmatrix}
1&0&-1\\
1&0&-1\\
1&0&-1\\
\end{bmatrix}=
\begin{bmatrix}
-5&-4&0&8\\
-10&-2&2&3\\
0&-2&-4&-7\\
-3&-2&-3&-16
\end{bmatrix}
\] - 3×1+1×1+2×1+0x0+5×0+7×0+1x-1+8x-1+2x-1=-5 …
- Why works \[
\begin{bmatrix}
10&10&10&0&0&0\\
10&10&10&0&0&0\\
10&10&10&0&0&0\\
10&10&10&0&0&0\\
10&10&10&0&0&0\\
10&10&10&0&0&0
\end{bmatrix} *
\begin{bmatrix}
1&0&-1\\
1&0&-1\\
1&0&-1\\
\end{bmatrix}=
\begin{bmatrix}
0&30&30&0\\
0&30&30&0\\
0&30&30&0\\
0&30&30&0
\end{bmatrix}
\]
- 6×6 * 3×3 = 4×4\[
More Edge Detection
- Vertical and Horizontal Edge Detection
- Vertical\[
\begin{bmatrix}
1&0&-1\\
1&0&-1\\
1&0&-1\\
\end{bmatrix}
\] - Horizontal\[
\begin{bmatrix}
1&1&1\\
0&0&0\\
-1&-1&-1\\
\end{bmatrix}
\] - Learning to detect edges
- Sobel filter\[
\begin{bmatrix}
1&0&-1\\
2&0&-2\\
1&0&-1
\end{bmatrix}
\] - Sharr filter\[
\begin{bmatrix}
3&0&-3\\
10&0&-10\\
3&0&-3
\end{bmatrix}
\] - filter as weight\[
\begin{bmatrix}
w_1&w_2&w_3\\
w_4&w_5&w_6\\
w_7&w_8&w_9
\end{bmatrix}
\]
Padding
Valid and Same convolutions
- Padding Problem
- shrinking output
- throwing away info from edge
- Practical
- p=1: padding all around with an extra boarder of 1 pixel. (so 6×6 p=1 8×8)
- “Zero-Padding” padded field value = 0
- “Valid” convolutions (no padding)
- (6×6)*(3×3)=(4×4)
- (nxn)*(fxf)=(n-f+1,n-f+1)
- “Same” convolutions
- (6×6)->padding->(8×8)*(3×3)=(6×6)
- (nxn)*(fxf)=(n+2p-f+1,n+2p-f+1)
- p=(f-1)/2
- Pad so that output size is the same as the input size.
- f is usually odd
- symmetry padding
- center pixel as filter position
Strided Convolutions
- stride=2: 1つジャンプしてフィルタを掛ける
- Summary of convolutions: \[
\text{image: }n\times n \text{ filter: }f\times f\\
\text{padding: }p, \text{ stride: }s\\
\text{output size: }
|\frac{n+2p-f}{s}+1| \times |\frac{n+2p-f}{s}+1|
\]
Technical note on cross-correlation vs. convolution
- Convolution in math textbook
- narrowing filter both on the vertical and horizontal axes
- (A*B)C=A*(B*C)
- Convolution in Deep Learning
- cross-correlation called “Convolution”
- DLではfilterのnarrowingには関わらない
Convolutions Over Volume
- Convolution on RGB images
- hight,width,#channel
- (6x6x3)*(3x3x3)=(4,4)
- Multiple filters
- \[
(6,6,3) \begin{cases}
*\overbrace{(3,3,3)}^{vertical edge}=(4,4)\\
*\overbrace{(3,3,3)}^{horizontal edge}=(4,4)
\end{cases}\rightarrow (4,4,2)
\] - Summary:\[
(n \times n \times n_c) * (f \times f \times n_c) \longrightarrow (n-f+1 \times n-f+1 \times \underbrace{n_c’}_{\text{#filters}})\\
(6 \times 6 \times 3) * (3 \times 3 \times 3) \longrightarrow (4 \times 4 \times 2)
\]
- \[
- Summary
- RGB, vertical, horizontal, 10,128 several hundred features can detect.
- Output #channels is detecting #filters
- “channel” notation people often call this the depth of this 3D value. And both notations channels or depth are commonly used in the literature.But they find depth more confusing because you usually talk about the depth of the neural network as well.
One Layer of a Convolutional Network
- Example of layer\[
\overbrace{(6,6,3)}^{a^{[0]}} \begin{cases}
*\overbrace{(3,3,3)}^{w^{[1]}}\rightarrow \overbrace{ReLU(\overbrace{(4,4)}^{ w^{[1]}a^{[0]} } +b_1)}^{z^{[1]}}\\
*(3,3,3)\rightarrow ReLU((4,4)+b_2)
\end{cases}
\rightarrow \overbrace{(4,4,2)}^{a^{[1]}}\\
z^{[1]}=w^{[1]}a^{[0]}+b^{[1]}\\
a^{[1]}=g(z^{[1]})
\] - If you have 10 filters that are 3x3x3 in one layer of a NN, how many parameters does that layer have?
- 3x3x3+bias=28parameters
- 28x10filters=280parameters
- Nevertheless large input image, parameters fixed in 280.
Summary of notation
- Layer l is ConvolutionLayer
- filter size: \(f^{[l]}\)
- padding: \(p^{[l]}\)
- stride: \(s^{[l]}\)
- number of filters: \(n_c^{[l]}\)
- Input: \(n_H^{[l-1]} \times n_W^{[l-1]} \times n_c^{[l-1]}\)
- Output: \(n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]}\)
- Output volume size(height): \(n_H^{[l]}=|\frac{n_H^{[l-1]}+2p^{[l]}-f^{[l]}}{s^{[l]}}+1| \)
- Output volume size(width): \(n_W^{[l]}=|\frac{n_W^{[l-1]}+2p^{[l]}-f^{[l]}}{s^{[l]}}+1| \)
- Each filters: \( f^{[l]} \times f^{[l]} \times n_c^{[l-1]}\)
- Activations: \( a^{[l]} \rightarrow n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]} \)
- A for gradient descent etc.: \( A^{[l]} \rightarrow m \times n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]} \)
- Weights: \( f^{[l]} \times f^{[l]} \times n_c^{[l-1]} \times n_c^{[l]}\)
- bias: \( n_c^{[l]}-(1,1,1,n_c)\)
Simple Convolutional Network Example
- 39x39x3 \(x\) image is cat? \[
n_H^{[0]}=n_W^{[0]}=39\\
n_c^{[0]}=3\\
\] - ↓\[
f^{[1]}=3\\
s^{[1]}=1\\
p^{[1]}=0\dots \text{valid convolutions}\\
10 filters
\] - 37x37x10 \(a^{[1]}\)\[
n_H^{[1]}=n_W^{[1]}=37\\
n_c^{[1]}=10
\] - ↓\[
f^{[2]}=5\\
s^{[2]}=2\dots 速く縮む\\
p^{[2]}=0\\
20 filters
\] - 17x17x20 \(a^{[2]}\)\[
n_H^{[2]}=n_W^{[2]}=17\\
n_c^{[2]}=20
\] - ↓\[
f^{[3]}=5\\
s^{[3]}=2\\
p^{[3]}=0\\
40 filters
\] - 7x7x40
- ↓\[
1960 units\\
\text{logistic regression/softmax}
\] - \(\hat y\)
Types of layer in a convolutional network
- Convolution (CONV)
- Pooling (POOL)
- Fully connected (FC)
Pooling Layers
- Max pooling
- Input: 4×4、4つの2×2リージョンに分割し、それぞれのMAXをとって、2×2を作る
- Hyperparameter
- f=2
- s=2
- Hyperparameter
- ある特徴が(Max pooling)フィルタ内に存在すれば、MAXは高い数値となり、存在しなければMAXは小さくなる
- Input 5×5
- f=3, s=1
- Output 3×3
- Input: 4×4、4つの2×2リージョンに分割し、それぞれのMAXをとって、2×2を作る
- Average pooling
- MaxではなくAvarageをとる
Summary of pooling
- Hyperparameters:
- f: filter size
- s: stride
- Max or average pooling
- p: =0 usually
- \[
n_H \times n_W \times n_c\\
\downarrow\\
|\frac{n_H-f}{s}+1|\times| \frac{n_W-f}{s}+1 |\times n_c
\] - pooling layer
- fixed function
- no parameter in back propagation
- there is actually nothing to learn
- hyperparameters set once by hand or cross-validation
CNN Example
- Neural Network example(LeNet-5)
- x: 32x32x3
- f=5,s=1
- conv1: 28x28x6
- maxpool f=2,s=2
- pool1: 14x14x6
- conv1+pool1=layer1(in ConvNet context)
- f=5,s=1
- conv2: 10x10x16
- maxpool f=2,s=2
- pool2 : 5x5x16
- conv2+pool2=layer2
- 400units
- FC3: 120units\[
w^{[3]}: (120,400)\\
b^{[3]}: (120)
\] - FC4: 84units
- softmax
- 10outputs
- guideline
- 自分独自のHyperparametersを設定せず、文献を見て上手くいってるものを選ぶ
- NNが深くなると、nH,nWは減っていき、Ch数は増えていく。
Activation shape | Activation size | #parameters | |
Input: | (32,32,3) | 3072 | 0 |
CONV1(f=5,s=1) | (28,28,8) | 6272 | 608=(5*5*3+1)*8 |
POOL1 | (14,14,8) | 1568 | 0 |
CONV2(f=5,s=1) | (10,10,16) | 1600 | 3,216=(5*5*8+1)*16 |
POOL2 | (5,5,16) | 400 | 0 |
FC3 | (120,1) | 120 | 48,120=400*120+120 |
FC4 | (84,1) | 84 | 10,164=120*84+84 |
Softmax | (10,1) | 10 | 850=84*10+10 |
- Max pooling layers don’t have any parameters.
- Conv layers tend to have relatively few parameters.
- A lot of the parameters tend to be in the Fully Collected layers of the NN.
- Activation size tends to go down gradually as you go deeper in the NN. If drops too quickly, that’s usually not great for performance as well.
Why Convolutions?
- Parameter sharing パラメタ共有
- A feature detector (such as a vertical edge detector) that’s useful in one part of the image is probably useful in another part of the image.
- Sparsity of connections スパース(疎)結合
- In each layer, each output value depends only on a small number of inputs.
- prevent from overfitting
- Putting it together
- Training set \((x^{(1)},y^{(1)}),\dots (x^{(m)},y^{(m)} )\)
- \[
X\rightarrow (Conv*\rightarrow Pool)\rightarrow \underbrace{FC*}_{w,b} \rightarrow softmax \rightarrow \hat y
\] - Cost\[
J=\frac1m \sum_{i=1}^m L(\hat y^{(i)},y^{(i)})
\]
Programming assignments
Convolutional Neural Networks
Create placeholders
Initialize parameters
Forward propagation
- Z: conv2d X,P,A
- P: max_pool A
- A: relu Z
- F: flatten P
- Z: full_connected F
Implements the forward propagation for the model:
CONV2D -> RELU -> MAXPOOL -> CONV2D -> RELU -> MAXPOOL -> FLATTEN -> FULLYCONNECTED
Window, kernel, filter
The words “window”, “kernel”, and “filter” are used to refer to the same thing. This is why the parameter ksize
refers to “kernel size”, and we use (f,f)
to refer to the filter size. Both “kernel” and “filter” refer to the “window.”