カテゴリー

# DL [Course 4/5] Convolutional Neural Networks [Week 2/4] Deep convolutional models: case studies

Key Concepts

• Understand and Implement a Residual network
• Clone a repository from github and use transfer learning
• Analyze the dimensionality reduction of a volume in a very deep network
• Understand multiple foundational papers of convolutional neural networks
• Implement a skip-connection in your network
• Build a deep neural network using Keras

$$\require{cancel}$$

## Case studies

### Why look at case studies?

• よいConvNetを作る直感は、論文を読むこと
• 別の課題でも上手く適用できることがわかっている
• Classic networks
• LeNet-5
• 80年代に登場した
• AlexNet
• しばしば引用される
• VGG
• しばしば引用される
• ResNet (conv residual network)
• 152層
• トリック、アイディア
• Inception

### Classic Networks

#### LeNet-5

• steps
• image: 32,32,1
• filter: f=5, s=1, no padding
• 32-5+1=28
• 6 channel
• 28,28,6
• avg_pool: f=2, s=2
• If build modern variant you probably use max pooling instead.
• 28/2=14
• 14,14,6
• filter: f=5, s=1
• 14+1-5=10
• 16 channel
• 10,10,16
• avg_pool: f=2,s=2
• 10/2=5
• 5,5,16=400units
• FC
• 120units
• FC
• 84units
• 10 pattern classifier
• modern variant probably use softmax
• $$\hat y$$
• LeNet-5
• has 60k parameters
• modern NN has 10 million to 100 million parameters
• $$n_H,n_W \downarrow n_C \uparrow$$
• conv pool conv pool fc fc output
• sigmoid/tanh(modarn ReLU)
• crazy complicated way: different filters look at different channels of the input block.
• it actually uses sigmoid non-linearity after the pooling layer.
• focusing on section 2.

#### AlexNet

• steps
• 227,227,3
• filter: f=11,s=4
• 55,55,96
• max_pool: f=3,s=2
• 27,27,96
• same: f=5
• 27,27,256
• max_pool: f=3,s=2
• 13,13,256
• same: f=3
• 13,13,384
• same: f=3
• 13,13,384
• same: f=3
• 13,13,256
• max_pool: f=3,s=2
• 6,6,256
• FC 9216
• FC 4096
• FC 4096
• Softmax 1000
• better than LeNet-5
• 60 millions parameters
• many hidden networks
• ReLU activation
• ImageNet Classification with Deep Convolutional Neural Networks
• ImageNet Classification with Deep Convolutional Neural Networks (slide)
• When this paper written GPUs was slower so it had a complicated way of training on two GPUs.
• Local Response Normalization(LRN)
• one position w,h, across all the c normalize
• you don’t want too many neurons with a very high activation.
• But subsequently, many researchers have found that this doesn’t help that much.
• In the history of deep learning, it was really just paper that convinced a lot of the computer vision community to take a serious look at deep learning to convince them that really works in computer vision. And then it grew on to have a huge impact beyond computer vision as well.

#### VGG-16

• CONV=3×3 filter,s=1,same
• MAX-POOL=2×2,s=2
• steps
• 224,224,3
• [CONV 64] x2 (2 conv layers)
• 224,224,64
• POOL
• 112,112,64
• [CONV 128] x2
• 112,112,128
• POOL
• 56,56,128
• [CONV 256]x3
• 56,56,256
• POOL
• 28,28,256
• [CONV 512]x3
• 28,28,512
• POOL
• 14,14,512
• [CONV 512]x3
• 14,14,512
• POOL
• 7,7,512
• FC 4096
• FC 4096
• Softmax 1000
• VGG-16
• 16 refers to the fact that this has 16 layers that have weights.
• 138 million parameters even large by modern standards.
• simplicity architecture made it quite appealing.
• Doubling through every stack of conv-layers was another simple principle used to design the architecture of this network.

### ResNets

• Very deep neural networks are difficult to train
• vanishing and exploding gradient types of problems.
• Residual Block 残差ブロック
• main path: $a^{[l]}\rightarrow \boxed{\circ\\\circ\\\circ}\overbrace{\rightarrow}^{a^{[l+1]}}\boxed{\circ\\\circ\\\circ}\rightarrow a^{[l+1]}\\ a^{[l]}\rightarrow Linear \rightarrow \ ReLU \overbrace{\rightarrow}^{a^{[l+2]}} Linear \rightarrow ReLU \rightarrow a^{[l+2]}$
• $z^{[l+1]}=W^{[l+1]}a^{[l]}+b^{[l+1]}\\ a^{[l+1]}=g(z^{[l+1]})\\ z^{[l+2]}=W^{[l+2]}a^{[l+1]}+b^{[l+2]}\\ a^{[l+2]}=g(z^{[l+2]})$
• short path / skip connection:$a^{[l]}\rightarrow \overbrace{ \bullet \rightarrow \boxed{\circ\\\circ\\\circ}\rightarrow\boxed{\circ\\\circ\\\circ} \oplus}^{\text{short path / skip connection}} \rightarrow a^{[l+1]}\\ a^{[l]}\rightarrow \overbrace{\bullet \rightarrow Linear \rightarrow \ ReLU \rightarrow Linear \rightarrow \oplus}^{\text{short path / skip connection}} \rightarrow ReLU \rightarrow a^{[l+2]}$
• $z^{[l+1]}=W^{[l+1]}a^{[l]}+b^{[l+1]}\\ a^{[l+1]}=g(z^{[l+1]})\\ z^{[l+2]}=W^{[l+2]}a^{[l+1]}+b^{[l+2]}\\ \xcancel{a^{[l+2]}=g(z^{[l+2]})}\\ a^{[l+2]}=g(z^{[l+2]}+a^{[l]})$
• It allows you to take the activation from one layer and suddenly feed it to another layer even much deeper in the neural network.
• Residual Network
• Plain Network$x \rightarrow\boxed{\circ\\\circ\\\circ} \rightarrow \boxed{\circ\\\circ\\\circ } \rightarrow \boxed{\circ\\\circ\\\circ } \rightarrow \boxed{\circ\\\circ\\\circ} \rightarrow \boxed{\circ\\\circ\\\circ } \rightarrow \boxed{\circ\\\circ\\\circ } \rightarrow \boxed{\circ\\\circ\\\circ} \rightarrow \boxed{\circ\\\circ\\\circ } \rightarrow \boxed{\circ\\\circ\\\circ } \rightarrow \boxed{\circ\\\circ\\\circ } \rightarrow a^{[l]}$
• 残差なしの場合、層の数が増えるほど学習誤差はしばらく減ったあと、また増え始める傾向がある
• 理論上NNを深くするほど良くなるはずなのだが実際は最適化アルゴリズム学習は難しい
• Residual Network (5 residual blocks stacked)$x \rightarrow \overbrace{\bullet \rightarrow\boxed{\circ\\\circ\\\circ} \rightarrow \boxed{\circ\\\circ\\\circ }}^{\text{short path}} \rightarrow \overbrace{ \bullet \rightarrow \boxed{\circ\\\circ\\\circ} \rightarrow \boxed{\circ\\\circ\\\circ} }^{\text{short path}} \rightarrow \overbrace{ \bullet \rightarrow \boxed{\circ\\\circ\\\circ } \rightarrow \boxed{\circ\\\circ\\\circ } }^{\text{short path}} \rightarrow \overbrace{ \bullet \rightarrow \boxed{\circ\\\circ\\\circ} \rightarrow \boxed{\circ\\\circ\\\circ } }^{\text{short path}} \rightarrow \overbrace{\bullet \rightarrow \boxed{\circ\\\circ\\\circ } \rightarrow \boxed{\circ\\\circ\\\circ } }^{\text{short path}} \rightarrow a^{[l]}$
• Residual Networkでは層がふかくなっていっても、学習誤差を下げ続けるような性能が得られる。100層でも。1000層を超える研究もある。使われることはあまりないが。
• This is really helps with the vanishing and exploding gradient problems and allows you to train much deeper neural networks without really appreciable loss in performance.

### Why ResNets Work

• Why do residual networks work?x\rightarrow\boxed{BigNN}\rightarrow a^{[l]}\\ x\rightarrow\boxed{BigNN}\rightarrow \overbrace{a^{[l]} \rightarrow \boxed{\circ\\\circ\\\circ}\rightarrow \boxed{\circ\\\circ\\\circ}}^{\text{short path}}\rightarrow a^{[l+2]}\\ ReLU\dots a\geq0\\ \begin{align*} a^{[l+2]}&=g(Z^{[l+2]}+a^{[l]})\\ &=g(\underbrace{w^{[l+2]}a^{[l+1]}+b^{l+2]}}_{IF w^{[l+2]}=0,b^{[l+2]}=0}+a^{[l]})\\ &=g(a^{[l]})\dots \text{ReLU}\\ &=a^{[l]} \end{align*}
• The identity function (恒等関数) is easy for residual block to lean.
• Adding these 2 layers in your neural network, it doesn’t hurt your neural networks ability to do well as this simpler network without these two extra layers.
• Assuming that $$z^{[l+2]},a^{[l]}$$ have the same dimension, so what you see in ResNet is a lot of use of same convolutions
• In case the input and output have different dimensions, adding $$w_s$$. \begin{align*}a^{[l+2]}&=g(Z^{[l+2]}+ w_s a^{[l]})\\ &=g(\underbrace{w^{[l+2]}a^{[l+1]}+b^{l+2]}}_{IF w^{[l+2]}=0,b^{[l+2]}=0}+w_s a^{[l]})\\ &=g(w_s a^{[l]})\dots \text{ReLU}\\ &=w_s a^{[l]} \end{align*}\\ \text{If: } a^{[l]}\in \Bbb R^{128}, a^{[l+2]}\in \Bbb R^{256}\\ \text{then: } w_s\in \Bbb R^{256\times 128}
• $$w_s$$ could be a matrix of parameters could be learned
• $$w_s$$ could be a fixed matrix that just implements zero paddings that takes $$a^{[l]}$$ an then zero pads it to be 256 dimensional.

### Networks in Networks and 1×1 Convolutions

• Why does a 1×1 convolution do?
• If c=1, no benefit?
• image: 6,6,1
• filter: 1,1,1
• 6,6,1
• If c=32
• image:6,6,32
• filter: 1,1,32
• $$32 \rightarrow \text{#filters} \dots n_c^{[l+1]}$$
• 6,6,#filters
• 1×1 convolution?
• In same Width,Height 32 channels multiplying them by 32 weights and then applying ReLU to it and output corresponding thing over there.
• 1×1 convolution is basically having a FC Neuron Network, 36 positions.
• Sometimes called Network in Network
• Even though the details of the architecture aren’t used widely, but this idea has very influenced many other neural network architectures, including the inception network.
• Using 1×1 convolutions
• 28,28,192
• ReLU: CONV 1,1,32
• IF 1,1,192 then just adds non-linearity. then this allows yout to learn the more complex function.
• but 32 effect to shrink this channels.
• 28,28,32

### Inception Network Motivation

• When ConvNet layer designing, f=3 or f-5 or POOL ? Inception Network said why you should do them all?
• Motivation for inception network
• 28,28,192
• Inception module
• Same 1,1
• 28,28,64
• Same 3,3
• 28,28,128
• Same 5,5
• 28,28,32
• 28,28,32
• 28,28,256
• Basically idea
• filter size, pooling, do them all and just concatenate all the outputs and let the network learn whatever parameters it wants to use whatever the combinations of these filter sizes it wants.
• The problem of computational cost
• 28,28,192
• CONV 5,5,same,32
• 28,28,32
• 32 filters, each filters are 5,5,192, output 28,28,32
• Cost 28*28*32*5*5*192=120M(possible but expensive operation)
• Using 1×1 convolution
• 28,28,192
• CONV 1,1,16 reduce volume 1,1,192
• 28*28*16*192=2.4M
• 28,28,16
• bottleneck layer(192->16->32)
• CONV 5,5,32 reduce volume 5,5,16
• 28*28*32*5*5*16=10.0M
• 28,28,32
• Total cost 12,4M

### Inception Network

#### Inception module

• Previous Activation(192) → {parallel}
• 1×1 CONV(64) →CC
• 1×1 CONV(96) →
• 3×3 CONV(128) →CC
• 1×1 CONV(16) →
• 5×5 CONV(32) →CC
• MAX-POOL(3×3,s=1,same)(192) →
• 1×1 CONV(32) →CC
• 32 filter, 1x1x192
• → Channel Concat
• 28,28, 64+128+32+32 channel

#### Inception network

• Channel Concatを次ブロックのPrevious Activationとしてつなぐ
• Channel ConcatのあとMAX-POOLで高さと幅を変え、それを次ブロックのPrevious Activationとしてつなぐ
• 分岐させて、FilterしFCしてSoftmax、Hidden layerから予測を試みる
• これらはReguraization効果があると考えられていて、overfittingを防ぐ
• Inception paper actually cites this meme for we need to go deepr.
• This Inception network developed by authors at Google created by googler called GoogLeNet spelled like that homage.

## Practical advices for using ConvNets

### Using Open-Source Implementation

• Sometimes these networks take a long time to train, and someone else might have used multiple GPUs and a very large dataset to pretrain some of these networks. and that allows you to do transfer learning using these networks.

• dataset

### Transfer Learning

• Cat classification: Tigger, Misty, others (3 class)
• Open-source NN and Weights (pre-trained NN)
• ImageNet dataset
• x→NN→softmax→1/1000 class$x \rightarrow \boxed{\circ\\\circ\\\circ} \rightarrow \boxed{\circ\\\circ\\\circ } \rightarrow \boxed{\circ\\\circ\\\circ } \rightarrow \boxed{\circ\\\circ\\\circ} \rightarrow \boxed{\circ\\\circ\\\circ } \rightarrow \boxed{\circ\\\circ\\\circ } \rightarrow \underbrace{\circ}_{softmax} \rightarrow \hat y \{1000\}$
• When you don’t have a lot of pictures of Tigger,Misty
• Replace with original softmax unit
• x→NN→{softmax→3class(Tigger,Misty,None)}$x \rightarrow \underbrace{\boxed{\circ\\\circ\\\circ} \rightarrow \boxed{\circ\\\circ\\\circ } \rightarrow \boxed{\circ\\\circ\\\circ } \rightarrow \boxed{\circ\\\circ\\\circ} \rightarrow \boxed{\circ\\\circ\\\circ } \rightarrow \overbrace{\boxed{\circ\\\circ\\\circ }}^{save\\to\\disk}}_{freeze=1} \rightarrow \underbrace{\circ}_{myown\\softmax} \rightarrow \hat y \{T,M,N\}$
• Parameters in NN be fixed.
• trainableParameter=0
• freeze=1
• Prameters in softmax must be train.
• Trick of speed up in training
• We just pre-compute NN last layer. The features of re-activations from that layer and just save them to disk.(don’t need recompute every epoch)
• Input any image X and compute some feature vector with NN fixed function, and then training a shallow softmax model from this feature vector to make a a prediction.
• When you have a larger training set. Tigger,Misty,None images
• freeze fewer layers$x \rightarrow \underbrace{\boxed{\circ\\\circ\\\circ} \rightarrow \boxed{\circ\\\circ\\\circ } \rightarrow \boxed{\circ\\\circ\\\circ } \rightarrow \boxed{\circ\\\circ\\\circ} }_{freeze=1} \rightarrow \underbrace{\overbrace{\boxed{\circ\\\circ\\\circ } \rightarrow \boxed{\circ\\\circ\\\circ }}^{original\\or\\myown} \rightarrow \underbrace{\circ}_{myown\\softmax}}_{train} \rightarrow \hat y \{T,M,N\}$
• Finally you have a lot of larger training set. Tigger,Misty,None images
• use open source network and weights whole thing just as initialization$x \rightarrow \underbrace{\boxed{\circ\\\circ\\\circ} \rightarrow \boxed{\circ\\\circ\\\circ } \rightarrow \boxed{\circ\\\circ\\\circ } \rightarrow \boxed{\circ\\\circ\\\circ} \rightarrow \boxed{\circ\\\circ\\\circ } \rightarrow \boxed{\circ\\\circ\\\circ } \rightarrow \underbrace{\circ}_{myown\\softmax} }_{train} \rightarrow \hat y \{T,M,N\}$
• In computer vision, “transfer learning” you should almost always do unless you have an exceptionally large dataset to train everything else from scratch yourself.

### Data Augmentation

• Computer Visionにおいてはデータ不足なので
• Common augmentation method
• Mirroring
• Random Cropping
• not a perfect data augmentation, what if randomly cropping not look much like a cat.
• but in practice and worthwhile so long as your random crops are reasonably large subsets of the actual image.
• Rotation
• Shearing
• Local warping
• Color shifting
• RGB +20,-20,+20
• Motivation if maybe the sunlight was a bit yellow or maybe the in-goal illumination was a bit more yellow, this makes your learning algorithm more robust to changes in the colors of your images.
• PCA(Principles Component Analysis)主成分分析
• PCA Color Augmentation
• 紫系の画像だったら、RGが大きくBが小さいが、そこにフィルタするとき、RGに大きく変化をつける
• Implementing distortions during training
• distortions
• mini-batfch
• training

### State of Computer Vision

#### Data vs. hand-engineering

• Lots of data
• Simpler algorithm
• less hand-engineering
• Speech recognition
• much data relative to the complexity of the problem
• Image recognition
• feels like we still wish we had more data
• Object detection
• bounding boxes is just more expensive
• Little data
• More hand-engineering(“hacks”)
• Two sources of knowledge
• Labeled data
• (x,y) supervised learning
• Hand engineering features/network architecture/other components
• Computer Visionが歴史的にも今日でもHand-engineeringに頼るのは、データがない中で、良い性能を出すために、ネットワーク構造を設計し、工夫することに時間を掛ける
• When you don’t have enough data hand-engineering is a very difficult, very skillful task that requires a lot of insight.And someone that is insightful with hand-engineering will get better performance,and great contribution to a project to do that hand-engineering when you don;t have enough data.
• It’s just when you have lots of data then I wouldn’t spend time hand-engineering, I would spend time building up the learning system instead.

#### Tips for doing well on benchmarks/winning competitions

• Ensembling
• Train several networks independently and average their outputs
• 3,5,7 network randomly initialize and train up all of NN and then average their outputs.
• BTW important to average their outputs $$\hat y$$ don’t average their weights that won’t work
• 7 different predictions and average that. and this will cause you to do maybe 1% better, or 2% in some benchmark.
• But because of ensembling means that to test on each image you might need to run an image through anywhere from 3 to 15 different networks quite typical this slows down running time by a factor of 3 to 15,sometimes even more.
• And so ensembling is one of those tips that people use doing well in benchmarks and for winning competitions. But that is almost never use in production to serve actual customers.
• problem
• you need to keep all these different networks around takes up a lot more computer memory
• Multi-crop at test time
• Run classifier on multiple versions of test images and average results
• 10-crop
• (1 center, 4 corners)*2
• and 10 images through your crossfire and then average the results
• maybe don’t need as many as 10-crops, can use few crops
• this might get you a little bit better performance in a production system.
• But this is another technique that is used much more for doing well on benchmarks than in actual production systems.
• problem
• at least you keep just one network around so doesn’t suck up as much memory, but still slows down your run time quite a bit.

#### Use open source code

• Use architectures of networks published in the literature
• Use open source implementations if possible
• Use pretrained models and fine-tune on your dataset

## Practice questions

• Suppose you have an input volume of dimension 64x64x16. How many parameters would a single 1×1 convolutional filter have(including the bias)?
• IN: 64,64,16
• CONV: 1,1,1
• n-f+1=64
• nc=1
• a: 64,64,1
• a=wa+b (each inputs)
• 16+1 parameters
• Suppose you have an input volume of dimension nH,nW,nC. Which of the following statements you agree with?(Assume that “1×1 convolutional layer” below always uses a stride of 1 and no padding.)
• You can use a 1×1 convolutional layer to reduce nC but not nH,nW
• You can use a pooling layer to reduce nH,nW but not nC
• Which ones of the following statements on Inception Networks are true?
• Making an inception network deeper (by stacking more inception blocks together) should not hurt training set performance
• A single inception block allows the network to use a combination of 1×1,3×3,5×5 convolutions and pooling
• Inception networks incorporates a variety of network architectures (similar to dropout, which randomly chooses a network architecture on each step) and thus has a similar regularizing effect as dropout.
• Inception blocks usually use 1×1 convolutions to reduce the input data volume’s size before applying 3×3,5×5 convolutions.

## Programming assignments

#### Keras tutorial

• Why are we using Keras?
• Keras was developed to enable deep learning engineers to build and experiment with different models very quickly.
• Just as TensorFlow is a higher-level framework than Python, Keras is an even higher-level framework and provides additional abstractions.
• Being able to go from idea to result with the least possible delay is key to finding good models.
• However, Keras is more restrictive than the lower-level frameworks, so there are some very complex models that you would still implement in TensorFlow rather than in Keras.
• That being said, Keras will work fine for many common models.

#### Residual Networks

• What you should remember
• Very deep “plain” networks don’t work in practice because they are hard to train due to vanishing gradients.
• The skip-connections help to address the Vanishing Gradient problem. They also make it easy for a ResNet block to learn an identity function.
• There are two main types of blocks: The identity block and the convolutional block.
• Very deep Residual Networks are built by stacking these blocks together.