カテゴリー

# DL [Course 4/5] Convolutional Neural Networks [Week 4/4] Special applications: Face recognition & Neural style transfer

[mathjax]

## Face Recognition

### What is face recognition?

#### Face verification vs. face recognition

• Verification
• Input image, name/ID
• Output whether the input image is that of the claimed person
• Recognition
• Has a database of K persons
• Get an input image
• output ID if the image is any of the K persons(or “not recognized”)
• 生体検知
• 生きた人間かどうか、写真を拒否する
• 教師あり学習
• 顔認識は顔認証より難しい
• 99%の精度で1％の間違い
• 許容できる精度は99.9%やそれ以上

### One Shot Learning

• Learning from one example to recognize the person again
• image->CNN->softmax(5)
• doesn’t work well because small training set it is not enough
• what if a new person joins your team? retrain convnet every time?
• Learning a “similarity” function
• d(img1,img2)=degree of defference between images
• d(img1,img2)<= τ #same
• d(img1,img2)>τ #different

### Siamese Network

• Siamese Network シャムネットワーク
$\displaylines{ \boxed{img👨: x^{(1)}}\rightarrow\boxed{CNN}\rightarrow\fc\rightarrow \underbrace{\fc}_{f(x^{(1)})\\128}\\ \boxed{img👩:x^{(2)}}\rightarrow\boxed{CNN}\rightarrow\fc\rightarrow \underbrace{\fc}_{f(x^{(2)})\\128}\\ }$
• Deep Face Papers
• Goal of learning$\boxed{img}\rightarrow\boxed{CNN}\rightarrow\fc\rightarrow \underbrace{\fc}_{f(x^{(1)})}$
• Parameters of NN define an encoding $$f(x^{(i)})$$
• 128 dimensional
• Learn parameters so that:
• If $$x^{(i)}, x^{(j)}$$ are the same person, $$\| f(x^{(i)}) – f(x^{(j)}) \|^2$$ is small.
• If $$x^{(i)}, x^{(j)}$$ are different persons, $$\| f(x^{(i)}) – f(x^{(j)}) \|^2$$ is large.

### Triplet Loss

#### Learning Objective

• Images:
• 👧Anchor,👩Positive: the same person
• 👧Anchor,👵Negative: different persons
• always looking at 3 images at a time (A,P,A,N)
• Want:$\displaylines{ \underbrace{\| f(A)-f(P) \|^2}_{d(A,P)} \leq \underbrace{\| f(A)-f(N) \|^2}_{d(A,N)}\\ \| f(A)-f(P) \|^2 – \| f(A)-f(N) \|^2 \leq 0 }$
• trivial output: f(A)-f(P)=0, f(A)-f(N)=0
• so to make sure that it doesn’t set all the encoding s equal to each other.
• Modify objective with hyperparameter(margin) $$\alpha$$:$\displaylines{ \underbrace{\| f(A)-f(P) \|^2}_{d(A,P)}+\alpha \leq \underbrace{\| f(A)-f(N) \|^2}_{d(A,N)}\\ \| f(A)-f(P) \|^2 – \| f(A)-f(N) \|^2+\alpha \leq 0 }$

#### Loss function

\eqalign{ L(A,P,N)&=\max( \| f(A)-f(P) \|^2 – \| f(A)-f(N) \|^2 + \alpha, 0)\\ J&=\sum_{i=1}^m L(A^{(i)}, P^{(i)}, N^{(i)}) }

• Training set: 10k pictures of 1k persons
• Just one picture of each person, you can’t actually train this system.
• But of course after training,you should apply this with one shot learning.

#### Choosing the triplets A,P,N

• During training , if A,P,N are chose randomly, $$d(A,P)+\alpha\leq d(A,N)$$ is easily satisfied.
• Choose triplets that’re “hard” to train on.$\displaylines{ d(A,P)+\alpha\leq d(A,N) \\ d(A,P)+\approx d(A,N) }$
• increase computation efficiently learning algorithm
• FaceNet Paper

#### Training set using triplet loss

• (Anchor, Positive, Negative)…
• 企業は100万～1億のデータセットで訓練している
• オンラインにパラメータを公開している
• スクラッチで訓練するよりも、それを利用する

### Face Verification and Binary Classification

$\left. \begin{array}{r} \boxed{👨\\ x^{(i)}}\rightarrow\boxed{CNN}\rightarrow \rightarrow \underbrace{\fc}_{f(x^{(i)})}\\ \boxed{👩\\x^{(j)}}\rightarrow\boxed{CNN}\rightarrow\rightarrow \underbrace{\fc}_{f(x^{(j)})} \end{array} \right\} \rightarrow \underbrace{\circ}_{sigmoid} \rightarrow\hat y$

$\hat y = \sigma (\sum_{k=1}^{128} w_k \underbrace{|f(x^{(i)}_k – f(x^{(i)})_k|}_{ \color{green}{ \frac{(f(x^{(i)})_k- f(x^{(j)})_k)^2}{ f(x^{(i)})_k+ f(x^{(j)})_k }:\chi^2 similarity } })+b)$

• So that i and j NN really tied to the parameters each other.
• Computational trick
• Pre computing some of thease encodings can save significant computation.

## Neural Style Transfer

### What is neural style transfer?

• Content+Style→Generated Images
• ヽ( ツ )丿 ＋🎨→😏

### What are deep ConvNets learning?

#### Visualizing what a deep network is learning

• Pick a unit in layer 1.
• Find the nine image patches that maximize the unit’s activation.
• Repeat for other units.
• Paper
• layer 1
• ❏〼─│
• edge, angle
• layer 2
• ◆△
• more complex shapes and patterns
• layer 3
• ○👤
• rounder shape
• cars
• person
• textures like honeycomb shapes
• layer 4
• 🐶🐧
• dog
• water
• bird legs
• layer 5
• 😺🐶🐩🎹🌸

### Cost Function

#### Neural style transfer const function

Content C+Style S=>Generate Image G$J(G)=\alpha J_{content}(C,G)+\beta J_{style}(S,G)$

#### Find the generated image G

• Initiate G randomly$G: 100\times100\times 3$
• Use gradient descent to minimize $$J(G)$$ $G := G-\frac{\partial}{\partial G}J(G)$

### Content Cost Function

$J(G)=\alpha J_{content}(C,G)+\beta J_{style}(S,G)$

• Say you use hidden layer $$l$$ to compute content cost.
• small $$l$$: very similar C and G
• deep $$l$$: “If there is a dog in your content image, then make sure there is a dog somewhere in your generated image.”
• Use pre-trained ConvNet. (E.g., VGG network)
• Let $$a^{[l](C)}$$ and $$a^{[l](G)}$$ be the activation of layer $$l$$ on the images
• If $$a^{[l](C)}$$ and $$a^{[l](G)}$$ are similar, both images have similar content
• $J_{content}(C,G)= \frac12 \| a^{[l](C)} – a^{[l](G)} \|^2$

### Style Cost Function

#### Meaning of the “style” of an image

• Say you are using layer l’s activation to measure “style.”
• Define style as correlation between activations across channels.
• How correlated are the activations across different channels?

#### Style matrix

Let $$a^{[l]}_{i,j,k} =$$ activation at $$(i,j,k)$$. $$G^{[l]}$$ is $$n_c^{[l]} \times n_c^{[l]}$$

i: H, j: W, k:C

$$G^[l]$$ means how correlated are the activations in channel k and channel k prime.

$\displaylines{ G_{kk’}^{[l]\color{green}{(S)}} &= \sum_{i=1}^{n_H}\sum_{j=1}^{n_W} a_{i,j,k}^{[l]\color{green}{(S)}} a_{i,j,k’}^{[l]\color{green}{(S)}}\\ G_{kk’}^{[l]\color{green}{(G)}} &= \sum_{i=1}^{n_H}\sum_{j=1}^{n_W} a_{i,j,k}^{[l]\color{green}{(G)}} a_{i,j,k’}^{[l]\color{green}{(G)}} }$

Style matrix G: Gram matrix in liner algebra

unnormalized cross of the areas 正規化されていない相互共分散

#### Style cost function

\begin{align} J_{style}^{[l]}(S,G) &= \frac{1}{ (2n_{H}^{[l]} n_{W}^{[l]} n_{C}^{[l]} )^{2} } \sum_{k}\sum_{k’}\left ( G_{kk’}^{[l](S)} – G_{kk’}^{[l](G)} \right)^{2}\\ J_{style}(S,G)&=\sum_l \lambda^{[l]} J_{style}^{[l]}(S,G)\\ J(G)&=\alpha J_{content}(C,G)+\beta J_{style}(S,G) \end{align}

tries to minimize this cost function j of G

### 1D and 3D Generalizations

#### Convolutions in 2D and 1D

• 2D
• 2D input image 14×14
• 2D filter 5×5
• 14x14x3 * 5x5x3 => 10×10
• 1D
• 1D EKG signal 心電図 1電極 14dim
• filter 5
• 14×1 * 5×1 => 10×16
• For along with 1D sequenced data apps, you actually use a recurrent neural network, LCM and others.
• 3D
• CT scan
• 3D volume 14x14x14 * 3D filter 5x5x5
• 14x14x14x1 * 5x5x5x1 => 10x10x10x16
• 10x10x10x16 * 5x5x5x16 => 6x6x6x32
• Movie
• where the different slices could be different slices in time through a movie.

### Programming assignments

• Art Generation with Neural Style Transfer
• Neural Style Transfer is an algorithm that given a content image C and a style image S can generate an artistic image
• It uses representations (hidden layer activations) based on a pretrained ConvNet.
• The content cost function is computed using one hidden layer’s activations.
• The style cost function for one layer is computed using the Gram matrix of that layer’s activations. The overall style cost function is obtained using several hidden layers.
• Optimizing the total cost function results in synthesizing new images.
• Face Recognition
• Face verification solves an easier 1:1 matching problem; face recognition addresses a harder 1:K matching problem.
• The triplet loss is an effective loss function for training a neural network to learn an encoding of a face image.
• The same encoding can be used for verification and recognition. Measuring distances between two images’ encodings allows you to determine whether they are pictures of the same person.