[mathjax]
Face Recognition
What is face recognition?
Face verification vs. face recognition
- Verification
- Input image, name/ID
- Output whether the input image is that of the claimed person
- Recognition
- Has a database of K persons
- Get an input image
- output ID if the image is any of the K persons(or “not recognized”)
- 生体検知
- 生きた人間かどうか、写真を拒否する
- 教師あり学習
- 顔認識は顔認証より難しい
- 99%の精度で1%の間違い
- 許容できる精度は99.9%やそれ以上
One Shot Learning
- Learning from one example to recognize the person again
- image->CNN->softmax(5)
- doesn’t work well because small training set it is not enough
- what if a new person joins your team? retrain convnet every time?
- image->CNN->softmax(5)
- Learning a “similarity” function
- d(img1,img2)=degree of defference between images
- d(img1,img2)<= τ #same
- d(img1,img2)>τ #different
Siamese Network
- Siamese Network シャムネットワーク
\[
\displaylines{
\boxed{img👨: x^{(1)}}\rightarrow\boxed{CNN}\rightarrow\fc\rightarrow \underbrace{\fc}_{f(x^{(1)})\\128}\\
\boxed{img👩:x^{(2)}}\rightarrow\boxed{CNN}\rightarrow\fc\rightarrow \underbrace{\fc}_{f(x^{(2)})\\128}\\
}
\] - Deep Face Papers
- Goal of learning\[
\boxed{img}\rightarrow\boxed{CNN}\rightarrow\fc\rightarrow \underbrace{\fc}_{f(x^{(1)})}
\] - Parameters of NN define an encoding \(f(x^{(i)})\)
- 128 dimensional
- Learn parameters so that:
- If \(x^{(i)}, x^{(j)} \) are the same person, \( \| f(x^{(i)}) – f(x^{(j)}) \|^2 \) is small.
- If \(x^{(i)}, x^{(j)} \) are different persons, \( \| f(x^{(i)}) – f(x^{(j)}) \|^2 \) is large.
Triplet Loss
Learning Objective
- Images:
- 👧Anchor,👩Positive: the same person
- 👧Anchor,👵Negative: different persons
- always looking at 3 images at a time (A,P,A,N)
- Want:\[
\displaylines{
\underbrace{\| f(A)-f(P) \|^2}_{d(A,P)} \leq \underbrace{\| f(A)-f(N) \|^2}_{d(A,N)}\\
\| f(A)-f(P) \|^2 – \| f(A)-f(N) \|^2 \leq 0
}
\] - trivial output: f(A)-f(P)=0, f(A)-f(N)=0
- so to make sure that it doesn’t set all the encoding s equal to each other.
- Modify objective with hyperparameter(margin) \(\alpha\):\[
\displaylines{
\underbrace{\| f(A)-f(P) \|^2}_{d(A,P)}+\alpha \leq \underbrace{\| f(A)-f(N) \|^2}_{d(A,N)}\\
\| f(A)-f(P) \|^2 – \| f(A)-f(N) \|^2+\alpha \leq 0
}
\]
Loss function
\[
\eqalign{
L(A,P,N)&=\max( \| f(A)-f(P) \|^2 – \| f(A)-f(N) \|^2 + \alpha, 0)\\
J&=\sum_{i=1}^m L(A^{(i)}, P^{(i)}, N^{(i)})
}
\]
- Training set: 10k pictures of 1k persons
- Just one picture of each person, you can’t actually train this system.
- But of course after training,you should apply this with one shot learning.
Choosing the triplets A,P,N
- During training , if A,P,N are chose randomly, \(d(A,P)+\alpha\leq d(A,N)\) is easily satisfied.
- Choose triplets that’re “hard” to train on.\[
\displaylines{
d(A,P)+\alpha\leq d(A,N) \\
d(A,P)+\approx d(A,N)
}
\]- increase computation efficiently learning algorithm
- FaceNet Paper
Training set using triplet loss
- (Anchor, Positive, Negative)…
- 企業は100万~1億のデータセットで訓練している
- オンラインにパラメータを公開している
- スクラッチで訓練するよりも、それを利用する
Face Verification and Binary Classification
\[
\left.
\begin{array}{r}
\boxed{👨\\ x^{(i)}}\rightarrow\boxed{CNN}\rightarrow \rightarrow \underbrace{\fc}_{f(x^{(i)})}\\
\boxed{👩\\x^{(j)}}\rightarrow\boxed{CNN}\rightarrow\rightarrow \underbrace{\fc}_{f(x^{(j)})}
\end{array}
\right\} \rightarrow \underbrace{\circ}_{sigmoid} \rightarrow\hat y
\]
\[
\hat y = \sigma (\sum_{k=1}^{128} w_k \underbrace{|f(x^{(i)}_k – f(x^{(i)})_k|}_{
\color{green}{
\frac{(f(x^{(i)})_k- f(x^{(j)})_k)^2}{ f(x^{(i)})_k+ f(x^{(j)})_k }:\chi^2 similarity
}
})+b)
\]
- So that i and j NN really tied to the parameters each other.
- Computational trick
- Pre computing some of thease encodings can save significant computation.
Neural Style Transfer
What is neural style transfer?
- Content+Style→Generated Images
- ヽ( ツ )丿 +🎨→😏
What are deep ConvNets learning?
Visualizing what a deep network is learning
- Pick a unit in layer 1.
- Find the nine image patches that maximize the unit’s activation.
- Repeat for other units.
- Paper
- layer 1
- ❏〼─│
- edge, angle
- layer 2
- ◆△
- more complex shapes and patterns
- layer 3
- ○👤
- rounder shape
- cars
- person
- textures like honeycomb shapes
- layer 4
- 🐶🐧
- dog
- water
- bird legs
- layer 5
- 😺🐶🐩🎹🌸
Cost Function
Neural style transfer const function
Content C+Style S=>Generate Image G\[
J(G)=\alpha J_{content}(C,G)+\beta J_{style}(S,G)
\]
Find the generated image G
- Initiate G randomly\[
G: 100\times100\times 3
\] - Use gradient descent to minimize \(J(G)\) \[
G := G-\frac{\partial}{\partial G}J(G)
\]
Content Cost Function
\[
J(G)=\alpha J_{content}(C,G)+\beta J_{style}(S,G)
\]
- Say you use hidden layer \(l\) to compute content cost.
- small \(l\): very similar C and G
- deep \(l\): “If there is a dog in your content image, then make sure there is a dog somewhere in your generated image.”
- Use pre-trained ConvNet. (E.g., VGG network)
- Let \(a^{[l](C)}\) and \(a^{[l](G)}\) be the activation of layer \(l\) on the images
- If \(a^{[l](C)}\) and \(a^{[l](G)}\) are similar, both images have similar content
- \[
J_{content}(C,G)= \frac12 \| a^{[l](C)} – a^{[l](G)} \|^2
\]
Style Cost Function
Meaning of the “style” of an image
- Say you are using layer l’s activation to measure “style.”
- Define style as correlation between activations across channels.
- How correlated are the activations across different channels?
Intuition about style of an image
Style matrix
Let \(a^{[l]}_{i,j,k} =\) activation at \((i,j,k)\). \(G^{[l]}\) is \(n_c^{[l]} \times n_c^{[l]}\)
i: H, j: W, k:C
\(G^[l]\) means how correlated are the activations in channel k and channel k prime.
\[
\displaylines{
G_{kk’}^{[l]\color{green}{(S)}} &= \sum_{i=1}^{n_H}\sum_{j=1}^{n_W} a_{i,j,k}^{[l]\color{green}{(S)}} a_{i,j,k’}^{[l]\color{green}{(S)}}\\
G_{kk’}^{[l]\color{green}{(G)}} &= \sum_{i=1}^{n_H}\sum_{j=1}^{n_W} a_{i,j,k}^{[l]\color{green}{(G)}} a_{i,j,k’}^{[l]\color{green}{(G)}}
}
\]
Style matrix G: Gram matrix in liner algebra
unnormalized cross of the areas 正規化されていない相互共分散
Style cost function
\[
\begin{align}
J_{style}^{[l]}(S,G) &= \frac{1}{ (2n_{H}^{[l]} n_{W}^{[l]} n_{C}^{[l]} )^{2} } \sum_{k}\sum_{k’}\left ( G_{kk’}^{[l](S)} – G_{kk’}^{[l](G)} \right)^{2}\\
J_{style}(S,G)&=\sum_l \lambda^{[l]} J_{style}^{[l]}(S,G)\\
J(G)&=\alpha J_{content}(C,G)+\beta J_{style}(S,G)
\end{align}
\]
tries to minimize this cost function j of G
1D and 3D Generalizations
Convolutions in 2D and 1D
- 2D
- 2D input image 14×14
- 2D filter 5×5
- 14x14x3 * 5x5x3 => 10×10
- 1D
- 1D EKG signal 心電図 1電極 14dim
- filter 5
- 14×1 * 5×1 => 10×16
- For along with 1D sequenced data apps, you actually use a recurrent neural network, LCM and others.
- 3D
- CT scan
- 3D volume 14x14x14 * 3D filter 5x5x5
- 14x14x14x1 * 5x5x5x1 => 10x10x10x16
- 10x10x10x16 * 5x5x5x16 => 6x6x6x32
- Movie
- where the different slices could be different slices in time through a movie.
- CT scan
Programming assignments
- Art Generation with Neural Style Transfer
- Neural Style Transfer is an algorithm that given a content image C and a style image S can generate an artistic image
- It uses representations (hidden layer activations) based on a pretrained ConvNet.
- The content cost function is computed using one hidden layer’s activations.
- The style cost function for one layer is computed using the Gram matrix of that layer’s activations. The overall style cost function is obtained using several hidden layers.
- Optimizing the total cost function results in synthesizing new images.
- Face Recognition
- Face verification solves an easier 1:1 matching problem; face recognition addresses a harder 1:K matching problem.
- The triplet loss is an effective loss function for training a neural network to learn an encoding of a face image.
- The same encoding can be used for verification and recognition. Measuring distances between two images’ encodings allows you to determine whether they are pictures of the same person.