カテゴリー
機械学習

DL [Course 4/5] Convolutional Neural Networks [Week 4/4] Special applications: Face recognition & Neural style transfer

Face Recognition

What is face recognition?

Face verification vs. face recognition

  • Verification
    • Input image, name/ID
    • Output whether the input image is that of the claimed person
  • Recognition
    • Has a database of K persons
    • Get an input image
    • output ID if the image is any of the K persons(or “not recognized”)
  • 生体検知
    • 生きた人間かどうか、写真を拒否する
    • 教師あり学習
  • 顔認識は顔認証より難しい
    • 99%の精度で1%の間違い
    • 許容できる精度は99.9%やそれ以上

One Shot Learning

  • Learning from one example to recognize the person again
    • image->CNN->softmax(5)
      • doesn’t work well because small training set it is not enough
      • what if a new person joins your team? retrain convnet every time?
  • Learning a “similarity” function
    • d(img1,img2)=degree of defference between images
    • d(img1,img2)<= τ #same
    • d(img1,img2)>τ #different

Siamese Network

  • Siamese Network シャムネットワーク\[
    \boxed{img👨: x^{(1)}}\rightarrow\boxed{CNN}\rightarrow\boxed{\circ\\\circ\\\circ}\rightarrow \underbrace{\boxed{ \circ\\\circ\\\vdots\\\circ}}_{f(x^{(1)})\\128}\\

    \boxed{img👩:x^{(2)}}\rightarrow\boxed{CNN}\rightarrow\boxed{\circ\\\circ\\\circ}\rightarrow \underbrace{\boxed{ \circ\\\circ\\\vdots\\\circ}}_{f(x^{(2)})\\128}\\
    \]
  • Deep Face Papers
  • Goal of learning\[
    \boxed{img}\rightarrow\boxed{CNN}\rightarrow\boxed{\circ\\\circ\\\circ}\rightarrow \underbrace{\boxed{ \circ\\\circ\\\vdots\\\circ}}_{f(x^{(1)})}
    \]
  • Parameters of NN define an encoding \(f(x^{(i)})\)
    • 128 dimensional
  • Learn parameters so that:
    • If \(x^{(i)}, x^{(j)} \) are the same person, \( \| f(x^{(i)}) – f(x^{(j)}) \|^2 \) is small.
    • If \(x^{(i)}, x^{(j)} \) are different persons, \( \| f(x^{(i)}) – f(x^{(j)}) \|^2 \) is large.

Triplet Loss

Learning Objective

  • Images:
    • 👧Anchor,👩Positive: the same person
    • 👧Anchor,👵Negative: different persons
  • always looking at 3 images at a time (A,P,A,N)
  • Want:\[
    \underbrace{\| f(A)-f(P) \|^2}_{d(A,P)} \leq \underbrace{\| f(A)-f(N) \|^2}_{d(A,N)}\\
    \| f(A)-f(P) \|^2 – \| f(A)-f(N) \|^2 \leq 0
    \]
  • trivial output: f(A)-f(P)=0, f(A)-f(N)=0
    • so to make sure that it doesn’t set all the encoding s equal to each other.
  • Modify objective with hyperparameter(margin) \(\alpha\):\[
    \underbrace{\| f(A)-f(P) \|^2}_{d(A,P)}+\alpha \leq \underbrace{\| f(A)-f(N) \|^2}_{d(A,N)}\\
    \| f(A)-f(P) \|^2 – \| f(A)-f(N) \|^2+\alpha \leq 0
    \]

Loss function

\[
L(A,P,N)=\max( \| f(A)-f(P) \|^2 – \| f(A)-f(N) \|^2 + \alpha, 0)\\
J=\sum_{i=1}^m L(A^{(i)}, P^{(i)}, N^{(i)})
\]

  • Training set: 10k pictures of 1k persons
    • Just one picture of each person, you can’t actually train this system.
    • But of course after training,you should apply this with one shot learning.

Choosing the triplets A,P,N

  • During training , if A,P,N are chose randomly, \(d(A,P)+\alpha\leq d(A,N)\) is easily satisfied.
  • Choose triplets that’re “hard” to train on.\[
    d(A,P)+\alpha\leq d(A,N) \\
    d(A,P)+\approx d(A,N)
    \]
    • increase computation efficiently learning algorithm
  • FaceNet Paper

Training set using triplet loss

  • (Anchor, Positive, Negative)…
  • 企業は100万~1億のデータセットで訓練している
  • オンラインにパラメータを公開している
  • スクラッチで訓練するよりも、それを利用する

Face Verification and Binary Classification

\[
\left.
\begin{array}{r}
\boxed{👨\\ x^{(i)}}\rightarrow\boxed{CNN}\rightarrow \rightarrow \underbrace{\boxed{ \circ\\\circ\\\vdots\\\circ}}_{f(x^{(i)})}\\
\boxed{👩\\x^{(j)}}\rightarrow\boxed{CNN}\rightarrow\rightarrow \underbrace{\boxed{ \circ\\\circ\\\vdots\\\circ}}_{f(x^{(j)})}
\end{array}
\right\} \rightarrow \underbrace{\circ}_{sigmoid} \rightarrow\hat y
\]

\[
\hat y = \sigma (\sum_{k=1}^{128} w_k \underbrace{|f(x^{(i)}_k – f(x^{(i)})_k|}_{
\color{green}{
\frac{(f(x^{(i)})_k- f(x^{(j)})_k)^2}{ f(x^{(i)})_k+ f(x^{(j)})_k }:\chi^2 similarity
}
})+b)
\]

  • So that i and j NN really tied to the parameters each other.
  • Computational trick
    • Pre computing some of thease encodings can save significant computation.

Neural Style Transfer

What is neural style transfer?

  • Content+Style→Generated Images
  • ヽ( ツ )丿 +🎨→😏

What are deep ConvNets learning?

Visualizing what a deep network is learning

  • Pick a unit in layer 1.
  • Find the nine image patches that maximize the unit’s activation.
  • Repeat for other units.
  • Paper
  • layer 1
    • ❏〼─│
    • edge, angle
  • layer 2
    • ◆△
    • more complex shapes and patterns
  • layer 3
    • ○👤
    • rounder shape
    • cars
    • person
    • textures like honeycomb shapes
  • layer 4
    • 🐶🐧
    • dog
    • water
    • bird legs
  • layer 5
    • 😺🐶🐩🎹🌸

Cost Function

Neural style transfer const function

Content C+Style S=>Generate Image G\[
J(G)=\alpha J_{content}(C,G)+\beta J_{style}(S,G)
\]

Find the generated image G

  • Initiate G randomly\[
    G: 100\times100\times 3
    \]
  • Use gradient descent to minimize \(J(G)\) \[
    G := G-\frac{\partial}{\partial G}J(G)
    \]

Content Cost Function

\[
J(G)=\alpha J_{content}(C,G)+\beta J_{style}(S,G)
\]

  • Say you use hidden layer \(l\) to compute content cost.
    • small \(l\): very similar C and G
    • deep \(l\): “If there is a dog in your content image, then make sure there is a dog somewhere in your generated image.”
  • Use pre-trained ConvNet. (E.g., VGG network)
  • Let \(a^{[l](C)}\) and \(a^{[l](G)}\) be the activation of layer \(l\) on the images
  • If \(a^{[l](C)}\) and \(a^{[l](G)}\) are similar, both images have similar content
  • \[
    J_{content}(C,G)= \frac12 \| a^{[l](C)} – a^{[l](G)} \|^2
    \]

Style Cost Function

Meaning of the “style” of an image

  • Say you are using layer l’s activation to measure “style.”
  • Define style as correlation between activations across channels.
  • How correlated are the activations across different channels?

Intuition about style of an image

Style matrix

Let \(a^{[l]}_{i,j,k} =\) activation at \((i,j,k)\). \(G^{[l]}\) is \(n_c^{[l]} \times n_c^{[l]}\)

i: H, j: W, k:C

\(G^[l]\) means how correlated are the activations in channel k and channel k prime.

\[
G_{kk’}^{[l]\color{green}{(S)}} = \sum_{i=1}^{n_H}\sum_{j=1}^{n_W} a_{i,j,k}^{[l]\color{green}{(S)}} a_{i,j,k’}^{[l]\color{green}{(S)}}\\
G_{kk’}^{[l]\color{green}{(G)}} = \sum_{i=1}^{n_H}\sum_{j=1}^{n_W} a_{i,j,k}^{[l]\color{green}{(G)}} a_{i,j,k’}^{[l]\color{green}{(G)}}
\]

Style matrix G: Gram matrix in liner algebra

unnormalized cross of the areas 正規化されていない相互共分散

Style cost function

\[
\begin{align}
J_{style}^{[l]}(S,G) &= \frac{1}{ (2n_{H}^{[l]} n_{W}^{[l]} n_{C}^{[l]} )^{2} } \sum_{k}\sum_{k’}\left ( G_{kk’}^{[l](S)} – G_{kk’}^{[l](G)} \right)^{2}\\

J_{style}(S,G)&=\sum_l \lambda^{[l]} J_{style}^{[l]}(S,G)\\
J(G)&=\alpha J_{content}(C,G)+\beta J_{style}(S,G)
\end{align}
\]

tries to minimize this cost function j of G

1D and 3D Generalizations

Convolutions in 2D and 1D

  • 2D
    • 2D input image 14×14
    • 2D filter 5×5
    • 14x14x3 * 5x5x3 => 10×10
  • 1D
    • 1D EKG signal 心電図 1電極 14dim
    • filter 5
    • 14×1 * 5×1 => 10×16
  • For along with 1D sequenced data apps, you actually use a recurrent neural network, LCM and others.
  • 3D
    • CT scan
      • 3D volume 14x14x14 * 3D filter 5x5x5
      • 14x14x14x1 * 5x5x5x1 => 10x10x10x16
      • 10x10x10x16 * 5x5x5x16 => 6x6x6x32
    • Movie
      • where the different slices could be different slices in time through a movie.

Programming assignments

  • Art Generation with Neural Style Transfer
    • Neural Style Transfer is an algorithm that given a content image C and a style image S can generate an artistic image
    • It uses representations (hidden layer activations) based on a pretrained ConvNet.
    • The content cost function is computed using one hidden layer’s activations.
    • The style cost function for one layer is computed using the Gram matrix of that layer’s activations. The overall style cost function is obtained using several hidden layers.
    • Optimizing the total cost function results in synthesizing new images.
  • Face Recognition
    • Face verification solves an easier 1:1 matching problem; face recognition addresses a harder 1:K matching problem.
    • The triplet loss is an effective loss function for training a neural network to learn an encoding of a face image.
    • The same encoding can be used for verification and recognition. Measuring distances between two images’ encodings allows you to determine whether they are pictures of the same person.

コメントを残す

メールアドレスが公開されることはありません。 * が付いている欄は必須項目です