カテゴリー
深層学習

DL [Course 4/5] Convolutional Neural Networks [Week 4/4] Special applications: Face recognition & Neural style transfer

[mathjax]

Face Recognition

What is face recognition?

Face verification vs. face recognition

  • Verification
    • Input image, name/ID
    • Output whether the input image is that of the claimed person
  • Recognition
    • Has a database of K persons
    • Get an input image
    • output ID if the image is any of the K persons(or “not recognized”)
  • 生体検知
    • 生きた人間かどうか、写真を拒否する
    • 教師あり学習
  • 顔認識は顔認証より難しい
    • 99%の精度で1%の間違い
    • 許容できる精度は99.9%やそれ以上

One Shot Learning

  • Learning from one example to recognize the person again
    • image->CNN->softmax(5)
      • doesn’t work well because small training set it is not enough
      • what if a new person joins your team? retrain convnet every time?
  • Learning a “similarity” function
    • d(img1,img2)=degree of defference between images
    • d(img1,img2)<= τ #same
    • d(img1,img2)>τ #different

Siamese Network

  • Siamese Network シャムネットワーク
    \[
    \displaylines{
    \boxed{img👨: x^{(1)}}\rightarrow\boxed{CNN}\rightarrow\fc\rightarrow \underbrace{\fc}_{f(x^{(1)})\\128}\\

    \boxed{img👩:x^{(2)}}\rightarrow\boxed{CNN}\rightarrow\fc\rightarrow \underbrace{\fc}_{f(x^{(2)})\\128}\\
    }
    \]
  • Deep Face Papers
  • Goal of learning\[
    \boxed{img}\rightarrow\boxed{CNN}\rightarrow\fc\rightarrow \underbrace{\fc}_{f(x^{(1)})}
    \]
  • Parameters of NN define an encoding \(f(x^{(i)})\)
    • 128 dimensional
  • Learn parameters so that:
    • If \(x^{(i)}, x^{(j)} \) are the same person, \( \| f(x^{(i)}) – f(x^{(j)}) \|^2 \) is small.
    • If \(x^{(i)}, x^{(j)} \) are different persons, \( \| f(x^{(i)}) – f(x^{(j)}) \|^2 \) is large.

Triplet Loss

Learning Objective

  • Images:
    • 👧Anchor,👩Positive: the same person
    • 👧Anchor,👵Negative: different persons
  • always looking at 3 images at a time (A,P,A,N)
  • Want:\[
    \displaylines{
    \underbrace{\| f(A)-f(P) \|^2}_{d(A,P)} \leq \underbrace{\| f(A)-f(N) \|^2}_{d(A,N)}\\
    \| f(A)-f(P) \|^2 – \| f(A)-f(N) \|^2 \leq 0
    }
    \]
  • trivial output: f(A)-f(P)=0, f(A)-f(N)=0
    • so to make sure that it doesn’t set all the encoding s equal to each other.
  • Modify objective with hyperparameter(margin) \(\alpha\):\[
    \displaylines{
    \underbrace{\| f(A)-f(P) \|^2}_{d(A,P)}+\alpha \leq \underbrace{\| f(A)-f(N) \|^2}_{d(A,N)}\\
    \| f(A)-f(P) \|^2 – \| f(A)-f(N) \|^2+\alpha \leq 0
    }
    \]

Loss function

\[
\eqalign{
L(A,P,N)&=\max( \| f(A)-f(P) \|^2 – \| f(A)-f(N) \|^2 + \alpha, 0)\\
J&=\sum_{i=1}^m L(A^{(i)}, P^{(i)}, N^{(i)})
}
\]

  • Training set: 10k pictures of 1k persons
    • Just one picture of each person, you can’t actually train this system.
    • But of course after training,you should apply this with one shot learning.

Choosing the triplets A,P,N

  • During training , if A,P,N are chose randomly, \(d(A,P)+\alpha\leq d(A,N)\) is easily satisfied.
  • Choose triplets that’re “hard” to train on.\[
    \displaylines{
    d(A,P)+\alpha\leq d(A,N) \\
    d(A,P)+\approx d(A,N)
    }
    \]
    • increase computation efficiently learning algorithm
  • FaceNet Paper

Training set using triplet loss

  • (Anchor, Positive, Negative)…
  • 企業は100万~1億のデータセットで訓練している
  • オンラインにパラメータを公開している
  • スクラッチで訓練するよりも、それを利用する

Face Verification and Binary Classification

\[
\left.
\begin{array}{r}
\boxed{👨\\ x^{(i)}}\rightarrow\boxed{CNN}\rightarrow \rightarrow \underbrace{\fc}_{f(x^{(i)})}\\
\boxed{👩\\x^{(j)}}\rightarrow\boxed{CNN}\rightarrow\rightarrow \underbrace{\fc}_{f(x^{(j)})}
\end{array}
\right\} \rightarrow \underbrace{\circ}_{sigmoid} \rightarrow\hat y
\]

\[
\hat y = \sigma (\sum_{k=1}^{128} w_k \underbrace{|f(x^{(i)}_k – f(x^{(i)})_k|}_{
\color{green}{
\frac{(f(x^{(i)})_k- f(x^{(j)})_k)^2}{ f(x^{(i)})_k+ f(x^{(j)})_k }:\chi^2 similarity
}
})+b)
\]

  • So that i and j NN really tied to the parameters each other.
  • Computational trick
    • Pre computing some of thease encodings can save significant computation.

Neural Style Transfer

What is neural style transfer?

  • Content+Style→Generated Images
  • ヽ( ツ )丿 +🎨→😏

What are deep ConvNets learning?

Visualizing what a deep network is learning

  • Pick a unit in layer 1.
  • Find the nine image patches that maximize the unit’s activation.
  • Repeat for other units.
  • Paper
  • layer 1
    • ❏〼─│
    • edge, angle
  • layer 2
    • ◆△
    • more complex shapes and patterns
  • layer 3
    • ○👤
    • rounder shape
    • cars
    • person
    • textures like honeycomb shapes
  • layer 4
    • 🐶🐧
    • dog
    • water
    • bird legs
  • layer 5
    • 😺🐶🐩🎹🌸

Cost Function

Neural style transfer const function

Content C+Style S=>Generate Image G\[
J(G)=\alpha J_{content}(C,G)+\beta J_{style}(S,G)
\]

Find the generated image G

  • Initiate G randomly\[
    G: 100\times100\times 3
    \]
  • Use gradient descent to minimize \(J(G)\) \[
    G := G-\frac{\partial}{\partial G}J(G)
    \]

Content Cost Function

\[
J(G)=\alpha J_{content}(C,G)+\beta J_{style}(S,G)
\]

  • Say you use hidden layer \(l\) to compute content cost.
    • small \(l\): very similar C and G
    • deep \(l\): “If there is a dog in your content image, then make sure there is a dog somewhere in your generated image.”
  • Use pre-trained ConvNet. (E.g., VGG network)
  • Let \(a^{[l](C)}\) and \(a^{[l](G)}\) be the activation of layer \(l\) on the images
  • If \(a^{[l](C)}\) and \(a^{[l](G)}\) are similar, both images have similar content
  • \[
    J_{content}(C,G)= \frac12 \| a^{[l](C)} – a^{[l](G)} \|^2
    \]

Style Cost Function

Meaning of the “style” of an image

  • Say you are using layer l’s activation to measure “style.”
  • Define style as correlation between activations across channels.
  • How correlated are the activations across different channels?

Intuition about style of an image

Style matrix

Let \(a^{[l]}_{i,j,k} =\) activation at \((i,j,k)\). \(G^{[l]}\) is \(n_c^{[l]} \times n_c^{[l]}\)

i: H, j: W, k:C

\(G^[l]\) means how correlated are the activations in channel k and channel k prime.

\[
\displaylines{
G_{kk’}^{[l]\color{green}{(S)}} &= \sum_{i=1}^{n_H}\sum_{j=1}^{n_W} a_{i,j,k}^{[l]\color{green}{(S)}} a_{i,j,k’}^{[l]\color{green}{(S)}}\\
G_{kk’}^{[l]\color{green}{(G)}} &= \sum_{i=1}^{n_H}\sum_{j=1}^{n_W} a_{i,j,k}^{[l]\color{green}{(G)}} a_{i,j,k’}^{[l]\color{green}{(G)}}
}
\]

Style matrix G: Gram matrix in liner algebra

unnormalized cross of the areas 正規化されていない相互共分散

Style cost function

\[
\begin{align}
J_{style}^{[l]}(S,G) &= \frac{1}{ (2n_{H}^{[l]} n_{W}^{[l]} n_{C}^{[l]} )^{2} } \sum_{k}\sum_{k’}\left ( G_{kk’}^{[l](S)} – G_{kk’}^{[l](G)} \right)^{2}\\

J_{style}(S,G)&=\sum_l \lambda^{[l]} J_{style}^{[l]}(S,G)\\
J(G)&=\alpha J_{content}(C,G)+\beta J_{style}(S,G)
\end{align}
\]

tries to minimize this cost function j of G

1D and 3D Generalizations

Convolutions in 2D and 1D

  • 2D
    • 2D input image 14×14
    • 2D filter 5×5
    • 14x14x3 * 5x5x3 => 10×10
  • 1D
    • 1D EKG signal 心電図 1電極 14dim
    • filter 5
    • 14×1 * 5×1 => 10×16
  • For along with 1D sequenced data apps, you actually use a recurrent neural network, LCM and others.
  • 3D
    • CT scan
      • 3D volume 14x14x14 * 3D filter 5x5x5
      • 14x14x14x1 * 5x5x5x1 => 10x10x10x16
      • 10x10x10x16 * 5x5x5x16 => 6x6x6x32
    • Movie
      • where the different slices could be different slices in time through a movie.

Programming assignments

  • Art Generation with Neural Style Transfer
    • Neural Style Transfer is an algorithm that given a content image C and a style image S can generate an artistic image
    • It uses representations (hidden layer activations) based on a pretrained ConvNet.
    • The content cost function is computed using one hidden layer’s activations.
    • The style cost function for one layer is computed using the Gram matrix of that layer’s activations. The overall style cost function is obtained using several hidden layers.
    • Optimizing the total cost function results in synthesizing new images.
  • Face Recognition
    • Face verification solves an easier 1:1 matching problem; face recognition addresses a harder 1:K matching problem.
    • The triplet loss is an effective loss function for training a neural network to learn an encoding of a face image.
    • The same encoding can be used for verification and recognition. Measuring distances between two images’ encodings allows you to determine whether they are pictures of the same person.

コメントを残す

メールアドレスが公開されることはありません。 が付いている欄は必須項目です