カテゴリー
深層学習

DL [Course 5/5] Sequence Models [Week 3/3] Sequence models & Attention mechanism

Various sequence to sequence architectures

Basic Models

Sequence to sequence model

\(x\): Jane visite l’Afrique en septembre

\(y\): Jane is visitiong Africa in September.

Image captioning

\[
\displaylines{
💺😺
\underbrace{\rightarrow}_{11×11,s=4}
\boxed{(55,55,96)}
\underbrace{\rightarrow}_{Max-pool: 11×11,s=4}
\boxed{(27,27,96)}
\underbrace{\rightarrow}_{5×5,same}\\
\boxed{(27,27,256)}
\underbrace{\rightarrow}_{Max-pool: 3×3,s=2}
\boxed{(13,13,256)}
\underbrace{\rightarrow}_{3×3,same}
\boxed{(13,13,384)}
\underbrace{\rightarrow}_{3×3}\\
\boxed{(13,13,384)}
\underbrace{\rightarrow}_{3×3}
\boxed{(13,13,256)}
\underbrace{\rightarrow}_{Max-pool: 3×3,s=2}
\boxed{(6,6,256)}\\
=\fc_{9216}
\rightarrow \fc_{4096}
\rightarrow \fc_{4096}
\rightarrow \underbrace{\boxed{y^{<1>},y^{<2>},…,y^{<Ty>}}}_{\text{A cat sitting on a chair}}
}
\]

Picking the most likely sentence

Machine translation a sbuilding a conditional language model

Language model:
  • \( P(y^{<>1},\dots,y^{<Ty>})\)
  • So what the machine translation model is, is very similar to the language model, except that instead of always starting along with the vector of all zeros
Machine translation:
  • \( P(y^{<>1},\dots,y^{<Ty>}|x^{<1>},\dots,x^{<Tx>})\)
  • “conditional language model”
    • Encoder: word to model input
    • Decoder: language model

Finding the most likely translation

  • Jane visite l’Afrique en septembre.
    • Jane is visiting Africa in September.
    • Jane is going to be visiting Africa in September.
    • In September, Jane will visit Africa.
    • Her African friend welcomed Jane in September.

\[
\frac{\text{arg max}}{y^{<1>},\dots,y^{<Ty>}} \rightarrow P(\underbrace{y^{<1>},\dots,y^{<Ty>}}_{\text{English}}| \underbrace{x}_{\text{French}})
\]

Why not a greedy search?

  • Jane is visiting Africa in September.
    • better translation
  • Jane is going to be visiting Africa in September.
    • not a bad translation but verbose
  • \( P(\text{Jane is going}|x) > P(\text{Jane is visit}|x) \)
    • going is more common English word
  • And, of course, the total number of combinations of words in the English sentence is exponentially larger.
  • So, this is just a huge space of possible sentences, and it’s impossible to rate them all, which is why the most common thing to do is use an approximate search out of them.

Beam Search

Beam search algorithm

  • \( P(y^{<1>},y^{<2>}|x)=P(y^{<1>}|x)P(y^{<2>}|x,y^{<1>}) \)
    • jane visits africa in september.<EOS>
    • outcome of this process will be that adding one word at a time until Beam search will decide EOS as the best next symbol.
  • Beam width is 1, then this essentially becomes the greedy search algorithm.
  • Length normalization is a small change to the beam search algorithm that can help get much better results.
  • Beam search is maximizing the probability in the first formula below. It is the product of all the probabilities where t is total number of words in the output.
  • In machine translation, if we carry out beam search without using sentence normalization, the algorithm will tend to output overly short translations.

Refinements to Beam Search

Length normalization

\[
\displaylines{
arg \max_y \overbrace{\prod_{t=1}^{T_y} P(y^{<t>}| x,y^{<t>},\dots,y^{<t-1>})}^{P(y^{<1>},\dots,y^{<T_y>}|x)=P(y^{<1>}|x)P(y^{<2>}|x,y^{<1>})\dots P(y^{<T_y>}|x,y^{<1>},\dots,y^{<T_y>})}\\
arg \max_y \underbrace{\sum_{t=1}^{T_y} P(y^{<t>}| x,y^{<t>},\dots,y^{<t-1>})}_{T_y=1,2,3,\dots,30}\\
\frac{1}{T_y^\alpha}\sum_{t=1}^{T_y} P(y^{<t>}| x,y^{<t>},\dots,y^{<t-1>})\\
}
\]

  • Probabilities are les than 1, too small can result in numerical underflow.
  • Insted of maximizing this product, we will take logs.
  • \(\alpha =0.7\)
    • softer approach
  • \(\alpha =1\)
    • completely normalizing by length
  • \(\alpha =0\)
    • no normalization
  • Finally all of these sentences, you pick the one that achieves the highest value on this normalized log probability objective.

Beam search discussion

Beam width B?

  • Larger B
    • more possibilities considering and does the better sentence find.
    • but more computationally expensive, because you’re also keeping a lot more possibilities around.
    • slower and memory requirements will grow.
  • Smaller B
    • worse result
    • faster
  • in production system
    • B=10 is not uncommon
    • B=100 is very large
  • in research system (publish, best possible result)
    • B=1000,3000 is not uncommon
    • domain dependent

Unlike exact search algorithms like BFS(Breadth First Search) or DFS(Depth First Search), Beam Search runs faster but is not guaranteed to find exact maximum for \(arg \max_y P(y|x)\)

Error analysis in beam search

Example

  • Jane visite l’Afrique en septembre.
  • Human: Jane visits Africa in September. \(y^*\)
  • Algorithm: Jane visited Africa last September. \(\hat y\)
  • model consists of 2 component
    • RNN: encoder and decoder
    • Beam search
  • error and bad translation
    • which more to blame
  • more B?, more training data?

Error analysis on beam search

  • Case 1: \(P(y^*|x) \gt P(\hat y|x)\)
    • Beam search chose \(\hat y\). But \(y^*\) attain s higher \(P(y|x)\).
    • Conclusion: Beam search is at fault.
  • Case 2: \(P(y^*|x) \leq P(\hat y|x)\)
    • \(y^*\) is a better translation than \(\hat y\). But RNN predicted \(P(y^*|x) \lt P(\hat y|x)\).
    • Conclusion: RNN model is at fault.(rather than to the search algorithm)

Error analysis process

HumanAlgorithm\(P(y^*|x)\)\(P(\hat y|x)\)At fault?
Jane visits
Africa in
Spetember
Jane visited
Africa last
September.
2*10^-101*10^-10B
R
B

Figures out what faction of errors are “due to” beam search vs. RNN model

Bleu Score

Evaluating machine translation

  • French: Le chat est sur le tapis.
  • Reference 1: The cat is on the mat.
  • Reference 2: There is a cat on the mat.
  • MT output: the the the the the the the.
  • Precision:7/7
    • every one of these 7 words appears in R1 and R2
    • this is not a particularly useful measure.
  • Modified precision: 2/7, 1/7
    • maximum number of times it appears in the R sentences.

Blue score on bigrams

  • Example:
    • Reference 1: The cat is on the mat.
    • Reference 2: there is a cat on the mat.
    • MT output: The cat the cat on the mat.
CountCount_clip
the cat21
cat the10
cat on11
on the11
the mat11

4/6

Blue score on unigrams

  • Example:
    • Reference 1: The cat is on the mat.
    • Reference 2: there is a cat on the mat.
    • MT output: The cat the cat on the mat.

\[
\displaylines{
p_1=\frac{\sum_{unigram \in \hat y} Count_{clip}(unigram)}{\sum_{unigram \in \hat y} Count(unigram)}\\
p_n=\frac{\sum_{n-gram \in \hat y} Count_{clip}(n-gram)}{\sum_{n-gram \in \hat y} Count(n-gram)}
}
\]

  • MT output exactly same
    • \(P_1, P_2= 1.0\)

Bleu details

  • \(p_n =\) Bleu socre on n-grams only
  • Combined Bleu score:
    • \(BP_{exp}(\frac14 \sum_{n=1}^4 P_n)\)
  • BP: brevity penalty
  • \[
    BP=\begin{cases}
    1 & (\text{MT_output_length} \gt \text{reference_output_length})\\
    exp(1-\text{reference_output_length}/\text{MT_output_length}) & (\text{otherwise})
    \end{cases}
    \]
  • Bleu score was revolutionary for MT
    • gave a pretty good, by no means perfect, but pretty good single real number evaluation metric
  • Open source implement exists
  • Today used to evaluate many text generation system
    • translation
    • image caption
  • This is not used to speech recognition
    • because it has one ground truth

Attention Model Intuition

The problem of long sequences

  • Encoder-Decoder architecture
    • it works quite well for short sentences
    • very long sentences maybe longer than 30-40 words, performance comes down

Attention Model Intuition

Attention Model

Attention model

\[
\begin{align}
a^{<t’>}&=(\overrightarrow a^{<t’>}, \overleftarrow a^{<t’>})\\
\sum_{t’} \alpha^{<t,t’>}&=1\\
c^{<1>}&=\sum_{t’}\alpha^{<1,t’>}a^{<t’>}\\
\alpha^{<t,t’>}&=\text{ amount of attention }y^{<t>} \text{ should pay to } a^{<t’>}
\end{align}
\]

Computing attention \(\alpha^{<t,t’>}\)

\[
\displaylines{
\alpha^{<t,t’>}=\text{ amount of attention }y^{<t>} \text{ should pay to } a^{<t’>}\\
\alpha^{<t,t’>}=\frac{\exp(e^{<t,t’>})}{\sum_{t’=1}^{T_x}\exp(e^{<t,t’>})}\\
\begin{matrix}
s^{<t-1>}\rightarrow \\
a^{<t’>}\rightarrow
\end{matrix}
\fc \rightarrow e^{<t,t’>}
}
\]

Attention examples

  • date normalization
    • July 20th 1969
    • 23 April, 1564

Speech recognition – Audio data

Speech recognition

Speech recognition problem

  • x: audio clip
  • y: transcript
    • the quick brown fox
    • phonemes音素: de ku wik bra…
    • ones upon a time, using phonemes that hand-engineered basic units of cells
  • academic data set on speech recognition 300h
    • 3000h reasonable size
    • 10,000 h commercial, sometime 100,000h more

CTC cost for speech recognition

(Connectionist temporal classification)

  • bidirectional LSP or bidirectional GIU
  • Input number of timestep very large
    • 10sec audio 100Hz=1000 inputs
  • “the quick brown fox” (19 chars include space)
  • CTC cost function allows the RNN to generate an output like:
    • ttt_h_eee_____ ____qqqq___
    • the q
  • Basic rule: collapse repeated characters not separated by blank

Trigger Word Detection

What i trigger word detection?

  • Amazon Echo (Alexa)
  • Baidu DuerOS (xiaodunihao)
  • Apple Siri (Hey Siri)
  • Google Home (Okay Google)

Trigger word detection algorithm

  • The literature on triggered detection algorithm is still evolving, so there isn’t wide consensus yet, on what’s the best algorithm for trigger word detection.
  • \(x^{<t>}\)from a audio clip, maybe compute spectrogram features then pass through RNN
  • someone just finish to say word
    • \(y^{<t>}=1\)
    • It creates very imbalanced training set that has lot of zeros than we want. it’s hard to train.
  • instead of setting only a single time step to operate one, you could actually make it to operate a few ones for several times.
  • fixed period of time before reverting back to zero.
  • slightly evens out the ratio of one’s to zero’s

Conclusion

Conclusion and thank you

Specialization outline

  • Neural Networks and Deep Learning
  • Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
  • Structuring Machine Learning Projects
  • Convolutional Neural Networks
  • Sequence Models

Programming assignments

Neural Machine Translation

  • Machine translation models can be used to map from one sequence to another. They are useful not just for translating human languages (like French->English) but also for tasks like date format translation.
  • An attention mechanism allows a network to focus on the most relevant parts of the input when producing a specific part of the output.
  • A network using an attention mechanism can translate from inputs of length \(T_x\) to outputs of length \(T_y\), where \(T_x\) and \(T_y\) can be different.
  • You can visualize attention weights \(\alpha^{<t,t’>}\) to see what the network is paying attention to while generating each output.

Trigger word detection

  • Data synthesis is an effective way to create a large training set for speech problems, specifically trigger word detection.
  • Using a spectrogram and optionally a 1D conv layer is a common pre-processing step prior to passing audio data to an RNN, GRU or LSTM.
  • An end-to-end deep learning approach can be used to build a very effective trigger word detection system.

コメントを残す

メールアドレスが公開されることはありません。 * が付いている欄は必須項目です