DL [Course 5/5] Sequence Models [Week 3/3] Sequence models & Attention mechanism

Various sequence to sequence architectures

Basic Models

Sequence to sequence model

\(x\): Jane visite l’Afrique en septembre

\(y\): Jane is visitiong Africa in September.

paper
- Sequence to Sequence Learning with Neural Networks
- Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

Image captioning

\[
\displaylines{
💺😺
\underbrace{\rightarrow}_{11×11,s=4}
\boxed{(55,55,96)}
\underbrace{\rightarrow}_{Max-pool: 11×11,s=4}
\boxed{(27,27,96)}
\underbrace{\rightarrow}_{5×5,same}\\
\boxed{(27,27,256)}
\underbrace{\rightarrow}_{Max-pool: 3×3,s=2}
\boxed{(13,13,256)}
\underbrace{\rightarrow}_{3×3,same}
\boxed{(13,13,384)}
\underbrace{\rightarrow}_{3×3}\\
\boxed{(13,13,384)}
\underbrace{\rightarrow}_{3×3}
\boxed{(13,13,256)}
\underbrace{\rightarrow}_{Max-pool: 3×3,s=2}
\boxed{(6,6,256)}\\
=\fc_{9216}
\rightarrow \fc_{4096}
\rightarrow \fc_{4096}
\rightarrow \underbrace{\boxed{y^{<1>},y^{<2>},…,y^{<Ty>}}}_{\text{A cat sitting on a chair}}
}
\]

paper

Picking the most likely sentence

Machine translation a sbuilding a conditional language model

Language model:

\( P(y^{<>1},\dots,y^{<Ty>})\)
So what the machine translation model is, is very similar to the language model, except that instead of always starting along with the vector of all zeros

Machine translation:

\( P(y^{<>1},\dots,y^{<Ty>}|x^{<1>},\dots,x^{<Tx>})\)
“conditional language model”
- Encoder: word to model input
- Decoder: language model

Finding the most likely translation

Jane visite l’Afrique en septembre.
- Jane is visiting Africa in September.
- Jane is going to be visiting Africa in September.
- In September, Jane will visit Africa.
- Her African friend welcomed Jane in September.

\[
\frac{\text{arg max}}{y^{<1>},\dots,y^{<Ty>}} \rightarrow P(\underbrace{y^{<1>},\dots,y^{<Ty>}}_{\text{English}}| \underbrace{x}_{\text{French}})
\]

Why not a greedy search?

Jane is visiting Africa in September.
- better translation
Jane is going to be visiting Africa in September.
- not a bad translation but verbose
\( P(\text{Jane is going}|x) > P(\text{Jane is visit}|x) \)
- going is more common English word
And, of course, the total number of combinations of words in the English sentence is exponentially larger.
So, this is just a huge space of possible sentences, and it’s impossible to rate them all, which is why the most common thing to do is use an approximate search out of them.

Beam Search

Beam search algorithm

\( P(y^{<1>},y^{<2>}|x)=P(y^{<1>}|x)P(y^{<2>}|x,y^{<1>}) \)
- jane visits africa in september.<EOS>
- outcome of this process will be that adding one word at a time until Beam search will decide EOS as the best next symbol.
Beam width is 1, then this essentially becomes the greedy search algorithm.
Length normalization is a small change to the beam search algorithm that can help get much better results.
Beam search is maximizing the probability in the first formula below. It is the product of all the probabilities where t is total number of words in the output.
In machine translation, if we carry out beam search without using sentence normalization, the algorithm will tend to output overly short translations.

Refinements to Beam Search

Length normalization

\[
\displaylines{
arg \max_y \overbrace{\prod_{t=1}^{T_y} P(y^{<t>}| x,y^{<t>},\dots,y^{<t-1>})}^{P(y^{<1>},\dots,y^{<T_y>}|x)=P(y^{<1>}|x)P(y^{<2>}|x,y^{<1>})\dots P(y^{<T_y>}|x,y^{<1>},\dots,y^{<T_y>})}\\
arg \max_y \underbrace{\sum_{t=1}^{T_y} P(y^{<t>}| x,y^{<t>},\dots,y^{<t-1>})}_{T_y=1,2,3,\dots,30}\\
\frac{1}{T_y^\alpha}\sum_{t=1}^{T_y} P(y^{<t>}| x,y^{<t>},\dots,y^{<t-1>})\\
}
\]

Probabilities are les than 1, too small can result in numerical underflow.
Insted of maximizing this product, we will take logs.
\(\alpha =0.7\)
- softer approach
\(\alpha =1\)
- completely normalizing by length
\(\alpha =0\)
- no normalization
Finally all of these sentences, you pick the one that achieves the highest value on this normalized log probability objective.

Beam search discussion

Beam width B?

Larger B
- more possibilities considering and does the better sentence find.
- but more computationally expensive, because you’re also keeping a lot more possibilities around.
- slower and memory requirements will grow.
Smaller B
- worse result
- faster
in production system
- B=10 is not uncommon
- B=100 is very large
in research system (publish, best possible result)
- B=1000,3000 is not uncommon
- domain dependent

Unlike exact search algorithms like BFS(Breadth First Search) or DFS(Depth First Search), Beam Search runs faster but is not guaranteed to find exact maximum for \(arg \max_y P(y|x)\)

Error analysis in beam search

Example

Jane visite l’Afrique en septembre.
Human: Jane visits Africa in September. \(y^*\)
Algorithm: Jane visited Africa last September. \(\hat y\)
model consists of 2 component
- RNN: encoder and decoder
- Beam search
error and bad translation
- which more to blame
more B?, more training data?

Error analysis on beam search

Case 1: \(P(y^*|x) \gt P(\hat y|x)\)
- Beam search chose \(\hat y\). But \(y^*\) attain s higher \(P(y|x)\).
- Conclusion: Beam search is at fault.
Case 2: \(P(y^*|x) \leq P(\hat y|x)\)
- \(y^*\) is a better translation than \(\hat y\). But RNN predicted \(P(y^*|x) \lt P(\hat y|x)\).
- Conclusion: RNN model is at fault.(rather than to the search algorithm)

Error analysis process

Human	Algorithm	\(P(y^*\|x)\)	\(P(\hat y\|x)\)	At fault?
Jane visits Africa in Spetember	Jane visited Africa last September.	2*10^-10	1*10^-10	B
				R
				B

Figures out what faction of errors are “due to” beam search vs. RNN model

Bleu Score

Evaluating machine translation

French: Le chat est sur le tapis.
Reference 1: The cat is on the mat.
Reference 2: There is a cat on the mat.
MT output: the the the the the the the.
Precision:7/7
- every one of these 7 words appears in R1 and R2
- this is not a particularly useful measure.
Modified precision: 2/7, 1/7
- maximum number of times it appears in the R sentences.

Bleu: Bilingual Evaluation Understudy
paper
- BLEU: a Method for Automatic Evaluation of Machine Translation

Blue score on bigrams

Example:
- Reference 1: The cat is on the mat.
- Reference 2: there is a cat on the mat.
- MT output: The cat the cat on the mat.

	Count	Count_clip
the cat	2	1
cat the	1	0
cat on	1	1
on the	1	1
the mat	1	1

4/6

Blue score on unigrams

Example:
- Reference 1: The cat is on the mat.
- Reference 2: there is a cat on the mat.
- MT output: The cat the cat on the mat.

\[
\displaylines{
p_1=\frac{\sum_{unigram \in \hat y} Count_{clip}(unigram)}{\sum_{unigram \in \hat y} Count(unigram)}\\
p_n=\frac{\sum_{n-gram \in \hat y} Count_{clip}(n-gram)}{\sum_{n-gram \in \hat y} Count(n-gram)}
}
\]

MT output exactly same
- \(P_1, P_2= 1.0\)

Bleu details

\(p_n =\) Bleu socre on n-grams only
Combined Bleu score:
- \(BP_{exp}(\frac14 \sum_{n=1}^4 P_n)\)
BP: brevity penalty
\[
BP=\begin{cases}
1 & (\text{MT_output_length} \gt \text{reference_output_length})\\
exp(1-\text{reference_output_length}/\text{MT_output_length}) & (\text{otherwise})
\end{cases}
\]

Bleu score was revolutionary for MT
- gave a pretty good, by no means perfect, but pretty good single real number evaluation metric
Open source implement exists
Today used to evaluate many text generation system
- translation
- image caption
This is not used to speech recognition
- because it has one ground truth

Attention Model Intuition

The problem of long sequences

Encoder-Decoder architecture
- it works quite well for short sentences
- very long sentences maybe longer than 30-40 words, performance comes down

Attention Model Intuition

Attention Model

Attention model

\[
\begin{align}
a^{<t’>}&=(\overrightarrow a^{<t’>}, \overleftarrow a^{<t’>})\\
\sum_{t’} \alpha^{<t,t’>}&=1\\
c^{<1>}&=\sum_{t’}\alpha^{<1,t’>}a^{<t’>}\\
\alpha^{<t,t’>}&=\text{ amount of attention }y^{<t>} \text{ should pay to } a^{<t’>}
\end{align}
\]

Computing attention \(\alpha^{<t,t’>}\)

\[
\displaylines{
\alpha^{<t,t’>}=\text{ amount of attention }y^{<t>} \text{ should pay to } a^{<t’>}\\
\alpha^{<t,t’>}=\frac{\exp(e^{<t,t’>})}{\sum_{t’=1}^{T_x}\exp(e^{<t,t’>})}\\
\begin{matrix}
s^{<t-1>}\rightarrow \\
a^{<t’>}\rightarrow
\end{matrix}
\fc \rightarrow e^{<t,t’>}
}
\]

paper
- NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Attention examples

date normalization
- July 20th 1969
- 23 April, 1564

Speech recognition – Audio data

Speech recognition

Speech recognition problem

x: audio clip
y: transcript
- the quick brown fox
- phonemes音素: de ku wik bra…
- ones upon a time, using phonemes that hand-engineered basic units of cells
academic data set on speech recognition 300h
- 3000h reasonable size
- 10,000 h commercial, sometime 100,000h more

CTC cost for speech recognition

(Connectionist temporal classification)

bidirectional LSP or bidirectional GIU
Input number of timestep very large
- 10sec audio 100Hz=1000 inputs
“the quick brown fox” (19 chars include space)
CTC cost function allows the RNN to generate an output like:
- ttt_h_eee_____ ____qqqq___
- the q
Basic rule: collapse repeated characters not separated by blank

Trigger Word Detection

What i trigger word detection?

Amazon Echo (Alexa)
Baidu DuerOS (xiaodunihao)
Apple Siri (Hey Siri)
Google Home (Okay Google)

Trigger word detection algorithm

The literature on triggered detection algorithm is still evolving, so there isn’t wide consensus yet, on what’s the best algorithm for trigger word detection.
\(x^{<t>}\)from a audio clip, maybe compute spectrogram features then pass through RNN
someone just finish to say word
- \(y^{<t>}=1\)
- It creates very imbalanced training set that has lot of zeros than we want. it’s hard to train.
instead of setting only a single time step to operate one, you could actually make it to operate a few ones for several times.
fixed period of time before reverting back to zero.
slightly evens out the ratio of one’s to zero’s

Conclusion

Conclusion and thank you

Specialization outline

Neural Networks and Deep Learning
Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
Structuring Machine Learning Projects
Convolutional Neural Networks
Sequence Models

Programming assignments

Neural Machine Translation

Machine translation models can be used to map from one sequence to another. They are useful not just for translating human languages (like French->English) but also for tasks like date format translation.
An attention mechanism allows a network to focus on the most relevant parts of the input when producing a specific part of the output.
A network using an attention mechanism can translate from inputs of length \(T_x\) to outputs of length \(T_y\), where \(T_x\) and \(T_y\) can be different.
You can visualize attention weights \(\alpha^{<t,t’>}\) to see what the network is paying attention to while generating each output.

Trigger word detection

Data synthesis is an effective way to create a large training set for speech problems, specifically trigger word detection.
Using a spectrogram and optionally a 1D conv layer is a common pre-processing step prior to passing audio data to an RNN, GRU or LSTM.
An end-to-end deep learning approach can be used to build a very effective trigger word detection system.

Various sequence to sequence architectures

Basic Models

Sequence to sequence model

Image captioning

Picking the most likely sentence

Machine translation a sbuilding a conditional language model

Language model:

Machine translation:

Finding the most likely translation

Why not a greedy search?

Beam Search

Beam search algorithm

Refinements to Beam Search

Length normalization

Beam search discussion

Error analysis in beam search

Example

Error analysis on beam search

Error analysis process

Bleu Score

Evaluating machine translation

Blue score on bigrams

Blue score on unigrams

Bleu details

Attention Model Intuition

The problem of long sequences

Attention Model Intuition

Attention Model

Computing attention \(\alpha^{<t,t’>}\)

Attention examples

Speech recognition – Audio data

Speech recognition

Speech recognition problem

CTC cost for speech recognition

Trigger Word Detection

What i trigger word detection?

Trigger word detection algorithm

Conclusion

Conclusion and thank you

Specialization outline

Programming assignments

Neural Machine Translation

Trigger word detection

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル