カテゴリー

# DL [Course 5/5] Sequence Models [Week 3/3] Sequence models & Attention mechanism

## Various sequence to sequence architectures

### Basic Models

#### Sequence to sequence model

$$x$$: Jane visite l’Afrique en septembre

$$y$$: Jane is visitiong Africa in September.

#### Image captioning

$\displaylines{ 💺😺 \underbrace{\rightarrow}_{11×11,s=4} \boxed{(55,55,96)} \underbrace{\rightarrow}_{Max-pool: 11×11,s=4} \boxed{(27,27,96)} \underbrace{\rightarrow}_{5×5,same}\\ \boxed{(27,27,256)} \underbrace{\rightarrow}_{Max-pool: 3×3,s=2} \boxed{(13,13,256)} \underbrace{\rightarrow}_{3×3,same} \boxed{(13,13,384)} \underbrace{\rightarrow}_{3×3}\\ \boxed{(13,13,384)} \underbrace{\rightarrow}_{3×3} \boxed{(13,13,256)} \underbrace{\rightarrow}_{Max-pool: 3×3,s=2} \boxed{(6,6,256)}\\ =\fc_{9216} \rightarrow \fc_{4096} \rightarrow \fc_{4096} \rightarrow \underbrace{\boxed{y^{<1>},y^{<2>},…,y^{<Ty>}}}_{\text{A cat sitting on a chair}} }$

• paper

### Picking the most likely sentence

#### Machine translation a sbuilding a conditional language model

##### Language model:
• $$P(y^{<>1},\dots,y^{<Ty>})$$
• So what the machine translation model is, is very similar to the language model, except that instead of always starting along with the vector of all zeros
##### Machine translation:
• $$P(y^{<>1},\dots,y^{<Ty>}|x^{<1>},\dots,x^{<Tx>})$$
• “conditional language model”
• Encoder: word to model input
• Decoder: language model

#### Finding the most likely translation

• Jane visite l’Afrique en septembre.
• Jane is visiting Africa in September.
• Jane is going to be visiting Africa in September.
• In September, Jane will visit Africa.
• Her African friend welcomed Jane in September.

$\frac{\text{arg max}}{y^{<1>},\dots,y^{<Ty>}} \rightarrow P(\underbrace{y^{<1>},\dots,y^{<Ty>}}_{\text{English}}| \underbrace{x}_{\text{French}})$

#### Why not a greedy search?

• Jane is visiting Africa in September.
• better translation
• Jane is going to be visiting Africa in September.
• not a bad translation but verbose
• $$P(\text{Jane is going}|x) > P(\text{Jane is visit}|x)$$
• going is more common English word
• And, of course, the total number of combinations of words in the English sentence is exponentially larger.
• So, this is just a huge space of possible sentences, and it’s impossible to rate them all, which is why the most common thing to do is use an approximate search out of them.

### Beam Search

#### Beam search algorithm

• $$P(y^{<1>},y^{<2>}|x)=P(y^{<1>}|x)P(y^{<2>}|x,y^{<1>})$$
• jane visits africa in september.<EOS>
• outcome of this process will be that adding one word at a time until Beam search will decide EOS as the best next symbol.
• Beam width is 1, then this essentially becomes the greedy search algorithm.
• Length normalization is a small change to the beam search algorithm that can help get much better results.
• Beam search is maximizing the probability in the first formula below. It is the product of all the probabilities where t is total number of words in the output.
• In machine translation, if we carry out beam search without using sentence normalization, the algorithm will tend to output overly short translations.

### Refinements to Beam Search

#### Length normalization

$\displaylines{ arg \max_y \overbrace{\prod_{t=1}^{T_y} P(y^{<t>}| x,y^{<t>},\dots,y^{<t-1>})}^{P(y^{<1>},\dots,y^{<T_y>}|x)=P(y^{<1>}|x)P(y^{<2>}|x,y^{<1>})\dots P(y^{<T_y>}|x,y^{<1>},\dots,y^{<T_y>})}\\ arg \max_y \underbrace{\sum_{t=1}^{T_y} P(y^{<t>}| x,y^{<t>},\dots,y^{<t-1>})}_{T_y=1,2,3,\dots,30}\\ \frac{1}{T_y^\alpha}\sum_{t=1}^{T_y} P(y^{<t>}| x,y^{<t>},\dots,y^{<t-1>})\\ }$

• Probabilities are les than 1, too small can result in numerical underflow.
• Insted of maximizing this product, we will take logs.
• $$\alpha =0.7$$
• softer approach
• $$\alpha =1$$
• completely normalizing by length
• $$\alpha =0$$
• no normalization
• Finally all of these sentences, you pick the one that achieves the highest value on this normalized log probability objective.

#### Beam search discussion

Beam width B?

• Larger B
• more possibilities considering and does the better sentence find.
• but more computationally expensive, because you’re also keeping a lot more possibilities around.
• slower and memory requirements will grow.
• Smaller B
• worse result
• faster
• in production system
• B=10 is not uncommon
• B=100 is very large
• in research system (publish, best possible result)
• B=1000,3000 is not uncommon
• domain dependent

Unlike exact search algorithms like BFS(Breadth First Search) or DFS(Depth First Search), Beam Search runs faster but is not guaranteed to find exact maximum for $$arg \max_y P(y|x)$$

### Error analysis in beam search

#### Example

• Jane visite l’Afrique en septembre.
• Human: Jane visits Africa in September. $$y^*$$
• Algorithm: Jane visited Africa last September. $$\hat y$$
• model consists of 2 component
• RNN: encoder and decoder
• Beam search
• error and bad translation
• which more to blame
• more B?, more training data?

#### Error analysis on beam search

• Case 1: $$P(y^*|x) \gt P(\hat y|x)$$
• Beam search chose $$\hat y$$. But $$y^*$$ attain s higher $$P(y|x)$$.
• Conclusion: Beam search is at fault.
• Case 2: $$P(y^*|x) \leq P(\hat y|x)$$
• $$y^*$$ is a better translation than $$\hat y$$. But RNN predicted $$P(y^*|x) \lt P(\hat y|x)$$.
• Conclusion: RNN model is at fault.(rather than to the search algorithm)

#### Error analysis process

Figures out what faction of errors are “due to” beam search vs. RNN model

### Bleu Score

#### Evaluating machine translation

• French: Le chat est sur le tapis.
• Reference 1: The cat is on the mat.
• Reference 2: There is a cat on the mat.
• MT output: the the the the the the the.
• Precision:7/7
• every one of these 7 words appears in R1 and R2
• this is not a particularly useful measure.
• Modified precision: 2/7, 1/7
• maximum number of times it appears in the R sentences.

#### Blue score on bigrams

• Example:
• Reference 1: The cat is on the mat.
• Reference 2: there is a cat on the mat.
• MT output: The cat the cat on the mat.

4/6

#### Blue score on unigrams

• Example:
• Reference 1: The cat is on the mat.
• Reference 2: there is a cat on the mat.
• MT output: The cat the cat on the mat.

$\displaylines{ p_1=\frac{\sum_{unigram \in \hat y} Count_{clip}(unigram)}{\sum_{unigram \in \hat y} Count(unigram)}\\ p_n=\frac{\sum_{n-gram \in \hat y} Count_{clip}(n-gram)}{\sum_{n-gram \in \hat y} Count(n-gram)} }$

• MT output exactly same
• $$P_1, P_2= 1.0$$

#### Bleu details

• $$p_n =$$ Bleu socre on n-grams only
• Combined Bleu score:
• $$BP_{exp}(\frac14 \sum_{n=1}^4 P_n)$$
• BP: brevity penalty
• $BP=\begin{cases} 1 & (\text{MT_output_length} \gt \text{reference_output_length})\\ exp(1-\text{reference_output_length}/\text{MT_output_length}) & (\text{otherwise}) \end{cases}$
• Bleu score was revolutionary for MT
• gave a pretty good, by no means perfect, but pretty good single real number evaluation metric
• Open source implement exists
• Today used to evaluate many text generation system
• translation
• image caption
• This is not used to speech recognition
• because it has one ground truth

### Attention Model Intuition

#### The problem of long sequences

• Encoder-Decoder architecture
• it works quite well for short sentences
• very long sentences maybe longer than 30-40 words, performance comes down

### Attention Model

Attention model

\begin{align} a^{<t’>}&=(\overrightarrow a^{<t’>}, \overleftarrow a^{<t’>})\\ \sum_{t’} \alpha^{<t,t’>}&=1\\ c^{<1>}&=\sum_{t’}\alpha^{<1,t’>}a^{<t’>}\\ \alpha^{<t,t’>}&=\text{ amount of attention }y^{<t>} \text{ should pay to } a^{<t’>} \end{align}

#### Computing attention $$\alpha^{<t,t’>}$$

$\displaylines{ \alpha^{<t,t’>}=\text{ amount of attention }y^{<t>} \text{ should pay to } a^{<t’>}\\ \alpha^{<t,t’>}=\frac{\exp(e^{<t,t’>})}{\sum_{t’=1}^{T_x}\exp(e^{<t,t’>})}\\ \begin{matrix} s^{<t-1>}\rightarrow \\ a^{<t’>}\rightarrow \end{matrix} \fc \rightarrow e^{<t,t’>} }$

• paper

#### Attention examples

• date normalization
• July 20th 1969
• 23 April, 1564

## Speech recognition – Audio data

### Speech recognition

#### Speech recognition problem

• x: audio clip
• y: transcript
• the quick brown fox
• phonemes音素: de ku wik bra…
• ones upon a time, using phonemes that hand-engineered basic units of cells
• academic data set on speech recognition 300h
• 3000h reasonable size
• 10,000 h commercial, sometime 100,000h more

#### CTC cost for speech recognition

(Connectionist temporal classification)

• bidirectional LSP or bidirectional GIU
• Input number of timestep very large
• 10sec audio 100Hz=1000 inputs
• “the quick brown fox” (19 chars include space)
• CTC cost function allows the RNN to generate an output like:
• ttt_h_eee_____ ____qqqq___
• the q
• Basic rule: collapse repeated characters not separated by blank

### Trigger Word Detection

#### What i trigger word detection?

• Amazon Echo (Alexa)
• Baidu DuerOS (xiaodunihao)
• Apple Siri (Hey Siri)

#### Trigger word detection algorithm

• The literature on triggered detection algorithm is still evolving, so there isn’t wide consensus yet, on what’s the best algorithm for trigger word detection.
• $$x^{<t>}$$from a audio clip, maybe compute spectrogram features then pass through RNN
• someone just finish to say word
• $$y^{<t>}=1$$
• It creates very imbalanced training set that has lot of zeros than we want. it’s hard to train.
• instead of setting only a single time step to operate one, you could actually make it to operate a few ones for several times.
• fixed period of time before reverting back to zero.
• slightly evens out the ratio of one’s to zero’s

## Conclusion

### Conclusion and thank you

#### Specialization outline

• Neural Networks and Deep Learning
• Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
• Structuring Machine Learning Projects
• Convolutional Neural Networks
• Sequence Models

## Programming assignments

### Neural Machine Translation

• Machine translation models can be used to map from one sequence to another. They are useful not just for translating human languages (like French->English) but also for tasks like date format translation.
• An attention mechanism allows a network to focus on the most relevant parts of the input when producing a specific part of the output.
• A network using an attention mechanism can translate from inputs of length $$T_x$$ to outputs of length $$T_y$$, where $$T_x$$ and $$T_y$$ can be different.
• You can visualize attention weights $$\alpha^{<t,t’>}$$ to see what the network is paying attention to while generating each output.

### Trigger word detection

• Data synthesis is an effective way to create a large training set for speech problems, specifically trigger word detection.
• Using a spectrogram and optionally a 1D conv layer is a common pre-processing step prior to passing audio data to an RNN, GRU or LSTM.
• An end-to-end deep learning approach can be used to build a very effective trigger word detection system.