## Various sequence to sequence architectures

### Basic Models

#### Sequence to sequence model

\(x\): Jane visite l’Afrique en septembre

\(y\): Jane is visitiong Africa in September.

- paper

#### Image captioning

\[

\displaylines{

💺😺

\underbrace{\rightarrow}_{11×11,s=4}

\boxed{(55,55,96)}

\underbrace{\rightarrow}_{Max-pool: 11×11,s=4}

\boxed{(27,27,96)}

\underbrace{\rightarrow}_{5×5,same}\\

\boxed{(27,27,256)}

\underbrace{\rightarrow}_{Max-pool: 3×3,s=2}

\boxed{(13,13,256)}

\underbrace{\rightarrow}_{3×3,same}

\boxed{(13,13,384)}

\underbrace{\rightarrow}_{3×3}\\

\boxed{(13,13,384)}

\underbrace{\rightarrow}_{3×3}

\boxed{(13,13,256)}

\underbrace{\rightarrow}_{Max-pool: 3×3,s=2}

\boxed{(6,6,256)}\\

=\fc_{9216}

\rightarrow \fc_{4096}

\rightarrow \fc_{4096}

\rightarrow \underbrace{\boxed{y^{<1>},y^{<2>},…,y^{<Ty>}}}_{\text{A cat sitting on a chair}}

}

\]

- paper

### Picking the most likely sentence

#### Machine translation a sbuilding a conditional language model

##### Language model:

- \( P(y^{<>1},\dots,y^{<Ty>})\)
- So what the machine translation model is, is very similar to the language model, except that instead of always starting along with the vector of all zeros

##### Machine translation:

- \( P(y^{<>1},\dots,y^{<Ty>}|x^{<1>},\dots,x^{<Tx>})\)
- “conditional language model”
- Encoder: word to model input
- Decoder: language model

#### Finding the most likely translation

- Jane visite l’Afrique en septembre.
- Jane is visiting Africa in September.
- Jane is going to be visiting Africa in September.
- In September, Jane will visit Africa.
- Her African friend welcomed Jane in September.

\[

\frac{\text{arg max}}{y^{<1>},\dots,y^{<Ty>}} \rightarrow P(\underbrace{y^{<1>},\dots,y^{<Ty>}}_{\text{English}}| \underbrace{x}_{\text{French}})

\]

#### Why not a greedy search?

- Jane is visiting Africa in September.
- better translation

- Jane is going to be visiting Africa in September.
- not a bad translation but verbose

- \( P(\text{Jane is going}|x) > P(\text{Jane is visit}|x) \)
- going is more common English word

- And, of course, the total number of combinations of words in the English sentence is exponentially larger.
- So, this is just a huge space of possible sentences, and it’s impossible to rate them all, which is why the most common thing to do is use an approximate search out of them.

### Beam Search

#### Beam search algorithm

- \( P(y^{<1>},y^{<2>}|x)=P(y^{<1>}|x)P(y^{<2>}|x,y^{<1>}) \)
- jane visits africa in september.<EOS>
- outcome of this process will be that adding one word at a time until Beam search will decide EOS as the best next symbol.

- Beam width is 1, then this essentially becomes the greedy search algorithm.
- Length normalization is a small change to the beam search algorithm that can help get much better results.
- Beam search is maximizing the probability in the first formula below. It is the product of all the probabilities where t is total number of words in the output.
- In machine translation, if we carry out beam search without using sentence normalization, the algorithm will tend to output overly short translations.

### Refinements to Beam Search

#### Length normalization

\[

\displaylines{

arg \max_y \overbrace{\prod_{t=1}^{T_y} P(y^{<t>}| x,y^{<t>},\dots,y^{<t-1>})}^{P(y^{<1>},\dots,y^{<T_y>}|x)=P(y^{<1>}|x)P(y^{<2>}|x,y^{<1>})\dots P(y^{<T_y>}|x,y^{<1>},\dots,y^{<T_y>})}\\

arg \max_y \underbrace{\sum_{t=1}^{T_y} P(y^{<t>}| x,y^{<t>},\dots,y^{<t-1>})}_{T_y=1,2,3,\dots,30}\\

\frac{1}{T_y^\alpha}\sum_{t=1}^{T_y} P(y^{<t>}| x,y^{<t>},\dots,y^{<t-1>})\\

}

\]

- Probabilities are les than 1, too small can result in numerical underflow.
- Insted of maximizing this product, we will take logs.
- \(\alpha =0.7\)
- softer approach

- \(\alpha =1\)
- completely normalizing by length

- \(\alpha =0\)
- no normalization

- Finally all of these sentences, you pick the one that achieves the highest value on this normalized log probability objective.

#### Beam search discussion

Beam width B?

- Larger B
- more possibilities considering and does the better sentence find.
- but more computationally expensive, because you’re also keeping a lot more possibilities around.
- slower and memory requirements will grow.

- Smaller B
- worse result
- faster

- in production system
- B=10 is not uncommon
- B=100 is very large

- in research system (publish, best possible result)
- B=1000,3000 is not uncommon
- domain dependent

Unlike exact search algorithms like BFS(Breadth First Search) or DFS(Depth First Search), Beam Search runs faster but is not guaranteed to find exact maximum for \(arg \max_y P(y|x)\)

### Error analysis in beam search

#### Example

- Jane visite l’Afrique en septembre.
- Human: Jane visits Africa in September. \(y^*\)
- Algorithm: Jane visited Africa last September. \(\hat y\)
- model consists of 2 component
- RNN: encoder and decoder
- Beam search

- error and bad translation
- which more to blame

- more B?, more training data?

#### Error analysis on beam search

- Case 1: \(P(y^*|x) \gt P(\hat y|x)\)
- Beam search chose \(\hat y\). But \(y^*\) attain s higher \(P(y|x)\).
- Conclusion: Beam search is at fault.

- Case 2: \(P(y^*|x) \leq P(\hat y|x)\)
- \(y^*\) is a better translation than \(\hat y\). But RNN predicted \(P(y^*|x) \lt P(\hat y|x)\).
- Conclusion: RNN model is at fault.(rather than to the search algorithm)

#### Error analysis process

Human | Algorithm | \(P(y^*|x)\) | \(P(\hat y|x)\) | At fault? |

Jane visits Africa in Spetember | Jane visited Africa last September. | 2*10^-10 | 1*10^-10 | B |

R | ||||

B |

Figures out what faction of errors are “due to” beam search vs. RNN model

### Bleu Score

#### Evaluating machine translation

- French: Le chat est sur le tapis.
- Reference 1:
**The**cat is on**the**mat. - Reference 2: There is a cat on
**the**mat. - MT output:
**the the the the the the the**. - Precision:7/7
- every one of these 7 words appears in R1 and R2

- this is not a particularly useful measure.

- Modified precision: 2/7, 1/7
- maximum number of times it appears in the R sentences.

- Bleu: Bilingual Evaluation Understudy
- paper

#### Blue score on bigrams

- Example:
- Reference 1: The cat is on the mat.
- Reference 2: there is a cat on the mat.
- MT output: The cat the cat on the mat.

Count | Count_clip | |

the cat | 2 | 1 |

cat the | 1 | 0 |

cat on | 1 | 1 |

on the | 1 | 1 |

the mat | 1 | 1 |

4/6

#### Blue score on unigrams

- Example:
- Reference 1: The cat is on the mat.
- Reference 2: there is a cat on the mat.
- MT output: The cat the cat on the mat.

\[

\displaylines{

p_1=\frac{\sum_{unigram \in \hat y} Count_{clip}(unigram)}{\sum_{unigram \in \hat y} Count(unigram)}\\

p_n=\frac{\sum_{n-gram \in \hat y} Count_{clip}(n-gram)}{\sum_{n-gram \in \hat y} Count(n-gram)}

}

\]

- MT output exactly same
- \(P_1, P_2= 1.0\)

#### Bleu details

- \(p_n =\) Bleu socre on n-grams only
- Combined Bleu score:
- \(BP_{exp}(\frac14 \sum_{n=1}^4 P_n)\)

- BP: brevity penalty
- \[

BP=\begin{cases}

1 & (\text{MT_output_length} \gt \text{reference_output_length})\\

exp(1-\text{reference_output_length}/\text{MT_output_length}) & (\text{otherwise})

\end{cases}

\]

- Bleu score was revolutionary for MT
- gave a pretty good, by no means perfect, but pretty good single real number evaluation metric

- Open source implement exists
- Today used to evaluate many text generation system
- translation
- image caption

- This is not used to speech recognition
- because it has one ground truth

### Attention Model Intuition

#### The problem of long sequences

- Encoder-Decoder architecture
- it works quite well for short sentences
- very long sentences maybe longer than 30-40 words, performance comes down

#### Attention Model Intuition

### Attention Model

Attention model

\[

\begin{align}

a^{<t’>}&=(\overrightarrow a^{<t’>}, \overleftarrow a^{<t’>})\\

\sum_{t’} \alpha^{<t,t’>}&=1\\

c^{<1>}&=\sum_{t’}\alpha^{<1,t’>}a^{<t’>}\\

\alpha^{<t,t’>}&=\text{ amount of attention }y^{<t>} \text{ should pay to } a^{<t’>}

\end{align}

\]

#### Computing attention \(\alpha^{<t,t’>}\)

\[

\displaylines{

\alpha^{<t,t’>}=\text{ amount of attention }y^{<t>} \text{ should pay to } a^{<t’>}\\

\alpha^{<t,t’>}=\frac{\exp(e^{<t,t’>})}{\sum_{t’=1}^{T_x}\exp(e^{<t,t’>})}\\

\begin{matrix}

s^{<t-1>}\rightarrow \\

a^{<t’>}\rightarrow

\end{matrix}

\fc \rightarrow e^{<t,t’>}

}

\]

- paper

#### Attention examples

- date normalization
- July 20th 1969
- 23 April, 1564

## Speech recognition – Audio data

### Speech recognition

#### Speech recognition problem

- x: audio clip
- y: transcript
- the quick brown fox
- phonemes音素: de ku wik bra…
- ones upon a time, using phonemes that hand-engineered basic units of cells

- academic data set on speech recognition 300h
- 3000h reasonable size
- 10,000 h commercial, sometime 100,000h more

#### CTC cost for speech recognition

(Connectionist temporal classification)

- bidirectional LSP or bidirectional GIU
- Input number of timestep very large
- 10sec audio 100Hz=1000 inputs

- “the quick brown fox” (19 chars include space)
- CTC cost function allows the RNN to generate an output like:
- ttt_h_eee_____ ____qqqq___
- the q

- Basic rule: collapse repeated characters not separated by blank

### Trigger Word Detection

#### What i trigger word detection?

- Amazon Echo (Alexa)
- Baidu DuerOS (xiaodunihao)
- Apple Siri (Hey Siri)
- Google Home (Okay Google)

#### Trigger word detection algorithm

- The literature on triggered detection algorithm is still evolving, so there isn’t wide consensus yet, on what’s the best algorithm for trigger word detection.
- \(x^{<t>}\)from a audio clip, maybe compute spectrogram features then pass through RNN
- someone just finish to say word
- \(y^{<t>}=1\)
- It creates very imbalanced training set that has lot of zeros than we want. it’s hard to train.

- instead of setting only a single time step to operate one, you could actually make it to
**operate a few ones for several times**. - fixed period of time before reverting back to zero.
**slightly evens out the ratio of one’s to zero’s**

## Conclusion

### Conclusion and thank you

#### Specialization outline

- Neural Networks and Deep Learning
- Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
- Structuring Machine Learning Projects
- Convolutional Neural Networks
- Sequence Models

## Programming assignments

### Neural Machine Translation

- Machine translation models can be used to map from one sequence to another. They are useful not just for translating human languages (like French->English) but also for tasks like date format translation.
- An attention mechanism allows a network to focus on the most relevant parts of the input when producing a specific part of the output.
- A network using an attention mechanism can translate from inputs of length \(T_x\) to outputs of length \(T_y\), where \(T_x\) and \(T_y\) can be different.
- You can visualize attention weights \(\alpha^{<t,t’>}\) to see what the network is paying attention to while generating each output.

### Trigger word detection

- Data synthesis is an effective way to create a large training set for speech problems, specifically trigger word detection.
- Using a spectrogram and optionally a 1D conv layer is a common pre-processing step prior to passing audio data to an RNN, GRU or LSTM.
- An end-to-end deep learning approach can be used to build a very effective trigger word detection system.