Various sequence to sequence architectures
Basic Models
Sequence to sequence model
\(x\): Jane visite l’Afrique en septembre
\(y\): Jane is visitiong Africa in September.

- paper
Image captioning
\[
\displaylines{
💺😺
\underbrace{\rightarrow}_{11×11,s=4}
\boxed{(55,55,96)}
\underbrace{\rightarrow}_{Max-pool: 11×11,s=4}
\boxed{(27,27,96)}
\underbrace{\rightarrow}_{5×5,same}\\
\boxed{(27,27,256)}
\underbrace{\rightarrow}_{Max-pool: 3×3,s=2}
\boxed{(13,13,256)}
\underbrace{\rightarrow}_{3×3,same}
\boxed{(13,13,384)}
\underbrace{\rightarrow}_{3×3}\\
\boxed{(13,13,384)}
\underbrace{\rightarrow}_{3×3}
\boxed{(13,13,256)}
\underbrace{\rightarrow}_{Max-pool: 3×3,s=2}
\boxed{(6,6,256)}\\
=\fc_{9216}
\rightarrow \fc_{4096}
\rightarrow \fc_{4096}
\rightarrow \underbrace{\boxed{y^{<1>},y^{<2>},…,y^{<Ty>}}}_{\text{A cat sitting on a chair}}
}
\]
- paper
Picking the most likely sentence
Machine translation a sbuilding a conditional language model
Language model:

- \( P(y^{<>1},\dots,y^{<Ty>})\)
- So what the machine translation model is, is very similar to the language model, except that instead of always starting along with the vector of all zeros
Machine translation:

- \( P(y^{<>1},\dots,y^{<Ty>}|x^{<1>},\dots,x^{<Tx>})\)
- “conditional language model”
- Encoder: word to model input
- Decoder: language model
Finding the most likely translation
- Jane visite l’Afrique en septembre.
- Jane is visiting Africa in September.
- Jane is going to be visiting Africa in September.
- In September, Jane will visit Africa.
- Her African friend welcomed Jane in September.
\[
\frac{\text{arg max}}{y^{<1>},\dots,y^{<Ty>}} \rightarrow P(\underbrace{y^{<1>},\dots,y^{<Ty>}}_{\text{English}}| \underbrace{x}_{\text{French}})
\]
Why not a greedy search?
- Jane is visiting Africa in September.
- better translation
- Jane is going to be visiting Africa in September.
- not a bad translation but verbose
- \( P(\text{Jane is going}|x) > P(\text{Jane is visit}|x) \)
- going is more common English word
- And, of course, the total number of combinations of words in the English sentence is exponentially larger.
- So, this is just a huge space of possible sentences, and it’s impossible to rate them all, which is why the most common thing to do is use an approximate search out of them.
Beam Search
Beam search algorithm


- \( P(y^{<1>},y^{<2>}|x)=P(y^{<1>}|x)P(y^{<2>}|x,y^{<1>}) \)
- jane visits africa in september.<EOS>
- outcome of this process will be that adding one word at a time until Beam search will decide EOS as the best next symbol.
- Beam width is 1, then this essentially becomes the greedy search algorithm.
- Length normalization is a small change to the beam search algorithm that can help get much better results.
- Beam search is maximizing the probability in the first formula below. It is the product of all the probabilities where t is total number of words in the output.
- In machine translation, if we carry out beam search without using sentence normalization, the algorithm will tend to output overly short translations.
Refinements to Beam Search
Length normalization
\[
\displaylines{
arg \max_y \overbrace{\prod_{t=1}^{T_y} P(y^{<t>}| x,y^{<t>},\dots,y^{<t-1>})}^{P(y^{<1>},\dots,y^{<T_y>}|x)=P(y^{<1>}|x)P(y^{<2>}|x,y^{<1>})\dots P(y^{<T_y>}|x,y^{<1>},\dots,y^{<T_y>})}\\
arg \max_y \underbrace{\sum_{t=1}^{T_y} P(y^{<t>}| x,y^{<t>},\dots,y^{<t-1>})}_{T_y=1,2,3,\dots,30}\\
\frac{1}{T_y^\alpha}\sum_{t=1}^{T_y} P(y^{<t>}| x,y^{<t>},\dots,y^{<t-1>})\\
}
\]
- Probabilities are les than 1, too small can result in numerical underflow.
- Insted of maximizing this product, we will take logs.
- \(\alpha =0.7\)
- softer approach
- \(\alpha =1\)
- completely normalizing by length
- \(\alpha =0\)
- no normalization
- Finally all of these sentences, you pick the one that achieves the highest value on this normalized log probability objective.
Beam search discussion
Beam width B?
- Larger B
- more possibilities considering and does the better sentence find.
- but more computationally expensive, because you’re also keeping a lot more possibilities around.
- slower and memory requirements will grow.
- Smaller B
- worse result
- faster
- in production system
- B=10 is not uncommon
- B=100 is very large
- in research system (publish, best possible result)
- B=1000,3000 is not uncommon
- domain dependent
Unlike exact search algorithms like BFS(Breadth First Search) or DFS(Depth First Search), Beam Search runs faster but is not guaranteed to find exact maximum for \(arg \max_y P(y|x)\)
Error analysis in beam search
Example
- Jane visite l’Afrique en septembre.
- Human: Jane visits Africa in September. \(y^*\)
- Algorithm: Jane visited Africa last September. \(\hat y\)
- model consists of 2 component
- RNN: encoder and decoder
- Beam search
- error and bad translation
- which more to blame
- more B?, more training data?
Error analysis on beam search
- Case 1: \(P(y^*|x) \gt P(\hat y|x)\)
- Beam search chose \(\hat y\). But \(y^*\) attain s higher \(P(y|x)\).
- Conclusion: Beam search is at fault.
- Case 2: \(P(y^*|x) \leq P(\hat y|x)\)
- \(y^*\) is a better translation than \(\hat y\). But RNN predicted \(P(y^*|x) \lt P(\hat y|x)\).
- Conclusion: RNN model is at fault.(rather than to the search algorithm)
Error analysis process
Human | Algorithm | \(P(y^*|x)\) | \(P(\hat y|x)\) | At fault? |
Jane visits Africa in Spetember | Jane visited Africa last September. | 2*10^-10 | 1*10^-10 | B |
R | ||||
B |
Figures out what faction of errors are “due to” beam search vs. RNN model
Bleu Score
Evaluating machine translation
- French: Le chat est sur le tapis.
- Reference 1: The cat is on the mat.
- Reference 2: There is a cat on the mat.
- MT output: the the the the the the the.
- Precision:7/7
- every one of these 7 words appears in R1 and R2
- this is not a particularly useful measure.
- Modified precision: 2/7, 1/7
- maximum number of times it appears in the R sentences.
- Bleu: Bilingual Evaluation Understudy
- paper
Blue score on bigrams
- Example:
- Reference 1: The cat is on the mat.
- Reference 2: there is a cat on the mat.
- MT output: The cat the cat on the mat.
Count | Count_clip | |
the cat | 2 | 1 |
cat the | 1 | 0 |
cat on | 1 | 1 |
on the | 1 | 1 |
the mat | 1 | 1 |
4/6
Blue score on unigrams
- Example:
- Reference 1: The cat is on the mat.
- Reference 2: there is a cat on the mat.
- MT output: The cat the cat on the mat.
\[
\displaylines{
p_1=\frac{\sum_{unigram \in \hat y} Count_{clip}(unigram)}{\sum_{unigram \in \hat y} Count(unigram)}\\
p_n=\frac{\sum_{n-gram \in \hat y} Count_{clip}(n-gram)}{\sum_{n-gram \in \hat y} Count(n-gram)}
}
\]
- MT output exactly same
- \(P_1, P_2= 1.0\)
Bleu details
- \(p_n =\) Bleu socre on n-grams only
- Combined Bleu score:
- \(BP_{exp}(\frac14 \sum_{n=1}^4 P_n)\)
- BP: brevity penalty
- \[
BP=\begin{cases}
1 & (\text{MT_output_length} \gt \text{reference_output_length})\\
exp(1-\text{reference_output_length}/\text{MT_output_length}) & (\text{otherwise})
\end{cases}
\]
- Bleu score was revolutionary for MT
- gave a pretty good, by no means perfect, but pretty good single real number evaluation metric
- Open source implement exists
- Today used to evaluate many text generation system
- translation
- image caption
- This is not used to speech recognition
- because it has one ground truth
Attention Model Intuition
The problem of long sequences
- Encoder-Decoder architecture
- it works quite well for short sentences
- very long sentences maybe longer than 30-40 words, performance comes down
Attention Model Intuition

Attention Model
Attention model
\[
\begin{align}
a^{<t’>}&=(\overrightarrow a^{<t’>}, \overleftarrow a^{<t’>})\\
\sum_{t’} \alpha^{<t,t’>}&=1\\
c^{<1>}&=\sum_{t’}\alpha^{<1,t’>}a^{<t’>}\\
\alpha^{<t,t’>}&=\text{ amount of attention }y^{<t>} \text{ should pay to } a^{<t’>}
\end{align}
\]
Computing attention \(\alpha^{<t,t’>}\)

\[
\displaylines{
\alpha^{<t,t’>}=\text{ amount of attention }y^{<t>} \text{ should pay to } a^{<t’>}\\
\alpha^{<t,t’>}=\frac{\exp(e^{<t,t’>})}{\sum_{t’=1}^{T_x}\exp(e^{<t,t’>})}\\
\begin{matrix}
s^{<t-1>}\rightarrow \\
a^{<t’>}\rightarrow
\end{matrix}
\fc \rightarrow e^{<t,t’>}
}
\]
- paper
Attention examples
- date normalization
- July 20th 1969
- 23 April, 1564
Speech recognition – Audio data
Speech recognition
Speech recognition problem
- x: audio clip
- y: transcript
- the quick brown fox
- phonemes音素: de ku wik bra…
- ones upon a time, using phonemes that hand-engineered basic units of cells
- academic data set on speech recognition 300h
- 3000h reasonable size
- 10,000 h commercial, sometime 100,000h more
CTC cost for speech recognition
(Connectionist temporal classification)
- bidirectional LSP or bidirectional GIU
- Input number of timestep very large
- 10sec audio 100Hz=1000 inputs
- “the quick brown fox” (19 chars include space)
- CTC cost function allows the RNN to generate an output like:
- ttt_h_eee_____ ____qqqq___
- the q
- Basic rule: collapse repeated characters not separated by blank
Trigger Word Detection
What i trigger word detection?
- Amazon Echo (Alexa)
- Baidu DuerOS (xiaodunihao)
- Apple Siri (Hey Siri)
- Google Home (Okay Google)
Trigger word detection algorithm
- The literature on triggered detection algorithm is still evolving, so there isn’t wide consensus yet, on what’s the best algorithm for trigger word detection.
- \(x^{<t>}\)from a audio clip, maybe compute spectrogram features then pass through RNN
- someone just finish to say word
- \(y^{<t>}=1\)
- It creates very imbalanced training set that has lot of zeros than we want. it’s hard to train.
- instead of setting only a single time step to operate one, you could actually make it to operate a few ones for several times.
- fixed period of time before reverting back to zero.
- slightly evens out the ratio of one’s to zero’s
Conclusion
Conclusion and thank you
Specialization outline
- Neural Networks and Deep Learning
- Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
- Structuring Machine Learning Projects
- Convolutional Neural Networks
- Sequence Models
Programming assignments
Neural Machine Translation
- Machine translation models can be used to map from one sequence to another. They are useful not just for translating human languages (like French->English) but also for tasks like date format translation.
- An attention mechanism allows a network to focus on the most relevant parts of the input when producing a specific part of the output.
- A network using an attention mechanism can translate from inputs of length \(T_x\) to outputs of length \(T_y\), where \(T_x\) and \(T_y\) can be different.
- You can visualize attention weights \(\alpha^{<t,t’>}\) to see what the network is paying attention to while generating each output.
Trigger word detection
- Data synthesis is an effective way to create a large training set for speech problems, specifically trigger word detection.
- Using a spectrogram and optionally a 1D conv layer is a common pre-processing step prior to passing audio data to an RNN, GRU or LSTM.
- An end-to-end deep learning approach can be used to build a very effective trigger word detection system.