カテゴリー
深層学習

DL [Course 5/5] Sequence Models [Week 2/3] Natural Language Processing & Word Embeddings

Introduction to Word Embeddings

Word Representation

Word representation

\[
V=[a,aaron,…,zulu,<UNK>]\\
1-hot representation
\]

\[
\eqalign{
\text{Man}\\\text{(5391)}\\
\begin{bmatrix}
0\\0\\0\\0\\ \vdots\\1\\ \vdots\\0\\0
\end{bmatrix}\\
O_{5391}
}
\eqalign{
\text{Woman}\\\text{(9853)}\\
\begin{bmatrix}
0\\0\\0\\0\\0\\ \vdots\\1\\ \vdots\\0
\end{bmatrix}\\
O_{5391}
}
\eqalign{
\text{King}\\\text{(4914)}\\
\begin{bmatrix}
0\\0\\0\\ \vdots\\1\\ \vdots\\0\\0\\0
\end{bmatrix}\\
O_{4914}
}
\eqalign{
\text{Queen}\\\text{(7157)}\\
\begin{bmatrix}
0\\0\\0\\0\\0\\ \vdots\\1\\ \vdots\\0
\end{bmatrix}\\
O_{7157}
}
\eqalign{
\text{Apple}\\\text{(456)}\\
\begin{bmatrix}
0\\ \vdots\\1\\ \vdots\\0\\0\\0\\0\\0
\end{bmatrix}\\
O_{456}
}
\eqalign{
\text{Orange}\\\text{(6257)}\\
\begin{bmatrix}
0\\0\\0\\0\\0\\ \vdots\\1\\ \vdots\\0
\end{bmatrix}\\
O_{6257}
}
\]

I want a glass of orange _____.
I want a glass of apple _____.

Featurized representation: word embedding

Man
(5391)
Woman
(9853)
King
(4914)
Queen
(7157)
Apple
(456)
Orange
(6257)
Gender-11-0.950.970.000.01
Royal0.010.020.930.95-0.010.00
Age0.030.020.70.690.03-0.02
Food0.040.010.020.010.950.97
\(e_{5391}\)\(e_{9853}\)
  • The dimension of word vectors is usually smaller than the size of the vocabulary. Most common sizes for word vectors ranges between 50 and 400.

Visualizing word embeddings

  • t-SNE
    • A non-linear dimensionality reduction technique
    • 300 dimensional feature vector or 300 dimensional embedding for each words
    • in a two dimensional space so that you can visualize them.
    • ex) 300D to 2D
  • Word embedding s has been one of the most important ideas in NLP in Natural Language Processing

Using word embeddings

Named entity recognition example

Transfer learning and word embeddings

  1. Learn word embeddings from large text corpus. (1-100B words) (Or download pre-trained embedding online.)
  2. Transfer embedding to new task with smaller training set.(say, 100k words)
  3. Optional: Continue to finetune the word embeddings with new data.

Relation to face encoding(embedding)

\[
\displaylines{
x^{(i)} 👨 \rightarrow \boxed{CNN} \rightarrow \dots \rightarrow \underbrace{\fc}_{f(x^{i})} \rightarrow \\
x^{(j)} 👩 \rightarrow \boxed{CNN} \rightarrow \dots \rightarrow \underbrace{\fc}_{f(x^{j})} \rightarrow
}
\fc \rightarrow \hat y
\]

  • encoding and embedding means fairy similar things.
    • input any face picture you’ve never seen
    • fixed vocabulary like e^{1000}

Properties of word embeddings

Analogies

Man
(5391)
Woman
(9853)
King
(4914)
Queen
(7157)
Apple
(456)
Orange
(6257)
Gender-11-0.950.970.000.01
Royal0.010.020.930.95-0.010.00
Age0.030.020.70.690.03-0.02
Food0.040.010.020.010.950.97
\(e_{5391}\)\(e_{9853}\)

\[
\displaylines{
e_{man}-e_{woman}\approx \begin{bmatrix}-2\\ 0\\ 0\\ 0\end{bmatrix}\\
e_{king}-e_{queen}\approx \begin{bmatrix}-2\\ 0\\ 0\\ 0\end{bmatrix}\\
}
\]

Analogies using word vectors

  • in 300D space
    • \(e_{man}-e_{woman} \approx e_{king}-e_?\)
    • Fill word w: arg max w \( sim(e_w, \underbrace{ e_{king}-e_{man}+e_{woman}}_{30-75\%} ) \)
  • t-SNE 300D to 2D
    • non-linear mapping
    • parallelogram relationship will be broken

Cosine similarity

\( sim(e_w,e_{king}-e_{man}+e_{woman}) \)

\( sim(u,v) = \frac{u^T v}{\|u\|_2 \|v\|_2} \)

Embedding matrix

\[
\overbrace{
\begin{bmatrix}
& & & \color{purple}■ & & & \\
& & & \color{green}■ & & & \\
& & & \color{yellow}■ & & & \\
& & & \color{yellow}■ & E & & \\
& & & \color{yellow}■ & & & \\
& & & \color{yellow}■ & & &
\end{bmatrix}}^\text{a aaron … orange … zulu <UNK>}_{(300,10000)}
\begin{bmatrix}
0\\ 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0\\
\end{bmatrix}_{(10000,1)}
\]

\[
\displaylines{
E \cdot O_{6257} = \begin{bmatrix}
\color{purple}■\\ \color{green}■\\ \color{yellow}{\displaylines{■\\ ■\\ ■\\ ■}}
\end{bmatrix}_{(300, 1)} = e_{6257}\\
E \cdot O_j = e_j (\text{embedding for word j})
}
\]

\( E \cdot O_j \): It is computationally wasteful.

In practice, use specialized function to look up an embedding.

Learning Word Embeddings: Word2vec & GloVe

Learning word embeddings

Neural language model

I want a glass of orange ____.

\[
\begin{array}{ccc}
I & o_{4343} & \rightarrow & E & \rightarrow & e_{4343}\rightarrow\\
want & o_{9665} & \rightarrow & E & \rightarrow & e_{9665}\rightarrow\\
a & o_{1} & \rightarrow & \color{yellow}E & \rightarrow & e_{1}\rightarrow\\
glass & o_{3852} & \rightarrow & \color{yellow}E & \rightarrow &e_{3852}\rightarrow\\
of & o_{6163} & \rightarrow & \color{yellow}E & \rightarrow &e_{6163}\rightarrow\\
orange & o_{6257} & \rightarrow & \color{yellow}E & \rightarrow & \underbrace{e_{6257}}_{\cancel{1800}\color{yellow}1200}\rightarrow\\
\end{array}
\underbrace{\fc}_{\uparrow w^{[1]}, b^{[1]}} \rightarrow \underbrace{\circ}_{\text{softmax} \leftarrow w^{[2]},b^{[2]}}
\]

  • Input 1800 dimensional vector obtained by taking 6 embedding vectors and stacking together.
    • fixed history: just look previous 4 words (hyperparameter of the algorithm)
    • that network will input a 1200 dimensional feature vector.
  • paper

Other context/target pairs

I want a glass of orange juice to go along with my cereal.

  • Context: Last 4 words.
    • 4 words on left & right
      • a glass of orange ? to go along with
    • Last 1 word
      • orange ?
    • Nearby 1 word
      • glass ?
      • skip-gram

Word2Vec

Skip-gram

I want a glass of orange juice to get along with my cereal.

Model

\[
\displaylines{
\text{Vocab size}=10,000k\\
Content: c (“orange”) \rightarrow Target: t (“juice”)\\
o_c \rightarrow E \rightarrow e_c \rightarrow \circ(softmax) \rightarrow \hat y\\
\text{softmax}: p(t|c)=\frac{e^{\theta_t^T e_c}}{\sum_{j=1}^{10,000} e^{\theta_j^T e_c}}\\
\theta_t: \text{parameter associated with output} t\\
\mathfrak{L}(\hat y, y)=-\sum_{i=1}^{10,000} y_i log \hat y_i\\
y=\begin{bmatrix}
0\\\vdots\\1\\\vdots\\0
\end{bmatrix}
}
\]

Problems with softmax classification

\[
\displaylines{
p(t|c)=\frac{e^{\theta_t^T e_c}}{\sum_{j=1}^{10,000} e^{\theta_j^T e_c}}\\
}
\]


  • How to sample the context c?
    • extremely frequently: the, of, a, and, to…
    • don’t appear often: orange, apple, durian…

Negative Sampling

Defining a new learning problem

I want a glass of orange juice to go along with my cereal.

xxy
Contextwordtarget?
orangejuice1
orangeking0
orangebook0
orangethe0
orangeof0

k=4 (king, book, the, of)

Model

\[
\text{softmax}: p(t|c)=\frac{e^{\theta_t^T e_c}}{\sum_{j=1}^{10,000} e^{\theta_j^T e_c}}\\
P(y=1|c,t)=\gamma(\theta_t^T e_c)\\
\]

  • before: 10,000 softmax problem
  • now: 10,000 binary classification problem

Selecting negative examples

xxy
Contextwordtarget?
orangejuice1
orangeking0
orangebook0
orangethe0
orangeof0

t → king, book, the, of

\[
P(w_i)=\frac{f(w_i)^{\frac34}}{\sum_{j=1}^{10,000} f(w_i)^{\frac34}}
\]

  • 3/4
    • heuristic value
    • Mikolov did was sampled proportional to their frequency of the word to the power 3/4
    • from whatever’s the observed distribution in English text to the uniform distribution.
  • if you run the algorithm
    • use open source implementation
    • use pre-trained word vectors

GloVe word vectors

GloVe (global vectors for word representation)

I want a glass of orange juice to go along with my cereal.

\( X_{ij}=X_{tc}=\text{ times } j \text{ appears in context of } i \)

  • depending on the definition of context and target words
    • \(X_{ij}=X_{ji}\)
  • c and t whether or not appear within +- 10 words each other
    • symmetric relationship
  • always the word immediately before the target word
    • not be symmetric
  • \(X_{ij}\) is a count that captures how often do words i and j appear with each other, or close to each other.
  • paper

Model

\[\text{minimize} \sum_{i=1}^{10,000}\sum_{j=1}^{10,000} \underbrace{f(X_{ij})}_{\text{weighting term}}(\theta_i^T e_j + b_i + b’_j -\log X_{ij})^2
\]

  • weighting term
    • heuristics for choosing this weighting function F 
    • \( X_{ij}\rightarrow 0, \log X_{ij} \rightarrow \infty \)
    • less frequent words: more weight
    • more frequent words: little weight
    • \( f(0)=0 \): The weighting function helps prevent learning only from extremely common word pairs. It is not necessary that it satisfies this function.
  • \(\theta, e\)
    • roles of theta and e are now completely symmetric
    • \(\theta_i, e_j\): symmetric
    • initialize \(\theta\) and \(e\), uniformly random and gradient descent to minimize every word then take the average.
    • \( e_w^{final}=\frac{e_w+\theta_w}{2}\)

A note on the featurization view of word embeddings

\[
\displaylines{
\text{minimize} \sum_{i=1}^{10,000}\sum_{j=1}^{10,000} f(X_{ij})({\color{blue} \theta_i^T e_j} + b_i + b’_j -\log X_{ij})^2\\
(A\theta_i)^T(A^{-T}e_j)=\theta_i \cancel{A^T A^{-T}} e_j
}
\]

  • features easily humanly interpretable axis
  • features might be combination of gender,royal,age ,and food and all the other features
  • arbitrary linear transformation of the features, you end up learning the parallelogram平行四辺形 map for figure analogies still works.

Applications using Word Embeddings

Sentiment Classification

Sentiment classification problem

xy
The dessert is excellent★★★★☆
Service was quite slow★★☆☆☆
Good for a quick meal, but nothing special★★★☆☆
Completely lacking in good taste,
good service, and good ambience.
★☆☆☆☆
  • not have a huge label data set.
    • 10,000-100,000 words would not be uncommon

Simple sentiment classification model

The dessert is excellent. ★★★★☆

\[
\begin{array}{ccc}
The & o_{8928} & \rightarrow & E & \rightarrow & e_{8928}\rightarrow\\
dessert & o_{2468} & \rightarrow & E & \rightarrow & e_{2468}\rightarrow\\
is & o_{4694} & \rightarrow & E & \rightarrow & e_{4694}\rightarrow\\
excellent & o_{3180} & \rightarrow & \underbrace{E}_{\uparrow 100B} & \rightarrow &e_{3180}\rightarrow
\end{array}
\underbrace{\text{Avg.}}_{\uparrow 3000} \rightarrow \underbrace{\circ}_{\text{softmax} \leftarrow ★} \rightarrow \hat y
\]

  • So notice that by using the average operation here, this particular algorithm works for reviews that are short or long text.
  • Very negative review problem
    • Completely lacking in good taste, good service, and good ambience.

RNN for sentiment classification

  • it will be much better at taking word sequence
  • word embeddings can be trained from a much larger data set, this will do a better job generalizing to maybe even new words now that you’ll see in your training set.

Debiasing word embeddings

The problem of bias in word embeddings

Man:Woman as King:Queen

Man:Computer_programmer as Woman:Homemaker (bad)

Father:Doctor as Mother:Nurse (bad)

Word embeddings can reflect gender, ethnicity, age, sexual orientation, and other biases of the text used to train the model.

Addressing bias in word embeddings

  • 1. Identify bias direction.
    • \( e_{he}-e_{she}\)
    • \( e_{male}-e_{female}\)
    • ↑average
  • 2. Neutralize: For every word that is not definitional, project to get rid of bias.
  • 3. Equalize pairs.
    • grandmother – grandfather
      • similarity, between babysitter and grandmother is actually smaller than the distance between babysitter and grandfather.
      • maybe reinforces an unhealthy, or maybe undesirable, bias that grandmothers end up babysitting more than grandfathers.
    • girl – boy…
  • So the final
    • how do you decide what word to neutralize?
      • beard
    • what words should be gender-specific and what words should not be.
      • most words in the English language are not definitional, meaning that gender is not part of the definition.
  • paper

Programming assignments

Operations on word vectors – Debiasing

After this assignment you will be able to:

  • Load pre-trained word vectors, and measure similarity using cosine similarity
  • Use word embeddings to solve word analogy problems such as Man is to Woman as King is to __.
  • Modify word embeddings to reduce their gender bias

Emojify

Using word vectors to improve emoji lookups

  • In many emoji interfaces, you need to remember that ❤️ is the “heart” symbol rather than the “love” symbol.
    • In other words, you’ll have to remember to type “heart” to find the desired emoji, and typing “love” won’t bring up that symbol.
  • We can make a more flexible emoji interface by using word vectors!
  • When using word vectors, you’ll see that even if your training set explicitly relates only a few words to a particular emoji, your algorithm will be able to generalize and associate additional words in the test set to the same emoji.
    • This works even if those additional words don’t even appear in the training set.
    • This allows you to build an accurate classifier mapping from sentences to emojis, even using a small training set.

What you’ll build

  1. In this exercise, you’ll start with a baseline model (Emojifier-V1) using word embeddings.
  2. Then you will build a more sophisticated model (Emojifier-V2) that further incorporates an LSTM.

コメントを残す

メールアドレスが公開されることはありません。 が付いている欄は必須項目です