Documentation

## Visualize Word Embeddings Using Text Scatter Plots

This example shows how to visualize word embeddings using 2-D and 3-D t-SNE and text scatter plots.

Word embeddings map words in a vocabulary to real vectors. The vectors attempt to capture the semantics of the words, so that similar words have similar vectors. Some embeddings also capture relationships between words like "Italy is to France as Rome is to Paris". In vector form, this relationship is $\mathit{Italy}-\mathit{Rome}+\mathit{Paris}=\mathit{France}$.

To reproduce the results in this example, set rng to 'default'.

rng('default')

### Load Pretrained Word Embedding

Load a pretrained word embedding using fastTextWordEmbedding. This function requires Text Analytics Toolbox™ Model for fastText English 16 Billion Token Word Embedding support package. If this support package is not installed, then the function provides a download link.

emb = fastTextWordEmbedding
emb =
wordEmbedding with properties:

Dimension: 300
Vocabulary: [1×999994 string]

Explore the word embedding using word2vec and vec2word. Convert the words Italy, Rome, and Paris to vectors using word2vec.

italy = word2vec(emb,"Italy");
rome = word2vec(emb,"Rome");
paris = word2vec(emb,"Paris");

Compute the vector given by italy - rome + paris. This vector encapsulates the semantic meaning of the word Italy, without the semantics of the word Rome, and also includes the semantics of the word Paris.

vec = italy - rome + paris
vec = 1×300 single row vector

0.1606   -0.0690    0.1183   -0.0349    0.0672    0.0907   -0.1820   -0.0080    0.0320   -0.0936   -0.0329   -0.1548    0.1737   -0.0937   -0.1619    0.0777   -0.0843    0.0066    0.0600   -0.2059   -0.0268    0.1350   -0.0900    0.0314    0.0686   -0.0338    0.1841    0.1708    0.0276    0.0719   -0.1667    0.0231    0.0265   -0.1773   -0.1135    0.1018   -0.2339    0.1008    0.1057   -0.1118    0.2891   -0.0358    0.0911   -0.0958   -0.0184    0.0740   -0.1081    0.0826    0.0463    0.0043

Find the closest words in the embedding to vec using vec2word.

word = vec2word(emb,vec)
word =
"France"

### Create 2-D Text Scatter Plot

Visualize the word embedding by creating a 2-D text scatter plot using tsne and textscatter.

Convert the first 500 words to vectors using word2vec. V is a matrix of word vectors of length 300.

words = emb.Vocabulary(1:5000);
V = word2vec(emb,words);
size(V)
ans = 1×2

5000         300

Embed the word vectors in two-dimensional space using tsne. This function may take a few minutes to run. If you want to display the convergence information, then set the 'Verbose' name-value pair to 1.

XY = tsne(V);

Plot the words at the coordinates specified by XY in a 2-D text scatter plot. For readability, textscatter, by default, does not display all of the input words and displays markers instead.

figure
textscatter(XY,words)
title("Word Embedding t-SNE Plot")

Zoom in on a section of the plot.

xlim([-18 -5])
ylim([11 21])

### Create 3-D Text Scatter Plot

Visualize the word embedding by creating a 3-D text scatter plot using tsne and textscatter.

Convert the first 5000 words to vectors using word2vec. V is a matrix of word vectors of length 300.

words = emb.Vocabulary(1:5000);
V = word2vec(emb,words);
size(V)
ans = 1×2

5000         300

Embed the word vectors in a three-dimensional space using tsne by specifying the number of dimensions to be three. This function may take a few minutes to run. If you want to display the convergence information, then you can set the 'Verbose' name-value pair to 1.

XYZ = tsne(V,'NumDimensions',3);

Plot the words at the coordinates specified by XYZ in a 3-D text scatter plot.

figure
ts = textscatter3(XYZ,words);
title("3-D Word Embedding t-SNE Plot")

Zoom in on a section of the plot.

xlim([12.04 19.48])
ylim([-2.66 3.40])
zlim([10.03 14.53])

### Perform Cluster Analysis

Convert the first 5000 words to vectors using word2vec. V is a matrix of word vectors of length 300.

words = emb.Vocabulary(1:5000);
V = word2vec(emb,words);
size(V)
ans = 1×2

5000         300

Discover 25 clusters using kmeans.

cidx = kmeans(V,25,'dist','sqeuclidean');

Visualize the clusters in a text scatter plot using the 2-D t-SNE data coordinates calculated earlier.

figure
textscatter(XY,words,'ColorData',categorical(cidx));
title("Word Embedding t-SNE Plot")

Zoom in on a section of the plot.

xlim([13 24])
ylim([-47 -35])