File Exchange

image thumbnail

Text Analytics Toolbox Model for fastText English 16 Billion Token Word Embedding

Pre-trained English Word Embedding Model for Machine Learning and Deep Learning with Text

9 Downloads

Updated 18 Mar 2020

This Add-on provides a pre-trained word embedding and sentence classification model using FastText for use in machine learning and deep learning algorithms. FastText is an open-source library which provides efficient and scalable libraries for text analytics. For more information on the pre-trained word vector model see : https://fasttext.cc/docs/en/english-vectors.html

Opening the fasttext.mlpkginstall file from your operating system or from within MATLAB will initiate the installation process for the release you have.
This mlpkginstall file is functional for R2018a and beyond.
Usage Example:
% Load the trained model
emb = fastTextWordEmbedding;

% Find the top 10 closest words to “impedance” according to this word embedding
impedanceVec = word2vec(emb,"impedance");
vec2word(emb, impedanceVec,10)

ans =

10×1 string array

"impedance"
"impedances"
"capacitance"
"Impedance"
"resistor"
"impedence"
"inductance"
"voltage"
"S-parameters"
"ohms"

Comments and Ratings (11)

Jon Cherrie

@Alexander -- an alternative that might work for you is to download 'wiki-news-300d-1M.vec.zip' from https://fasttext.cc/docs/en/english-vectors.html and then use readWordEmbedding to read that file into MATLAB. These two commands should be equivalent:

>> emb = fastTextWordEmbedding;
>> emb = readWordEmbedding("wiki-news-300d-1M.vec.zip");

Note that other word vectors are available from https://fasttext.cc/docs/en/english-vectors.html if 'wiki-news-300d-1M' doesn't match your requirements.

Alexander Doud

Third party assets during installation fail to load as mentioned by Jaijun. Reinstalled text toolbox as recommended by link at time of error and issue still persists with this add on. Not sure who generically to contact with support@mathworks as this seems to be an issue related to connecting with the third-party assets. Apologies if I am misunderstanding...but in absence of other recommendations I will reach out to support @ mathworks. Dr. Alex Doud

Kenta

@Jiajun, please contact support@mathworks.com. They will happily help you, and this comment section is not really a good place for back and forth and finding the cause of problems.

@Peter: If you look into the Mikolov papers on the topic in detail, you will see that they get their famous “king” - “man” + “woman” ≈ “queen” result not by simply looking at the closest word, but only after prohibiting certain answers from being returned, including the three words from the input. If you write your code the same way, it will return “queen,” but their forbidden words are more complicated, require a lot of fine-tuning, and easily causes the program to introduce biases the vector data itself does not even have. (Such as the famous problems with “man is to doctor as woman is to X.”)

For more on this, please check out https://arxiv.org/abs/1905.09866

Jiajun Wei

why I can't install this page?

Peter Krammer

I think that something is wrong. Look at the most often example in many scientific papers ( "king" - "man" + "woman" -> "queen") .

manVec = word2vec(emb,"man");
womanVec = word2vec(emb,"woman");
kingVec = word2vec(emb,"king");

answer = kingVec - manVec + womanVec;
res1 = vec2word(emb, answer,5)
(vecnorm((word2vec(emb,res1) - answer)'))'

Five nearest words are:
"king"
"queen"
"monarch"
"kings"
"princess"
with distances from answer:
1.1425352
1.5177922
1.7698069
1.7606366
1.7804255

It is surprise for me (that king is the first, and the queen is second). I think that, it is problem. What are you suggest ? Are you sure that vector length 300 is enough ?
Or, Am I doing something incorrect ? Thank you.

P.S. I tested different forms of words "man", "Man", "MAN", also average 0.5 * (word2vec("man") + word2vec("Man") ) ... but the first result is never queen.

@Peter,
To add words to the embedding vocabulary, follow the steps below to create a new embedding object after reading it in:
>> emb = fasttextenglishembedding();
>> vocab = emb.Vocabulary;
>> mat = word2vec(emb,vocab);
>> newvocab = [vocab "sample 1" "sample 2"];
>> newmat = [mat ; randn(2,300)];
>> newemb = wordEmbedding(newvocab,newmat);

Peter Mayhew

Is it possible to add additional words to the pretrained vocabulary? If so, how is this done?

Chengchao Lu

MATLAB Release Compatibility
Created with R2018a
Compatible with R2018a to R2020a
Platform Compatibility
Windows macOS Linux