Build multiword language models and analyze them with machine learning

An n-gram is a collection of n successive items in a text document that may include words, numbers, symbols, and punctuation. N-gram models are useful in many text analytics applications, where sequences of words are relevant such as in sentiment analysis, text classification, and text generation. For example, in the following sentence:

“Word clouds from string arrays and word clouds from bag-of-words models and LDA topics can be created using Text Analytics Toolbox.”

“Word clouds” is a 2-gram (bigram), “from string arrays” is a 3-gram (trigram), “using Text Analytics Toolbox” is a 4-gram, and so on. The size of the n-gram depends on the application and size of the common phrases used in that application.

N-gram modeling is one of the many techniques used to convert text from an unstructured format to a structured format. An alternative to n-gram is word embedding techniques such as word2vec. A language model, incorporating n-grams, can be created by counting the number of times each unique n-gram appears in a document. This is known as a bag-of-n-grams model. In the previous example, the bag-of-n-grams model for n=2 would look like the following:

n-grams Counts
Word clouds 2
String arrays 1
Bag-of-words models 1

Once the language model is built, it can then be used with machine learning algorithms to build predictive models for text analytics applications. To learn more about n-grams and building models with text data, see Text Analytics Toolbox™, for use with MATLAB®.

See also: natural language processing, sentiment analysis, word2vec, text mining with MATLAB, data science, deep learning, Deep Learning Toolbox™, Predictive Maintenance Toolbox™