What Is a Bag-of-Words?

The bag-of-words (BoW) model is one of the simplest feature extraction techniques, used in many natural language processing (NLP) applications such as text classification, sentiment analysis, and topic modeling. Bag-of-words is built by counting the number of occurrences of unique features such as words and symbols in a document.

Example

In this example, the MATLAB^® function bagOfWords creates a bag-of-words model from a collection of abstracts of math papers published on arXiv. One of the easiest ways to visualize the model is by plotting a word cloud using the MATLAB function wordcloud(bag). Words displayed in bigger fonts and in orange are the most dominant (frequent) in the bag-of-words model.

When to Use Bag-of-Words Models

Bag-of-words is easy to understand and implement. As a result, it is often the first method used to build models with text data. However, bag-of-words has several limitations, including:

Lack of context: Bag-of-words models do not preserve the order of appearance of features in a document, which can remove important information in some cases. For example, “is this a good day” and “this is a good day” would be considered equivalent if context is not taken into account while analyzing the text data.
Unpredictable model quality: Including all features from a document in a bag-of-words model can increase the model size, resulting in sparsity and numerical instabilities. Careful preprocessing of the document text is often required to build a useful bag-of-words model.

Alternatives to Bag-of-Words Models

Several good model alternatives don’t have the same inherent model limitations as bag-of-words:

bag-of-n-grams: uses multiple features instead of single ones
term frequency–inverse document frequency: reflects importance
word embedding: creates distributed representations of features into numerical vectors such as word2vec, GloVe and FastText
transformer models: uses pretrained deep learning models for transfer learning

However, bag-of-words is easy to understand and implement and is sufficient for many use cases. To learn more about bag-of-words and other modeling techniques for text data, see Text Analytics Toolbox™ for use with MATLAB.

Examples and How To

Prepare Text Data for Analysis - Example
Create Simple Text Model for Classification - Example
Analyze Text Data Using Topic Models - Example
bagOfWords: Bag-of-words model - Function
tfidf: Term frequency–inverse document frequency (tf-idf) matrix - Function
topkwords: Most important words in bag-of-words model - Function
removeInfrequestWords: Removes words with low counts from bag-of-words model - Function
bagOfNgrams: Bag-of-n-grams model - Function