Bag-of-Words

What Is a Bag-of-Words?

The bag-of-words (BoW) model is one of the simplest feature extraction techniques, used in many natural language processing (NLP) applications such as text classification, sentiment analysis, and topic modeling. Bag-of-words is built by counting the number of occurrences of unique features such as words and symbols in a document.

Example

In this example, the MATLAB® function bagOfWords creates a bag-of-words model from a collection of abstracts of math papers published on arXiv. One of the easiest ways to visualize the model is by plotting a word cloud using the MATLAB function wordcloud(bag). Words displayed in bigger fonts and in orange are the most dominant (frequent) in the bag-of-words model.

Word cloud from a bag-of-words model.
Word cloud from a bag-of-words model.

When to Use Bag-of-Words Models

Bag-of-words is easy to understand and implement. As a result, it is often the first method used to build models with text data. However, bag-of-words has several limitations, including:

  • Lack of context: Bag-of-words models do not preserve the order of appearance of features in a document, which can remove important information in some cases. For example, “is this a good day” and “this is a good day” would be considered equivalent if context is not taken into account while analyzing the text data.
  • Unpredictable model quality: Including all features from a document in a bag-of-words model can increase the model size, resulting in sparsity and numerical instabilities. Careful preprocessing of the document text is often required to build a useful bag-of-words model.

Alternatives to Bag-of-Words Models

Several good model alternatives don’t have the same inherent model limitations as bag-of-words:

However, bag-of-words is easy to understand and implement and is sufficient for many use cases. To learn more about bag-of-words and other modeling techniques for text data, see Text Analytics Toolbox™ for use with MATLAB.

See also: natural language processing, text analytics, sentiment analysis, word2vec, text mining with MATLAB, lemmatization, stemming, n-gram, data science, deep learning, ngram