Main Content

containsNgrams

Check if n-gram is member of documents

    Description

    example

    tf = containsNgrams(documents,ngrams) returns 1 where any n-gram of documents matches ngrams and returns 0 otherwise.

    tf = containsNgrams(documents,ngrams,IgnoreCase=flag) also specifies whether to ignore letter case when checking n-grams.

    Examples

    collapse all

    Create an array of tokenized documents.

    documents = tokenizedDocument([
        "an example of a short sentence" 
        "a second short sentence"]);

    Check for documents containing the n-gram ["a" "short"].

    tf = containsNgrams(documents,["a" "short"])
    tf = 2x1 logical array
    
       1
       0
    
    

    Input Arguments

    collapse all

    Input documents, specified as a tokenizedDocument array.

    N-grams to check, specified as one of the these values:

    • String array

    • Character vector

    • Cell array of character vectors

    • pattern array

    If ngrams is a string array, cell array, or pattern array, then it has size numNgrams-by-maxN, where numNgrams is the number of n-grams and maxN is the length of the largest n-gram. If ngrams is a character vector, then it represents a single word (unigram).

    The value of ngrams(i,j) corresponds to the jth word of the ith n-gram. If the number of words in the ith n-gram is less than maxN, then the remaining entries of the ith row of ngrams must be empty.

    If ngrams contains multiple n-grams or patterns, then the function returns 1 where any of the n-grams appear in the corresponding document.

    Example: ["An" ""; "An example"; "example" ""]

    Data Types: string | char | cell

    Option to ignore case, specified as one of the these values:

    • 0 (false) – Treat candidate matches that differ only by letter case as nonmatching.

    • 1 (true) – Treat candidate matches that differ only by letter case as matching.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64 | logical

    Version History

    Introduced in R2022a