replaceNgrams

Replace n-grams in documents

Syntax

newDocuments = replaceNgrams(documents,oldNgrams,newNgrams)

newDocuments = replaceNgrams(documents,oldNgrams,newNgrams,'IgnoreCase',true)

Description

newDocuments = replaceNgrams(documents,oldNgrams,newNgrams) updates the specified documents by replacing the n-grams oldNgrams with the corresponding n-grams in newNgrams. The function, by default, is case sensitive.

example

newDocuments = replaceNgrams(documents,oldNgrams,newNgrams,'IgnoreCase',true) replaces the n-grams oldNgrams ignoring case.

Examples

collapse all

Replace N-grams In Documents

Open Live Script

Use the replaceNgrams function to replace abbreviations with their corresponding expanded forms.

Create an array of tokenized documents.

str = [ ...
    "Currently in Cambridge, MA."
    "Next stop, NY!"];
documents = tokenizedDocument(str)

documents = 
  2×1 tokenizedDocument:

    6 tokens: Currently in Cambridge , MA .
    5 tokens: Next stop , NY !

Replace the tokens "MA" and "NY" with "Massachusetts" and ["New" "York"] respectively. If the n-grams have different lengths, you must pad the rows with the empty string "". In this case, you must pad "Massachusetts" with a single empty string "".

oldNgrams = [
    "MA"
    "NY"];
newNgrams = [
    "Massachusetts" ""
    "New" "York"];
documents = replaceNgrams(documents,oldNgrams,newNgrams)

documents = 
  2×1 tokenizedDocument:

    6 tokens: Currently in Cambridge , Massachusetts .
    6 tokens: Next stop , New York !

Input Arguments

collapse all

`documents` — Input documents
`tokenizedDocument` array

Input documents, specified as a tokenizedDocument array.

`oldNgrams` — N-grams to replace
string array | character vector | cell array of character vectors

N-grams to replace, specified as a string array, character vector, or a cell array of character vectors.

If oldNgrams is a string array or cell array, then it has size NumNgrams-by-maxN , where NumNgrams is the number of n-grams, and maxN is the length of the largest n-gram. If oldNgrams is a character vector, then it represents a single word (unigram).

The value of oldNgrams(i,j) is the jth word of the ith n-gram. If the number of words in the ith n-gram is less than maxN, then the remaining entries of the ith row of oldNgrams must be padded with the empty string "".

For example, to specify both the unigram "Massachusetts", and the bigram ["New" "York"], specify the 2-by-2 string array ["Massachusetts" ""; "New" "York"], where "Massachusetts" is padded with a single empty string "".

Data Types: string | char | cell

`newNgrams` — New n-grams
string array | character vector | cell array of character vectors

New n-grams, specified as a string array, character vector, or a cell array of character vectors.

If newNgrams is a string array or cell array, then it has size NumNgrams-by-maxN , where NumNgrams is the number of n-grams, and maxN is the length of the largest n-gram. If newNgrams is a character vector, then it represents a single word (unigram).

The value of newNgrams(i,j) is the jth word of the ith n-gram. If the number of words in the ith n-gram is less than maxN, then the remaining entries of the ith row of newNgrams are empty.

newNgrams must have one row, or the same number of rows as oldNgrams.

Data Types: string | char | cell

Output Arguments

collapse all

`newDocuments` — Output documents
`tokenizedDocument` array

Output documents, returned as a tokenizedDocument array.

Version History

Introduced in R2019a

replaceNgrams

Syntax

Description

Examples

Replace N-grams In Documents

Input Arguments

`documents` — Input documents
`tokenizedDocument` array

`oldNgrams` — N-grams to replace
string array | character vector | cell array of character vectors

`newNgrams` — New n-grams
string array | character vector | cell array of character vectors

Output Arguments

`newDocuments` — Output documents
`tokenizedDocument` array

Version History

See Also

Topics

replaceNgrams

Syntax

Description

Examples

Replace N-grams In Documents

Input Arguments

documents — Input documents tokenizedDocument array

oldNgrams — N-grams to replace string array | character vector | cell array of character vectors

newNgrams — New n-grams string array | character vector | cell array of character vectors

Output Arguments

newDocuments — Output documents tokenizedDocument array

Version History

See Also

Topics

`documents` — Input documents
`tokenizedDocument` array

`oldNgrams` — N-grams to replace
string array | character vector | cell array of character vectors

`newNgrams` — New n-grams
string array | character vector | cell array of character vectors

`newDocuments` — Output documents
`tokenizedDocument` array