Language Considerations
Text Analytics Toolbox™ supports the languages English, Japanese, German, and Korean. Most Text Analytics Toolbox functions also work with text in other languages. This table summarizes how to use Text Analytics Toolbox features for other languages.
Feature | Language Consideration | Workaround |
---|---|---|
Tokenization |
The | For other languages, you can still try using For more information, see |
Stop word removal | The | To remove stop words from other languages, use |
Sentence detection |
The | For other languages, you might need to specify your own list of abbreviations for sentence
detection. To do this, use the For more information, see |
Word clouds | For string input, the | For other languages, you might need to manually preprocess your text data and specify unique
words and corresponding sizes in To specify word sizes in For more information, see |
Word embeddings | File input to the | For files containing non-English text, you might need to input a To create a For more information, see |
Keyword extraction | The | The For other languages, specify an appropriate set of delimiters using the For more information, see |
The | The For other languages, try using the For more information, see |
Language-Independent Features
Word and N-Gram Counting
The bagOfWords
and bagOfNgrams
functions support tokenizedDocument
input regardless of language. If you have a tokenizedDocument
array containing your data, then you can use these functions.
Modeling and Prediction
The fitlda
and fitlsa
functions support bagOfWords
and bagOfNgrams
input regardless of language. If you have a bagOfWords
or bagOfNgrams
object containing your data, then you can use these functions.
The trainWordEmbedding
function supports tokenizedDocument
or file input regardless of language. If you have a tokenizedDocument
array or a file containing your data in the correct format, then you can use this function.
References
[1] Unicode Text Segmentation. https://www.unicode.org/reports/tr29/
[2] Boundary Analysis. https://unicode-org.github.io/icu/userguide/boundaryanalysis/
[3] MeCab: Yet Another Part-of-Speech and Morphological Analyzer. https://taku910.github.io/mecab/
See Also
stopWords
| removeWords
| normalizeWords
| bagOfWords
| bagOfNgrams
| tokenizedDocument
| fitlda
| fitlsa
| wordcloud
| addSentenceDetails
| addLanguageDetails