mecabOptions

Options for MeCab tokenization

expand all in page

Description

A mecabOptions object specifies additional options for tokenizing Japanese and Korean text.

To tokenize using the specified MeCab tokenization options, use the 'TokenizeMethod' option of tokenizedDocument.

Creation

Syntax

options = mecabOptions

options = mecabOptions(PropertyName=Value)

Description

options = mecabOptions creates a MeCab tokenization option set with the default values for tokenizing Japanese.

example

options = mecabOptions(PropertyName=Value) additionally sets additional Properties using one or more name-value pair arguments.

example

Properties

expand all

`Model` — Path to trained model
string scalar | character vector

Path to trained model (MeCab dictionary), specified as a string scalar or a character vector.

The default value is a path to the internal dictionary for Japanese tokenization.

Example: "C:\myDict"

Data Types: char | string

`UserModel` — Files containing model extensions
`""` (default) | string array | character vector | cell array of character vectors

Files containing model extensions (MeCab user dictionary .dic files), specified as a string array, a character vector, or a cell array of character vectors.

Example: "C:\myFile.dic"

Data Types: char | string | cell

`LemmaExtractor` — Function extracting lemma from MeCab reply
`@textanalytics.ja.mecabToLemma` (default) | function handle

Function extracting lemma from MeCab reply, specified as a function handle.

The function must have the form lemmata = fun(words,info), where words is a string vector of tokens and info is a struct with the following fields:

Feature – String vector of tokens of the same size as words containing the MeCab output lines in ChaSen format without the split tokens themselves.
PartOfSpeech – Numerical code used inside the dictionary for the part-of-speech classification.

The output lemmata is a string array of the same size as words containing the extracted lemmata.

The default lemma extractor is the textanalytics.ja.mecabToLemma function.

Data Types: function_handle

`POSExtractor` — Function extracting part-of-speech information from MeCab reply
`@textanalytics.ja.mecabToPOS` (default) | function handle

Function extracting part-of-speech information from MeCab reply, specified as a function handle.

The function must have the form posTags = fun(words,info), where words is a string vector of tokens and info is a struct with the following fields:

Feature – String vector of tokens of the same size as words containing the MeCab output lines in ChaSen format without the split tokens themselves.
PartOfSpeech – Numerical code used inside the dictionary for the part-of-speech classification.

The output posTags is a categorical array of the same size as words containing the extracted part-of-speech tags from the following categories:

adjective
adposition
adverb
auxiliary-verb
coord-conjunction
determiner
interjection
noun
numeral
pronoun
proper-noun
punctuation
symbol
verb
other

The default part-of-speech information extractor is the textanalytics.ja.mecabToPOS function.

Data Types: function_handle

`NERExtractor` — Function extracting named entity information from MeCab reply
`@textanalytics.ja.mecabToNER` (default) | function handle

Function extracting named entity information from MeCab reply, specified as a function handle.

The function must have the form entities = fun(words,info), where words is a string vector of tokens and info is a struct with the following fields:

Feature – String vector of tokens of the same size as words containing the MeCab output lines in ChaSen format without the split tokens themselves.
PartOfSpeech – Numerical code used inside the dictionary for the part-of-speech classification.

The output entities is a categorical array of the same size as words containing the extracted entities from the following categories:

non-entity
person
organization
location
other

The default part-of-speech information extractor is the textanalytics.ja.mecabToNER function.

Data Types: function_handle

Examples

collapse all

Create MeCab Options Object

Open Live Script

Create a MecabOptions object containing the default options for Japanese tokenization.

options = mecabOptions

options = 
  MecabOptions with properties:

             Model: "C:\Program Files\MATLAB\R2023a\sys\share\dict-ipadic"
         UserModel: ""
    LemmaExtractor: @textanalytics.ja.mecabToLemma
      POSExtractor: @textanalytics.ja.mecabToPOS
      NERExtractor: @textanalytics.ja.mecabToNER

Specify MeCab User Dictionary for Tokenization

Open Live Script

Tokenize Japanese text using custom MeCab options.

Create a string array of Japanese text.

str = [
    "恋に悩み、苦しむ。"
    "恋の悩みで苦しむ。"
    "空に星が輝き、瞬いている。"
    "空の星が輝きを増している。"];

Create a MecabOptions object and specify a user model as a .dic file using the 'UserModel' option.

options = mecabOptions('UserModel','myFile.dic')

options = 
  MecabOptions with properties:

             Model: "C:\Program Files\MATLAB\R2023a\sys\share\dict-ipadic"
         UserModel: "myFile.dic"
    LemmaExtractor: @textanalytics.ja.mecabToLemma
      POSExtractor: @textanalytics.ja.mecabToPOS
      NERExtractor: @textanalytics.ja.mecabToNER

Tokenize the text using the specified options using the 'TokenizeMethod' option.

documents = tokenizedDocument(str,'TokenizeMethod',options)

documents = 
  4×1 tokenizedDocument:

     6 tokens: 恋 に 悩み 、 苦しむ 。
     6 tokens: 恋 の 悩み で 苦しむ 。
    10 tokens: 空 に 星 が 輝き 、 瞬い て いる 。
    10 tokens: 空 の 星 が 輝き を 増し て いる 。

Version History

Introduced in R2019b

mecabOptions

Description

Creation

Syntax

Description

Properties

`Model` — Path to trained model
string scalar | character vector

`UserModel` — Files containing model extensions
`""` (default) | string array | character vector | cell array of character vectors

`LemmaExtractor` — Function extracting lemma from MeCab reply
`@textanalytics.ja.mecabToLemma` (default) | function handle

`POSExtractor` — Function extracting part-of-speech information from MeCab reply
`@textanalytics.ja.mecabToPOS` (default) | function handle

`NERExtractor` — Function extracting named entity information from MeCab reply
`@textanalytics.ja.mecabToNER` (default) | function handle

Examples

Create MeCab Options Object

Specify MeCab User Dictionary for Tokenization

Version History

See Also

Topics

mecabOptions

Description

Creation

Syntax

Description

Properties

Model — Path to trained model string scalar | character vector

UserModel — Files containing model extensions "" (default) | string array | character vector | cell array of character vectors

LemmaExtractor — Function extracting lemma from MeCab reply @textanalytics.ja.mecabToLemma (default) | function handle

POSExtractor — Function extracting part-of-speech information from MeCab reply @textanalytics.ja.mecabToPOS (default) | function handle

NERExtractor — Function extracting named entity information from MeCab reply @textanalytics.ja.mecabToNER (default) | function handle

Examples

Create MeCab Options Object

Specify MeCab User Dictionary for Tokenization

Version History

See Also

Topics

`Model` — Path to trained model
string scalar | character vector

`UserModel` — Files containing model extensions
`""` (default) | string array | character vector | cell array of character vectors

`LemmaExtractor` — Function extracting lemma from MeCab reply
`@textanalytics.ja.mecabToLemma` (default) | function handle

`POSExtractor` — Function extracting part-of-speech information from MeCab reply
`@textanalytics.ja.mecabToPOS` (default) | function handle

`NERExtractor` — Function extracting named entity information from MeCab reply
`@textanalytics.ja.mecabToNER` (default) | function handle